Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing

Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing

Paperback

$62.99 $69.99 Save 10% Current price is $62.99, Original price is $69.99. You Save 10%.
Choose Expedited Shipping at checkout for guaranteed delivery by Monday, November 26

Product Details

ISBN-13: 9781491983874
Publisher: O'Reilly Media, Incorporated
Publication date: 08/09/2018
Pages: 352
Sales rank: 566,740
Product dimensions: 6.90(w) x 9.10(h) x 0.80(d)

About the Author

Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O’Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Slava Chernyak is a senior software engineer at Google Seattle. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Windmill, Google Cloud Dataflow's next-generation streaming backend, from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.

Reuven Lax is a senior staff software engineer at Google Seattle, and has spent the past nine years helping to shape Google's data processing and analysis strategy. For much of that time he has focused on Google's low-latency, streaming data processing efforts, first as a long-time member and lead of the MillWheel team, and more recently founding and leading the team responsible for Windmill, the next-generation stream processing engine powering Google Cloud Dataflow. He's very excited to bring Google's data-processing experience to the world at large, and proud to have been a part of publishing both the MillWheel paper in 2013 and the Dataflow Model paper in 2015. When not at work, Reuven enjoys swing dancing, rock climbing, and exploring new parts of the world.

Table of Contents

Preface Or: What Are You Getting Yourself Into Here? vii

Part I The Beam Model

1 Streaming 101 3

Terminology: What Is Streaming? 4

On the Greatly Exaggerated Limitations of Streaming 6

Event Time Versus Processing Time 9

Data Processing Patterns 12

Bounded Data 12

Unbounded Data: Batch 13

Unbounded Data: Streaming 14

Summary 22

2 The What, Where, When, and How of Data Processing 25

Roadmap 26

Batch Foundations: What and Where 28

When: Transformations 28

Where: Windowing 32

Going Streaming: When and How 34

When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things! 34

When: Watermarks 39

When: Early/On-Time/Late Triggers FTW! 44

When: Allowed Lateness (i.e., Garbage Collection) 47

How: Accumulation 51

Summary 55

3 Watermarks 59

Definition 59

Source Watermark Creation 62

Perfect Watermark Creation 64

Heuristic Watermark Creation 65

Watermark Propagation 67

Understanding Watermark Propagation 69

Watermark Propagation and Output Timestamps 75

The Tricky Case of Overlapping Windows 80

Percentile Watermarks 81

Processing-Time Watermarks 84

Case Studies 86

Case Study: Watermarks in Google Cloud Dataflow 87

Case Study: Watermarks in Apache Flink 88

Case Study: Source Watermarks for Google Cloud Pub/Sub 90

Summary 93

4 Advanced Windowing 95

When/Where: Processing-Time Windows 95

Event-Time Windowing 97

Processing-Time Windowing via Triggers 98

Processing-Time Windowing via Ingress Time 100

Where: Session Windows 103

Where: Custom Windowing 107

Variations on Fixed Windows 108

Variations on Session Windows 115

One Size Does Not Fit All 119

Summary 119

5 Exactly-Once and Side Effects 121

Why Exactly Once Matters 121

Accuracy Versus Completeness 122

Side Effects 123

Problem Definition 123

Ensuring Exactly Once in Shuffle 125

Addressing Determinism 126

Performance 127

Graph Optimization 127

Bloom Filters 128

Garbage Collection 129

Exactly Once in Sources 130

Exactly Once in Sinks 131

Use Cases 133

Example Source: Cloud Pub/Sub 133

Example Sink: Files 134

Example Sink: Google BigQuery 135

Other Systems 136

Apache Spark Streaming 136

Apache Flink 136

Summary 138

Part II Streams and Tables

6 Streams and Tables 141

Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity 142

Toward a General Theory of Stream and Table Relativity 143

Batch Processing Versus Streams and Tables 144

A Streams and Tables Analysis of MapReduce 144

Reconciling with Batch Processing 150

What, Where, When, and How in a Streams and Tables World 150

What: Transformations 150

Where: Windowing 154

When: Triggers 157

How: Accumulation 165

A Holistic View of Streams and Tables in the Beam Model 166

A General Theory of Stream and Table Relativity 171

Summary 172

7 The Practicalities of Persistent State 175

Motivation 175

The Inevitability of Failure 176

Correctness and Efficiency 177

Implicit State 178

Raw Grouping 179

Incremental Combining 181

Generalized State 184

Case Study: Conversion Attribution 186

Conversion Attribution with Apache Beam 189

Summary 199

8 Streaming SQL 201

What Is Streaming SQL? 201

Relational Algebra 202

Time-Varying Relations 203

Streams and Tables 207

Looking Backward: Stream and Table Biases 214

The Beam Model: A Stream-Biased Approach 214

The SQL Model: A Table-Biased Approach 218

Looking Forward: Toward Robust Streaming SQL 226

Stream and Table Selection 227

Temporal Operators 228

Summary 249

9 Streaming Joins 253

All Your Joins Are Belong to Streaming 253

Unwindowed Joins 254

Full Outer 255

Left Outer 258

Right Outer 259

Inner 259

Anti 261

Semi 262

Windowed Joins 266

Fixed Windows 267

Temporal Validity 269

Summary 282

10 The Evolution of Large-Scale Data Processing 283

MapReduce 284

Hadoop 288

Flume 289

Storm 294

Spark 297

MillWheel 300

Kafka 304

Cloud Dataflow 307

Flink 309

Beam 313

Summary 316

Index 319

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews