
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
352
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
352eBook
Related collections and offers
Overview
Streaming data is a big deal in big data these days. As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.
Expanded from Tyler Akidau’s popular blog posts "Streaming 101" and "Streaming 102", this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You’ll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax.
You’ll explore:
- How streaming and batch data processing patterns compare
- The core principles and concepts behind robust out-of-order data processing
- How watermarks track progress and completeness in infinite datasets
- How exactly-once data processing techniques ensure correctness
- How the concepts of streams and tables form the foundations of both batch and streaming data processing
- The practical motivations behind a powerful persistent state mechanism, driven by a real-world example
- How time-varying relations provide a link between stream processing and the world of SQL and relational algebra
Product Details
ISBN-13: | 9781491983829 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 07/16/2018 |
Sold by: | Barnes & Noble |
Format: | eBook |
Pages: | 352 |
File size: | 16 MB |
Note: | This product may take a few minutes to download. |
About the Author
Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O’Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.
Slava Chernyak is a senior software engineer at Google Seattle. Slava spent over five years working on Google’s internal massive-scale streaming data processing systems and has since become involved with designing and building Windmill, Google Cloud Dataflow's next-generation streaming backend, from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.
Reuven Lax is a senior staff software engineer at Google Seattle, and has spent the past nine years helping to shape Google's data processing and analysis strategy. For much of that time he has focused on Google's low-latency, streaming data processing efforts, first as a long-time member and lead of the MillWheel team, and more recently founding and leading the team responsible for Windmill, the next-generation stream processing engine powering Google Cloud Dataflow. He's very excited to bring Google's data-processing experience to the world at large, and proud to have been a part of publishing both the MillWheel paper in 2013 and the Dataflow Model paper in 2015. When not at work, Reuven enjoys swing dancing, rock climbing, and exploring new parts of the world.
Table of Contents
Preface Or: What Are You Getting Yourself Into Here? vii
Part I The Beam Model
1 Streaming 101 3
Terminology: What Is Streaming? 4
On the Greatly Exaggerated Limitations of Streaming 6
Event Time Versus Processing Time 9
Data Processing Patterns 12
Bounded Data 12
Unbounded Data: Batch 13
Unbounded Data: Streaming 14
Summary 22
2 The What, Where, When, and How of Data Processing 25
Roadmap 26
Batch Foundations: What and Where 28
When: Transformations 28
Where: Windowing 32
Going Streaming: When and How 34
When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things! 34
When: Watermarks 39
When: Early/On-Time/Late Triggers FTW! 44
When: Allowed Lateness (i.e., Garbage Collection) 47
How: Accumulation 51
Summary 55
3 Watermarks 59
Definition 59
Source Watermark Creation 62
Perfect Watermark Creation 64
Heuristic Watermark Creation 65
Watermark Propagation 67
Understanding Watermark Propagation 69
Watermark Propagation and Output Timestamps 75
The Tricky Case of Overlapping Windows 80
Percentile Watermarks 81
Processing-Time Watermarks 84
Case Studies 86
Case Study: Watermarks in Google Cloud Dataflow 87
Case Study: Watermarks in Apache Flink 88
Case Study: Source Watermarks for Google Cloud Pub/Sub 90
Summary 93
4 Advanced Windowing 95
When/Where: Processing-Time Windows 95
Event-Time Windowing 97
Processing-Time Windowing via Triggers 98
Processing-Time Windowing via Ingress Time 100
Where: Session Windows 103
Where: Custom Windowing 107
Variations on Fixed Windows 108
Variations on Session Windows 115
One Size Does Not Fit All 119
Summary 119
5 Exactly-Once and Side Effects 121
Why Exactly Once Matters 121
Accuracy Versus Completeness 122
Side Effects 123
Problem Definition 123
Ensuring Exactly Once in Shuffle 125
Addressing Determinism 126
Performance 127
Graph Optimization 127
Bloom Filters 128
Garbage Collection 129
Exactly Once in Sources 130
Exactly Once in Sinks 131
Use Cases 133
Example Source: Cloud Pub/Sub 133
Example Sink: Files 134
Example Sink: Google BigQuery 135
Other Systems 136
Apache Spark Streaming 136
Apache Flink 136
Summary 138
Part II Streams and Tables
6 Streams and Tables 141
Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity 142
Toward a General Theory of Stream and Table Relativity 143
Batch Processing Versus Streams and Tables 144
A Streams and Tables Analysis of MapReduce 144
Reconciling with Batch Processing 150
What, Where, When, and How in a Streams and Tables World 150
What: Transformations 150
Where: Windowing 154
When: Triggers 157
How: Accumulation 165
A Holistic View of Streams and Tables in the Beam Model 166
A General Theory of Stream and Table Relativity 171
Summary 172
7 The Practicalities of Persistent State 175
Motivation 175
The Inevitability of Failure 176
Correctness and Efficiency 177
Implicit State 178
Raw Grouping 179
Incremental Combining 181
Generalized State 184
Case Study: Conversion Attribution 186
Conversion Attribution with Apache Beam 189
Summary 199
8 Streaming SQL 201
What Is Streaming SQL? 201
Relational Algebra 202
Time-Varying Relations 203
Streams and Tables 207
Looking Backward: Stream and Table Biases 214
The Beam Model: A Stream-Biased Approach 214
The SQL Model: A Table-Biased Approach 218
Looking Forward: Toward Robust Streaming SQL 226
Stream and Table Selection 227
Temporal Operators 228
Summary 249
9 Streaming Joins 253
All Your Joins Are Belong to Streaming 253
Unwindowed Joins 254
Full Outer 255
Left Outer 258
Right Outer 259
Inner 259
Anti 261
Semi 262
Windowed Joins 266
Fixed Windows 267
Temporal Validity 269
Summary 282
10 The Evolution of Large-Scale Data Processing 283
MapReduce 284
Hadoop 288
Flume 289
Storm 294
Spark 297
MillWheel 300
Kafka 304
Cloud Dataflow 307
Flink 309
Beam 313
Summary 316
Index 319