Table of Contents
Preface ix
1 Introduction to Stateful Stream Processing 1
Traditional Data Infrastructures 1
Transactional Processing 2
Analytical Processing 3
Stateful Stream Processing 4
Event-Driven Applications 6
Data Pipelines 8
Streaming Analytics 8
The Evolution of Open Source Stream Processing 9
A Bit of History 10
A Quick Look at Flink 12
Running Your First Flink Application 13
Summary 15
2 Stream Processing Fundamentals 17
Introduction to Dataflow Programming 17
Dataflow Graphs 17
Data Parallelism and Task Parallelism 18
Data Exchange Strategies 19
Processing Streams in Parallel 20
Latency and Throughput 20
Operations on Data Streams 22
Time Semantics 27
What Does One Minute Mean in Stream Processing? 27
Processing Time 29
Event Time 29
Watermarks 30
Processing Time Versus Event Time 31
State and Consistency Models 32
Task Failures 33
Result Guarantees 34
Summary 36
3 The Architecture of Apache Flink 37
System Architecture 37
Components of a Flink Setup 38
Application Deployment 39
Task Execution 40
Highly Available Setup 42
Data Transfer in Flink 44
Credit-Based Flow Control 45
Task Chaining 46
Event-Time Processing 47
Times tamps 47
Watermarks 48
Watermark Propagation and Event Time 49
Timestamp Assignment and Watermark Generation 52
State Management 53
Operator State 54
Keyed State 54
State Backends 55
Scaling Stateful Operators 56
Checkpoints, Savepoints, and State Recovery 58
Consistent Checkpoints 59
Recovery from a Consistent Checkpoint 60
Flink's Checkpointing Algorithm 61
Perform ace Implications of Checkpointing 65
Savepoints 66
Summary 69
4 Setting Up a Development Environment for Apache Flink 71
Required Software 71
Run and Debug Flink Applications in an IDE 72
Import the Book's Examples in an IDE 72
Run Flink Applications in an IDE 75
Debug Flink Applications in an IDE 76
Bootstrap a Flink Maven Project 76
Summary 77
5 The DataStream API (v1.7) 79
Hello, Flink! 79
Set Up the Execution Environment 81
Read an Input Stream 81
Apply Transformations 82
Output the Result 82
Execute 83
Transformations 83
Basic Transformations 84
KeyedStream Transformations 87
Multistream Transformations 90
Distribution Transformations 94
Setting the Parallelism 96
Types 97
Supported Data Types 98
Creating Type Information for Data Types 100
Explicitly Providing Type Information 102
Defining Keys and Referencing Fields 102
Field Positions 103
Field Expressions 103
Key Selectors 104
Implementing Functions 105
Function Classes 105
Lambda Functions 106
Rich Functions 106
Including External and Flink Dependencies 107
Summary 108
6 Time-Based and Window Operators 109
Configuring Time Characteristics 109
Assigning Timestamps and Generating Watermarks 111
Watermarks, Latency, and Completeness 115
Process Functions 116
TimerService and Timers 117
Emitting to Side Outputs 119
CoProcessFunction 120
Window Operators 122
Defining Window Operators 122
Built-in Window Assigners 123
Applying Functions on Windows 127
Customizing Window Operators 134
Joining Streams on Time 145
Interval Join 145
Window Join 146
Handling Late Data 148
Dropping Late Events 148
Redirecting Late Events 148
Updating Results by including Late Events 150
Summary 152
7 Stateful Operators and Applications 153
Implementing Stateful Functions 154
Declaring Keyed State at RuntimeContext 154
Implementing Operator List State with the ListCheckpointed Interface 158
Using Connected Broadcast State 160
Losing the CheckpointedFunction Interface 164
Receiving Notifications About Completed Checkpoints 166
Enabling Failure Recovery for Stateful Applications 166
Ensuring the Maintainability of Stateful Applications 167
Specifying Unique Operator Identifiers 168
Defining the Maximum Parallelism of Keyed State Operators 168
Performance and Robustness of Stateful Applications 169
Choosing a State Backend 169
Choosing a State Primitive 171
Preventing Leaking State 171
Evolving Stateful Applications 174
Updating an Application without Modifying Existing State 175
Removing State from an Application 175
Modifying the State of an Operator 175
Queryable State 177
Architecture and Enabling Queryable Stale 177
Exposing Queryable State 179
Querying State from External Applications 180
Summary 182
8 Reading from and Writing to External Systems 183
Application Consistency Guarantees 184
Idempotent Writes 184
Transactional Writes 185
Provided Connectors 186
Apache Kafka Source Connector 187
Apache Kafka Sink Connector 190
Filesystem Source Connector 194
Filesystem Sink Connector 196
Apache Cassandra Sink Connector 199
Implementing a Custom Source Function 202
Resettable Source Functions 203
Source Functions, Timestamps, and Watermarks 204
Implementing a Custom Sink Function 206
Idempotent Sink Connectors 207
Transactional Sink Connectors 209
Asynchronously Accessing External Systems 216
Summary 219
9 Setting Up Flink for Streaming Applications 221
Deployment Modes 221
Standalone Cluster 221
Docker 223
Apache Hadoop YARN 225
Kubernetes 228
Highly Available Setups 232
HA Standalone Setup 233
HA YARN Setup 234
HA Kubernetes Setup 235
Integration with Hadoop Components 236
Filesystem Configuration 237
System Configuration 239
Java and Classloading 239
CPU 240
Main Memory and Network Buffers 240
Disk Storage 242
Checkpointing and State Backends 243
Security 243
Summary 244
10 Operating Flink and Streaming Applications 245
Running and Managing Streaming Applications 245
Savepoints 246
Managing Applications with the Command-Line Client 247
Managing Applications with the REST API 252
Bundling and Deploying Applications in Containers 258
Controlling Task Scheduling 260
Controlling Task Chaining 261
Defining Slot-Sharing Groups 262
Tuning Checkpointing and Recovery 263
Configuring Checkpointing 264
Configuring State Backends 266
Configuring Recovery 268
Monitoring Flink Clusters and Applications 270
Flink Web UI 270
Metric System 273
Monitoring Latency 278
Configuring the Logging Behavior 279
Summary 280
11 Where to Go from Here? 281
The Rest of the Flink Ecosystem 281
The DataSet API for Batch Processing 281
Table API and SQL for Relational Analysis 282
FlinkCEP for Complex Event Processing and Pattern Matching 282
Gelly for Graph Processing 282
A Welcoming Community 283
Index 285