Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Get started with Apache Flink, the open source framework that powers some of the world’s largest stream processing applications. With this practical book, you’ll explore the fundamental concepts of parallel stream processing and discover how this technology differs from traditional batch data processing.

Longtime Apache Flink committers Fabian Hueske and Vasia Kalavri show you how to implement scalable streaming applications with Flink’s DataStream API and continuously run and maintain these applications in operational environments. Stream processing is ideal for many use cases, including low-latency ETL, streaming analytics, and real-time dashboards as well as fraud detection, anomaly detection, and alerting. You can process continuous data of any kind, including user interactions, financial transactions, and IoT data, as soon as you generate them.

  • Learn concepts and challenges of distributed stateful stream processing
  • Explore Flink’s system architecture, including its event-time processing mode and fault-tolerance model
  • Understand the fundamentals and building blocks of the DataStream API, including its time-based and statefuloperators
  • Read data from and write data to external systems with exactly-once consistency
  • Deploy and configure Flink clusters
  • Operate continuously running streaming applications
1125864957
Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Get started with Apache Flink, the open source framework that powers some of the world’s largest stream processing applications. With this practical book, you’ll explore the fundamental concepts of parallel stream processing and discover how this technology differs from traditional batch data processing.

Longtime Apache Flink committers Fabian Hueske and Vasia Kalavri show you how to implement scalable streaming applications with Flink’s DataStream API and continuously run and maintain these applications in operational environments. Stream processing is ideal for many use cases, including low-latency ETL, streaming analytics, and real-time dashboards as well as fraud detection, anomaly detection, and alerting. You can process continuous data of any kind, including user interactions, financial transactions, and IoT data, as soon as you generate them.

  • Learn concepts and challenges of distributed stateful stream processing
  • Explore Flink’s system architecture, including its event-time processing mode and fault-tolerance model
  • Understand the fundamentals and building blocks of the DataStream API, including its time-based and statefuloperators
  • Read data from and write data to external systems with exactly-once consistency
  • Deploy and configure Flink clusters
  • Operate continuously running streaming applications
59.99 In Stock
Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications

eBook

$59.99 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Get started with Apache Flink, the open source framework that powers some of the world’s largest stream processing applications. With this practical book, you’ll explore the fundamental concepts of parallel stream processing and discover how this technology differs from traditional batch data processing.

Longtime Apache Flink committers Fabian Hueske and Vasia Kalavri show you how to implement scalable streaming applications with Flink’s DataStream API and continuously run and maintain these applications in operational environments. Stream processing is ideal for many use cases, including low-latency ETL, streaming analytics, and real-time dashboards as well as fraud detection, anomaly detection, and alerting. You can process continuous data of any kind, including user interactions, financial transactions, and IoT data, as soon as you generate them.

  • Learn concepts and challenges of distributed stateful stream processing
  • Explore Flink’s system architecture, including its event-time processing mode and fault-tolerance model
  • Understand the fundamentals and building blocks of the DataStream API, including its time-based and statefuloperators
  • Read data from and write data to external systems with exactly-once consistency
  • Deploy and configure Flink clusters
  • Operate continuously running streaming applications

Product Details

ISBN-13: 9781491974247
Publisher: O'Reilly Media, Incorporated
Publication date: 04/11/2019
Sold by: Barnes & Noble
Format: eBook
Pages: 310
File size: 7 MB

About the Author

Fabian Hueske is a committer to and PMC member of the Apache Flink project and has been contributing to Flink since its earliest days. Fabian is cofounder, software engineer, and community evangelist at data Artisans (now Ververica), a Berlin-based startup that fosters Flink and its community. He holds a PhD in computer science from TU Berlin.


Vasiliki (Vasia) Kalavri is a postdoctoral fellow in the Systems Group at ETH Zurich, where she uses Apache Flink extensively for streaming systems research and teaching. Vasia is a PMC member of the Apache Flink project. An early contributor to Flink, she has worked on its graph processing library, Gelly, and on early versions of the Table API and streaming SQL.

Table of Contents

Preface ix

1 Introduction to Stateful Stream Processing 1

Traditional Data Infrastructures 1

Transactional Processing 2

Analytical Processing 3

Stateful Stream Processing 4

Event-Driven Applications 6

Data Pipelines 8

Streaming Analytics 8

The Evolution of Open Source Stream Processing 9

A Bit of History 10

A Quick Look at Flink 12

Running Your First Flink Application 13

Summary 15

2 Stream Processing Fundamentals 17

Introduction to Dataflow Programming 17

Dataflow Graphs 17

Data Parallelism and Task Parallelism 18

Data Exchange Strategies 19

Processing Streams in Parallel 20

Latency and Throughput 20

Operations on Data Streams 22

Time Semantics 27

What Does One Minute Mean in Stream Processing? 27

Processing Time 29

Event Time 29

Watermarks 30

Processing Time Versus Event Time 31

State and Consistency Models 32

Task Failures 33

Result Guarantees 34

Summary 36

3 The Architecture of Apache Flink 37

System Architecture 37

Components of a Flink Setup 38

Application Deployment 39

Task Execution 40

Highly Available Setup 42

Data Transfer in Flink 44

Credit-Based Flow Control 45

Task Chaining 46

Event-Time Processing 47

Times tamps 47

Watermarks 48

Watermark Propagation and Event Time 49

Timestamp Assignment and Watermark Generation 52

State Management 53

Operator State 54

Keyed State 54

State Backends 55

Scaling Stateful Operators 56

Checkpoints, Savepoints, and State Recovery 58

Consistent Checkpoints 59

Recovery from a Consistent Checkpoint 60

Flink's Checkpointing Algorithm 61

Perform ace Implications of Checkpointing 65

Savepoints 66

Summary 69

4 Setting Up a Development Environment for Apache Flink 71

Required Software 71

Run and Debug Flink Applications in an IDE 72

Import the Book's Examples in an IDE 72

Run Flink Applications in an IDE 75

Debug Flink Applications in an IDE 76

Bootstrap a Flink Maven Project 76

Summary 77

5 The DataStream API (v1.7) 79

Hello, Flink! 79

Set Up the Execution Environment 81

Read an Input Stream 81

Apply Transformations 82

Output the Result 82

Execute 83

Transformations 83

Basic Transformations 84

KeyedStream Transformations 87

Multistream Transformations 90

Distribution Transformations 94

Setting the Parallelism 96

Types 97

Supported Data Types 98

Creating Type Information for Data Types 100

Explicitly Providing Type Information 102

Defining Keys and Referencing Fields 102

Field Positions 103

Field Expressions 103

Key Selectors 104

Implementing Functions 105

Function Classes 105

Lambda Functions 106

Rich Functions 106

Including External and Flink Dependencies 107

Summary 108

6 Time-Based and Window Operators 109

Configuring Time Characteristics 109

Assigning Timestamps and Generating Watermarks 111

Watermarks, Latency, and Completeness 115

Process Functions 116

TimerService and Timers 117

Emitting to Side Outputs 119

CoProcessFunction 120

Window Operators 122

Defining Window Operators 122

Built-in Window Assigners 123

Applying Functions on Windows 127

Customizing Window Operators 134

Joining Streams on Time 145

Interval Join 145

Window Join 146

Handling Late Data 148

Dropping Late Events 148

Redirecting Late Events 148

Updating Results by including Late Events 150

Summary 152

7 Stateful Operators and Applications 153

Implementing Stateful Functions 154

Declaring Keyed State at RuntimeContext 154

Implementing Operator List State with the ListCheckpointed Interface 158

Using Connected Broadcast State 160

Losing the CheckpointedFunction Interface 164

Receiving Notifications About Completed Checkpoints 166

Enabling Failure Recovery for Stateful Applications 166

Ensuring the Maintainability of Stateful Applications 167

Specifying Unique Operator Identifiers 168

Defining the Maximum Parallelism of Keyed State Operators 168

Performance and Robustness of Stateful Applications 169

Choosing a State Backend 169

Choosing a State Primitive 171

Preventing Leaking State 171

Evolving Stateful Applications 174

Updating an Application without Modifying Existing State 175

Removing State from an Application 175

Modifying the State of an Operator 175

Queryable State 177

Architecture and Enabling Queryable Stale 177

Exposing Queryable State 179

Querying State from External Applications 180

Summary 182

8 Reading from and Writing to External Systems 183

Application Consistency Guarantees 184

Idempotent Writes 184

Transactional Writes 185

Provided Connectors 186

Apache Kafka Source Connector 187

Apache Kafka Sink Connector 190

Filesystem Source Connector 194

Filesystem Sink Connector 196

Apache Cassandra Sink Connector 199

Implementing a Custom Source Function 202

Resettable Source Functions 203

Source Functions, Timestamps, and Watermarks 204

Implementing a Custom Sink Function 206

Idempotent Sink Connectors 207

Transactional Sink Connectors 209

Asynchronously Accessing External Systems 216

Summary 219

9 Setting Up Flink for Streaming Applications 221

Deployment Modes 221

Standalone Cluster 221

Docker 223

Apache Hadoop YARN 225

Kubernetes 228

Highly Available Setups 232

HA Standalone Setup 233

HA YARN Setup 234

HA Kubernetes Setup 235

Integration with Hadoop Components 236

Filesystem Configuration 237

System Configuration 239

Java and Classloading 239

CPU 240

Main Memory and Network Buffers 240

Disk Storage 242

Checkpointing and State Backends 243

Security 243

Summary 244

10 Operating Flink and Streaming Applications 245

Running and Managing Streaming Applications 245

Savepoints 246

Managing Applications with the Command-Line Client 247

Managing Applications with the REST API 252

Bundling and Deploying Applications in Containers 258

Controlling Task Scheduling 260

Controlling Task Chaining 261

Defining Slot-Sharing Groups 262

Tuning Checkpointing and Recovery 263

Configuring Checkpointing 264

Configuring State Backends 266

Configuring Recovery 268

Monitoring Flink Clusters and Applications 270

Flink Web UI 270

Metric System 273

Monitoring Latency 278

Configuring the Logging Behavior 279

Summary 280

11 Where to Go from Here? 281

The Rest of the Flink Ecosystem 281

The DataSet API for Batch Processing 281

Table API and SQL for Relational Analysis 282

FlinkCEP for Complex Event Processing and Pattern Matching 282

Gelly for Graph Processing 282

A Welcoming Community 283

Index 285

From the B&N Reads Blog

Customer Reviews