Data Pipelines with Apache Airflow

Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines.

Summary
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Data pipelines manage the flow of data from initial collection through consolidation, cleaning, analysis, visualization, and more. Apache Airflow provides a single platform you can use to design, implement, monitor, and maintain your pipelines. Its easy-to-use UI, plug-and-play options, and flexible Python scripting make Airflow perfect for any data management task.

About the book
Data Pipelines with Apache Airflow teaches you how to build and maintain effective data pipelines. You’ll explore the most common usage patterns, including aggregating multiple data sources, connecting to and from data lakes, and cloud deployment. Part reference and part tutorial, this practical guide covers every aspect of the directed acyclic graphs (DAGs) that power Airflow, and how to customize them for your pipeline’s needs.

What's inside
Build, test, and deploy Airflow pipelines as DAGs
Automate moving and transforming data
Analyze historical datasets using backfilling
Develop custom components
Set up Airflow in production environments

About the reader
For DevOps, data engineers, machine learning engineers, and sysadmins with intermediate Python skills.

About the author
Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies. Bas is also an Airflow committer.

Table of Contents

PART 1 - GETTING STARTED

1 Meet Apache Airflow
2 Anatomy of an Airflow DAG
3 Scheduling in Airflow
4 Templating tasks using the Airflow context
5 Defining dependencies between tasks

PART 2 - BEYOND THE BASICS

6 Triggering workflows
7 Communicating with external systems
8 Building custom components
9 Testing
10 Running tasks in containers

PART 3 - AIRFLOW IN PRACTICE

11 Best practices
12 Operating Airflow in production
13 Securing Airflow
14 Project: Finding the fastest way to get around NYC

PART 4 - IN THE CLOUDS

15 Airflow in the clouds
16 Airflow on AWS
17 Airflow on Azure
18 Airflow in GCP

Data Pipelines with Apache Airflow

49.99 In Stock

Data Pipelines with Apache Airflow

Add to Wishlist

Data Pipelines with Apache Airflow

Paperback

$49.99

View All Available Formats & Editions

Paperback
$49.99

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 6-10 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Product Details

ISBN-13:	9781617296901
Publisher:	Manning
Publication date:	04/27/2021
Pages:	480
Product dimensions:	7.38(w) x 9.25(h) x 1.10(d)

About the Author

Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Booking.com. Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.

Preface xv

Acknowledgments xvii

About this book xix

About the authors xxiii

About the cover illustration xxiv

Part 1 Getting Started 1

1 Meet Apache Airflow 3

1.1 Introducing data pipelines 4

Data pipelines as graphs 4

Executing a pipeline graph 6

Pipeline graphs vs. sequential scripts 6

Running pipeline using workflow managers 9

1.2 Introducing Airflow 10

Defining pipelines flexibly in (Python) code 10

Scheduling and executing pipelines 11

Monitoring and handling failures 13

Incremental loading and backfilling 15

1.3 When to use Airflow 17

Reasons to choose Airflow 17

Reasons not to choose Airflow 17

1.4 The rest of this book 18

2 Anatomy of an Airflow DAG 20

2.1 Collecting data from numerous sources 21

Exploring the data 21

2.2 Writing your first Airflow DAG 22

Tasks vs. operators 26

Running arbitrary Python code 27

2.3 Running a DAG in Airflow 29

Running Airflow in a Python environment 29

Running Airflow in Docker containers 30

Inspecting the Airflow UI 31

2.4 Running at regular intervals 33

2.5 Handling failing tasks 36

3 Scheduling in Airflow 40

3.1 An example: Processing user events 41

3.2 Running at regular intervals 42

Defining scheduling intervals 42

Cron-based intervals 44

Frequency-based intervals 46

3.3 Processing data incrementally 46

Fetching events incrementally 46

Dynamic time references using execution dates 48

Partitioning your data 50

3.4 Understanding Airflow's execution dates 52

Executing work in fixed-length intervals 52

3.5 Using backfilling to fill in past gaps 54

Executing work back in time 54

3.6 Best practices for designing tasks 55

Atomicity 55

Idempotency 57

4 Templating tasks using the Airflow context 60

4.1 Inspecting data for processing with Airflow 61

Determining how to load incremental data 61

4.2 Task context and Jinja templating 63

Templating operator arguments 64

What is available for templating? 66

Templating the PythonOperator 68

Providing variables to the PythonOperator 73

Inspecting templated arguments 75

4.3 Hooking up other systems 77

5 Defining dependencies between tasks 85

5.1 Basic dependencies 86

Linear dependencies 86

Fan-in/-out dependencies 87

5.2 Branching 90

Branching milk in tasks 90

Branching within the DAG 92

5.3 Conditional tasks 97

Conditions within tasks 97

Making tasks conditional 98

Using built-in operators 100

5.4 More about trigger rules 100

What is a trigger rule? 101

The effect of failures 102

Other trigger rules 103

5.5 Sharing data between tasks 104

Sharing data using XComs 104

When (not) to use XComs 107

Using custom XCom backends 108

5.6 Chaining Python tasks with the Taskflow API 108

Simplifying Python tasks with the Taskflow API 109

When (not) to use the Taskflow API 111

Part 2 Beyond the Basics 113

6 Triggering workflows 115

6.1 Polling conditions with sensors 116

Polling custom conditions 119

Sensors outside the happy flow 120

6.2 Triggering other DAGs 122

Backfilling with the TriggerDagRunOperator 126

Polling the state of other DAGs 127

6.3 Starting workflows with REST/CLI 131

Communicating with external systems 135

7.1 Connecting to cloud services 136

Installing extra dependencies 137

Developing a machine learning model 137

Developing locally with external systems 143

7.2 Moving data from between systems 150

Implementing a PostgresToS3Operator 151

Outsourcing the heavy work 155

8 Building custom components 157

8.1 Starting with a Python Operator 158

Simulating a movie rating API 158

Fetching ratings from the API 161

Building the actual DAG 164

8.2 Building a custom hook 166

Designing a custom hook 166

Building our DAG with the MovielensHook 172

8.3 Building a custom operator 173

Defining a custom operator 174

Building an operator for fetching ratings 175

8.4 Building custom sensors 178

8.5 Packaging your components 181

Bootstrapping a Python package 182

Installing your package 184

9 Testing 186

9.1 Getting started with testing 187

Integrity testing all DAGs 187

Setting up a CI/CD pipeline 193

Writing unit tests 195

Pytest project structure 196

Testing with files on disk 201

9.2 Working with DAGs and task context in tests 203

Working with external systems 208

9.3 Using tests for development 215

Testing complete DAGs 217

9.4 Emulate production environments with Whirl 218

9.5 Create DTAP environments 219

10 Running tasks in containers 220

10.1 Challenges of many different operators 221

Operator interfaces and implementations 221

Complex and conflicting dependencies 222

Moving toward a generic operator 223

10.2 Introducing containers 223

What are containers? 223

Running our first Docker container 224

Creating a Docker image 225

Persisting data using volumes 227

10.3 Containers and Airflow 230

Tasks in containers 230

Why use containers? 231

10.4 Running tasks in Docker 232

Introducing the DockerOperator 232

Creating container images for tasks 233

Building a DAG with Docker tasks 236

Docker-based workflow 239

10.5 Running tasks in Kubernetes 240

Introducing Kubernetes 240

Setting up Kubernetes 242

Using the KubernetesPodOperator 245

Diagnosing Kubernetes-related issues 248

Differences with Docker-based workflows 250

Part 3 Airflow in Practice 253

11 Best practices 255

11.1 Writing clean DAGs 256

Use style conventions 256

Manage credentials centrally 260

Specify configuration details consistently 261

Avoid doing any computation in your DAG definition 263

Use factories to generate common patterns 265

Group related tasks using task groups 269

Create new DAGs for big changes 270

11.2 Designing reproducible tasks 270

Always require tasks to be idempotent 271

Task results should be deterministic 271

Design tasks using functional paradigms 272

11.3 Handling data efficiently 272

Limit the amount of data being processed 272

Incremental loading/processing 274

Cache intermediate data 275

Don't store data on local file systems 275

Offload work-to external/source systems 276

11.4 Managing your resources 276

Managing concurrency using pools 276

Detecting long-running tasks using SLAs and alerts 278

Operating Airflow in production 281

12.1 Airflow architectures 282

Which executor is right for met 284

Configuring a metastore for Airflow 284

A closer look at the scheduler 286

12.2 Installing each executor 290

Setting up the SequentialExecutor 291

Setting up the LocalExecutor 292

Setting up the CeleryExecutor 293

Setting up the KubernetesExecutor 296

12.3 Capturing logs of all Airflow processes 302

Capturing the webserver output 303

Capturing the scheduler output 303

Capturing task logs 304

Sending logs to remote storage 305

12.4 Visualizing and monitoring Airflow metrics 305

Collecting metrics from Airflow 306

Configuring Airflow to send metrics 307

Configuring Prometheus to collect metrics 308

Creating dashboards with Graf ana 310

What should you monitor? 312

12.5 How to get notified of a failing task 314

Alerting within DAGs and operators 314

Defining service-level agreements 316

Scalability and performance 318

Controlling the maximum number of running tasks 318

System performance configurations 319

Running multiple schedulers 320

13 Securing Airflow 322

13.1 Securing the Airflow web interface 323

Adding users to the PS AC interface 324

Configuring the RBAC interface 327

13.2 Encrypting data at rest 327

Creating a Fernet key 328

13.3 Connecting with an LDAP service 330

Understanding LDAP 330

Fetching users from an LDAP service 333

13.4 Encrypting traffic to the webserver 333

Understanding HTTPS 334

Configuring a certificate for HTTPS 336

13.5 Fetching credentials from secret management systems 339

14 Project: Finding the fastest way to get around NYC 344

14.1 Understanding the data 347

Yellow Cab file share 348

Citi Bike REST API 348

Deciding on a plan of approach 350

14.2 Extracting the data 350

Downloading Citi Bike data 351

Downloading Yellow Cab data 353

14.3 Applying similar transformations to data 356

14.4 Structuring a data pipeline 360

14.5 Developing idempotent data pipelines 361

Part 4 In the Clouds 365

15 Airflow in the clouds 367

15.1 Designing (cloud) deployment strategies 368

15.2 Cloud-specific operators and hooks 369

15.3 Managed services 370

Astronomer.io 371

Google Cloud Composer 371

Amazon Managed Workflows for Apache Airflow 372

15.4 Choosing a deployment strategy 372

16 Airflow on AWS 375

16.1 Deploying Airflow in AWS 375

Picking cloud services 376

Designing the network 377

Adding DAG syncing 378

Scaling with the CeleryExecutor 378

Further steps 380

16.2 AWS-specific hooks and operators 381

16.3 Use case: Serverless movie ranking with AWS Athena 383

Overview 383

Setting up resources 384

Building the DAG 387

Cleaning up 393

17 Airflow on Azure 394

17.1 Deploying Airflow in Azure 394

Picking services 395

Designing the network 395

Scaling with the CeleryExecutor 397

Further steps 398

17.2 Azure-specific hooks/operators 398

17.3 Example: Serverless movie ranking with Azure Synapse 400

Overview 400

Setting up resources 401

Building the DAG 404

Cleaning up 410

Airflow in GCP 412

18.1 Deploying Airflow in GCP 413

Picking services 413

Deploying on GKE with Helm 415

Integrating with Google services 417

Designing the network 419

Scaling with the CeleryExecutor 419

18.2 GCP-specific hooks and operators 422

18.3 Use case: Serverless movie ranking on GCP 427

Uploading to GCS 428

Getting data into BigQuery 429

Extracting top ratings 432

Appendix A Running code samples 436

Appendix B Package structures Airflow 1 and 2 439

Appendix C Prometheus metric mapping 443

Index 445

From the B&N Reads Blog

Page 1 of

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews