Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark

Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark

by Russell Jurney

Paperback

$40.49 $44.99 Save 10% Current price is $40.49, Original price is $44.99. You Save 10%.
View All Available Formats & Editions
Eligible for FREE SHIPPING
  • Want it by Tuesday, October 23  Order now and choose Expedited Shipping during checkout.

Overview

Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark by Russell Jurney

Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools.

Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization.

  • Build value from your data in a series of agile sprints, using the data-value pyramid
  • Extract features for statistical models from a single dataset
  • Visualize data with charts, and expose different aspects through interactive reports
  • Use historical data to predict the future via classification and regression
  • Translate predictions into actions
  • Get feedback from users after each sprint to keep your project on track

Product Details

ISBN-13: 9781491960110
Publisher: O'Reilly Media, Incorporated
Publication date: 06/24/2017
Pages: 352
Sales rank: 634,918
Product dimensions: 7.00(w) x 9.10(h) x 0.90(d)

About the Author

Russell Jurney runs a boutique consultancy, Data Syndrome, specializing in building analytics products. He cut his data teeth in casino gaming, building web apps to analyze the performance of slot machines in the US and Mexico. After dabbling in entrepreneurship, interactive media and journalism, he moved to silicon valley to build analytics applications at scale at Ning and LinkedIn. He lives on the ocean, in the fog, in Pacifica, California with Bella the Data Dog.

Table of Contents

Preface ix

Part I Setup

1 Theory 3

Introduction 3

Definition 5

Methodology as Tweet 5

Agile Data Science Manifesto 6

The Problem with the Waterfall 10

Research Versus Application Development 11

The Problem with Agile Software 14

Eventual Quality: Financing Technical Debt 14

The Pull of the Waterfall 15

The Data Science Process 16

Setting Expectations 17

Data Science Team Roles 18

Recognizing the Opportunity and the Problem 19

Adapting to Change 21

Notes on Process 23

Code Review and Pair Programming 25

Agile Environments: Engineering Productivity 25

Realizing Ideas with Large-Format Printing 27

2 Agile Tools 29

Scalability = Simplicity 30

Agile Data Science Data Processing 30

Local Environment Setup 32

System Requirements 33

Setting Up Vagrant 33

Downloading the Data 33

EC2 Environment Setup 34

Downloading the Data 38

Getting and Running the Code 38

Getting the Code 38

Running the Code 38

Jupyter Notebooks 39

Touring the Toolset 39

Agile Stack Requirements 39

Python 3 39

Serializing Events with JSON Lines and Parquet 42

Collecting Data 45

Data Processing with Spark 45

Publishing Data with MongoDB 48

Searching Data with Elasticsearch 50

Distributed Streams with Apache Kafka 54

Processing Streams with PySpark Streaming 57

Machine Learning with scikit-learn and Spark MLlib 58

Scheduling with Apache Airflow (Incubating) 59

Reflecting on Our Workflow 70

Lightweight Web Applications 70

Presenting Our Data 73

Conclusion 75

3 Data 77

Air Travel Data 77

Flight On-Time Performance Data 78

OpenFlights Database 79

Weather Data 80

Data Processing in Agile Data Science 81

Structured Versus Semistructured Data 81

SQL Versus NoSQL 82

SQL 83

NoSQL and Dataflow Programming 83

Spark: SQL + NoSQL 84

Schemas in NoSQL 84

Data Serialization 85

Extracting and Exposing Features in Evolving Schemas 85

Conclusion 86

Part II Climbing the Pyramid

4 Collecting and Displaying Records 89

Putting It All Together 90

Collecting and Serializing Flight Data 91

Processing and Publishing Flight Records 94

Publishing Flight Records to MongoDB 95

Presenting Flight Records in a Browser 96

Serving Flights with Flask and pymongo 97

Rendering HTML5 with Jinja2 98

Agile Checkpoint 102

Listing Flights 103

Listing Flights with MongoDB 103

Paginating Data 106

Searching for Flights 112

Creating Our Index 112

Publishing Flights to Elasticsearch 113

Searching Flights on the Web 114

Conclusion 117

5 Visualizing Data with Charts and Tables 119

Chart Quality: Iteration Is Essential 120

Scaling a Database in the Publish/Decorate Model 120

First Order Form 121

Second Order Form 122

Third Order Form 123

Choosing a Form 123

Exploring Seasonality 124

Querying and Presenting Flight Volume 124

Extracting Metal (Airplanes [Entities]) 132

Extracting Tail Numbers 132

Assessing Our Airplanes 139

Data Enrichment 140

Reverse Engineering a Web Form 140

Gathering Tail Numbers 142

Automating Form Submission 143

Extracting Data from HTML 144

Evaluating Enriched Data 147

Conclusion 148

6 Exploring Data with Reports 149

Extracting Airlines (Entities) 150

Defining Airlines as Groups of Airplanes Using PySpark 150

Querying Airline Data in Mongo 151

Building an Airline Page in Flask 151

Linking Back to Our Airline Page 152

Creating an All Airlines Home Page 153

Curating Ontologies of Semi-structured Data 154

Improving Airlines 155

Adding Names to Carrier Codes 156

Incorporating Wikipedia Content 158

Publishing Enriched Airlines to Mongo 159

Enriched Airlines on the Web 160

Investigating Airplanes (Entities) 162

SQL Subqueries Versus Dataflow Programming 164

Dataflow Programming Without Subqueries 164

Subqueries in Spark SQL 165

Creating an Airplanes Home Page 166

Adding Search to the Airplanes Page 167

Creating a Manufacturers Bar Chart 172

Iterating on the Manufacturers Bar Chart 174

Entity Resolution: Another Chart Iteration 177

Conclusion 183

7 Making Predictions 185

The Role of Predictions 186

Predict What? 186

Introduction to Predictive Analytics 187

Making Predictions 187

Exploring Flight Delays 189

Extracting Features with PySpark 193

Building a Regression with scikit-learn 198

Loading Our Data 198

Sampling Our Data 199

Vectorizing Our Results 200

Preparing Our Training Data 201

Vectorizing Our Features 201

Sparse Versus Dense Matrices 203

Preparing an Experiment 204

Training Our Model 204

Testing Our Model 205

Conclusion 207

Building a Classifier with Spark MLlib 208

Loading Our Training Data with a Specified Schema 208

Addressing Nulls 210

Replacing FlighlNum with Route 210

Bucketizing a Continuous Variable for Classification 211

Feature Vectorization with pyspark.ml.feature 219

Classification with Spark ML 221

Conclusion 223

8 Deploying Predictive Systems 225

Deploying a scikit-learn Application as a Web Service 225

Saving and Loading scikit-learn Models 226

Groundwork for Serving Predictions 227

Creating Our Flight Delay Regression API 228

Testing Our API 232

Pulling Our API into Our Product 232

Deploying Spark ML Applications in Batch with Airflow 234

Gathering Training Data in Production 235

Training, Storing, and Loading Spark ML Models 237

Creating Prediction Requests in Mongo 239

Fetching Prediction Requests from MongoDB 245

Making Predictions in a Batch with Spark ML 248

Storing Predictions in MongoDB 252

Displaying Batch Prediction Results in Our Web Application 253

Automating Our Workflow with Apache Airflow (Incubating) 256

Conclusion 264

Deploying Spark ML via Spark Streaming 264

Gathering Training Data in Production 265

Training, Storing, and Loading Spark ML Models 265

Sending Prediction Requests to Kafka 266

Making Predictions in Spark Streaming 277

Testing the Entire System 283

Conclusion 285

8 Improving Predictions 287

Fixing Oar Prediction Problem 287

When to Improve Predictions 288

Improving Prediction Performance 288

Experimental Adhesion Method: See What Sticks 288

Establishing Rigorous Metrics for Experiments 289

Time of Day as a Feature 298

Incorporating Airplane Data 302

Extracting Airplane Features 302

Incorporating Airplane Features into Our Classifier Model 305

Incorporating Flight Time 310

Conclusion 313

A Manual Installation 315

Index 323

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews