Data Analysis with Python and PySpark

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.

In Data Analysis with Python and PySpark you will learn how to:

Manage your data as it scales across multiple machines
Scale up your data programs with full confidence
Read and write data to and from a variety of sources and formats
Deal with messy data with PySpark’s data manipulation functionality
Discover new data sets and perform exploratory data analysis
Build automated data pipelines that transform, summarize, and get insights from data
Troubleshoot common PySpark errors
Creating reliable long-running jobs

Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.

About the book
Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.

What's inside

Organizing your PySpark code
Managing your data, no matter the size
Scale up your data programs with full confidence
Troubleshooting common data pipeline problems
Creating reliable long-running jobs

About the reader
Written for data scientists and data engineers comfortable with Python.

About the author
As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

Table of Contents

1 Introduction
PART 1 GET ACQUAINTED: FIRST STEPS IN PYSPARK
2 Your first data program in PySpark
3 Submitting and scaling your first PySpark program
4 Analyzing tabular data with pyspark.sql
5 Data frame gymnastics: Joining and grouping
PART 2 GET PROFICIENT: TRANSLATE YOUR IDEAS INTO CODE
6 Multidimensional data frames: Using PySpark with JSON data
7 Bilingual PySpark: Blending Python and SQL code
8 Extending PySpark with Python: RDD and UDFs
9 Big data is just a lot of small data: Using pandas UDFs
10 Your data under a different lens: Window functions
11 Faster PySpark: Understanding Spark’s query planning
PART 3 GET CONFIDENT: USING MACHINE LEARNING WITH PYSPARK
12 Setting the stage: Preparing features for machine learning
13 Robust machine learning with ML Pipelines
14 Building custom ML transformers and estimators

Data Analysis with Python and PySpark

59.99 In Stock

Data Analysis with Python and PySpark

Add to Wishlist

Data Analysis with Python and PySpark

Paperback

$59.99

View All Available Formats & Editions

Paperback
$59.99

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 6-10 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Product Details

ISBN-13:	9781617297205
Publisher:	Manning
Publication date:	03/22/2022
Pages:	456
Product dimensions:	7.38(w) x 9.25(h) x 0.90(d)

About the Author

As a data scientist for an engineering consultancy Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

Preface xi

Acknowledgments xiii

About this book xv

About the author xviii

About the cover illustration xix

1 Introduction 1

1.1 What is PySpark? 2

Taking it from the start: What is Spark? 2

PySpark = Spark + Python 3

Why PySpark? 4

1.2 Your very own factory: How PySpark works 6

Some physical planning with the cluster manager 7

A factory made efficient through a lazy leader 10

1.3 What will you learn in this book? 13

1.4 What do I need to get started? 14

Part 1 Get Acquainted: Firs steps in PySpark 15

2 Your first data program in PySpark 17

2.1 Setting up the PySpark shell 18

The SparkSession entry point 20

Configuring how chatty spark is: The log level 22

2.2 Mapping our program 23

2.3 Ingest and explore: Setting the stage for data transformation 24

Reading data into a data frame with spark read 25

From structure to content: Exploring our data frame with show() 28

2.4 Simple column transformations: Moving from a sentence to a list of words 31

Selecting specific columns using select() 32

Transforming columns: Splitting a string into a list of words 33

Renaming columns: alias and withColumnRenamed 35

Reshaping your data: Exploding a list into rows 36

Working with words: Changing case and removing punctuation 37

2.5 Filtering rows 40

3 Submitting and scaling your first PySpark program 45

3.1 Grouping records: Counting word frequencies 46

3.2 Ordering the results on the screen using orderBy 48

3.3 Writing data from a data frame 50

3.4 Putting it all together: Counting 52

Simplifying your dependencies with PySpark's import conventions 53

Simplifying our program via method chaining 54

3.5 Using spark-submit to launch your program in batch mode 56

3.6 What didn't happen in this chapter 58

3.7 Scaling up our word frequency program 58

4 Analyzing tabular data with pyspark.sql 62

4.1 What is tabular data? 63

How does PySpark represent tabular data? 64

4.2 PySpark for analyzing and processing tabular data 65

4.3 Reading and assessing delimited data in PySpark 67

A first pass at the SparkReader specialized for CSV files 67

Customizing the SparkReader object to read CSV data files 69

Exploring the shape of our data universe 72

4.4 The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing 73

Knowing what we want: Selecting columns 73

Keeping what we need: Deleting columns 76

Creating what's not there: New columns with withColumn() 78

Tidying our data frame: Renaming and reordering columns 81

Diagnosing a data frame with describe() and summary() 83

5 Data frame gymnastics: Joining and grouping 87

5.1 From many to one: Joining data 88

What's what in the world of joins 88

Knowing our left from our right 89

The rules to a successful join: The predicates 90

How do you do it: The join method 92

Naming conventions in the joining world 96

5.2 Summarizing the data via groupby and GroupedData 100

A simple groupby blueprint 101

A column is a column: Using agg() with custom column definitions 105

5.3 Taking care of null values: Drop and fill 106

Dropping it like it's hot: Using dropna() to remove records with null values 107

Filling values to our heart's content using fillna() 108

5.4 What was our question again? Our end-to-end program 109

Part 2 Get proficient: Translate your ideas into code 115

6 Multidimensional data frames: Using PySpark with JSON data 117

6.1 Reading JSON data: Getting ready for the schemapocalypse 118

Starting small: JSON data as a limited Python dictionary 119

Going bigger: Reading JSON data in PySpark 121

6.2 Breaking the second dimension with complex data types 123

When you have more than one value: The array 125

The map type: Keys and values within a column 129

6.3 The struct: Nesting columns within columns 131

Navigating structs as if they were nested columns 132

6.4 Building and using the data frame schema 135

Using Spark types as the base blocks of a schema 135

Reading a JSON document with a strict schema in place 138

Going full circle: Specifying your schemas in JSON 141

6.5 Putting it all together: Reducing duplicate data with complex data types 144

Getting to the "just right" data frame: Explode and collect 146

Building your own hierarchies: Struct as a function 148

7 Bilingual PySpark: Blending Python and SQL code 151

7.1 Banking on what we know: pyspark.sql vs. plain SQL 152

7.2 Preparing a data frame for SQL 154

Promoting a data frame to a Spark table 154

Using the Spark catalog 156

7.3 SQL and PySpark 157

7.4 Using SQL-like syntax within data frame methods 159

Get the rows and columns you want: select and where 159

Grouping similar records together: group by and order by 160

Filtering after grouping using having 161

Creating new tables/views using the CREATE keyword 163

Adding data to our table using UNION and JOIN 164

Organizing your SQL code better through subqueries and common table expressions 166

A quick summary of PySpark vs. SQL syntax 168

7.5 Simplifying our code: Blending SQL and Python 169

Using Python to increase the resiliency and simplifying the data reading stage 169

Using SQL-style expressions in PySpark 170

7.6 Conclusion 172

8 Extending PySpark with Python: RDD and UDFs 175

8.1 PySpark, freestyle: The RDD 176

Manipulating data the RDD way: map(), filterQ, and reduce() 177

8.2 Using Python to extend PySpark via UDFs 185

It all starts with plain Python: Using typed Python functions 186

From Python function to UDFs using udf() 188

9 Big data is just a lot of small data: Using pandas UDFs 192

9.1 Column transformations with pandas: Using Series UDF 194

Connecting Spark to Google's BigQuery 194

Series to Series UDF: Column functions, but with pandas 199

Scalar UDF + cold start = Iterator of Series UDF 202

9.2 UDFs on grouped data: Aggregate and apply 205

Group aggregate UDFs 207

Group map UDF 208

9.3 What to use, when 210

10 Your data under a different lens: Window junctions 215

10.1 Growing and using a simple window function 216

Identifying the coldest day of each year, the long way 217

Creating and using a simple window function to get the coldest days 219

Comparing both approaches 223

10.2 Beyond summarizing: Using ranking and analytical functions 224

Ranking functions: Quick, who's first? 225

Analytic functions: Looking back and ahead 230

10.3 Flex those windows! Using row and range boundaries 232

Counting, window style: Static, growing, unbounded 233

What you are vs. where you are: Range vs. rows 235

10.4 Going full circle: Using UDFs within windows 239

10.5 Look in the window: The main steps to a successful window function 240

11 Faster PySpark: Understanding Spark's query planning 244

11.1 Open sesame: Navigating the Spark UI to understand the environment 245

Reviewing the configuration: The environment tab 247

Greater than the sum of its parts: The Executors tab and resource management 249

Look at what you've done: Diagnosing a completed job via the Spark UI 254

Mapping the operations via Spark query plans: The SQL tab 257

The core of Spark: The parsed, analyzed, optimized, and physical plans 260

11.2 Thinking about performance: Operations and memory 263

Narrow vs. wide operations 264

Caching a data frame: Powerful, but often deadly (for perf) 269

Part 3 Get confident: Using machine learning with PySpark 275

12 Setting the stage: Preparing features for machine learning 277

12.1 Reading, exploring, and preparing our machine learning data set 278

Standardizing column names using toDFQ 279

Exploring our data and getting our first feature columns 281

Addressing data mishaps and building our first feature set 283

Weeding out useless records and imputing binary features 286

Taking care of extreme values: Cleaning continuous columns 287

Weeding out the rare binary occurrence columns 290

12.2 Feature creation and refinement 291

Creating custom features 292

Removing highly correlated features 293

12.3 Feature preparation with transformers and estimators 296

Imputing continuous features using the Imputer estimator 298

Scaling our features using the MinMaxScaler estimator 300

13 Robust machine learning with ML Pipelines 303

13.1 Transformers and estimators: The building blocks of ML in Spark 304

Data comes in, data com.es out: The Transformer 305

Data comes in, transformer comes out: The Estimator 310

13.2 Building a (complete) machine learning pipeline 312

Assembling the final data set with the vector column type 314

Training an ML model using a LogisticRegression classifier 316

13.3 Evaluating and optimizing our model 319

Assessing model accuracy: Confusion matrix and evaluator object 320

True positives vs. false positives: The ROC curve 323

Optimizing hyperparameters with cross-validation 325

13.4 Getting the biggest drivers from our model: Extracting the coefficients 328

14 Building custom ML transformers and estimators 331

14.1 Creating your own transformer 332

Designing a transformer: Thinking in terms of Params and transformation 333

Creating the Params of a transformer 335

Getters and setters: Being a nice PySpark citizen 337

Creating a custom transformer's initialization function 340

Creating our transformation function 341

Using our transformer 343

14.2 Creating your own estimator 344

Designing our estimator: From model to params 345

Implementing the companion model: Creating our own Mixin 347

Creating the ExtremeValueCapper estimator 350

Trying out our custom estimator 352

14.3 Using our transformer and estimator in an ML pipeline 353

Dealing with multiple inputCols 353

In practice: Inserting custom components into an ML pipeline 356

Appendix A Solutions to the exercises 361

Appendix B Installing PySpark 389

Appendix C Some useful Python concepts 408

Index 423

From the B&N Reads Blog

Page 1 of

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews