Paperback
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
In Data Analysis with Python and PySpark you will learn how to:
Manage your data as it scales across multiple machines
Scale up your data programs with full confidence
Read and write data to and from a variety of sources and formats
Deal with messy data with PySpark’s data manipulation functionality
Discover new data sets and perform exploratory data analysis
Build automated data pipelines that transform, summarize, and get insights from data
Troubleshoot common PySpark errors
Creating reliable long-running jobs
Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.
About the book
Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.
What's inside
Organizing your PySpark code
Managing your data, no matter the size
Scale up your data programs with full confidence
Troubleshooting common data pipeline problems
Creating reliable long-running jobs
About the reader
Written for data scientists and data engineers comfortable with Python.
About the author
As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.
Table of Contents
1 Introduction
PART 1 GET ACQUAINTED: FIRST STEPS IN PYSPARK
2 Your first data program in PySpark
3 Submitting and scaling your first PySpark program
4 Analyzing tabular data with pyspark.sql
5 Data frame gymnastics: Joining and grouping
PART 2 GET PROFICIENT: TRANSLATE YOUR IDEAS INTO CODE
6 Multidimensional data frames: Using PySpark with JSON data
7 Bilingual PySpark: Blending Python and SQL code
8 Extending PySpark with Python: RDD and UDFs
9 Big data is just a lot of small data: Using pandas UDFs
10 Your data under a different lens: Window functions
11 Faster PySpark: Understanding Spark’s query planning
PART 3 GET CONFIDENT: USING MACHINE LEARNING WITH PYSPARK
12 Setting the stage: Preparing features for machine learning
13 Robust machine learning with ML Pipelines
14 Building custom ML transformers and estimators
Product Details
ISBN-13: | 9781617297205 |
---|---|
Publisher: | Manning |
Publication date: | 03/22/2022 |
Pages: | 456 |
Sales rank: | 543,204 |
Product dimensions: | 7.38(w) x 9.25(h) x 0.90(d) |
About the Author
Table of Contents
Preface xi
Acknowledgments xiii
About this book xv
About the author xviii
About the cover illustration xix
1 Introduction 1
1.1 What is PySpark? 2
Taking it from the start: What is Spark? 2
PySpark = Spark + Python 3
Why PySpark? 4
1.2 Your very own factory: How PySpark works 6
Some physical planning with the cluster manager 7
A factory made efficient through a lazy leader 10
1.3 What will you learn in this book? 13
1.4 What do I need to get started? 14
Part 1 Get Acquainted: Firs steps in PySpark 15
2 Your first data program in PySpark 17
2.1 Setting up the PySpark shell 18
The SparkSession entry point 20
Configuring how chatty spark is: The log level 22
2.2 Mapping our program 23
2.3 Ingest and explore: Setting the stage for data transformation 24
Reading data into a data frame with spark read 25
From structure to content: Exploring our data frame with show() 28
2.4 Simple column transformations: Moving from a sentence to a list of words 31
Selecting specific columns using select() 32
Transforming columns: Splitting a string into a list of words 33
Renaming columns: alias and withColumnRenamed 35
Reshaping your data: Exploding a list into rows 36
Working with words: Changing case and removing punctuation 37
2.5 Filtering rows 40
3 Submitting and scaling your first PySpark program 45
3.1 Grouping records: Counting word frequencies 46
3.2 Ordering the results on the screen using orderBy 48
3.3 Writing data from a data frame 50
3.4 Putting it all together: Counting 52
Simplifying your dependencies with PySpark's import conventions 53
Simplifying our program via method chaining 54
3.5 Using spark-submit to launch your program in batch mode 56
3.6 What didn't happen in this chapter 58
3.7 Scaling up our word frequency program 58
4 Analyzing tabular data with pyspark.sql 62
4.1 What is tabular data? 63
How does PySpark represent tabular data? 64
4.2 PySpark for analyzing and processing tabular data 65
4.3 Reading and assessing delimited data in PySpark 67
A first pass at the SparkReader specialized for CSV files 67
Customizing the SparkReader object to read CSV data files 69
Exploring the shape of our data universe 72
4.4 The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing 73
Knowing what we want: Selecting columns 73
Keeping what we need: Deleting columns 76
Creating what's not there: New columns with withColumn() 78
Tidying our data frame: Renaming and reordering columns 81
Diagnosing a data frame with describe() and summary() 83
5 Data frame gymnastics: Joining and grouping 87
5.1 From many to one: Joining data 88
What's what in the world of joins 88
Knowing our left from our right 89
The rules to a successful join: The predicates 90
How do you do it: The join method 92
Naming conventions in the joining world 96
5.2 Summarizing the data via groupby and GroupedData 100
A simple groupby blueprint 101
A column is a column: Using agg() with custom column definitions 105
5.3 Taking care of null values: Drop and fill 106
Dropping it like it's hot: Using dropna() to remove records with null values 107
Filling values to our heart's content using fillna() 108
5.4 What was our question again? Our end-to-end program 109
Part 2 Get proficient: Translate your ideas into code 115
6 Multidimensional data frames: Using PySpark with JSON data 117
6.1 Reading JSON data: Getting ready for the schemapocalypse 118
Starting small: JSON data as a limited Python dictionary 119
Going bigger: Reading JSON data in PySpark 121
6.2 Breaking the second dimension with complex data types 123
When you have more than one value: The array 125
The map type: Keys and values within a column 129
6.3 The struct: Nesting columns within columns 131
Navigating structs as if they were nested columns 132
6.4 Building and using the data frame schema 135
Using Spark types as the base blocks of a schema 135
Reading a JSON document with a strict schema in place 138
Going full circle: Specifying your schemas in JSON 141
6.5 Putting it all together: Reducing duplicate data with complex data types 144
Getting to the "just right" data frame: Explode and collect 146
Building your own hierarchies: Struct as a function 148
7 Bilingual PySpark: Blending Python and SQL code 151
7.1 Banking on what we know: pyspark.sql vs. plain SQL 152
7.2 Preparing a data frame for SQL 154
Promoting a data frame to a Spark table 154
Using the Spark catalog 156
7.3 SQL and PySpark 157
7.4 Using SQL-like syntax within data frame methods 159
Get the rows and columns you want: select and where 159
Grouping similar records together: group by and order by 160
Filtering after grouping using having 161
Creating new tables/views using the CREATE keyword 163
Adding data to our table using UNION and JOIN 164
Organizing your SQL code better through subqueries and common table expressions 166
A quick summary of PySpark vs. SQL syntax 168
7.5 Simplifying our code: Blending SQL and Python 169
Using Python to increase the resiliency and simplifying the data reading stage 169
Using SQL-style expressions in PySpark 170
7.6 Conclusion 172
8 Extending PySpark with Python: RDD and UDFs 175
8.1 PySpark, freestyle: The RDD 176
Manipulating data the RDD way: map(), filterQ, and reduce() 177
8.2 Using Python to extend PySpark via UDFs 185
It all starts with plain Python: Using typed Python functions 186
From Python function to UDFs using udf() 188
9 Big data is just a lot of small data: Using pandas UDFs 192
9.1 Column transformations with pandas: Using Series UDF 194
Connecting Spark to Google's BigQuery 194
Series to Series UDF: Column functions, but with pandas 199
Scalar UDF + cold start = Iterator of Series UDF 202
9.2 UDFs on grouped data: Aggregate and apply 205
Group aggregate UDFs 207
Group map UDF 208
9.3 What to use, when 210
10 Your data under a different lens: Window junctions 215
10.1 Growing and using a simple window function 216
Identifying the coldest day of each year, the long way 217
Creating and using a simple window function to get the coldest days 219
Comparing both approaches 223
10.2 Beyond summarizing: Using ranking and analytical functions 224
Ranking functions: Quick, who's first? 225
Analytic functions: Looking back and ahead 230
10.3 Flex those windows! Using row and range boundaries 232
Counting, window style: Static, growing, unbounded 233
What you are vs. where you are: Range vs. rows 235
10.4 Going full circle: Using UDFs within windows 239
10.5 Look in the window: The main steps to a successful window function 240
11 Faster PySpark: Understanding Spark's query planning 244
11.1 Open sesame: Navigating the Spark UI to understand the environment 245
Reviewing the configuration: The environment tab 247
Greater than the sum of its parts: The Executors tab and resource management 249
Look at what you've done: Diagnosing a completed job via the Spark UI 254
Mapping the operations via Spark query plans: The SQL tab 257
The core of Spark: The parsed, analyzed, optimized, and physical plans 260
11.2 Thinking about performance: Operations and memory 263
Narrow vs. wide operations 264
Caching a data frame: Powerful, but often deadly (for perf) 269
Part 3 Get confident: Using machine learning with PySpark 275
12 Setting the stage: Preparing features for machine learning 277
12.1 Reading, exploring, and preparing our machine learning data set 278
Standardizing column names using toDFQ 279
Exploring our data and getting our first feature columns 281
Addressing data mishaps and building our first feature set 283
Weeding out useless records and imputing binary features 286
Taking care of extreme values: Cleaning continuous columns 287
Weeding out the rare binary occurrence columns 290
12.2 Feature creation and refinement 291
Creating custom features 292
Removing highly correlated features 293
12.3 Feature preparation with transformers and estimators 296
Imputing continuous features using the Imputer estimator 298
Scaling our features using the MinMaxScaler estimator 300
13 Robust machine learning with ML Pipelines 303
13.1 Transformers and estimators: The building blocks of ML in Spark 304
Data comes in, data com.es out: The Transformer 305
Data comes in, transformer comes out: The Estimator 310
13.2 Building a (complete) machine learning pipeline 312
Assembling the final data set with the vector column type 314
Training an ML model using a LogisticRegression classifier 316
13.3 Evaluating and optimizing our model 319
Assessing model accuracy: Confusion matrix and evaluator object 320
True positives vs. false positives: The ROC curve 323
Optimizing hyperparameters with cross-validation 325
13.4 Getting the biggest drivers from our model: Extracting the coefficients 328
14 Building custom ML transformers and estimators 331
14.1 Creating your own transformer 332
Designing a transformer: Thinking in terms of Params and transformation 333
Creating the Params of a transformer 335
Getters and setters: Being a nice PySpark citizen 337
Creating a custom transformer's initialization function 340
Creating our transformation function 341
Using our transformer 343
14.2 Creating your own estimator 344
Designing our estimator: From model to params 345
Implementing the companion model: Creating our own Mixin 347
Creating the ExtremeValueCapper estimator 350
Trying out our custom estimator 352
14.3 Using our transformer and estimator in an ML pipeline 353
Dealing with multiple inputCols 353
In practice: Inserting custom components into an ML pipeline 356
Appendix A Solutions to the exercises 361
Appendix B Installing PySpark 389
Appendix C Some useful Python concepts 408
Index 423