Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

Modern data science solutions need

to be clean, easy to read, and scalable. In Mastering Large Datasets with Python, author J.T. Wolohan teaches you how to take a small project and scale it up using a functionally influenced approach to Python coding. You’ll explore methods and built-in Python tools that lend themselves to clarity and scalability, like the high-performing parallelism method, as well as distributed technologies that allow for high data throughput. The abundant hands-on exercises in this practical tutorial will lock in these essential skills for any large-scale data science project.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

Programming techniques that work well on laptop-sized data can slow to a crawl—or fail altogether—when applied to massive files or distributed datasets. By mastering the powerful map and reduce paradigm, along with the Python-based tools that support it, you can write data-centric applications that scale efficiently without requiring codebase rewrites as your requirements change.

Mastering Large Datasets with Python teaches you to write code that can handle datasets of any size. You’ll start with laptop-sized datasets that teach you to parallelize data analysis by breaking large tasks into smaller ones that can run simultaneously. You’ll then scale those same programs to industrial-sized datasets on a cluster of cloud servers. With the map and reduce paradigm firmly in place, you’ll explore tools like Hadoop and PySpark to efficiently process massive distributed datasets, speed up decision-making with machine learning, and simplify your data storage with AWS S3.

An introduction to the map and reduce paradigm
Parallelization with the multiprocessing module and pathos framework
Hadoop and Spark for distributed computing
Running AWS jobs to process large datasets

For Python programmers who need to work faster with more data.

J. T. Wolohan is a lead data scientist at Booz Allen Hamilton, and a PhD researcher at Indiana University, Bloomington.

Table of Contents:

PART 1

1 ¦ Introduction

2 ¦ Accelerating large dataset work: Map and parallel computing

3 ¦ Function pipelines for mapping complex transformations

4 ¦ Processing large datasets with lazy workflows

5 ¦ Accumulation operations with reduce

6 ¦ Speeding up map and reduce with advanced parallelization

PART 2

7 ¦ Processing truly big datasets with Hadoop and Spark

8 ¦ Best practices for large data with Apache Streaming and mrjob

9 ¦ PageRank with map and reduce in PySpark

10 ¦ Faster decision-making with machine learning and PySpark

PART 3

11 ¦ Large datasets in the cloud with Amazon Web Services and S3

12 ¦ MapReduce in the cloud with Amazon’s Elastic MapReduce

1136154421

Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

Modern data science solutions need

An introduction to the map and reduce paradigm
Parallelization with the multiprocessing module and pathos framework
Hadoop and Spark for distributed computing
Running AWS jobs to process large datasets

49.99 In Stock

Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

Add to Wishlist

Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

Paperback(1st Edition)

$49.99

View All Available Formats & Editions

Paperback(1st Edition)
$49.99

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 1-2 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Modern data science solutions need

An introduction to the map and reduce paradigm
Parallelization with the multiprocessing module and pathos framework
Hadoop and Spark for distributed computing
Running AWS jobs to process large datasets

Product Details

ISBN-13:	9781617296239
Publisher:	Manning
Publication date:	01/21/2020
Edition description:	1st Edition
Pages:	312
Product dimensions:	7.30(w) x 9.20(h) x 0.80(d)

About the Author

J.T. Wolohan is a lead data scientist at Booz Allen Hamilton and a PhD researcher at Indiana University, Bloomington, affiliated with the Department of Information and Library Science and the School of Informatics and Computing. His professional work focuses on rapid prototyping and scalable AI. His research focuses on computational analysis of social uses of language online.

Preface xiii

Acknowledgments xiv

About this book xv

About the author xix

About the cover illustration xx

Part 1 1

1 Introduction 3

1.1 What you'll learn in this book 4

1.2 Why large datasets? 5

1.3 What is parallel computing? 6

Understanding parallel computing 6

Scalable computing with the map and reduce style 7

When to program in a map and reduce style 9

1.4 The map and reduce style 9

The map function for transforming data 9

The reduce function for advanced transformations 11

Map and reduce for data transformation pipelines 12

1.5 Distributed computing for speed and scale 12

1.6 Hadoop: A distributed framework for map and reduce 14

1.7 Spark for high-powered map, reduce, and more 15

1.8 AWS Elastic MapReduce-Large datasets in the cloud 15

2 Accelerating large dataset work: Map and parallel computing 17

2.1 An introduction to map 18

Retrieving URLs with map 22

The power of lazy functions (like map) for large datasets 24

2.2 Parallel processing 25

Processors and processing 27

Parallelization and pickling 28

Order and parallelization 32

State and parallelization 33

2.3 Putting it all together: Scraping a Wikipedia network 36

Visualizing our graph 40

Returning to map 42

2.4 Exercises 43

Problems of parallelization 43

Map function 43

Parallelization and speed 44

Pickling storage 44

Web scraping data 44

Heterogenous map transformations 44

3 Function pipelines for mapping complex transformations 46

3.1 Helper functions and function chains 47

3.2 Unmasking hacker communications 49

Creating helper functions 52

Creating a pipeline 53

3.3 Twitter demographic projections 56

Tweet-level pipeline 57

User-level pipeline 63

Applying the pipeline 64

3.4 Exercises 66

Helper functions and function pipelines 66

Math teacher trick 66

Caesar's cipher 66

4 Processing large datasets with lazy workflows 68

4.1 What is laziness? 69

4.2 Some lazy functions to know 69

Shrinking sequences with the filter function 70

Combining sequences with zip 72

Lazy file searching with iglob 73

4.3 Understanding iterators: The magic behind lazy Python 73

The backbone of lazy Python: Iterators 73

Generators: Functions for creating data 75

4.4 The poetry puzzle: Lazily processing a large dataset 77

Generating data for this example 77

Reading poems in with iglob 78

A poem-cleaning regular expression class 79

Calculating the ratio of articles 80

4.5 Lazy simulations: Simulating fishing villages 83

Creating a village class 85

Designing the simulation class for our fishing simulation 86

4.6 Exercises 92

Lazy functions 92

Fizz buzz generator 92

Repeat access 92

Parallel simulations 92

Scrabble words 92

5 Accumulation operations with reduce 94

5.1 N-to-X with reduce 95

5.2 The three parts of reduce 96

Accumulation functions in reduce 97

Concise accumulations using lambda functions 99

Initializers for complex start behavior in reduce 101

5.3 Reductions you're familiar with 103

Creating a filter with reduce 103

Creating frequencies with reduce 104

5.4 Using map and reduce together 106

5.5 Analyzing car trends with reduce 108

Using map to clean our car data 109

Using reduce for sums and counts 112

Applying the map and reduce pattern to cars data 113

5.6 Speeding up map and reduce 114

5.7 Exercises 116

Situations to use reduce 116

Lambda functions 116

Largest numbers 116

Group words by length 117

6 Speeding up map and reduce with advanced parallelization 118

6.1 Getting the most out of parallel map 119

Chunk sizes and getting the most out of parallel map 119

Parallel map runtime with variable sequence and chunk size 121

More parallel maps: .imap and starmap 124

6.2 Solving the parallel map and reduce paradox 126

Parallel reduce for faster reductions 126

Combination functions and the parallel reduce workflow 128

Implementing parallel summation, filter, and frequencies with fold 134

Part 2 139

7 Processing truly big datasets with Hadoop and Spark 141

7.1 Distributed computing 142

7.2 Hadoop for batch processing 144

Getting to know the five Hadoop modules 144

7.3 Using Hadoop to find high-scoring words 147

MapReduce jobs using Python and Hadoop Streaming 148

Scoring words using Hadoop Streaming 150

7.4 Spark for interactive workflows 152

Big datasets in memory with Spark 152

PySpark for mixing Python and Spark 153

Enterprise data analytics using Spark SQL 154

Columns of data with Spark DataFrame 154

7.5 Document word scores in Spark 155

Setting up Spark 156

MapReduce Spark jobs with spark-submit 157

7.6 Exercises 159

Hadoop streaming scripts 159

Spark interface 159

RDDs 159

Passing data between steps 159

8 Best practices for large data with Apache Streaming and mrjob 161

8.1 Unstructured data: Logs and documents 162

8.2 Tennis analytics with Hadoop 163

A mapper for reading match data 163

Reducer for calculating tennis player ratings 166

8.3 mrjob for Pythonic Hadoop streaming 169

The Pythonic structure of a mrjob job 170

Counting errors with mrjob 172

8.4 Tennis match analysis with mrjob 175

Counting Serena's dominance by court type 175

Sibling rivalry for the ages 177

8.5 Exercises 180

Hadoop data formats 180

More Hadoop data formats 180

Hadoop's native tongue 180

Designing common patterns in MRJob 180

9 PageRank with map and reduce in PySpark 182

9.1 A closer look at PySpark 183

Map-like methods in PySpark 183

Reduce-like methods in PySpark 187

Convenience methods in PySpark 189

9.2 Tennis rankings with Elo and PageRank in PySpark 192

Revisiting Elo ratings with PySpark 192

Introducing the PageRank algorithm 194

Ranking tennis players with PageRank 197

9.3 Exercises 204

SumByKey 204

SumByKey with took 204

Spark and took 204

Wikipedia PageRank 204

10 Faster decision-making with machine learning and PySpark 206

10.1 What is machine learning? 207

Machine learning as self adjusting judgmental algorithms 208

Common applications of machine learning 210

10.2 Machine learning basics with decision tree classifiers 214

Designing decision tree classifiers 214

Implementing a decision tree in PySpark 217

10.3 Fast random forest classifications in PySpark 224

Understanding random forest classifiers 225

Implementing a random forest classifier 226

Part 3 231

11 Large datasets in the cloud with Amazon Web Services and S3 233

11.1 AWS Simple Storage Service-A solution for large datasets 234

Limitless storage with S3 234

Cloud-based storage for scalability 235

Objects for convenient heterogenous storage 236

Managed service for conveniently managing large datasets 237

Life cycle policies for managing large datasets over time 237

11.2 Storing data in the cloud with S3 239

Storing data with S3 through the browser 239

Programmatic access to S3 with Python and boto 249

11.3 Exercises 254

S3 Storage classes 254

S3 storage region 254

Object storage 255

12 MapReduce in the cloud with Amazon's Elastic MapReduce 256

12.1 Running Hadoop on EMR with mrjob 257

Convenient cloud clusters with EMR 257

Starting EMR clusters with mrjob 258

The AWS EMR browser interface 261

12.2 Machine learning in the cloud with Spark on EMR 265

Writing our machine learning model 265

Setting up an EMR cluster for Spark 270

Running PySpark jobs from our cluster 275

12.3 Exercises 278

R-series cluster 278

Back-to-bach Hadoop jobs 278

Instance types 279

Index 281

From the B&N Reads Blog

Page 1 of

Modern data science solutions need

Modern data science solutions need

Related collections and offers

Overview

Modern data science solutions need

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews