Data Science with Python and Dask
Summary

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. You'll find registration instructions inside the print book.

About the Technology

An efficient data pipeline means everything for the success of a data science project. Dask is a flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting and analyzing large, distributed datasets. Dask provides dynamic task scheduling and parallel collections that extend the functionality of NumPy, Pandas, and Scikit-learn, enabling users to scale their code from a single laptop to a cluster of hundreds of machines with ease.

About the Book

Data Science with Python and Dask teaches you to build scalable projects that can handle massive datasets. After meeting the Dask framework, you'll analyze data in the NYC Parking Ticket database and use DataFrames to streamline your process. Then, you'll create machine learning models using Dask-ML, build interactive visualizations, and build clusters using AWS and Docker.

What's inside

  • Working with large, structured and unstructured datasets
  • Visualization with Seaborn and Datashader
  • Implementing your own algorithms
  • Building distributed apps with Dask Distributed
  • Packaging and deploying Dask apps

About the Reader

For data scientists and developers with experience using Python and the PyData stack.

About the Author

Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company.

Table of Contents

    PART 1 - The Building Blocks of scalable computing
  1. Why scalable computing matters
  2. Introducing Dask
  3. PART 2 - Working with Structured Data using Dask DataFrames
  4. Introducing Dask DataFrames
  5. Loading data into DataFrames
  6. Cleaning and transforming DataFrames
  7. Summarizing and analyzing DataFrames
  8. Visualizing DataFrames with Seaborn
  9. Visualizing location data with Datashader
  10. PART 3 - Extending and deploying Dask
  11. Working with Bags and Arrays
  12. Machine learning with Dask-ML
  13. Scaling and deploying Dask
1131109674
Data Science with Python and Dask
Summary

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. You'll find registration instructions inside the print book.

About the Technology

An efficient data pipeline means everything for the success of a data science project. Dask is a flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting and analyzing large, distributed datasets. Dask provides dynamic task scheduling and parallel collections that extend the functionality of NumPy, Pandas, and Scikit-learn, enabling users to scale their code from a single laptop to a cluster of hundreds of machines with ease.

About the Book

Data Science with Python and Dask teaches you to build scalable projects that can handle massive datasets. After meeting the Dask framework, you'll analyze data in the NYC Parking Ticket database and use DataFrames to streamline your process. Then, you'll create machine learning models using Dask-ML, build interactive visualizations, and build clusters using AWS and Docker.

What's inside

  • Working with large, structured and unstructured datasets
  • Visualization with Seaborn and Datashader
  • Implementing your own algorithms
  • Building distributed apps with Dask Distributed
  • Packaging and deploying Dask apps

About the Reader

For data scientists and developers with experience using Python and the PyData stack.

About the Author

Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company.

Table of Contents

    PART 1 - The Building Blocks of scalable computing
  1. Why scalable computing matters
  2. Introducing Dask
  3. PART 2 - Working with Structured Data using Dask DataFrames
  4. Introducing Dask DataFrames
  5. Loading data into DataFrames
  6. Cleaning and transforming DataFrames
  7. Summarizing and analyzing DataFrames
  8. Visualizing DataFrames with Seaborn
  9. Visualizing location data with Datashader
  10. PART 3 - Extending and deploying Dask
  11. Working with Bags and Arrays
  12. Machine learning with Dask-ML
  13. Scaling and deploying Dask
36.99 In Stock
Data Science with Python and Dask

Data Science with Python and Dask

by Jesse Daniel
Data Science with Python and Dask

Data Science with Python and Dask

by Jesse Daniel

eBook

$36.99 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Summary

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, using the tools you already have. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. You'll find registration instructions inside the print book.

About the Technology

An efficient data pipeline means everything for the success of a data science project. Dask is a flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting and analyzing large, distributed datasets. Dask provides dynamic task scheduling and parallel collections that extend the functionality of NumPy, Pandas, and Scikit-learn, enabling users to scale their code from a single laptop to a cluster of hundreds of machines with ease.

About the Book

Data Science with Python and Dask teaches you to build scalable projects that can handle massive datasets. After meeting the Dask framework, you'll analyze data in the NYC Parking Ticket database and use DataFrames to streamline your process. Then, you'll create machine learning models using Dask-ML, build interactive visualizations, and build clusters using AWS and Docker.

What's inside

  • Working with large, structured and unstructured datasets
  • Visualization with Seaborn and Datashader
  • Implementing your own algorithms
  • Building distributed apps with Dask Distributed
  • Packaging and deploying Dask apps

About the Reader

For data scientists and developers with experience using Python and the PyData stack.

About the Author

Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company.

Table of Contents

    PART 1 - The Building Blocks of scalable computing
  1. Why scalable computing matters
  2. Introducing Dask
  3. PART 2 - Working with Structured Data using Dask DataFrames
  4. Introducing Dask DataFrames
  5. Loading data into DataFrames
  6. Cleaning and transforming DataFrames
  7. Summarizing and analyzing DataFrames
  8. Visualizing DataFrames with Seaborn
  9. Visualizing location data with Datashader
  10. PART 3 - Extending and deploying Dask
  11. Working with Bags and Arrays
  12. Machine learning with Dask-ML
  13. Scaling and deploying Dask

Product Details

ISBN-13: 9781638353546
Publisher: Manning
Publication date: 07/08/2019
Sold by: SIMON & SCHUSTER
Format: eBook
Pages: 296
File size: 19 MB
Note: This product may take a few minutes to download.

About the Author

Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company.

We interviewed Jesse as a part of our Six Questions series. Check it out here.
Jesse Daniel is an experienced Python developer. He taught Python for Data Science at the University of Denver and leads a team of data scientists at a Denver-based media technology company.

Table of Contents

Preface ix

Acknowledgments xi

About this book xiii

About the author xvi

About the cover illustration xvii

Part 1 The Building Blocks of Scalable Computing 1

1 Why scalable computing matters 3

1.1 WhyDask? 5

1.2 Cooking with DAGs 10

1.3 Scaling out, concurrency, and recovery 14

Scaling up vs. scaling out 15

Concurrency and resource management 18

Recovering from failures 18

1.4 Introducing a companion dataset 19

2 Introducing Dask 21

2.1 Hello Dask: A first look at the DataFrame API 22

Examining the metadata of Dask objects 22

Running computations with the compute method 26

Making complex computations more efficient with persist 28

2.2 Visualizing DAGs 29

Visualizing a simple DAG using Dask Delayed objects 29

Visualizing more complex DAGs with loops and collections 30

Reducing DAG complexity with persist 33

2.3 Task scheduling 36

Lazy computations 36

Data locality 37

Part 2 Working with Structured Data Using Dask DataFrames 41

2 Introducing Dask DataFrames 43

3.1 Why use DataFrames? 44

3.2 Dask and Pandas 46

Managing DataFrame partitioning 47

What is the shuffle? 50

3.3 Limitations of Dask DataFrames 52

4 Loading data into DataFrames 54

4.1 Reading data from text files 55

Using Dask datatypes 61

Creating schemas for Dask DataFrames 63

4.2 Reading data from relational databases 67

4.3 Reading data from HDFS and S3 70

4.4 Reading data in Parquet format 74

5 Cleaning and transforming DataFrames 78

5.1 Working with indexes and axes 80

Selecting columns from a DataFrame 80

Dropping columns from a DataFrame 83

Renaming columns in a DataFrame 84

Selecting rows from a DataFrame 85

5.2 Dealing with missing values 87

Counting missing values in a DataFrame 87

Dropping columns with missing values 88

Imputing missing values 89

Dropping rows with missing data 90

Imputing multiple columns with missing values 91

5.3 Recoding data 93

5.4 Elementwise operations 97

5.5 Filtering and reindexing DataFrames 99

5.6 Joining and concatenating DataFrames 101

Joining two DataFrames 103

Unioning two DataFrames 105

5.7 Writing data to text files and Parquet files 107

Writing to delimited text files 107

Writing to Parquet files 108

6 Summarizing and analyzing DataFrames 110

6.1 Descriptive statistics 111

What are descriptive statistics? 111

Calculating descriptive statistics with Dask 114

Using the describe method for descriptive statistics 119

6.2 Built-in aggregate functions 119

What is correlation? 120

Calculating correlations with Dask DataFrames 121

6.3 Custom aggregate functions 126

Testing categorical variables with the t-test 126

Using custom aggregates to implement the Brown-Forsythe test 128

6.4 Rolling (window) functions 141

Preparing data for a rolling function 141

Using the rolling method to apply a window function 142

7 Visualizing DataFrames with Seaborn 145

7.1 The prepare-reduce-collect-plot pattern 147

7.2 Visualizing continuous relationships with scatterplot and regplot 150

Creating a scatterplot with Dask and Seaborn 150

Adding a linear regression line to the scatterplot 153

Adding a nonlinear regression line to a scatterplot 154

7.3 Visualizing categorical relationships with violinplot 156

Creating a violinplot with Dask and Seaborn 157

Randomly sampling data from a Dask DataFrame 160

7.4 Visualizing two categorical relationships with heatmap 161

8 Visualizing location data with Datashader 165

8.1 What is Datashader and how does it work? 166

The five stages of the Datashader rendering pipeline 167

Creating a Datashader Visualization 171

8.2 Plotting location data as an interactive heatmap 173

Preparing geographic data for map tiling 173

Creating the interactive heatmap 174

Part 3 Extending and Deploying Dask 179

9 Working with Bags and Arrays 181

9.1 Reading and parsing unstructured data with Bags 183

Selecting and viewing data from a Bag 184

Common parsing issues and how to overcome them 185

Working with delimiters 186

9.2 Transforming, filtering, and folding elements 192

Transforming elements with the map method 193

Filtering Bags with the filter method 195

Calculating descriptive statistics on Bags 198

Creating aggregate functions using the foldby method 199

9.3 Building Arrays and DataFrames from Bags 201

9.4 Using Bags for parallel text analysis with NLTK 203

The basics of bigram analysis 203

Extracting tokens and filtering stopwords 204

Analyzing the bigrams 208

10 Machine learning with Dask-ML 211

10.1 Building linear models with Dask-ML 213

Preparing the data with binary vectorization 214

Building a logistic regression model with Dask-ML 220

10.2 Evaluating and tuning Dask-ML models 222

Evaluating Dask-ML models with the score method 222

Building a naive Bayes classifier with Dask-ML 223

Automatically tuning hyperparameters 224

10.3 Persisting Dask-ML models 227

11 Scaling and deploying Dash 229

11.1 Building a Dask cluster on Amazon AWS with Docker 230

Getting started 232

Creating a security key 233

Creating the ECS cluster 234

Configuring the cluster's networking 237

Creating a shared data drive in Elastic File System 241

Allocating space for Docker images in Elastic Container Repository 246

Building and deploying images for scheduler, worker, and notebook 246

Connecting to the cluster 253

11.2 Running and monitoring Dask jobs on a cluster 256

11.3 Cleaning up the Dask cluster on AWS 261

Appendix:Software Installation 263

Index 266

From the B&N Reads Blog

Customer Reviews