Data Science from Scratch: First Principles with Python

Data Science from Scratch: First Principles with Python

by Joel Grus

Paperback

$35.99 $39.99 Save 10% Current price is $35.99, Original price is $39.99. You Save 10%.
View All Available Formats & Editions
Eligible for FREE SHIPPING
  • Want it by Thursday, October 25?   Order by 12:00 PM Eastern and choose Expedited Shipping at checkout.
    Same Day shipping in Manhattan. 
    See Details

Overview

Data Science from Scratch: First Principles with Python by Joel Grus

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.

If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.

  • Get a crash course in Python
  • Learn the basics of linear algebra, statistics, and probability—and understand how and when they're used in data science
  • Collect, explore, clean, munge, and manipulate data
  • Dive into the fundamentals of machine learning
  • Implement models such as k-nearest Neighbors, Naive Bayes, linear and logistic regression, decision trees, neural networks, and clustering
  • Explore recommender systems, natural language processing, network analysis, MapReduce, and databases

Product Details

ISBN-13: 9781491901427
Publisher: O'Reilly Media, Incorporated
Publication date: 05/08/2015
Pages: 330
Sales rank: 90,149
Product dimensions: 6.90(w) x 9.00(h) x 0.80(d)

About the Author

Joel Grus is a software engineer at Google. Before that he worked as a data scientist at multiple startups. He lives in Seattle, where he regularly attends data science happy hours. He blogs infrequently at joelgrus.com.

Table of Contents

Preface xi

1 Introduction 1

The Ascendance of Data 1

What Is Data Science? 1

Motivating Hypothetical: DataSciencester 2

Finding Key Connectors 3

Data Scientists You May Know 6

Salaries and Experience 8

Paid Accounts 11

Topics of Interest 11

Onward 13

2 A Crash Course in Python 15

The Basics 15

Getting Python 15

The Zen of Python 16

Whitespace Formatting 16

Modules 17

Arithmetic 18

Functions 18

Strings 19

Exceptions 19

Lists 20

Tuples 21

Dictionaries 21

Sets 24

Control Flow 25

Truthiness 25

The Not-So-Basics 26

Sorting 27

List Comprehensions 27

Generators and Iterators 28

Randomness 29

Regular Expressions 30

Object-Oriented Programming 30

Functional Tools 31

Enumerate 32

Zip and Argument Unpacking 33

Args and kwargs 34

Welcome to DataSciencester! 35

For Further Exploration 35

3 Visualizing Data 37

Matplotlib 37

Bar Charts 39

Line Charts 43

Scatterplots 44

For Further Exploration 47

4 Linear Algebra 49

Vectors 49

Matrices 53

For Further Exploration 55

5 Statistics 57

Describing a Single Set of Data 57

Central Tendencies 59

Dispersion 61

Correlation 62

Simpsons Paradox 65

Some Other Correlational Caveats 66

Correlation and Causation 67

For Further Exploration 68

6 Probability 69

Dependence and Independence 69

Conditional Probability 70

Bayes's Theorem 72

Random Variables 73

Continuous Distributions 74

The Normal Distribution 75

The Central Limit Theorem 78

For Further Exploration 80

7 Hypothesis and Inference 81

Statistical Hypothesis Testing 81

Example: Flipping a Coin 81

Confidence Intervals 85

P-hacking 86

Example: Running an A/B Test 87

Bayesian Inference 88

For Further Exploration 92

8 Gradient Descent 93

The Idea Behind Gradient Descent 93

Estimating the Gradient 94

Using the Gradient 97

Choosing the Right Step Size 97

Putting It AH Together 98

Stochastic Gradient Descent 99

For Further Exploration 100

9 Getting Data 103

Stdin and stdout 103

Reading Files 105

The Basics of Text Files 105

Delimited Files 106

Scraping the Web 108

HTML and the Parsing Thereof 108

Example: O'Reilly Books About Data 110

Using APIs 114

JSON (and XML) 114

Using an Unauthenticated API 115

Finding APIs 116

Example: Using the Twitter APIs 117

Getting Credentials 117

For Further Exploration 120

10 Working with Data 121

Exploring Your Data 121

Exploring One-Dimensional Data 121

Two Dimensions 123

Many Dimensions 125

Cleaning and Munging 127

Manipulating Data 129

Rescaling 132

Dimensionality Reduction 134

For Further Exploration 139

11 Machine Learning 141

Modeling 141

What Is Machine Learning? 142

Overfitting and Underfitting 142

Correctness 145

The Bias-Variance Trade-off 147

Feature Extraction and Selection 148

For Further Exploration 150

12 k-Nearest Neighbors 151

The Model 151

Example: Favorite Languages 153

The Curse of Dimensionality 156

For Further Exploration 163

13 Naive Bayes 165

A Really Dumb Spam Filter 165

A More Sophisticated Spam Filter 166

Implementation 168

Testing Our Model 169

For Further Exploration 172

14 Simple Linear Regression 173

The Mode! 173

Using Gradient Descent 176

Maximum Likelihood Estimation 177

For Further Exploration 177

15 Multiple Regression 179

The Model 179

Further Assumptions of the Least Squares Model 180

Fitting the Model 181

Interpreting the Model 182

Goodness of Fit 183

Digression: The Bootstrap 183

Standard Errors of Regression Coefficients 184

Regularization 186

For Further Exploration 188

16 Logistic Regression 189

The Problem 189

The Logistic Function 192

Applying the Model 194

Goodness of Fit 195

Support Vector Machines 196

For Further Investigation 200

17 Decision Trees 201

What Is a Decision Tree? 201

Entropy 203

The Entropy of a Partition 205

Creating a Decision Tree 206

Putting It All Together 208

Random Forests 211

For Further Exploration 212

18 Neural Networks 213

Perceptrons 213

Feed-Forward Neural Networks 215

Backpropagation 218

Example: Defeating a CAPTCHA 219

For Further Exploration 224

19 Clustering 225

The Idea 225

The Model 226

Example: Meetups 227

Choosing k 230

Example: Clustering Colors 231

Bottom-up Hierarchical Clustering 233

For Further Exploration 238

20 Natural Language Processing 239

Word Clouds 239

n-gram Models 241

Grammars 244

An Aside: Gibbs Sampling 246

Topic Modeling 247

For Further Exploration 253

21 Network Analysis 255

Betweenness Centrality 255

Eigenvector Centrality 260

Matrix Multiplication 260

Centrality 262

Directed Graphs and PageRank 264

For Further Exploration 266

22 Recommender Systems 267

Manual Curation 268

Recommending What's Popular 268

User-Based Collaborative Filtering 269

Item-Based Collaborative Filtering 272

For Further Exploration 274

23 Databases and SQL 275

Create Table and Insert 275

Update 277

Delete 278

Select 278

Group By 280

Order By 282

Join 283

Subqueries 285

Indexes 285

Query Optimization 286

NoSQL 287

For Further Exploration 287

24 MapReduce 289

Example: Word Count 289

WhyMapReduce? 291

MapReduce More Generally 292

Example: Analyzing Status Updates 293

Example: Matrix Multiplication 294

An Aside: Combiners 296

For Further Exploration 296

25 Go Forth and Do Data Science 299

IPython 299

Mathematics 300

Not from Scratch 300

NumPy 301

Pandas 301

Scikit-learn 301

Visualization 301

R 302

Find Data 302

Do Data Science 303

Hacker News 303

Fire Trucks 303

T-shirts 304

And You? 304

Index 305

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews