Pub. Date:
Practical Statistics for Data Scientists: 50 Essential Concepts

Practical Statistics for Data Scientists: 50 Essential Concepts

by Peter Bruce, Andrew Bruce




Statistical methods are a key part of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

  • Why exploratory data analysis is a key preliminary step in data science
  • How random sampling can reduce bias and yield a higher quality dataset, even with big data
  • How the principles of experimental design yield definitive answers to questions
  • How to use regression to estimate outcomes and detect anomalies
  • Key classification techniques for predicting which categories a record belongs to
  • Statistical machine learning methods that "learn" from data
  • Unsupervised learning methods for extracting meaning from unlabeled data

Related collections and offers

Product Details

ISBN-13: 9781491952962
Publisher: O'Reilly Media, Incorporated
Publication date: 06/06/2017
Pages: 318
Product dimensions: 6.90(w) x 9.10(h) x 0.80(d)

About the Author

Peter Bruce founded and grew the Institute for Statistics Education at, which now offers about 100 courses in statistics, roughly a third of which are aimed at the data scientist. In recruiting top authors as instructors and forging a marketing strategy to reach professional data scientists, Peter has developed both a broad view of the target market, and his own expertise to reach it.

Andrew Bruce has over 30 years of experience in statistics and data science in academia, government and business. He has a Ph.D. in statistics from the University of Washington and published numerous papers in refereed journals. He has developed statistical-based solutions to a wide range of problems faced by a variety of industries, from established financial firms to internet startups, and offers a deep understanding the practice of data science.

Table of Contents

Preface xiii

1 Exploratory Data Analysis 1

Elements of Structured Data 2

Further Reading 4

Rectangular Data 5

Data Frames and Indexes 6

Nonrectangular Data Structures 7

Further Reading 8

Estimates of Location 8

Mean 9

Median and Robust Estimates 10

Example: Location Estimates of Population and Murder Rates 12

Further Reading 13

Estimates of Variability 13

Standard Deviation and Related Estimates 15

Estimates Based on Percentiles 17

Example: Variability Estimates of State Population 18

Further Reading 19

Exploring the Data Distribution 19

Percentiles and Boxplots 20

Frequency Table and Histograms 21

Density Estimates 24

Further Reading 26

Exploring Binary and Categorical Data 26

Mode 28

Expected Value 28

Further Reading 29

Correlation 29

Scatterplots 32

Further Reading 34

Exploring Two or More Variables 34

Hexagonal Binning and Contours (Plotting Numeric versus Numeric Data) 34

Two Categorical Variables 37

Categorical and Numeric Data 38

Visualizing Multiple Variables 40

Further Reading 42

Summary 42

2 Data and Sampling Distributions 43

Random Sampling and Sample Bias 44

Bias 46

Random Selection 47

Size versus Quality: When Does Size Matter? 48

Sample Mean versus Population Mean 49

Further Reading 49

Selection Bias 50

Regression to the Mean 51

Further Reading 53

Sampling Distribution of a Statistic 53

Central Limit Theorem 55

Standard Error 56

Further Reading 57

The Bootstrap 57

Resampling versus Bootstrapping 60

Further Reading 60

Confidence Intervals 61

Further Reading 63

Normal Distribution 64

Standard Normal and QQ-Plots 65

Long-Tailed Distributions 67

Further Reading 69

Student's t-Distribution 69

Further Reading 72

Binomial Distribution 72

Further Reading 74

Poisson and Related Distributions 74

Poisson Distributions 75

Exponential Distribution 75

Estimating the Failure Rate 76

Weibull Distribution 76

Further Reading 77

Summary 77

3 Statistical Experiments and Significance Testing 79

A/B Testing 80

Why Have a Control Group? 82

Why Just A/B? Why Not C, D…? 83

For Further Reading 84

Hypothesis Tests 85

The Null Hypothesis 86

Alternative Hypothesis 86

One-Way, Two-Way Hypothesis Test 87

Further Reading 88

Resampling 88

Permutation Test 88

Example: Web Stickiness 89

Exhaustive and Bootstrap Permutation Test 92

Permutation Tests: The Bottom Line for Data Science 93

For Further Reading 93

Statistical Significance and P-Values 93

P-Value 96

Alpha 96

Type 1 and Type 2 Errors 98

Data Science and P-Values 98

Further Reading 99

t-Tests 99

Further Reading 101

Multiple Testing 101

Further Reading 104

Degrees of Freedom 104

Further Reading 106


F-Statistic 109

Two-Way ANOVA 110

Further Reading 111

Chi-Square Test 111

Chi-Square Test: A Resampling Approach 112

Chi-Squared Test: Statistical Theory 114

Fishers Exact Test 115

Relevance for Data Science 117

Further Reading 118

Multi-Arm Bandit Algorithm 119

Further Reading 122

Power and Sample Size 122

Sample Size 123

Further Reading 125

Summary 125

4 Regression and Prediction 127

Simple Linear Regression 127

The Regression Equation 129

Fitted Values and Residuals 131

Least Squares 132

Prediction versus Explanation (Profiling) 133

Further Reading 134

Multiple Linear Regression 134

Example: King County Housing Data 135

Assessing the Model 136

Cross-Validation 138

Model Selection and Stepwise Regression 139

Weighted Regression 141

Prediction Using Regression 142

The Dangers of Extrapolation 143

Confidence and Prediction Intervals 143

Factor Variables in Regression 145

Dummy Variables Representation 145

Factor Variables with Many Levels 147

Ordered Factor Variables 149

Interpreting the Regression Equation 150

Correlated Predictors 150

Multicollinearity 151

Confounding Variables 152

Interactions and Main Effects 153

Testing the Assumptions: Regression Diagnostics 155

Outliers 156

Influential Values 158

Heteroskedasticity, Non-Normality and Correlated Errors 161

Partial Residual Plots and Nonlinearity 164

Polynomial and Spline Regression 166

Polynomial 167

Splines 168

Generalized Additive Models 170

Further Reading 172

Summary 172

5 Classification 173

Naive Bayes 174

Why Exact Bayesian Classification Is Impractical 175

The Naive Solution 176

Numeric Predictor Variables 178

Further Reading 178

Discriminant Analysis 179

Covariance Matrix 180

Fisher's Linear Discriminant 180

A Simple Example 181

Further Reading 183

Logistic Regression 184

Logistic Response Function and Logit 184

Logistic Regression and the GLM 186

Generalized Linear Models 187

Predicted Values from Logistic Regression 188

Interpreting the Coefficients and Odds Ratios 188

Linear and Logistic Regression: Similarities and Differences 190

Assessing the Model 191

Further Reading 194

Evaluating Classification Models 194

Confusion Matrix 195

The Rare Class Problem 196

Precision, Recall, and Specificity 197

ROC Curve 198

AUC 200

Lift 201

Further Reading 202

Strategies for Imbalanced Data 203

Undersampling 204

Oversampling and Up/Down Weighting 204

Data Generation 205

Cost-Based Classification 206

Exploring the Predictions 206

Further Reading 208

Summary 208

6 Statistical Machine Learning 209

K-Nearest Neighbors 210

A Small Example: Predicting Loan Default 211

Distance Metrics 213

One Hot Encoder 214

Standardization (Normalization, Z-Scores) 215

Choosing K 217

KNN as a Feature Engine 218

Tree Models 219

A Simple Example 221

The Recursive Partitioning Algorithm 222

Measuring Homogeneity or Impurity 224

Stopping the Tree from Growing 225

Predicting a Continuous Value 227

How Trees Are Used 227

Further Reading 228

Bagging and the Random Forest 228

Bagging 230

Random Forest 230

Variable Importance 233

Hyperparameters 236

Boosting 237

The Boosting Algorithm 238

XGBoost 239

Regularization: Avoiding Overfitting 241

Hyperparameters and Cross-Validation 245

Summary 247

7 Unsupervised Learning 249

Principal Components Analysis 250

A Simple Example 251

Computing the Principal Components 254

Interpreting Principal Components 254

Further Reading 257

K-Means Clustering 257

A Simple Example 258

K-Means Algorithm 260

Interpreting the Clusters 261

Selecting the Number of Clusters 263

Hierarchical Clustering 265

A Simple Example 266

The Dendrogram 266

The Agglomerative Algorithm 268

Measures of Dissimilarity 268

Model-Based Clustering 270

Multivariate Normal Distribution 270

Mixtures of Normals 272

Selecting the Number of Clusters 274

Further Reading 276

Scaling and Categorical Variables 276

Scaling the Variables 277

Dominant Variables 278

Categorical Data and Gower's Distance 280

Problems with Clustering Mixed Data 283

Summary 284

Bibliography 285

Index 287

Customer Reviews