Practical Statistics for Data Scientists: 50 Essential Concepts
315Practical Statistics for Data Scientists: 50 Essential Concepts
315Paperback
-
SHIP THIS ITEMTemporarily Out of Stock Online
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.
With this book, you’ll learn:
- Why exploratory data analysis is a key preliminary step in data science
- How random sampling can reduce bias and yield a higher quality dataset, even with big data
- How the principles of experimental design yield definitive answers to questions
- How to use regression to estimate outcomes and detect anomalies
- Key classification techniques for predicting which categories a record belongs to
- Statistical machine learning methods that "learn" from data
- Unsupervised learning methods for extracting meaning from unlabeled data
Product Details
ISBN-13: | 9781491952962 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 06/06/2017 |
Pages: | 315 |
Product dimensions: | 6.90(w) x 9.10(h) x 0.80(d) |
About the Author
Andrew Bruce has over 30 years of experience in statistics and data science in academia, government and business. He has a Ph.D. in statistics from the Universityof Washington and published numerous papers in refereed journals. He has developed statistical-based solutions to a wide range of problems faced by a variety of industries, from established financial firms to internet startups, and offers a deep understanding the practice of data science.
Table of Contents
Preface xiii
1 Exploratory Data Analysis 1
Elements of Structured Data 2
Further Reading 4
Rectangular Data 5
Data Frames and Indexes 6
Nonrectangular Data Structures 7
Further Reading 8
Estimates of Location 8
Mean 9
Median and Robust Estimates 10
Example: Location Estimates of Population and Murder Rates 12
Further Reading 13
Estimates of Variability 13
Standard Deviation and Related Estimates 15
Estimates Based on Percentiles 17
Example: Variability Estimates of State Population 18
Further Reading 19
Exploring the Data Distribution 19
Percentiles and Boxplots 20
Frequency Table and Histograms 21
Density Estimates 24
Further Reading 26
Exploring Binary and Categorical Data 26
Mode 28
Expected Value 28
Further Reading 29
Correlation 29
Scatterplots 32
Further Reading 34
Exploring Two or More Variables 34
Hexagonal Binning and Contours (Plotting Numeric versus Numeric Data) 34
Two Categorical Variables 37
Categorical and Numeric Data 38
Visualizing Multiple Variables 40
Further Reading 42
Summary 42
2 Data and Sampling Distributions 43
Random Sampling and Sample Bias 44
Bias 46
Random Selection 47
Size versus Quality: When Does Size Matter? 48
Sample Mean versus Population Mean 49
Further Reading 49
Selection Bias 50
Regression to the Mean 51
Further Reading 53
Sampling Distribution of a Statistic 53
Central Limit Theorem 55
Standard Error 56
Further Reading 57
The Bootstrap 57
Resampling versus Bootstrapping 60
Further Reading 60
Confidence Intervals 61
Further Reading 63
Normal Distribution 64
Standard Normal and QQ-Plots 65
Long-Tailed Distributions 67
Further Reading 69
Student's t-Distribution 69
Further Reading 72
Binomial Distribution 72
Further Reading 74
Poisson and Related Distributions 74
Poisson Distributions 75
Exponential Distribution 75
Estimating the Failure Rate 76
Weibull Distribution 76
Further Reading 77
Summary 77
3 Statistical Experiments and Significance Testing 79
A/B Testing 80
Why Have a Control Group? 82
Why Just A/B? Why Not C, D…? 83
For Further Reading 84
Hypothesis Tests 85
The Null Hypothesis 86
Alternative Hypothesis 86
One-Way, Two-Way Hypothesis Test 87
Further Reading 88
Resampling 88
Permutation Test 88
Example: Web Stickiness 89
Exhaustive and Bootstrap Permutation Test 92
Permutation Tests: The Bottom Line for Data Science 93
For Further Reading 93
Statistical Significance and P-Values 93
P-Value 96
Alpha 96
Type 1 and Type 2 Errors 98
Data Science and P-Values 98
Further Reading 99
t-Tests 99
Further Reading 101
Multiple Testing 101
Further Reading 104
Degrees of Freedom 104
Further Reading 106
ANOVA 106
F-Statistic 109
Two-Way ANOVA 110
Further Reading 111
Chi-Square Test 111
Chi-Square Test: A Resampling Approach 112
Chi-Squared Test: Statistical Theory 114
Fishers Exact Test 115
Relevance for Data Science 117
Further Reading 118
Multi-Arm Bandit Algorithm 119
Further Reading 122
Power and Sample Size 122
Sample Size 123
Further Reading 125
Summary 125
4 Regression and Prediction 127
Simple Linear Regression 127
The Regression Equation 129
Fitted Values and Residuals 131
Least Squares 132
Prediction versus Explanation (Profiling) 133
Further Reading 134
Multiple Linear Regression 134
Example: King County Housing Data 135
Assessing the Model 136
Cross-Validation 138
Model Selection and Stepwise Regression 139
Weighted Regression 141
Prediction Using Regression 142
The Dangers of Extrapolation 143
Confidence and Prediction Intervals 143
Factor Variables in Regression 145
Dummy Variables Representation 145
Factor Variables with Many Levels 147
Ordered Factor Variables 149
Interpreting the Regression Equation 150
Correlated Predictors 150
Multicollinearity 151
Confounding Variables 152
Interactions and Main Effects 153
Testing the Assumptions: Regression Diagnostics 155
Outliers 156
Influential Values 158
Heteroskedasticity, Non-Normality and Correlated Errors 161
Partial Residual Plots and Nonlinearity 164
Polynomial and Spline Regression 166
Polynomial 167
Splines 168
Generalized Additive Models 170
Further Reading 172
Summary 172
5 Classification 173
Naive Bayes 174
Why Exact Bayesian Classification Is Impractical 175
The Naive Solution 176
Numeric Predictor Variables 178
Further Reading 178
Discriminant Analysis 179
Covariance Matrix 180
Fisher's Linear Discriminant 180
A Simple Example 181
Further Reading 183
Logistic Regression 184
Logistic Response Function and Logit 184
Logistic Regression and the GLM 186
Generalized Linear Models 187
Predicted Values from Logistic Regression 188
Interpreting the Coefficients and Odds Ratios 188
Linear and Logistic Regression: Similarities and Differences 190
Assessing the Model 191
Further Reading 194
Evaluating Classification Models 194
Confusion Matrix 195
The Rare Class Problem 196
Precision, Recall, and Specificity 197
ROC Curve 198
AUC 200
Lift 201
Further Reading 202
Strategies for Imbalanced Data 203
Undersampling 204
Oversampling and Up/Down Weighting 204
Data Generation 205
Cost-Based Classification 206
Exploring the Predictions 206
Further Reading 208
Summary 208
6 Statistical Machine Learning 209
K-Nearest Neighbors 210
A Small Example: Predicting Loan Default 211
Distance Metrics 213
One Hot Encoder 214
Standardization (Normalization, Z-Scores) 215
Choosing K 217
KNN as a Feature Engine 218
Tree Models 219
A Simple Example 221
The Recursive Partitioning Algorithm 222
Measuring Homogeneity or Impurity 224
Stopping the Tree from Growing 225
Predicting a Continuous Value 227
How Trees Are Used 227
Further Reading 228
Bagging and the Random Forest 228
Bagging 230
Random Forest 230
Variable Importance 233
Hyperparameters 236
Boosting 237
The Boosting Algorithm 238
XGBoost 239
Regularization: Avoiding Overfitting 241
Hyperparameters and Cross-Validation 245
Summary 247
7 Unsupervised Learning 249
Principal Components Analysis 250
A Simple Example 251
Computing the Principal Components 254
Interpreting Principal Components 254
Further Reading 257
K-Means Clustering 257
A Simple Example 258
K-Means Algorithm 260
Interpreting the Clusters 261
Selecting the Number of Clusters 263
Hierarchical Clustering 265
A Simple Example 266
The Dendrogram 266
The Agglomerative Algorithm 268
Measures of Dissimilarity 268
Model-Based Clustering 270
Multivariate Normal Distribution 270
Mixtures of Normals 272
Selecting the Number of Clusters 274
Further Reading 276
Scaling and Categorical Variables 276
Scaling the Variables 277
Dominant Variables 278
Categorical Data and Gower's Distance 280
Problems with Clustering Mixed Data 283
Summary 284
Bibliography 285
Index 287