Table of Contents
Foreword vii
Preface ix
1 Analyzing Big Data 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
The Second Edition 7
2 Introduction to Data Analysis with Scala and Spark 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 12
Getting Started: The Spark Shell and SparkContext 13
Bringing Data from the Cluster to the Client 19
Shipping Code from the Client to the Cluster 22
From RDDs to Data Frames 23
Analyzing Data with the DataFrame API 26
Fast Summary Statistics for DataFrames 32
Pivoting and Reshaping DataFrames 33
Joining DataFrames and Selecting Features 37
Preparing Models for Production Environments 38
Model Evaluation 40
Where to Go from Here 41
3 Recommending Music and the Audioscrobbler Data Set 43
Data Set 44
The Alternating Least Squares Recommender Algorithm 45
Preparing the Data 48
Building a First Model 51
Spot Checking Recommendations 54
Evaluating Recommendation Quality 57
Computing AUC 58
Hyperparameter Selection 60
Making Recommendations 62
Where to Go from Here 64
4 Predicting Forest Cover with Decision Trees 67
Fast Forward to Regression 67
Vectors and Features 68
Training Examples 69
Decision Trees and Forests 70
Covtype Data Set 73
Preparing the Data 73
A First Decision Tree 76
Decision Tree Hyperparameters 82
Tuning Decision Trees 84
Categorical Features Revisited 88
Random Decision Forests 91
Making Predictions 93
Where to Go from Here 94
5 Anomaly Detection in Network Traffic with K-means Clustering 97
Anomaly Detection 98
K-means Clustering 98
Network Intrusion 99
KDD Cup 1999 Data Set 100
A First Take on Clustering 101
Choosing k 103
Visualization with SparkR 106
Feature Normalization 110
Categorical Variables 112
Using Labels with Entropy 114
Clustering in Action 115
Where to Go from Here 117
6 Understanding Wikipedia with Latent Semantic Analysis 119
The Document-Term Matrix 120
Getting the Data 122
Parsing and Preparing the Data 122
Lemmatization 124
Computing the TF-IDFs 125
Singular Value Decomposition 127
Finding Important Concepts 129
Querying and Scoring with a Low-Dimensional Representation 133
Term-Term Relevance 134
Document-Document Relevance 136
Document-Term Relevance 137
Multiple-Term Queries 138
Where to Go from Here 140
7 Analyzing Co-Occurrence Networks with GraphX 141
The MEDLINE Citation Index: A Network Analysis 143
Getting the Data 144
Parsing XML Documents with Scala's XML Library 146
Analyzing the MeSH Major Topics and Their Co-Occurrences 147
Constructing a Co-Occurrence Network with GrapbX 150
Understanding the Structure of Networks 154
Connected Components 154
Degree Distribution 157
Filtering Out Noisy Edges 159
Processing EdgeTriplets 160
Analyzing the Filtered Graph 162
Small-World Networks 163
Cliques and Clustering Coefficients 164
Computing Average Path Length with Pregel 165
Where to Go from Here 170
8 Geospatial and Temporal Data Analysis on New York City Taxi Trip Data 173
Getting the Data 174
Working with Third-Party Libraries in Spark 175
Geospatial Data with the Esri Geometry API and Spray 176
Exploring the Esri Geometry API 176
Intro to GeoJSON 178
Preparing the New York City Taxi Trip Data 180
Handling Invalid Records at Scale 182
Geospatial Analysis 186
Sessionization in Spark 189
Building Sessions: Secondary Sorts in Spark 190
Where to Go from Here 193
9 Estimating Financial Risk Through Monte Carlo Simulation 195
Terminology 196
Methods for Calculating VaR 197
Variance-Covariance 197
Historical Simulation 197
Monte Carlo Simulation 197
Our Model 198
Getting the Data 199
Preprocessing 199
Determining the Factor Weights 202
Sampling 205
The Multivariate Normal Distribution 208
Running the Trials 209
Visualizing the Distribution of Returns 212
Evaluating Our Results 213
Where to Go from Here 215
10 Analyzing Genomics Data and the BDG Project 217
Decoupling Storage from Modeling 218
Ingesting Genomics Data with the ADAM CLI 221
Parquet Format and Columnar Storage 227
Predicting Transcription Factor Binding Sites from ENCODE Data 229
Querying Genotypes from the 1000 Genomes Project 236
Where to Go from Here 239
11 Analyzing Neuroimaging Data with PySpark and Thunder 241
Overview of PySpark 242
PySpark Internals 243
Overview and Installation of the Thunder Library 245
Loading Data with Thunder 245
Thunder Core Data Types 252
Categorizing Neuron Types with Thunder 253
Where to Go from Here 258
Index 259