Table of Contents
Preface xiii
I Natural Language Features 1
1 Language and modeling 3
1.1 Linguistics for text analysis 3
1.2 A glimpse into one area: morphology 5
1.3 Different languages 6
1.4 Other ways text can vary 7
1.5 Summary 8
1.5.1 In this chapter, you learned: 8
2 Tokenization 9
2.1 What is a token? 9
2.2 Types of tokens 13
2.2.1 Character tokens 16
2.2.2 Word tokens 18
2.2.3 Tokenizing by n-grams 19
2.2.4 Lines, sentence, and paragraph tokens 22
2.3 Where does tokenization break down? 25
2.4 Building your own tokenizer 26
2.4.1 Tokenize to characters, only keeping letters 27
2.4.2 Allow for hyphenated words 29
2.4.3 Wrapping it in a function 32
2.5 Tokenization for non-Latin alphabets 33
2.6 Tokenization benchmark 34
2.7 Summary 35
2.7.1 In this chapter, you learned: 35
3 Stop words 37
3.1 Using premade stop word lists 38
3.1.1 Stop word removal in R 41
3.2 Creating your own stop words list 43
3.3 All stop word lists are context-specific 48
3.4 What happens when you remove stop words 49
3.5 Stop words in languages other than English 50
3.6 Summary 52
3.6.1 In this chapter, you learned: 52
4 Stemming 53
4.1 How to stem text in R 54
4.2 Should you use stemming at all? 58
4.3 Understand a stemming algorithm 61
4.4 Handling punctuation when stemming 63
4.5 Compare some stemming options 65
4.6 Lemmatization and stemming 68
4.7 Stemming and stop words 70
4.8 Summary 71
4.8.1 In this chapter, you learned: 72
5 Word Embeddings 73
5.1 Motivating embeddings for sparse, high-dimensional data 73
5.2 Understand word embeddings by finding them yourself 77
5.3 Exploring CFPB word embeddings 81
5.4 Use pre-trained word embeddings 88
5.5 Fairness and word embeddings 93
5.6 Using word embeddings in the real world 95
5.7 Summary 96
5.7.1 In this chapter, you learned: 97
II Machine Learning Methods 99
Overview 101
6 Regression 105
6.1 A first regression model 106
6.1.1 Building our first regression model 107
6.1.2 Evaluation 112
6.2 Compare to the null model 117
6.3 Compare to a random forest model 119
6.4 Case study: removing stop words 122
6.5 Case study: varying n-grams 126
6.6 Case study: lemmatization 129
6.7 Case study, feature hashing 133
6.7.1 Text normalization 137
6.8 What evaluation metrics are appropriate? 139
6.9 The full game: regression 142
6.9.1 Preprocess the data 142
6.9.2 Specify the model 143
6.9.3 Tune the model 144
6.9.4 Evaluate the modeling 146
6.10 Summary 153
6.10.1 In this chapter, you learned: 153
7 Classification 155
7.1 A first classification model 156
7.1.1 Building our first classification model 158
7.1.2 Evaluation 161
7.2 Compare to the null model 166
7.3 Compare to a lasso classification model 167
7.4 Tuning lasso hyperparameters 170
7.5 Case study: sparse encoding 179
7.6 Two-class or multiclass? 183
7.7 Case study: including non-text data 191
7.8 Case study: data censoring 195
7.9 Case study: custom features 201
7.9.1 Detect credit cards 202
7.9.2 Calculate percentage censoring 204
7.9.3 Detect monetary amounts 205
7.10 What evaluation metrics are appropriate? 206
7.11 The full game: classification 208
7.11.1 Feature selection 209
7.11.2 Specify the model 210
7.11.3 Evaluate the modeling 212
7.12 Summary 220
7.12.1 In this chapter, you learned: 221
III Deep Learning Methods 223
Overview 225
8 Dense neural networks 231
8.1 Kickstarter data 232
8.2 A first deep learning model 237
8.2.1 Preprocessing for deep learning 237
8.2.2 One-hot sequence embedding of text 240
8.2.3 Simple flattened dense network 244
8.2.4 Evaluation 248
8.3 Using bag-of-words features 253
8.4 Using pre-trained word embeddings 257
8.5 Cross-validation for deep learning models 263
8.6 Compare and evaluate DNN models 267
8.7 Limitations of deep learning 271
8.8 Summary 272
8.8.1 In this chapter, you learned: 272
9 Long short-term memory (LSTM) networks 273
9.1 A first LSTM model 273
9.1.1 Building an LSTM 275
9.1.2 Evaluation 279
9.2 Compare to a recurrent neural network 283
9.3 Case study: bidirectional LSTM 286
9.4 Case study: stacking LSTM layers 288
9.5 Case study: padding 289
9.6 Case study: training a regression model 292
9.7 Case study: vocabulary size 295
9.8 The full game: LSTM 297
9.8.1 Preprocess the data 297
9.8.2 Specify the model 298
9.9 Summary 301
9.9.1 In this chapter, you learned: 302
10 Convolutional neural networks 303
10.1 What are CNNs? 303
10.1.1 Kernel 304
10.1.2 Kernel size 304
10.2 A first CNN model 305
10.3 Case study: adding more layers 309
10.4 Case study: byte pair encoding 317
10.5 Case study: explainability with LIME 324
10.6 Case study: hyperparameter search 330
10.7 Cross-validation for evaluation 334
10.8 The full game: CNN 337
10.8.1 Preprocess the data 337
10.8.2 Specify the model 338
10.9 Summary 341
10.9.1 In this chapter, you learned: 342
IV Conclusion 343
Text models in the real world 345
Appendix 347
A Regular expressions 347
A.1 Literal characters 347
A.1.1 Meta characters 349
A.2 Full stop, the wildcard 349
A.3 Character classes 350
A.3.1 Shorthand character classes 352
A.4 Quantifiers 353
A.5 Anchors 355
A.6 Additional resources 355
B Data 357
B.1 Hans Christian Andersen fairy tales 357
B.2 Opinions of the Supreme Court of the United States 358
B.3 Consumer Financial Protection Bureau (CFPB) complaints 359
B.4 Kickstarter campaign blurbs 359
C Baseline linear classifier 361
C.1 Read in the data 361
C.2 Split into test/train and create resampling folds 362
C.3 Recipe for data preprocessing 363
C.4 Lasso regularized classification model 363
C.5 A model workflow 364
C.6 Tune the workflow 366
References 369
Index 379