Supervised Machine Learning for Text Analysis in R

Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.

1139358617

Supervised Machine Learning for Text Analysis in R

68.99 In Stock

Supervised Machine Learning for Text Analysis in R

Add to Wishlist

Supervised Machine Learning for Text Analysis in R

Paperback

$68.99

View All Available Formats & Editions

Paperback
$68.99

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 1-2 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Product Details

ISBN-13:	9780367554194
Publisher:	CRC Press
Publication date:	10/22/2021
Series:	Chapman & Hall/CRC Data Science Series
Pages:	402
Product dimensions:	6.12(w) x 9.19(h) x (d)

About the Author

Emil Hvitfeldt is a clinical data analyst working in healthcare, and an adjunct professor at American University where he is teaching statistical machine learning with tidymodels. He is also an open source R developer and author of the textrecipes package.

Julia Silge is a data scientist and software engineer at RStudio PBC where she works on open source modeling tools. She is an author, an international keynote speaker and educator, and a real-world practitioner focusing on data analysis and machine learning practice.

Preface xiii

I Natural Language Features 1

1 Language and modeling 3

1.1 Linguistics for text analysis 3

1.2 A glimpse into one area: morphology 5

1.3 Different languages 6

1.4 Other ways text can vary 7

1.5 Summary 8

1.5.1 In this chapter, you learned: 8

2 Tokenization 9

2.1 What is a token? 9

2.2 Types of tokens 13

2.2.1 Character tokens 16

2.2.2 Word tokens 18

2.2.3 Tokenizing by n-grams 19

2.2.4 Lines, sentence, and paragraph tokens 22

2.3 Where does tokenization break down? 25

2.4 Building your own tokenizer 26

2.4.1 Tokenize to characters, only keeping letters 27

2.4.2 Allow for hyphenated words 29

2.4.3 Wrapping it in a function 32

2.5 Tokenization for non-Latin alphabets 33

2.6 Tokenization benchmark 34

2.7 Summary 35

2.7.1 In this chapter, you learned: 35

3 Stop words 37

3.1 Using premade stop word lists 38

3.1.1 Stop word removal in R 41

3.2 Creating your own stop words list 43

3.3 All stop word lists are context-specific 48

3.4 What happens when you remove stop words 49

3.5 Stop words in languages other than English 50

3.6 Summary 52

3.6.1 In this chapter, you learned: 52

4 Stemming 53

4.1 How to stem text in R 54

4.2 Should you use stemming at all? 58

4.3 Understand a stemming algorithm 61

4.4 Handling punctuation when stemming 63

4.5 Compare some stemming options 65

4.6 Lemmatization and stemming 68

4.7 Stemming and stop words 70

4.8 Summary 71

4.8.1 In this chapter, you learned: 72

5 Word Embeddings 73

5.1 Motivating embeddings for sparse, high-dimensional data 73

5.2 Understand word embeddings by finding them yourself 77

5.3 Exploring CFPB word embeddings 81

5.4 Use pre-trained word embeddings 88

5.5 Fairness and word embeddings 93

5.6 Using word embeddings in the real world 95

5.7 Summary 96

5.7.1 In this chapter, you learned: 97

II Machine Learning Methods 99

Overview 101

6 Regression 105

6.1 A first regression model 106

6.1.1 Building our first regression model 107

6.1.2 Evaluation 112

6.2 Compare to the null model 117

6.3 Compare to a random forest model 119

6.4 Case study: removing stop words 122

6.5 Case study: varying n-grams 126

6.6 Case study: lemmatization 129

6.7 Case study, feature hashing 133

6.7.1 Text normalization 137

6.8 What evaluation metrics are appropriate? 139

6.9 The full game: regression 142

6.9.1 Preprocess the data 142

6.9.2 Specify the model 143

6.9.3 Tune the model 144

6.9.4 Evaluate the modeling 146

6.10 Summary 153

6.10.1 In this chapter, you learned: 153

7 Classification 155

7.1 A first classification model 156

7.1.1 Building our first classification model 158

7.1.2 Evaluation 161

7.2 Compare to the null model 166

7.3 Compare to a lasso classification model 167

7.4 Tuning lasso hyperparameters 170

7.5 Case study: sparse encoding 179

7.6 Two-class or multiclass? 183

7.7 Case study: including non-text data 191

7.8 Case study: data censoring 195

7.9 Case study: custom features 201

7.9.1 Detect credit cards 202

7.9.2 Calculate percentage censoring 204

7.9.3 Detect monetary amounts 205

7.10 What evaluation metrics are appropriate? 206

7.11 The full game: classification 208

7.11.1 Feature selection 209

7.11.2 Specify the model 210

7.11.3 Evaluate the modeling 212

7.12 Summary 220

7.12.1 In this chapter, you learned: 221

III Deep Learning Methods 223

Overview 225

8 Dense neural networks 231

8.1 Kickstarter data 232

8.2 A first deep learning model 237

8.2.1 Preprocessing for deep learning 237

8.2.2 One-hot sequence embedding of text 240

8.2.3 Simple flattened dense network 244

8.2.4 Evaluation 248

8.3 Using bag-of-words features 253

8.4 Using pre-trained word embeddings 257

8.5 Cross-validation for deep learning models 263

8.6 Compare and evaluate DNN models 267

8.7 Limitations of deep learning 271

8.8 Summary 272

8.8.1 In this chapter, you learned: 272

9 Long short-term memory (LSTM) networks 273

9.1 A first LSTM model 273

9.1.1 Building an LSTM 275

9.1.2 Evaluation 279

9.2 Compare to a recurrent neural network 283

9.3 Case study: bidirectional LSTM 286

9.4 Case study: stacking LSTM layers 288

9.5 Case study: padding 289

9.6 Case study: training a regression model 292

9.7 Case study: vocabulary size 295

9.8 The full game: LSTM 297

9.8.1 Preprocess the data 297

9.8.2 Specify the model 298

9.9 Summary 301

9.9.1 In this chapter, you learned: 302

10 Convolutional neural networks 303

10.1 What are CNNs? 303

10.1.1 Kernel 304

10.1.2 Kernel size 304

10.2 A first CNN model 305

10.3 Case study: adding more layers 309

10.4 Case study: byte pair encoding 317

10.5 Case study: explainability with LIME 324

10.6 Case study: hyperparameter search 330

10.7 Cross-validation for evaluation 334

10.8 The full game: CNN 337

10.8.1 Preprocess the data 337

10.8.2 Specify the model 338

10.9 Summary 341

10.9.1 In this chapter, you learned: 342

IV Conclusion 343

Text models in the real world 345

Appendix 347

A Regular expressions 347

A.1 Literal characters 347

A.1.1 Meta characters 349

A.2 Full stop, the wildcard 349

A.3 Character classes 350

A.3.1 Shorthand character classes 352

A.4 Quantifiers 353

A.5 Anchors 355

A.6 Additional resources 355

B Data 357

B.1 Hans Christian Andersen fairy tales 357

B.2 Opinions of the Supreme Court of the United States 358

B.3 Consumer Financial Protection Bureau (CFPB) complaints 359

B.4 Kickstarter campaign blurbs 359

C Baseline linear classifier 361

C.1 Read in the data 361

C.2 Split into test/train and create resampling folds 362

C.3 Recipe for data preprocessing 363

C.4 Lasso regularized classification model 363

C.5 A model workflow 364

C.6 Tune the workflow 366

References 369

Index 379

From the B&N Reads Blog

Page 1 of

Editorial Reviews

"I find this book very useful, as predictive modelling with text is an important field in data science and statistics, and yet the one that has been consistently under-represented in technical literature. Given the growing volume, complexity and accessibility of unstructured data sources, as well as the rapid development of NLP algorithms, knowledge and skills in this domain is in increasing demand. In particular, there’s a demand for pragmatic guidelines that offer not just the theoretical background to the NLP issues but also explain the end-to-end modelling process and good practices supported with code examples, just like "Supervised Machine Learning for Text Analysis in R" does. Data scientists and computational linguists would be a prime audience for this kind of publication and would most likely use it as both, (coding) reference and a textbook."
~Kasia Kulma, data science consultant

"This book fills a critical gap between the plethora of text mining books (even in R) that are too basic for practical use and the more complex text mining books that are not accessible to most data scientists. In addition, this book uses statistical techniques to do text mining and text prediction and classification. Not all text mining books take this approach, and given the level of this book, it is one of its strongest features."
~Carol Haney, Quatrics

"This book would be valuable for advanced undergraduates and early PhD students in a wide range of areas that have started using text as data…The main strength of the book is its connection to the tidyverse environment in R. It's relatively easy to pick up and do powerful things."
~David Mimno, Cornell University

"The authors do a great job of presenting R programmers a variety of deep learning applications to text-based problems. Perhaps one of the best parts of this book is the section on interpretability, where the authors showcase methods to diagnose features on which these complex models rely to make their prediction. Considering how important the area of interpretability is to natural language processing research and is often skipped in applied textbooks, the authors should be commended for incorporating it in this book."
~Kanishka Misra, Purdue University

"In conclusion, the presented book is extremely useful for graduate students, advanced researchers, and practitioners of statistics and data science who are interested in learning cutting-edge supervised ML techniques for text data. By utilizing the tidyverse environment and providing easy-to-understand R code examples with detailed study cases of real-world text mining problems, this book stands out and is a worthwhile read."
-Han-Ming Wu, National Chengchi University, Biometrics, September 2022

"The volume is a valuable methodological resource, primarily for students interested in data science, concerned with: understanding the fundamentals of preprocessing steps required to transform a corpus, not always large, into a structure that is a good fit for modeling; implementation of machine learning and deep learning algorithms for building text predictive models under given research contexts in which they have to be integrated."
-Anca Vitcu in ISCB Book Reviews, September 2022

From the Publisher

Supervised Machine Learning for Text Analysis in R

Supervised Machine Learning for Text Analysis in R

Paperback

Paperback

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews