Supervised Machine Learning for Text Analysis in R
Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.

1139358617
Supervised Machine Learning for Text Analysis in R
Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.

66.99 In Stock
Supervised Machine Learning for Text Analysis in R

Supervised Machine Learning for Text Analysis in R

by Emil Hvitfeldt, Julia Silge
Supervised Machine Learning for Text Analysis in R

Supervised Machine Learning for Text Analysis in R

by Emil Hvitfeldt, Julia Silge

Paperback

$66.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE

    Your local store may have stock of this item.

Related collections and offers


Overview

Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.


Product Details

ISBN-13: 9780367554194
Publisher: CRC Press
Publication date: 10/22/2021
Series: Chapman & Hall/CRC Data Science Series
Pages: 402
Product dimensions: 6.12(w) x 9.19(h) x (d)

About the Author

Emil Hvitfeldt is a clinical data analyst working in healthcare, and an adjunct professor at American University where he is teaching statistical machine learning with tidymodels. He is also an open source R developer and author of the textrecipes package.

Julia Silge is a data scientist and software engineer at RStudio PBC where she works on open source modeling tools. She is an author, an international keynote speaker and educator, and a real-world practitioner focusing on data analysis and machine learning practice.

Table of Contents

Preface xiii

I Natural Language Features 1

1 Language and modeling 3

1.1 Linguistics for text analysis 3

1.2 A glimpse into one area: morphology 5

1.3 Different languages 6

1.4 Other ways text can vary 7

1.5 Summary 8

1.5.1 In this chapter, you learned: 8

2 Tokenization 9

2.1 What is a token? 9

2.2 Types of tokens 13

2.2.1 Character tokens 16

2.2.2 Word tokens 18

2.2.3 Tokenizing by n-grams 19

2.2.4 Lines, sentence, and paragraph tokens 22

2.3 Where does tokenization break down? 25

2.4 Building your own tokenizer 26

2.4.1 Tokenize to characters, only keeping letters 27

2.4.2 Allow for hyphenated words 29

2.4.3 Wrapping it in a function 32

2.5 Tokenization for non-Latin alphabets 33

2.6 Tokenization benchmark 34

2.7 Summary 35

2.7.1 In this chapter, you learned: 35

3 Stop words 37

3.1 Using premade stop word lists 38

3.1.1 Stop word removal in R 41

3.2 Creating your own stop words list 43

3.3 All stop word lists are context-specific 48

3.4 What happens when you remove stop words 49

3.5 Stop words in languages other than English 50

3.6 Summary 52

3.6.1 In this chapter, you learned: 52

4 Stemming 53

4.1 How to stem text in R 54

4.2 Should you use stemming at all? 58

4.3 Understand a stemming algorithm 61

4.4 Handling punctuation when stemming 63

4.5 Compare some stemming options 65

4.6 Lemmatization and stemming 68

4.7 Stemming and stop words 70

4.8 Summary 71

4.8.1 In this chapter, you learned: 72

5 Word Embeddings 73

5.1 Motivating embeddings for sparse, high-dimensional data 73

5.2 Understand word embeddings by finding them yourself 77

5.3 Exploring CFPB word embeddings 81

5.4 Use pre-trained word embeddings 88

5.5 Fairness and word embeddings 93

5.6 Using word embeddings in the real world 95

5.7 Summary 96

5.7.1 In this chapter, you learned: 97

II Machine Learning Methods 99

Overview 101

6 Regression 105

6.1 A first regression model 106

6.1.1 Building our first regression model 107

6.1.2 Evaluation 112

6.2 Compare to the null model 117

6.3 Compare to a random forest model 119

6.4 Case study: removing stop words 122

6.5 Case study: varying n-grams 126

6.6 Case study: lemmatization 129

6.7 Case study, feature hashing 133

6.7.1 Text normalization 137

6.8 What evaluation metrics are appropriate? 139

6.9 The full game: regression 142

6.9.1 Preprocess the data 142

6.9.2 Specify the model 143

6.9.3 Tune the model 144

6.9.4 Evaluate the modeling 146

6.10 Summary 153

6.10.1 In this chapter, you learned: 153

7 Classification 155

7.1 A first classification model 156

7.1.1 Building our first classification model 158

7.1.2 Evaluation 161

7.2 Compare to the null model 166

7.3 Compare to a lasso classification model 167

7.4 Tuning lasso hyperparameters 170

7.5 Case study: sparse encoding 179

7.6 Two-class or multiclass? 183

7.7 Case study: including non-text data 191

7.8 Case study: data censoring 195

7.9 Case study: custom features 201

7.9.1 Detect credit cards 202

7.9.2 Calculate percentage censoring 204

7.9.3 Detect monetary amounts 205

7.10 What evaluation metrics are appropriate? 206

7.11 The full game: classification 208

7.11.1 Feature selection 209

7.11.2 Specify the model 210

7.11.3 Evaluate the modeling 212

7.12 Summary 220

7.12.1 In this chapter, you learned: 221

III Deep Learning Methods 223

Overview 225

8 Dense neural networks 231

8.1 Kickstarter data 232

8.2 A first deep learning model 237

8.2.1 Preprocessing for deep learning 237

8.2.2 One-hot sequence embedding of text 240

8.2.3 Simple flattened dense network 244

8.2.4 Evaluation 248

8.3 Using bag-of-words features 253

8.4 Using pre-trained word embeddings 257

8.5 Cross-validation for deep learning models 263

8.6 Compare and evaluate DNN models 267

8.7 Limitations of deep learning 271

8.8 Summary 272

8.8.1 In this chapter, you learned: 272

9 Long short-term memory (LSTM) networks 273

9.1 A first LSTM model 273

9.1.1 Building an LSTM 275

9.1.2 Evaluation 279

9.2 Compare to a recurrent neural network 283

9.3 Case study: bidirectional LSTM 286

9.4 Case study: stacking LSTM layers 288

9.5 Case study: padding 289

9.6 Case study: training a regression model 292

9.7 Case study: vocabulary size 295

9.8 The full game: LSTM 297

9.8.1 Preprocess the data 297

9.8.2 Specify the model 298

9.9 Summary 301

9.9.1 In this chapter, you learned: 302

10 Convolutional neural networks 303

10.1 What are CNNs? 303

10.1.1 Kernel 304

10.1.2 Kernel size 304

10.2 A first CNN model 305

10.3 Case study: adding more layers 309

10.4 Case study: byte pair encoding 317

10.5 Case study: explainability with LIME 324

10.6 Case study: hyperparameter search 330

10.7 Cross-validation for evaluation 334

10.8 The full game: CNN 337

10.8.1 Preprocess the data 337

10.8.2 Specify the model 338

10.9 Summary 341

10.9.1 In this chapter, you learned: 342

IV Conclusion 343

Text models in the real world 345

Appendix 347

A Regular expressions 347

A.1 Literal characters 347

A.1.1 Meta characters 349

A.2 Full stop, the wildcard 349

A.3 Character classes 350

A.3.1 Shorthand character classes 352

A.4 Quantifiers 353

A.5 Anchors 355

A.6 Additional resources 355

B Data 357

B.1 Hans Christian Andersen fairy tales 357

B.2 Opinions of the Supreme Court of the United States 358

B.3 Consumer Financial Protection Bureau (CFPB) complaints 359

B.4 Kickstarter campaign blurbs 359

C Baseline linear classifier 361

C.1 Read in the data 361

C.2 Split into test/train and create resampling folds 362

C.3 Recipe for data preprocessing 363

C.4 Lasso regularized classification model 363

C.5 A model workflow 364

C.6 Tune the workflow 366

References 369

Index 379

From the B&N Reads Blog

Customer Reviews