Natural Language Processing with Spark NLP: Learning to Understand Text at Scale

If you want to build an enterprise-quality application that uses natural language text but aren’t sure where to begin or what tools to use, this practical guide will help get you started. Alex Thomas, principal data scientist at Wisecube, shows software engineers and data scientists how to build scalable natural language processing (NLP) applications using deep learning and the Apache Spark NLP library.

Through concrete examples, practical and theoretical explanations, and hands-on exercises for using NLP on the Spark processing framework, this book teaches you everything from basic linguistics and writing systems to sentiment analysis and search engines. You’ll also explore special concerns for developing text-based applications, such as performance.

In four sections, you’ll learn NLP basics and building blocks before diving into application and system building:

  • Basics: Understand the fundamentals of natural language processing, NLP on Apache Stark, and deep learning
  • Building blocks: Learn techniques for building NLP applications—including tokenization, sentence segmentation, and named-entity recognition—and discover how and why they work
  • Applications: Explore the design, development, and experimentation process for building your own NLP applications
  • Building NLP systems: Consider options for productionizing and deploying NLP models, including which human languages to support
1136377026
Natural Language Processing with Spark NLP: Learning to Understand Text at Scale

If you want to build an enterprise-quality application that uses natural language text but aren’t sure where to begin or what tools to use, this practical guide will help get you started. Alex Thomas, principal data scientist at Wisecube, shows software engineers and data scientists how to build scalable natural language processing (NLP) applications using deep learning and the Apache Spark NLP library.

Through concrete examples, practical and theoretical explanations, and hands-on exercises for using NLP on the Spark processing framework, this book teaches you everything from basic linguistics and writing systems to sentiment analysis and search engines. You’ll also explore special concerns for developing text-based applications, such as performance.

In four sections, you’ll learn NLP basics and building blocks before diving into application and system building:

  • Basics: Understand the fundamentals of natural language processing, NLP on Apache Stark, and deep learning
  • Building blocks: Learn techniques for building NLP applications—including tokenization, sentence segmentation, and named-entity recognition—and discover how and why they work
  • Applications: Explore the design, development, and experimentation process for building your own NLP applications
  • Building NLP systems: Consider options for productionizing and deploying NLP models, including which human languages to support
67.99 In Stock
Natural Language Processing with Spark NLP: Learning to Understand Text at Scale

Natural Language Processing with Spark NLP: Learning to Understand Text at Scale

by Alex Thomas
Natural Language Processing with Spark NLP: Learning to Understand Text at Scale

Natural Language Processing with Spark NLP: Learning to Understand Text at Scale

by Alex Thomas

eBook

$67.99 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

If you want to build an enterprise-quality application that uses natural language text but aren’t sure where to begin or what tools to use, this practical guide will help get you started. Alex Thomas, principal data scientist at Wisecube, shows software engineers and data scientists how to build scalable natural language processing (NLP) applications using deep learning and the Apache Spark NLP library.

Through concrete examples, practical and theoretical explanations, and hands-on exercises for using NLP on the Spark processing framework, this book teaches you everything from basic linguistics and writing systems to sentiment analysis and search engines. You’ll also explore special concerns for developing text-based applications, such as performance.

In four sections, you’ll learn NLP basics and building blocks before diving into application and system building:

  • Basics: Understand the fundamentals of natural language processing, NLP on Apache Stark, and deep learning
  • Building blocks: Learn techniques for building NLP applications—including tokenization, sentence segmentation, and named-entity recognition—and discover how and why they work
  • Applications: Explore the design, development, and experimentation process for building your own NLP applications
  • Building NLP systems: Consider options for productionizing and deploying NLP models, including which human languages to support

Product Details

ISBN-13: 9781492047711
Publisher: O'Reilly Media, Incorporated
Publication date: 06/25/2020
Sold by: Barnes & Noble
Format: eBook
Pages: 366
File size: 4 MB

About the Author

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Table of Contents

Preface xi

Part I Basics

1 Getting Started 3

Introduction 3

Other Tools 5

Setting Up Your Environment 6

Prerequisites 6

Starting Apache Spark 6

Checking Out the Code 7

Getting Familiar with Apache Spark 7

Starting Apache Spark with Spark NLP 8

Loading and Viewing Data in Apache Spark 8

Hello World with Spark NLP 11

2 Natural Language Basics 19

What Is Natural Language? 19

Origins of Language 20

Spoken Language Versus Written Language 21

Linguistics 22

Phonetics and Phonology 22

Morphology 23

Syntax 24

Semantics 25

Sociolinguistics: Dialects, Registers, and Other Varieties 25

Formality 26

Context 26

Pragmatics 27

Roman Jakobson 27

How To Use Pragmatics 28

Writing Systems 28

Origins 28

Abphabets 29

Abjads 30

Abugidas 31

Syllabaries 32

Logographs 32

Encodings 33

ASCII 33

Unicode 33

UTF-8 34

Exercises: Tokenizing 34

Tokenize English 35

Tokenize Greek 35

Tokenize Ge'ez (Amharic) 36

Resources 36

3 NLP on Apache Spark 39

Parallelism, Concurrency, Distributing Computation 40

Parallelization Before Apache Hadoop 43

MapReduce and Apache Hadoop 43

Apache Spark 44

Architecture of Apache Spark 44

Physical Architecture 45

Logical Architecture 46

Spark SQL and Spark MLlib 51

Transformers 54

Estimators and Models 57

Evaluators 60

NLP Libraries 63

Functionality Libraries 63

Annotation Libraries 63

NLP in Other Libraries 64

Spark NLP 65

Annotation Library 65

Stages 65

Pretrained Pipelines 72

Finisher 74

Exercises: Build a Topic Model 76

Resources 77

4 Deep Learning Basics 79

Gradient Descent 84

Backpropagation 85

Convolutional Neural Networks 96

Filters 96

Pooling 97

Recurrent Neural Networks 97

Backpropagation Through Time 97

Elman Nets 98

LSTMs 98

Exercise 1 99

Exercise 2 99

Resources 100

Part II Building Blocks

5 Processing Words 103

Tokenization 104

Vocabulary Reduction 107

Stemming 108

Lemmatization 108

Stemming Versus Lemmatization 108

Spelling Correction 110

Normalization 112

Bag-of-Words 113

CountVectorizer 114

N-Gram 116

Visualizing: Word and Document Distributions 118

Exercises 122

Resources 122

6 Information Retrieval 123

Inverted Indices 124

Building an Inverted Index 124

Vector Space Model 130

Stop-Word Removal 133

Inverse Document Frequency 134

In Spark 137

Exercises 137

Resources 138

7 Classification and Regression 139

Bag-of-Words Features 142

Regular Expression Features 143

Feature Selection 145

Modeling 148

Naïve Bayes 149

Linear Models 149

Decision/Regression Trees 149

Deep Learning Algorithms 150

Iteration 150

Exercises 153

8 Sequence Modeling with Keras 155

Sentence Segmentation 156

(Hidden) Markov Models 156

Section Segmentation 163

Part-of-Speech Tagging 164

Conditional Random Field 168

Chunking and Syntactic Parsing 168

Language Models 169

Recurrent Neural Networks 170

Exercise: Character N-Grams 176

Exercise: Word Language Model 176

Resources 177

9 Information Extraction 179

Named-Entity Recognition 179

Coreference Resolution 187

Assertion Status Detection 189

Relationship Extraction 191

Summary 195

Exercises 196

10 Topic Modeling 197

K-Means 198

Latent Semantic Indexing 202

Nonnegative Matrix Factorization 205

Latent Dirichlet Allocation 209

Exercises 211

11 Word Embeddings 215

Word2vec 215

GloVe 226

FastText 227

Transformers 227

ELMo, BERT, and XLNet 228

doc2vec 229

Exercises 231

Part III Applications

12 Sentiment Analysis and Emotion Detection 235

Problem Statement and Constraints 235

Plan the Project 236

Design the Solution 240

Implement the Solution 241

Test and Measure the Solution 245

Business Metrics 245

Model-Centric Metrics 246

Infrastructure Metrics 247

Process Metrics 247

Offline Versus Online Model Measurement 248

Review 248

Initial Deployment 249

Fallback Plans 249

Next Steps 250

Conclusion 250

13 Building Knowledge Bases 251

Problem Statement and Constraints 252

Plan the Project 253

Design the Solution 253

Implement the Solution 255

Test and Measure the Solution 262

Business Metrics 262

Model-Centric Metrics 262

Infrastructure Metrics 263

Process Metrics 263

Review 264

Conclusion 264

14 Search Engine 265

Problem Statement and Constraints 266

Plan the Project 266

Design the Solution 266

Implement the Solution 267

Test and Measure the Solution 275

Business Metrics 275

Model-Centric Metrics 275

Review 276

Conclusion 276

15 Chatbot 277

Problem Statement and Constraints 278

Plan the Project 279

Design the Solution 279

Implement the Solution 280

Test and Measure the Solution 289

Business Metrics 289

Model-Centric Metrics 290

Review 290

Conclusion 290

16 Object Character Recognition 291

Kinds of OCR Tasks 291

Images of Printed Text and PDFs to Text 291

Images of Handwritten Text to Text 292

Images of Text in Environment to Text 292

Images of Text to Target 293

Note on Different Writing Systems 293

Problem Statement and Constraints 294

Plan the Project 294

Implement the Solution 295

Test and Measure the Solution 299

Model-Centric Metrics 300

Review 300

Conclusion 300

Part IV Building NLP Systems

17 Supporting Multiple Languages 303

Language Typology 303

Scenario: Academic Paper Classification 303

Text Processing in Different Languages 304

Compound Words 304

Morphological Complexity 305

Transfer Learning and Multilingual Deep Learning 306

Search Across Languages 307

Checklist 308

Conclusion 308

18 Human Labeling 309

Guidelines 310

Scenario: Academic Paper Classification 310

Inter-Labeler Agreement 312

Iterative Labeling 313

Labeling Text 314

Classification 314

Tagging 314

Checklist 315

Conclusion 315

19 Productionizing NLP Applications 317

SparkNLP Model Cache 318

Spark NLP and TensorFlow Integration 319

Spark Optimization Basics 319

Design-Level Optimization 321

Profiling Tools 322

Monitoring 322

Managing Data Resources 322

Testing NLP-Based Applications 323

Unit Tests 323

Integration Tests 323

Smoke and Sanity Tests 323

Performance Tests 324

Usability Tests 324

Demoing NLP-Based Applications 325

Checklists 325

Model Deployment Checklist 325

Scaling and Performance Checklist 326

Testing Checklist 326

Conclusion 327

Glossary 329

Index 339

From the B&N Reads Blog

Customer Reviews