Practical Weak Supervision: Doing More with Less Data

Most data scientists and engineers today rely on quality labeled data to train machine learning models. But building a training set manually is time-consuming and expensive, leaving many companies with unfinished ML projects. There's a more practical approach. In this book, Wee Hyong Tok, Amit Bahree, and Senja Filipi show you how to create products using weakly supervised learning models.

You'll learn how to build natural language processing and computer vision projects using weakly labeled datasets from Snorkel, a spin-off from the Stanford AI Lab. Because so many companies have pursued ML projects that never go beyond their labs, this book also provides a guide on how to ship the deep learning models you build.

  • Get up to speed on the field of weak supervision, including ways to use it as part of the data science process
  • Use Snorkel AI for weak supervision and data programming
  • Get code examples for using Snorkel to label text and image datasets
  • Use a weakly labeled dataset for text and image classification
  • Learn practical considerations for using Snorkel with large datasets and using Spark clusters to scale labeling
1138121140
Practical Weak Supervision: Doing More with Less Data

Most data scientists and engineers today rely on quality labeled data to train machine learning models. But building a training set manually is time-consuming and expensive, leaving many companies with unfinished ML projects. There's a more practical approach. In this book, Wee Hyong Tok, Amit Bahree, and Senja Filipi show you how to create products using weakly supervised learning models.

You'll learn how to build natural language processing and computer vision projects using weakly labeled datasets from Snorkel, a spin-off from the Stanford AI Lab. Because so many companies have pursued ML projects that never go beyond their labs, this book also provides a guide on how to ship the deep learning models you build.

  • Get up to speed on the field of weak supervision, including ways to use it as part of the data science process
  • Use Snorkel AI for weak supervision and data programming
  • Get code examples for using Snorkel to label text and image datasets
  • Use a weakly labeled dataset for text and image classification
  • Learn practical considerations for using Snorkel with large datasets and using Spark clusters to scale labeling
76.99 In Stock
Practical Weak Supervision: Doing More with Less Data

Practical Weak Supervision: Doing More with Less Data

Practical Weak Supervision: Doing More with Less Data

Practical Weak Supervision: Doing More with Less Data

eBook

$76.99 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Most data scientists and engineers today rely on quality labeled data to train machine learning models. But building a training set manually is time-consuming and expensive, leaving many companies with unfinished ML projects. There's a more practical approach. In this book, Wee Hyong Tok, Amit Bahree, and Senja Filipi show you how to create products using weakly supervised learning models.

You'll learn how to build natural language processing and computer vision projects using weakly labeled datasets from Snorkel, a spin-off from the Stanford AI Lab. Because so many companies have pursued ML projects that never go beyond their labs, this book also provides a guide on how to ship the deep learning models you build.

  • Get up to speed on the field of weak supervision, including ways to use it as part of the data science process
  • Use Snorkel AI for weak supervision and data programming
  • Get code examples for using Snorkel to label text and image datasets
  • Use a weakly labeled dataset for text and image classification
  • Learn practical considerations for using Snorkel with large datasets and using Spark clusters to scale labeling

Product Details

ISBN-13: 9781492077015
Publisher: O'Reilly Media, Incorporated
Publication date: 09/30/2021
Sold by: Barnes & Noble
Format: eBook
Pages: 192
File size: 7 MB

About the Author

is a product and AI leader with a background in product management, machine learning/deep learning, research, and working on complex technical engagements with customers. Over the years, he has demonstrated that the early thought-leadership whitepapers he wrote on tech trends have become reality, and are deeply integrated into many products. Wee Hyong has worn many hats in his career—developer, program/product manager, data scientist, researcher, and strategist, and his range of experience has given him unique superpowers to lead and define the strategy for high-performing data and AI innovation teams.

Table of Contents

Foreword Xuedong Huang vii

Foreword Alex Ratner ix

Preface xiii

1 Introduction to Weak Supervision 1

What Is Weak Supervision? 1

Real-World Weak Supervision with Snorkel 2

Approaches to Weak Supervision 6

Incomplete Supervision 6

Inexact Supervision 9

Inaccurate Supervision 10

Data Programming 11

Getting Training Data 13

How Data Programming Is Helping Accelerate Software 2.0 14

Summary 16

2 Diving into Data Programming with Snorkel 17

Snorkel, a Data Programming Framework 18

Getting Started with Labeling Functions 19

Applying the Labels to the Datasets 21

Analyzing the Labeling Performance 22

Using a Validation Set 27

Reaching Labeling Consensus with LabelModel 29

Intuition Behind LabelModel 30

LabelModel Parameter Estimation 30

Strategies to Improve the Labeling Functions 32

Data Augmentation with Snorkel Transformers 33

Data Augmentation Through Word Removal 36

Snorkel Preprocessors 38

Data Augmentation Through GPT-2 Prediction 39

Data Augmentation Through Translation 42

Applying the Transformation Functions to the Dataset 45

Summary 47

3 Labeling in Action 49

Labeling a Text Dataset: Identifying Fake News 50

Exploring the Fake News Detection(FakeNewsNet) Dataset 51

Importing Snorkel and Setting Up Representative Constants 52

Fact-Checking Sites 52

Is the Speaker a "Liar"? 61

Twitter Profile and Botometer Score 63

Generating Agreements Between Weak Classifiers 64

Labeling an Images Dataset: Determining Indoor Versus Outdoor Images 67

Creating a Dataset of Images from Bing 71

Defining and Training Weak Classifiers in TensorFlow 71

Training the Various Classifiers 74

Weak Classifiers out of Image Tags 76

Deploying the Computer Vision Service 77

Interacting with the Computer Vision Service 78

Preparing the DataFrame 80

Learning a LabelModel 81

Summary 85

4 Using the Snorkel-Labeled Dataset for Text Classification 87

Getting Started with Natural Language Processing (NLP) 88

Transformers 89

Hard Versus Probabilistic Labels 91

Using ktrain for Performing Text Classification 91

Data Preparation 92

Dealing with an Imbalanced Dataset 93

Training the Model 95

Using the Text Classification Model for Prediction 97

Finding a Good Learning Rate 99

Using Hugging Face and Transformers 100

Loading the Relevant Python Packages 101

Dataset Preparation 101

Checking Whether GPU Hardware Is Available 102

Performing Tokenization 102

Model Training 104

Testing the Fine-Tuned Model 108

Summary 109

5 Using the Snorkel-Labeled Dataset for Image Classification 111

Visual Object Recognition Overview 111

Rep resenting Image Features 112

Transfer Learning for Computer Vision 113

Using PyTorch for Image Classification 114

Loading the Indoor/Outdoor Dataset 115

Utility Functions 118

Visualizing the Training Data 119

Fine-Tuning the Pretrained Model 120

Summary 130

6 Scalability and Distributed Training 131

The Need for Scalability 132

Distributed Training 133

Apache Spark: An Introduction 135

Spark Application Design 137

Using Azure Databricks to Scale 138

Cluster Setup for Weak Supervision 141

Fake News Detection Dataset on Databricks 143

Labeling Functions for Snorkel 145

Setting Up Dependencies 147

Loading the Data 149

Fact-Checking Sites 151

Transfer Learning Using the LIAR Dataset 153

Weak Classifiers: Generating Agreement 154

Type Conversions Needed for Spark Runtime 156

Summary 159

Index 161

From the B&N Reads Blog

Customer Reviews