Transformers for Machine Learning: A Deep Dive

Transformers are becoming a core part of many neural network architectures, employed in a wide range of applications such as NLP, Speech Recognition, Time Series, and Computer Vision. Transformers have gone through many adaptations and alterations, resulting in newer techniques and methods. Transformers for Machine Learning: A Deep Dive is the first comprehensive book on transformers.

Key Features:

A comprehensive reference book for detailed explanations for every algorithm and techniques related to the transformers.
60+ transformer architectures covered in a comprehensive manner.
A book for understanding how to apply the transformer techniques in speech, text, time series, and computer vision.
Practical tips and tricks for each architecture and how to use it in the real world.
Hands-on case studies and code snippets for theory and practical real-world analysis using the tools and libraries, all ready to run in Google Colab.

The theoretical explanations of the state-of-the-art transformer architectures will appeal to postgraduate students and researchers (academic and industry) as it will provide a single entry point with deep discussions of a quickly moving field. The practical hands-on case studies and code will appeal to undergraduate students, practitioners, and professionals as it allows for quick experimentation and lowers the barrier to entry into the field.

1140559234

Transformers for Machine Learning: A Deep Dive

Key Features:

A comprehensive reference book for detailed explanations for every algorithm and techniques related to the transformers.
60+ transformer architectures covered in a comprehensive manner.
A book for understanding how to apply the transformer techniques in speech, text, time series, and computer vision.
Practical tips and tricks for each architecture and how to use it in the real world.
Hands-on case studies and code snippets for theory and practical real-world analysis using the tools and libraries, all ready to run in Google Colab.

59.99 In Stock

Transformers for Machine Learning: A Deep Dive

Add to Wishlist

Transformers for Machine Learning: A Deep Dive

Paperback

$59.99

View All Available Formats & Editions

Paperback
$59.99

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 1-2 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Key Features:

A comprehensive reference book for detailed explanations for every algorithm and techniques related to the transformers.
60+ transformer architectures covered in a comprehensive manner.
A book for understanding how to apply the transformer techniques in speech, text, time series, and computer vision.
Practical tips and tricks for each architecture and how to use it in the real world.
Hands-on case studies and code snippets for theory and practical real-world analysis using the tools and libraries, all ready to run in Google Colab.

Product Details

ISBN-13:	9780367767341
Publisher:	CRC Press
Publication date:	05/25/2022
Series:	Chapman & Hall/CRC Machine Learning & Pattern Recognition
Pages:	283
Product dimensions:	6.12(w) x 9.19(h) x (d)

About the Author

Uday Kamath has spent more than two decades developing analytics products and combines this experience with learning in statistics, optimization, machine learning, bioinformatics, and evolutionary computing. Uday has contributed to many journals, conferences, and books, is the author of books like XAI: An Introduction to Interpretable XAI, Deep Learning for NLP and Speech Recognition, Mastering Java Machine Learning, and Machine Learning: End-to-End guide for Java developers. He held many senior roles: Chief Analytics Officer for Digital Reasoning, Advisor for
Falkonry, and Chief Data Scientist for BAE Systems Applied Intelligence. Uday has many patents and has built commercial products using AI in domains such as compliance, cybersecurity, financial crime, and bioinformatics. Uday currently works as the Chief Analytics Officer for Smarsh. He is responsible for data science, research of analytical products employing deep learning, transformers, explainable AI, and modern techniques in speech and text for the financial domain and healthcare.

Wael Emara has two decades of experience in academia and industry. Wael has a PhD in Computer Engineering and Computer Science with emphasis on machine learning and artificial intelligence. His technical background and research spans signal and image processing, computer vision, medical imaging, social media analytics, machine learning, and natural language processing. Wael has 10 research publications in various machine learning topics and he is active in the technical community in the greater New York area. Wael currently works as a Senior Research Engineer for Digital Reasoning where he is doing research on state-of-the-art artificial intelligence NLP systems.

Kenneth L. Graham has two decades experience solving quantitative problems in multiple domains, including Monte Carlo simulation, NLP, anomaly detection, cybersecurity, and behavioral profiling. For the past nine years, he has focused on building scalable solutions in NLP for government and industry, including entity coreference resolution, text classification, active learning, and temporal normalization. Kenneth currently works at Smarsh as a Principal Research Engineer, researching how to move state-of the-art NLP methods out of research and into production. Kenneth has five patents for his work in natural language processing, seven research publications, and a Ph.D. in Condensed Matter Physics.

Foreword xvii

Preface xix

Authors xxiii

Contributors xxv

Chapter 1 Deep Learning and Transformers: An Introduction 1

1.1 Deep Learning: A Historic Perspective 1

1.2 Transformers and Taxonomy 4

1.2.1 Modified Transformer Architecture 4

1.2.1.1 Transformer block changes 4

1.2.1.2 Transformer sublayer changes 5

1.2.2 Pre-training Methods and Applications 8

1.3 Resources 8

1.3.1 Libraries and Implementations 8

1.3.2 Books 9

1.3.3 Courses, Tutorials, and Lectures 9

1.3.4 Case Studies and Details 10

Chapter 2 Transformers: Basics and Introduction 11

2.1 Encoder-Decoder Architecture 11

2.2 Sequence-to-Sequence 12

2.2.1 Encoder 12

2.2.2 Decoder 13

2.2.3 Training 14

2.2.4 Issues with RNN-Based Encoder-Decoder 14

2.3 Attention Mechanism 14

2.3.1 Background 14

2.3.2 Types of Score-Based Attention 16

2.3.2.1 Dot product (multiplicative) 17

2.3.2.2 Scaled dot product or multiplicative 17

2.3.2.3 Linear, MLP, or Additive 17

2.3.3 Attention-Based Sequence-to-Sequence 18

2.4 Transformer 19

2.4.1 Source and Target Representation 20

2.4.1.1 Word embedding 20

2.4.1.2 Positional encoding 20

2.4.2 Attention Layers 22

2.4.2.1 Self-attention 22

2.4.2.2 Multi-head attention 24

2.4.2.3 Masked multi-head attention 25

2.4.2.4 Encoder-decoder multi-head attention 26

2.4.3 Residuals and Layer Normalization 26

2.4.4 Positionwise Feed-forward Networks 26

2.4.5 Encoder 27

2.4.6 Decoder 27

2.5 Case Study: Machine Translation 27

2.5.1 Goal 27

2.5.2 Data, Tools, and Libraries 27

2.5.3 Experiments, Results, and Analysis 28

2.5.3.1 Exploratory data analysis 28

2.5.3.2 Attention 29

2.5.3.3 Transformer 35

2.5.3.4 Results and analysis 38

2.5.3.5 Explainability 38

Chapter 3 Bidirectional Encoder Representations from Transformers (BERT) 43

3.1 BERT 43

3.1.1 Architecture 43

3.1.2 Pre-Training 45

3.1.3 Fine-Tuning 46

3.2 Bert Variants 48

3.2.1 RoBERTa 48

3.3 Applications 49

3.3.1 TaBERT 49

3.3.2 BERTopic 50

3.4 Bert Insights 51

3.4.1 BERT Sentence Representation 51

3.4.2 BERTology 52

3.5 Case Study: Topic Modeling with Transformers 53

3.5.1 Goal 53

3.5.2 Data, Tools, and Libraries 53

3.5.2.1 Data 54

3.5.2.2 Compute embeddings 54

3.5.3 Experiments, Results, and Analysis 55

3.5.3.1 Building topics 55

3.5.3.2 Topic size distribution 55

3.5.3.3 Visualization of topics 56

3.5.3.4 Content of topics 57

3.6 Case Study: Fine-Tuning Bert 63

3.6.1 Goal 63

3.6.2 Data, Tools, and Libraries 63

3.6.3 Experiments, Results, and Analysis 64

Chapter 4 Multilingual Transformer Architectures 71

4.1 Multilingual Transformer Architectures 72

4.1.1 Basic Multilingual Transformer 72

4.1.2 Single-Encoder Multilingual NLU 74

4.1.2.1 mBERT 74

4.1.2.2 XLM 75

4.1.2.3 XLM-RoBERTa 77

4.1.2.4 ALM 77

4.1.2.5 Unicoder 78

4.1.2.6 INFOXLM 80

4.1.2.7 AMBER 81

4.1.2.8 ERNIE-M 82

4.1.2.9 HITCL 84

4.1.3 Dual-Encoder Multilingual NLU 85

4.1.3.1 LaBSE 85

4.1.3.2 mUSE 87

4.1.4 Multilingual NLG 89

4.2 Multilingual Data 90

4.2.1 Pre-Training Data 90

4.2.2 Multilingual Benchmarks 91

4.2.2.1 Classification 91

4.2.2.2 Structure prediction 92

4.2.2.3 Question answering 92

4.2.2.4 Semantic retrieval 92

4.3 Multilingual Transfer Learning Insights 93

4.3.1 Zero-Shot Cross-Lingual Learning 93

4.3.1.1 Data factors 93

4.3.1.2 Model architecture factors 94

4.3.1.3 Model tasks factors 95

4.3.2 Language-Agnostic Cross-Lingual Representations 96

4.4 Case Study 97

4.4.1 Goal 97

4.4.2 Data, Tools, and Libraries 98

4.4.3 Experiments, Results, and Analysis 98

4.4.3.1 Data preprocessing 99

4.4.3.2 Experiments 101

Chapter 5 Transformer Modifications 109

5.1 Transformer Block Modifications 109

5.1.1 Lightweight Transformers 109

5.1.1.1 Funnel-transformer 109

5.1.1.2 DeLighT 112

5.1.2 Connections between Transformer Blocks 114

5.1.2.1 RealFormer 114

5.1.3 Adaptive Computation Time 115

5.1.3.1 Universal transformers (UT) 115

5.1.4 Recurrence Relations between Transformer Blocks 116

5.1.4.1 Transformer-XL 116

5.1.5 Hierarchical Transformers 120

5.2 Transformers with Modified Multi-Head Self-Attention 120

5.2.l Structure of Multi-Head Self-Attention 120

5.2.1.1 Multi-head self-attention 122

5.2.1.2 Space and time complexity 123

5.2.2 Reducing Complexity of Self-Attention 124

5.2.2.1 Longformer 124

5.2.2.2 Reformer 126

5.2.2.3 Performer 131

5.2.2.4 Big Bird 132

5.2.3 Improving Multi-Head-Attention 137

5.2.3.1 Talking-heads attention 137

5.2.4 Biasing Attention with Priors 140

5.2.5 Prototype Queries 140

5.2.5.1 Clustered attention 140

5.2.6 Compressed Key-Value Memory 141

5.2.6.1 Luna: Linear Unified Nested Attention 141

5.2.7 Low-Rank Approximations 143

5.2.7.1 Linformer 143

5.3 Modifications for Training Task Efficiency 145

5.3.1 Electra 145

5.3.1.1 Replaced token detection 145

5.3.2 T5 146

5.4 Transformer Submodule Changes 146

5.4.1 Switch Transformer 146

5.5 Case Study: Sentiment Analysis 148

5.5.1 Goal 148

5.5.2 Data, Tools, and Libraries 148

5.5.3 Experiments, Results, and Analysis 150

5.5.3.1 Visualizing attention head weights 150

5.5.3.2 Analysis 152

Chapter 6 Pre-trained and Application-Specific Transformers 155

6.1 Text Processing 155

6.1.1 Domain-Specific Transformers 155

6.1.1.1 BioBERT 155

6.1.1.2 SciBERT 156

6.1.1.3 FinBERT 156

6.1.2 Text-to-Text Transformers 157

6.1.2.1 ByT5 157

6.1.3 Text Generation 158

6.1.3.1 GPT: Generative pre-training 158

6.1.3.2 GPT-2 160

6.1.3.3 GPT-3 161

6.2 Computer Vision 163

6.2.1 Vision Transformer 163

6.3 Automatic Speech Recognition 164

6.3.1 Wav2vec 2.0 165

6.3.2 Speech2Text2 165

6.3.3 HuBERT: Hidden Units BERT 166

6.4 Multimodal and Multitasking Transformer 166

6.4.1 Vision-and-Language BERT (VilBERT) 167

6.4.2 Unified Transformer (UniT) 168

6.5 Video Processing with Timesformer 169

6.5.1 Patch Embeddings 169

6.5.2 Self-Attention 170

6.5.2.1 Spatiotemporal self-attention 171

6.5.2.2 Spatiotemporal attention blocks 171

6.6 Graph Transformers 172

6.6.1 Positional Encodings in a Graph 173

6.6.1.1 Laplacian positional encodings 173

6.6.2 Graph Transformer Input 173

6.6.2.1 Graphs without edge attributes 174

6.6.2.2 Graphs with edge attributes 175

6.7 Reinforcement Learning 177

6.7.1 Decision Transformer 178

6.8 Case Study: Automatic Speech Recognition 180

6.8.1 Goal 180

6.8.2 Data, Tools, and Libraries 180

6.8.3 Experiments, Results, and Analysis 180

6.8.3.1 Preprocessing speech data 180

6.8.3.2 Evaluation 181

Chapter 7 Interpratability and Explainability Techniques for Transformers 187

7.1 Traits of Explainable Systems 187

7.2 Related Areas that Impact Explainability 189

7.3 Explainable Methods Taxonomy 190

7.3.1 Visualization Methods 190

7.3.1.1 Backpropagation-based 190

7.3.1.2 Perturbation-based 194

7.3.2 Model Distillation 195

7.3.2.1 Local approximation 195

7.3.2.2 Model translation 198

7.3.3 Intrinsic Methods 198

7.3.3.1 Probing mechanism 198

7.3.3.2 Joint training 201

7.4 Attention and Explanation 202

7.4.1 Attention is Not an Explanation 202

7.4.1.1 Attention weights and feature importance 202

7.4.1.2 Counterfactual experiments 204

7.4.2 Attention is Not Not an Explanation 205

7.4.2.1 Is attention necessary for all tasks? 206

7.4.2.2 Searching for adversarial models 207

7.4.2.3 Attention probing 208

7.5 Quantifying Attention Flow 208

7.5.1 Information Flow as DAG 208

7.5.2 Attention Rollout 209

7.5.3 Attention Flow 209

7.6 Case Study: Text Classification with Explainability 210

7.6.1 Goal 210

7.6.2 Data, Tools, and Libraries 211

7.6.3 Experiments, Results, and Analysis 211

7.6.3.1 Exploratory data analysis 211

7.6.3.2 Experiments 211

7.6.3.3 Error analysis and explainability 212

Bibliography 221

Index 255

From the B&N Reads Blog

Page 1 of

Transformers for Machine Learning: A Deep Dive

Transformers for Machine Learning: A Deep Dive

Paperback

Paperback

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Customer Reviews