Table of Contents
List of Figures
List of Tables
Author Bios
Foreword
Preface
Contributors
Deep Learning and Transformers: An Introduction
/1.1 DEEP LEARNING: A HISTORIC PERSPECTIVE
/1.2 TRANSFORMERS AND TAXONOMY
/1.2.1 Modified Transformer Architecture
/1.2.1.1 Transformer block changes
/1.2.1.2 Transformer sublayer changes
/1.2.2 Pretraining Methods and Applications
/1.3 RESOURCES
/1.3.1 Libraries and Implementations
/1.3.2 Books
/1.3.3 Courses, Tutorials, and Lectures
/1.3.4 Case Studies and Details
Transformers: Basics and Introduction
/2.1 ENCODER-DECODER ARCHITECTURE
/2.2 SEQUENCE TO SEQUENCE
/2.2.1 Encoder
/2.2.2 Decoder
/2.2.3 Training
/2.2.4 Issues with RNN-based Encoder Decoder
/2.3 ATTENTION MECHANISM
/2.3.1 Background
/2.3.2 Types of Score-Based Attention
/2.3.2.1 Dot Product (multiplicative)
/2.3.2.2 Scaled Dot Product or multiplicative
/2.3.2.3 Linear, MLP, or additive
/2.3.3 Attention-based Sequence to Sequence
/2.4 TRANSFORMER
/2.4.1 Source and Target Representation
/2.4.1.1 Word Embedding
/2.4.1.2 Positional Encoding
/2.4.2 Attention Layers
/2.4.2.1 Self-Attention
/2.4.2.2 Multi-Head Attention
/2.4.2.3 Masked Multi-Head Attention
/2.4.2.4 Encoder-Decoder Multi-Head Attention
/2.4.3 Residuals and Layer Normalization
/2.4.4 Position-wise Feed-Forward Networks
/2.4.5 Encoder
/2.4.6 Decoder
/2.5 CASE STUDY: MACHINE TRANSLATION
/2.5.1 Goal
/2.5.2 Data, Tools and Libraries
/2.5.3 Experiments, Results and Analysis
/2.5.3.1 Exploratory Data Analysis
/2.5.3.2 Attention
/2.5.3.3 Transformer
/2.5.3.4 Results and Analysis
/2.5.3.5 Explainability
Bidirectional Encoder Representations from Transformers (BERT)
/3.1 BERT
/3.1.1 Architecture
/3.1.2 Pre-training
/3.1.3 Fine-tuning
/3.2 BERT VARIANTS
/3.2.1 RoBERTa
/3.3 APPLICATIONS
/3.3.1 TaBERT
/3.3.2 BERTopic
/3.4 BERT INSIGHTS
/3.4.1 BERT Sentence Representation
/3.4.2 BERTology
/3.5 CASE STUDY: TOPIC MODELING WITH TRANSFORMERS
/3.5.1 Goal
/3.5.2 Data, Tools, and Libraries
/3.5.2.1 Data
/3.5.2.2 Compute embeddings
/3.5.3 Experiments, Results, and Analysis
/3.5.3.1 Building Topics
/3.5.3.2 Topic size distribution
/3.5.3.3 Visualization of topics
/3.5.3.4 Content of topics
/3.6 CASE STUDY: FINE-TUNING BERT
/3.6.1 Goal
/3.6.2 Data, Tools and Libraries
/3.6.3 Experiments, Results and Analysis
Multilingual Transformer Architectures
/4.1 MULTILINGUAL TRANSFORMER ARCHITECTURES
/4.1.1 Basic Multilingual Transformer
/4.1.2 Single-Encoder Multilingual NLU
/4.1.2.1 mBERT
/4.1.2.2 XLM
/4.1.2.3 XLM-RoBERTa
/4.1.2.4 ALM
/4.1.2.5 Unicoder
/4.1.2.6 INFOXL
/4.1.2.7 AMBER
/4.1.2.8 ERNIE-M
/4.1.2.9 HITCL
/4.1.3 Dual-Encoder Multilingual NLU
/4.1.3.1 LaBSE
/4.1.3.2 mUSE
/4.1.4 Multilingual NLG
/4.2 MULTILINGUAL DATA
/4.2.1 Pre-training Data
/4.2.2 Multilingual Benchmarks
/4.2.2.1 Classification
/4.2.2.2 Structure Prediction
/4.2.2.3 Question Answering
/4.2.2.4 Semantic Retrieval
/4.3 MULTILINGUAL TRANSFER LEARNING INSIGHTS
/4.3.1 Zero-shot Cross-lingual Learning
/4.3.1.1 Data Factors
/4.3.1.2 Model Architecture Factors
/4.3.1.3 Model Tasks Factors
/4.3.2 Language-agnostic Cross-lingual Representations
/4.4 CASE STUDY
/4.4.1 Goal
/4.4.2 Data, Tools, and Libraries
/4.4.3 Experiments, Results, and Analysis
/4.4.3.1 Data Preprocessing
/4.4.3.2 Experiments
Transformer Modifications
/5.1 TRANSFORMER BLOCK MODIFICATIONS
/5.1.1 Lightweight Transformers
/5.1.1.1 Funnel-Transformer
/5.1.1.2 DeLighT
/5.1.2 Connections between Transformer Blocks
/5.1.2.1 RealFormer
/5.1.3 Adaptive Computation Time
/5.1.3.1 Universal Transformers (UT)
/5.1.4 Recurrence Relations between Transformer Blocks
/5.1.4.1 Transformer-XL
/5.1.5 Hierarchical Transformers
/5.2 TRANSFORMERS WITH MODIFIED MULTI-HEAD SELF-ATTENTION
/5.2.1 Structure of Multi-head Self-Attention
/5.2.1.1 Multi-head self-attention
/5.2.1.2 Space and time complexity
/5.2.2 Reducing Complexity of Self-attention
/5.2.2.1 Longformer
/5.2.2.2 Reformer
/5.2.2.3 Performer
/5.2.2.4 Big Bird
/5.2.3 Improving Multi-head-attention
/5.2.3.1 Talking-Heads Attention
/5.2.4 Biasing Attention with Priors
/5.2.5 Prototype Queries
/5.2.5.1 Clustered Attention
/5.2.6 Compressed Key-Value Memory
/5.2.6.1 Luna: Linear Unified Nested Attention
/5.2.7 Low-rank Approximations
/5.2.7.1 Linformer
/5.3 MODIFICATIONS FOR TRAINING TASK EFFICIENCY
/5.3.1 ELECTRA
/5.3.1.1 Replaced token detection
/5.3.2 T5
/5.4 TRANSFORMER SUBMODULE CHANGES
/5.4.1 Switch Transformer
/5.5 CASE STUDY: SENTIMENT ANALYSIS
/5.5.1 Goal
/5.5.2 Data, Tools, and Libraries
/5.5.3 Experiments, Results, and Analysis
/5.5.3.1 Visualizing attention head weights
/5.5.3.2 Analysis
Pretrained and Application-Specific Transformers
/6.1 TEXT PROCESSING
/6.1.1 Domain-Specific Transformers
/6.1.1.1 BioBERT
/6.1.1.2 SciBERT
/6.1.1.3 FinBERT
/6.1.2 Text-to-text Transformers
/6.1.2.1 ByT5
/6.1.3 Text generation
/6.1.3.1 GPT: Generative Pre-training
/6.1.3.2 GPT-2
/6.1.3.3 GPT-3
/6.2 COMPUTER VISION
/6.2.1 Vision Transformer
/6.3 AUTOMATIC SPEECH RECOGNITION
/6.3.1 Wav2vec 2.0
/6.3.2 Speech2Text2
/6.3.3 HuBERT: Hidden Units BERT
/6.4 MULTIMODAL AND MULTITASKING TRANSFORMER
/6.4.1 Vision-and-Language BERT (VilBERT)
/6.4.2 Unified Transformer (UniT)
/6.5 VIDEO PROCESSING WITH TIMESFORMER
/6.5.1 Patch embeddings
/6.5.2 Self-attention
/6.5.2.1 Spatiotemporal self-attention
/6.5.2.2 Spatiotemporal attention blocks
/6.6 GRAPH TRANSFORMERS
/6.6.1 Positional encodings in a graph
/6.6.1.1 Laplacian positional encodings
/6.6.2 Graph transformer input
/6.6.2.1 Graphs without edge attributes
/6.6.2.2 Graphs with edge attributes
/6.7 REINFORCEMENT LEARNING
/6.7.1 Decision Transformer
/6.8 CASE STUDY: AUTOMATIC SPEECH RECOGNITION
/6.8.1 Goal
/6.8.2 Data, Tools, and Libraries
/6.8.3 Experiments, Results, and Analysis
/6.8.3.1 Preprocessing speech data
/6.8.3.2 Evaluation
Interpretability and Explainability Techniques for Transformers
/7.1 TRAITS OF EXPLAINABLE SYSTEMS
/7.2 RELATED AREAS THAT IMPACT EXPLAINABILITY
/7.3 EXPLAINABLE METHODS TAXONOMY
/7.3.1 Visualization Methods
/7.3.1.1 Backpropagation-based
/7.3.1.2 Perturbation-based
/7.3.2 Model Distillation
/7.3.2.1 Local Approximation
/7.3.2.2 Model Translation
/7.3.3 Intrinsic Methods
/7.3.3.1 Probing Mechanism
/7.3.3.2 Joint Training
/7.4 ATTENTION AND EXPLANATION
/7.4.1 Attention is not Explanation
/7.4.1.1 Attention Weights and Feature Importance
/7.4.1.2 Counterfactual Experiments
/7.4.2 Attention is not not Explanation
/7.4.2.1 Is attention necessary for all tasks?
/7.4.2.2 Searching for Adversarial Models
/7.4.2.3 Attention Probing
/7.5 QUANTIFYING ATTENTION FLOW
/7.5.1 Information flow as DAG
/7.5.2 Attention Rollout
/7.5.3 Attention Flow
/7.6 CASE STUDY: TEXT CLASSIFICATION WITH EXPLAINABILITY
/7.6.1 Goal
/7.6.2 Data, Tools, and Libraries
/7.6.3 Experiments, Results and Analysis
/7.6.3.1 Exploratory Data Analysis
/7.6.3.2 Experiments
/7.6.3.3 Error Analysis and Explainability
Bibliography
Alphabetical Index