Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation

Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation

by Masashi Sugiyama, Motoaki Kawanabe

Hardcover

$49.50 $55.00 Save 10% Current price is $49.5, Original price is $55. You Save 10%.
View All Available Formats & Editions
Usually ships within 6 days

Overview

Theory, algorithms, and applications of machine learning techniques to overcome “covariate shift” non-stationarity.

As the power of computing has grown over the past few decades, the field of machine learning has advanced rapidly in both theory and practice. Machine learning methods are usually based on the assumption that the data generation mechanism does not change over time. Yet real-world applications of machine learning, including image recognition, natural language processing, speech recognition, robot control, and bioinformatics, often violate this common assumption. Dealing with non-stationarity is one of modern machine learning's greatest challenges. This book focuses on a specific non-stationary environment known as covariate shift, in which the distributions of inputs (queries) change but the conditional distribution of outputs (answers) is unchanged, and presents machine learning theory, algorithms, and applications to overcome this variety of non-stationarity.

After reviewing the state-of-the-art research in the field, the authors discuss topics that include learning under covariate shift, model selection, importance estimation, and active learning. They describe such real world applications of covariate shift adaption as brain-computer interface, speaker identification, and age prediction from facial images. With this book, they aim to encourage future research in machine learning, statistics, and engineering that strives to create truly autonomous learning machines able to learn under non-stationarity.

Product Details

ISBN-13: 9780262017091
Publisher: MIT Press
Publication date: 03/30/2012
Series: Adaptive Computation and Machine Learning series
Pages: 280
Product dimensions: 6.20(w) x 9.10(h) x 0.70(d)
Age Range: 18 Years

About the Author

Masashi Sugiyama is Associate Professor in the Department of Computer Science at Tokyo Institute of Technology.

Motoaki Kawanabe is a Postdoctoral Researcher in Intelligent Data Analysis at the Fraunhofer FIRST Institute, Berlin.

Table of Contents

Foreword xi

Preface xiii

I Introduction

1 Introduction and Problem Formulation 3

1.1 Machine Learning under Covariate Shift 3

1.2 Quick Tour of Covariate Shift Adaptation 5

1.3 Problem Formulation 7

1.3.1 Function Learning from Examples 7

1.3.2 Loss Functions 8

1.3.3 Generalization Error 9

1.3.4 Covariate Shift 9

1.3.5 Models for Function Learning 10

1.3.6 Specification of Models 13

1.4 Structure of This Book 14

1.4.1 Part II: Learning under Covariate Shift 14

1.4.2 Part III: Learning Causing Covariate Shift 17

II Learning Under Covariate Shift

2 Function Approximation 21

2.1 Importance-Weighting Techniques for Covariate Shift Adaptation 22

2.1.1 Importance-Weighted ERM 22

2.1.2 Adaptive IWERM 23

2.1.3 Regularized IWERM 23

2.2 Examples of Importance-Weighted Regression Methods 25

2.2.1 Squared Loss: Least-Squares Regression 26

2.2.2 Absolute Loss: Least-Absolute Regression 30

2.2.3 Huber Loss: Huber Regression 31

2.2.4 Deadzone-Linear Loss: Support Vector Regression 33

2.3 Examples of Importance-Weighted Classification Methods 35

2.3.1 Squared Loss: Fisher Discriminant Analysis 36

2.3.2 Logistic Loss: Logistic Regression Classifier 38

2.3.3 Hinge Loss: Support Vector Machine 39

2.3.4 Exponential Loss: Boosting 40

2.4 Numerical Examples 40

2.4.1 Regression 40

2.4.2 Classification 41

2.5 Summary and Discussion 45

3 Model Selection 47

3.1 Importance-Weighted Akaike Information Criterion 47

3.2 Importance-Weighted Subspace Information Criterion 50

3.2.1 Input Dependence vs. Input Independence in Generalization Error Analysis 51

3.2.2 Approximately Correct Models 53

3.2.3 Input-Dependent Analysis of Generalization Error 54

3.3 Importance-Weighted Cross-Validation 64

3.4 Numerical Examples 66

3.4.1 Regression 66

3.4.2 Classification 69

3.5 Summary and Discussion 70

4 Importance Estimation 73

4.1 Kernel Density Estimation 73

4.2 Kernel Mean Matching 75

4.3 Logistic Regression 76

4.4 Kullback-Leibler Importance Estimation Procedure 78

4.4.1 Algorithm 78

4.4.2 Model Selection by Cross-Validation 81

4.4.3 Basis Function Design 82

4.5 Least-Squares Importance Fitting 83

4.5.1 Algorithm 83

4.5.2 Basis Function Design and Model Selection 84

4.5.3 Regularization Path Tracking 85

4.6 Unconstrained Least-Squares Importance Fitting 87

4.6.1 Algorithm 87

4.6.2 Analytic Computation of Leave-One-Out Cross-Validation 88

4.7 Numerical Examples 88

4.7.1 Setting 90

4.7.2 Importance Estimation by KLIEP 90

4.7.3 Covariate Shift Adaptation by IWLS and IWCV 92

4.8 Experimental Comparison 94

4.9 Summary 101

5 Direct Density-Ratio Estimation with Dimensionality Reduction 103

5.1 Density Difference in Hetero-Distributional Subspace 103

5.2 Characterization of Hetero-Distributional Subspace 104

5.3 Identifying Hetero-Distributional Subspace 106

5.3.1 Basic Idea 106

5.3.2 Fisher Discriminant Analysis 108

5.3.3 Local Fisher Discriminant Analysis 109

5.4 Using LFDA for Finding Hetero-Distributional Subspace 112

5.5 Density-Ratio Estimation in the Hetero-Distributional Subspace 113

5.6 Numerical Examples 113

5.6.1 Illustrative Example 113

5.6.2 Performance Comparison Using Artificial Data Sets 117

5.7 Summary 121

6 Relation to Sample Selection Bias 125

6.1 Heckman's Sample Selection Model 125

6.2 Distributional Change and Sample Selection Bias 129

6.3 The Two-Step Algorithm 131

6.4 Relation to Covariate Shift Approach 134

7 Applications of Covariate Shift Adaptation 137

7.1 Brain-Computer Interface 137

7.1.1 Background 137

7.1.2 Experimental Setup 138

7.1.3 Experimental Results 140

7.2 Speaker Identification 142

7.2.1 Background 142

7.2.2 Formulation 142

7.2.3 Experimental Results 144

7.3 Natural Language Processing 149

7.3.1 Formulation 149

7.3.2 Experimental Results 151

7.4 Perceived Age Prediction from Face Images 152

7.4.1 Background 152

7.4.2 Formulation 153

7.4.3 Incorporating Characteristics of Human Age Perception 153

7.4.4 Experimental Results 155

7.5 Human Activity Recognition from Accelerometric Data 157

7.5.1 Background 157

7.5.2 Importance-Weighted Least-Squares Probabilistic Classifier 157

7.5.3 Experimental Results. 160

7.6 Sample Reuse in Reinforcement Learning 165

7.6.1 Markov Decision Problems 165

7.6.2 Policy Iteration 166

7.6.3 Value Function Approximation 167

7.6.4 Sample Reuse by Covariate Shift Adaptation 168

7.6.5 On-Policy vs. Off-Policy 169

7.6.6 Importance Weighting in Value Function Approximation 170

7.6.7 Automatic Selection of the Flattening Parameter 174

7.6.8 Sample Reuse Policy Iteration 175

7.6.9 Robot Control Experiments 176

III Learning Causing Covariate Shift

8 Active Learning 183

8.1 Preliminaries 183

8.1.1 Setup 183

8.1.2 Decomposition of Generalization Error 185

8.1.3 Basic Strategy of Active Learning 188

8.2 Population-Based Active Learning Methods 188

8.2.1 Classical Method of Active Learning for Correct Models 189

8.2.2 Limitations of Classical Approach and Countermeasures 190

8.2.3 Input-Independent Variance-Only Method 191

8.2.4 Input-Dependent Variance-Only Method 193

8.2.5 Input-Independent Bias-and-Variance Approach 195

8.3 Numerical Examples of Population-Based Active Learning Methods 198

8.3.1 Setup 198

8.3.2 Accuracy of Generalization Error Estimation 200

8.3.3 Obtained Generalization Error 202

8.4 Pool-Based Active Learning Methods 204

8.4.1 Classical Active Learning Method for Correct Models and Its Limitations 204

8.4.2 Input-Independent Variance-Only Method 205

8.4.3 Input-Dependent Variance-Only Method 206

8.4.4 Input-Independent Bias-and-Variance Approach 207

8.5 Numerical Examples of Pool-Based Active Learning Methods 209

8.6 Summary and Discussion 212

9 Active Learning with Model Selection 215

9.1 Direct Approach and the Active Learning/Model Selection Dilemma 215

9.2 Sequential Approach 216

9.3 Batch Approach 218

9.4 Ensemble Active Learning 219

9.5 Numerical Examples 220

9.5.1 Setting 220

9.5.2 Analysis of Batch Approach 221

9.5.3 Analysis of Sequential Approach 222

9.5.4 Comparison of Obtained Generalization Error 222

9.6 Summary and Discussion 223

10 Applications of Active Learning 225

10.1 Design of Efficient Exploration Strategies in Reinforcement Learning 225

10.1.1 Efficient Exploration with Active Learning 225

10.1.2 Reinforcement Learning Revisited 226

10.1.3 Decomposition of Generalization Error 228

10.1.4 Estimating Generalization Error for Active Learning 229

10.1.5 Designing Sampling Policies 230

10.1.6 Active Learning in Policy Iteration 231

10.1.7 Robot Control Experiments 232

10.2 Wafer Alignment in Semiconductor Exposure Apparatus 234

IV Conclusions

11 Conclusions and Future Prospects 241

11.1 Conclusions 241

11.2 Future Prospects 242

Appendix: List of Symbols and Abbreviations 243

Bibliography 247

Index 259

What People are Saying About This

Lihong Li

Written by two active researchers in the area, this book provides a highly accessible and self-contained exposition to some of the most important and recent advancements for tackling the covariate-shift problem. Students, researchers, and practitioners in related fields will benefit greatly from its huge collection of algorithms, numerical examples, and real-life applications.

Arthur Gretton

This book provides a clear and practical guide to the problem of learning when the training and test data are drawn from different distributions. Of particular value are the many worked examples, illustrating the operation of the described techniques on real-life problems, and demonstrating their strengths, limitations, and areas of application.

Neil Rubens

In machine learning we often assume that the characteristics of the data used to design a system will remain the same once the system is deployed. When this assumption is violated, and it does happen often, a system's accuracy may suffer significantly. This book provides the first in-depth look at how one can prepare for and cope with a frequently occurring instance of the above problem (covariate shift) both from theoretical and practical perspectives.

Bernhard Schölkopf

Though important in practice and conceptually intriguing, the topic of covariate shift adaptation has only recently begun to attract significant attention in machine learning. Building on their sample reweighting methods, the authors assay a core problem of robust empirical inference. This timely book should be recommended to researchers and practitioners in a range of disciplines.

From the Publisher

Though important in practice and conceptually intriguing, the topic of covariate shift adaptation has only recently begun to attract significant attention in machine learning. Building on their sample reweighting methods, the authors assay a core problem of robust empirical inference. This timely book should be recommended to researchers and practitioners in a range of disciplines.

Bernhard Schölkopf , Max Planck Institute for Intelligent Systems

In machine learning we often assume that the characteristics of the data used to design a system will remain the same once the system is deployed. When this assumption is violated, and it does happen often, a system's accuracy may suffer significantly. This book provides the first in-depth look at how one can prepare for and cope with a frequently occurring instance of the above problem (covariate shift) both from theoretical and practical perspectives.

Neil Rubens , University of Electro-Communications, Japan

Written by two active researchers in the area, this book provides a highly accessible and self-contained exposition to some of the most important and recent advancements for tackling the covariate-shift problem. Students, researchers, and practitioners in related fields will benefit greatly from its huge collection of algorithms, numerical examples, and real-life applications.

Lihong Li , Yahoo! Research

This book provides a clear and practical guide to the problem of learning when the training and test data are drawn from different distributions. Of particular value are the many worked examples, illustrating the operation of the described techniques on real-life problems, and demonstrating their strengths, limitations, and areas of application.

Arthur Gretton , Gatsby Computational Neuroscience Unit, CSML, University College London

Endorsement

This book provides a clear and practical guide to the problem of learning when the training and test data are drawn from different distributions. Of particular value are the many worked examples, illustrating the operation of the described techniques on real-life problems, and demonstrating their strengths, limitations, and areas of application.

Arthur Gretton, Gatsby Computational Neuroscience Unit, CSML, University College London

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews