Table of Contents
1 Introduction 1
1.1 A brief history 1
1.1.1 Probabilistic methods in computational linguistics 1
1.1.2 Supervised and unsupervised training 2
1.1.3 Semisupervised learning 3
1.2 Semisupervised learning 4
1.2.1 Major varieties of learning problem 4
1.2.2 Motivation 6
1.2.3 Evaluation 7
1.2.4 Active learning 8
1.3 Organization and assumptions 8
1.3.1 Leading ideas 8
1.3.2 Mathematical background 10
1.3.3 Notation 11
2 Self-training and Co-training 13
2.1 Classification 13
2.1.1 The standard setting 13
2.1.2 Features and rules 14
2.1.3 Decision lists 16
2.2 Self-training 18
2.2.1 The algorithm 19
2.2.2 Parameters and variants 20
2.2.3 Evaluation 23
2.2.4 Symmetry of features and instances 25
2.2.5 Related algorithms 27
2.3 Co-Training 28
3 Applications of Self-Training and Co-Training 31
3.1 Part-of-speech tagging 31
3.2 Information extraction 33
3.3 Parsing 35
3.4 Word senses 36
3.4.1 WordNet 36
3.4.2 Word-sense disambiguation 38
3.4.3 Taxonomic inference 40
4 Classification 43
4.1 Two simple classifiers 43
4.1.1 Naive Bayes 43
4.1.2 k-nearest-neighbor classifier 45
4.2 Abstract setting 48
4.2.1 Function approximation 48
4.2.2 Denning success 50
4.2.3 Fit and simplicity 52
4.3 Evaluating detectors and classifiers that abstain 53
4.3.1 Confidence-rated classifiers 53
4.3.2 Measures for detection 54
4.3.3 Idealized performance curves 57
4.3.4 The multiclass case 59
4.4 Binary classifiers and ECOC 62
5 Mathematics for Boundary-Oriented Methods 67
5.1 Linear separators 67
5.1.1 Representing a hyperplane 67
5.1.2 Eliminating the threshold 69
5.1.3 The point-normal form 70
5.1.4 Naive Bayes decision boundary 72
5.2 The gradient 74
5.2.1 Graphs and domains 74
5.2.2 Convexity 76
5.2.3 Differentiation of vector and matrix expressions 79
5.2.4 An example: linear regression 81
5.3 Constrained optimization 83
5.3.1 Optimization 83
5.3.2 Equality constraints 84
5.3.3 Inequality constraints 87
5.3.4 The Wolfe dual 91
6 Boundary-Oriented Methods 95
6.1 The perceptron 97
6.1.1 The algorithm 97
6.1.2 An example 99
6.1.3 Convergence 100
6.1.4 The perceptron algorithm as gradient descent 101
6.2 Game self-teaching 103
6.3 Boosting 105
6.3.1 Abstention 110
6.3.2 Semisupervised boosting 111
6.3.3 Co-boosting 113
6.4 Support Vector Machines (SVMs) 114
6.4.1 The margin 114
6.4.2 Maximizing the margin 116
6.4.3 The nonseparable case 119
6.4.4 Slack in the separable case 121
6.4.5 Multiple slack points 123
6.4.6 Transductive SVMs 125
6.4.7 Training a transductive SVM 127
6.5 Null-category noise model 129
7 Clustering 131
7.1 Cluster and label 131
7.2 Clustering concepts 132
7.2.1 Objective 132
7.2.2 Distance and similarity 133
7.2.3 Graphs 136
7.3 Hierarchical clustering 137
7.4 Self-training revisited 139
7.4.1 k-means clustering 139
7.4.2 Pseudo relevance feedback 140
7.5 Graph mincut 143
7.6 Label propagation 146
7.6.1 Clustering by propagation 146
7.6.2 Self-training as propagation 147
7.6.3 Co-training as propagation 150
7.7 Bibliographic notes 152
8 Generative Models 153
8.1 Gaussian mixtures 153
8.1.1 Definition and geometric interpretation 153
8.1.2 The linear discriminant decision boundary 156
8.1.3 Decision-directed approximation 159
8.1.4 McLachlan's algorithm 162
8.2 The EM algorithm 163
8.2.1 Maximizing likelihood 163
8.2.2 Relative frequency estimation 164
8.2.3 Divergence 166
8.2.4 The EM algorithm 169
9 Agreement Constraints 175
9.1 Co-training 175
9.1.1 The conditional independence assumption 176
9.1.2 The power of conditional independence 178
9.2 Agreement-based self-teaching 182
9.3 Random fields 184
9.3.1 Applied to self-training and co-training 184
9.3.2 Gibbs sampling 186
9.3.3 Markov chains and random walks 187
9.4 Bibliographic notes 192
10 Propagation Methods 193
10.1 Label propagation 194
10.2 Random walks 196
10.3 Harmonic functions 198
10.4 Fluids 203
10.4.1 Flow 203
10.4.2 Pressure 205
10.4.3 Conservation of energy 209
10.4.4 Thomson's principle 210
10.5 Computing the solution 213
10.6 Graph mincuts revisited 215
10.7 Bibliographic notes 220
11 Mathematics for Spectral Methods 221
11.1 Some basic concepts 221
11.1.1 The norm of a vector 221
11.1.2 Matrices as linear operators 222
11.1.3 The column space 222
11.2 Eigenvalues and eigenvectors 224
11.2.1 Definition of eigenvalues and eigenvectors 224
11.2.2 Diagonalization 225
11.2.3 Orthogonal diagonalization 226
11.3 Eigenvalues and the scaling effects of a matrix 227
11.3.1 Matrix norms 227
11.3.2 The Rayleigh quotient 228
11.3.3 The 2x2 case 230
11.3.4 The general case 232
11.3.5 The Courant-Fischer minimax theorem 234
11.4 Bibliographic notes 236
12 Spectral Methods 237
12.1 Simple harmonic motion 237
12.1.1 Harmonics 237
12.1.2 Mixtures of harmonics 239
12.1.3 An oscillating particle 241
12.1.4 A vibrating string 243
12.2 Spectra of matrices and graphs 251
12.2.1 The spectrum of a matrix 252
12.2.2 Relating matrices and graphs 253
12.2.3 The Laplacian matrix and graph spectrum 256
12.3 Spectral clustering 257
12.3.1 The second smallest eigenvector of the Laplacian 257
12.3.2 The cut size and the Laplacian 259
12.3.3 Approximating cut size 260
12.3.4 Minimizing cut size 262
12.3.5 Ratiocut 263
12.4 Spectral methods for semisupervised learning 265
12.4.1 Harmonics and harmonic functions 265
12.4.2 Eigenvalues and energy 267
12.4.3 The Laplacian and random fields 268
12.4.4 Harmonic functions and the Laplacian 270
12.4.5 Using the Laplacian for regularization 272
12.4.6 Transduction to induction 274
12.5 Bibliographic notes 275
Bibliography 277
Index 301