The Minimum Description Length Principle available in Paperback

The Minimum Description Length Principle
- ISBN-10:
- 0262529637
- ISBN-13:
- 9780262529631
- Pub. Date:
- 03/23/2007
- Publisher:
- MIT Press
- ISBN-10:
- 0262529637
- ISBN-13:
- 9780262529631
- Pub. Date:
- 03/23/2007
- Publisher:
- MIT Press

The Minimum Description Length Principle
Paperback
Buy New
$80.00-
SHIP THIS ITEMIn stock. Ships in 1-2 days.PICK UP IN STORE
Your local store may have stock of this item.
Available within 2 business hours
Overview
The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern. This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology.
Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.
Product Details
ISBN-13: | 9780262529631 |
---|---|
Publisher: | MIT Press |
Publication date: | 03/23/2007 |
Series: | Adaptive Computation and Machine Learning series |
Pages: | 736 |
Product dimensions: | 6.90(w) x 9.10(h) x 1.60(d) |
Age Range: | 18 Years |
About the Author
Table of Contents
List of Figures xix
Series Foreword xxi
Foreword xxiii
Preface xxv
Introductory Material 1
Learning, Regularity, and Compression 3
Regularity and Learning 4
Regularity and Compression 4
Solomonoff's Breakthrough - Kolmogorov Complexity 8
Making the Idea Applicable 10
Crude MDL, Refined MDL and Universal Coding 12
From Crude to Refined MDL 14
Universal Coding and Refined MDL 17
Refined MDL for Model Selection 18
Refined MDL for Prediction and Hypothesis Selection 20
Some Remarks on Model Selection 23
Model Selection among Non-Nested Models 23
Goals of Model vs. Point Hypothesis Selection 25
The MDL Philosophy 26
MDL, Occam's Razor, and the "True Model" 29
Answer to Criticism No. 1 30
Answer to Criticism No. 2 32
History and Forms of MDL 36
What Is MDL? 37
MDL Literature 38
Summary and Outlook 40
Probabilistic and Statistical Preliminaries 41
General Mathematical Preliminaries 41
Probabilistic Preliminaries 46
Definitions; Notational Conventions 46
Probabilistic Sources 53
Limit Theorems and Statements 55
Probabilistic Models 57
Probabilistic Model Classes 60
Kinds of Probabilistic Models 62
Terminological Preliminaries 69
Modeling Preliminaries: Goals and Methods for Inductive Inference 71
Consistency 71
Basic Concepts of Bayesian Statistics 74
Summary and Outlook 78
Information-Theoretic Preliminaries 79
Coding Preliminaries 79
Restriction to Prefix Coding Systems; Descriptions as Messages 83
Different Kinds of Codes 86
Assessing the Efficiency of Description Methods 90
The Most Important Section of This Book: Probabilities and Code Lengths 90
The Kraft Inequality 91
Code Lengths "Are" Probabilities 95
Immediate Insights and Consequences 99
Probabilities and Code Lengths, Part II 101
(Relative) Entropy and the Information Inequality 103
Uniform Codes, Maximum Entropy, and Minimax Codelength 106
Summary, Outlook, Further Reading 106
Information-Theoretic Properties of Statistical Models 109
Introduction 109
Likelihood and Observed Fisher Information 111
KL Divergence and Expected Fisher Information 117
Maximum Likelihood: Data vs. Parameters 124
Summary and Outlook 130
Crude Two-Part Code MDL 131
Introduction: Making Two-Part MDL Precise 132
Two-Part Code MDL for Markov Chain Selection 133
The Code C[subscript 2] 135
The Code C[subscript 1] 137
Crude Two-Part Code MDL for Markov Chains 138
Simplistic Two-Part Code MDL Hypothesis Selection 139
Two-Part MDL for Tasks Other Than Hypothesis Selection 141
Behavior of Two-Part Code MDL 142
Two-Part Code MDL and Maximum Likelihood 144
The Maximum Likelihood Principle 144
MDL vs. ML 147
MDL as a Maximum Probability Principle 148
Computing and Approximating Two-Part MDL in Practice 150
Justifying Crude MDL: Consistency and Code Design 152
A General Consistency Result 153
Code Design for Two-Part Code MDL 157
Summary and Outlook 163
Appendix: Proof of Theorem 5.1 163
Universal Coding 165
Universal Coding with Countable Models 171
Universal Coding: The Basic Idea 172
Two-part Codes as Simple Universal Codes 174
From Universal Codes to Universal Models 175
Formal Definition of Universality 177
The Finite Case 178
Minimax Regret and Normalized ML 179
NML vs. Two-Part vs. Bayes 182
The Countably Infinite Case 184
The Two-Part and Bayesian Codes 184
The NML Code 187
Prequential Universal Models 190
Distributions as Prediction Strategies 190
Bayes Is Prequential; NML and Two-part Are Not 193
The Prequential Plug-In Model 197
Individual vs. Stochastic Universality 199
Stochastic Redundancy 199
Uniformly Universal Models 201
Summary, Outlook and Further Reading 204
Parametric Models: Normalized Maximum Likelihood 207
Introduction 207
Preliminaries 208
Asymptotic Expansion of Parametric Complexity 211
The Meaning of [Characters not reproducible] 216
Complexity and Functional Form 217
KL Divergence and Distinguishability 219
Complexity and Volume 222
Complexity and the Number of Distinguishable Distributions 224
Explicit and Simplified Computations 226
Parametric Models: Bayes 231
The Bayesian Regret 231
Basic Interpretation of Theorem 8.1 233
Bayes Meets Minimax - Jeffreys' Prior 234
Jeffreys' Prior and the Boundary 237
How to Prove the Bayesian and NML Regret Theorems 239
Proof Sketch of Theorem 8.1 239
Beyond Exponential Families 241
Proof Sketch of Theorem 7.1 243
Stochastic Universality 244
Appendix: Proofs of Theorem 8.1 and Theorem 8.2 248
Parametric Models: Prequential Plug-in 257
Prequential Plug-in for Exponential Families 257
The Plug-in vs. the Bayes Universal Model 262
More Precise Asymptotics 265
Summary 269
Parametric Models: Two-Part 271
The Ordinary Two-Part Universal Model 271
Derivation of the Two-Part Code Regret 274
Proof Sketch of Theorem 10.1 277
Discussion 282
The Conditional Two-Part Universal Code 284
Conditional Two-Part Codes for Discrete Exponential Families 286
Distinguishability and the Phase Transition 290
Summary and Outlook 293
NML With Infinite Complexity 295
Introduction 295
Examples of Undefined NML Distribution 298
Examples of Undefined Jeffreys' Prior 299
Metauniversal Codes 301
Constrained Parametric Complexity 302
Meta-Two-Part Coding 303
Renormalized Maximum Likelihood 306
NML with Luckiness 308
Asymptotic Expansion of LNML 312
Conditional Universal Models 316
Bayesian Approach with Jeffreys' Prior 317
Conditional NML 320
Liang and Barron's Approach 325
Summary and Remarks 329
Appendix: Proof of Theorem 11.4 329
Linear Regression 335
Introduction 336
Prelude: The Normal Location Family 338
Least-Squares Estimation 340
The Normal Equations 342
Composition of Experiments 345
Penalized Least-Squares 346
The Linear Model 348
Bayesian Linea Model M[superscript X] with Gaussian Prior 354
Bayesian Linear Models M[superscript X] and S[superscript X] with Noninformative Priors 359
Universal Models for Linear Regression 363
NML 363
Bayes and LNML 364
Bayes-Jeffreys and CNML 365
Beyond Parametrics 369
Introduction 370
CUP: Unions of Parametric Models 372
CUP vs. Parametric Models 375
Universal Codes Based on Histograms 376
Redundancy of Universal CUP Histogram Codes 380
Nonparametric Redundancy 383
Standard CUP Universal Codes 384
Minimax Nonparametric Redundancy 387
Gaussian Process Regression 390
Kernelization of Bayesian Linear Regression 390
Gaussian Processes 394
Gaussian Processes as Universal Models 396
Conclusion and Further Reading 402
Refined MDL 403
MDL Model Selection 409
Introduction 409
Simple Refined MDL Model Selection 411
Compression Interpretation 415
Counting Interpretation 416
Bayesian Interpretation 418
Prequential Interpretation 419
General Parametric Model Selection 420
Models with Infinite Complexities 420
Comparing Many or Infinitely Many Models 422
The General Picture 425
Practical Issues in MDL Model Selection 428
Calculating Universal Codelengths 428
Computational Efficiency and Practical Quality of Non-NML Universal Codes 429
Model Selection with Conditional NML and Plug-in Codes 431
General Warnings about Model Selection 435
MDL Model Selection for Linear Regression 438
Rissanen's RNML Approach 439
Hansen and Yu's gMDL Approach 443
Liang and Barron's Approach 446
Discussion 448
Worst Case vs. Average Case 451
MDL Prediction and Estimation 459
Introduction 459
MDL for Prediction and Predictive Estimation 460
Prequential MDL Estimators 461
Prequential MDL Estimators Are Consistent 465
Parametric and Nonparametric Examples 469
Cesaro KL consistency vs. KL consistency 472
Two-Part Code MDL for Point Hypothesis Selection 476
Discussion of Two-Part Consistency Theorem 478
MDL Parameter Estimation 483
MDL Estimators vs. Luckiness ML Estimators 487
What Estimator To Use? 491
Comparison to Bayesian Estimators 493
Summary and Outlook 498
Appendix: Proof of Theorem 15.3 499
MDL Consistency and Convergence 501
Introduction 501
The Scenarios Considered 501
Consistency: Prequential and Two-Part MDL Estimators 502
Consistency: MDL Model Selection 505
Selection between a Union of Parametric Models 505
Nonparametric Model Selection Based on CUP Model Class 508
MDL Consistency Peculiarities 511
Risks and Rates 515
Relations between Divergences and Risk Measures 517
Minimax Rates 519
MDL Rates of Convergence 520
Prequential and Two-Part MDL Estimators 520
MDL Model Selection 522
MDL in Context 523
MDL and Frequentist Paradigms 524
Sanity Check or Design Principle? 525
The Weak Prequential Principle 528
MDL vs. Frequentist Principles: Remaining Issues 529
MDL and Bayesian Inference 531
Luckiness Functions vs. Prior Distributions 534
MDL, Bayes, and Occam 539
MDL and Brands of Bayesian Statistics 544
Conclusion: a Common Future after All? 548
MDL, AIC and BIC 549
BIC 549
AIC 550
Combining the Best of AIC and BIC 552
MDL and MML 555
Strict Minimum Message Length 556
Comparison to MDL 558
The Wallace-Freeman Estimator 560
MDL and Prequential Analysis 562
MDL and Cross-Validation 565
MDL and Maximum Entropy 567
Kolmogorov Complexity and Structure Function 570
MDL and Individual Sequence Prediction 573
MDL and Statistical Learning Theory 579
Structural Risk Minimization 581
PAC-Bayesian Approaches 585
PAC-Bayesand MDL 588
The Road Ahead 592
Additional Background 597
The Exponential or "Maximum Entropy" Families 599
Introduction 600
Definition and Overview 601
Basic Properties 605
Mean-Value, Canonical, and Other Parameterizations 609
The Mean Value Parameterization 609
Other Parameterizations 611
Relating Mean-Value and Canonical Parameters 613
Exponential Families of General Probabilistic Sources 617
Fisher Information Definitions and Characterizations 619
Information-Theoretic Properties of Exponential Families 623
Introduction 624
Robustness of Exponential Family Codes 624
If [Characters not reproducible][subscript mean] Does Not Contain the Mean 627
Behavior at the ML Estimate [Beta] 629
Behavior of the ML Estimate [Beta] 632
Central Limit Theorem 633
Large Deviations 634
Maximum Entropy and Minimax Codelength 637
Exponential Families and Maximum Entropy 638
Exponential Families and Minimax Codelength 641
The Compression Game 643
Likelihood Ratio Families and Renyi Divergences 645
The Likelihood Ratio Family 647
Summary 650
References 651
List of Symbols 675
Subject Index 679