Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

The Minimum Description Length Principle

The Minimum Description Length Principle

by Peter D. Grünwald

ISBN-10: 0262072815

ISBN-13: 9780262072816

Pub. Date: 03/23/2007

Publisher: MIT Press

The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with


The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.

Product Details

MIT Press
Publication date:
Adaptive Computation and Machine Learning series
Edition description:
New Edition
Product dimensions:
7.00(w) x 9.00(h) x 1.50(d)
Age Range:
18 Years

Related Subjects

Table of Contents

List of Figures     xix
Series Foreword     xxi
Foreword     xxiii
Preface     xxv
Introductory Material     1
Learning, Regularity, and Compression     3
Regularity and Learning     4
Regularity and Compression     4
Solomonoff's Breakthrough - Kolmogorov Complexity     8
Making the Idea Applicable     10
Crude MDL, Refined MDL and Universal Coding     12
From Crude to Refined MDL     14
Universal Coding and Refined MDL     17
Refined MDL for Model Selection     18
Refined MDL for Prediction and Hypothesis Selection     20
Some Remarks on Model Selection     23
Model Selection among Non-Nested Models     23
Goals of Model vs. Point Hypothesis Selection     25
The MDL Philosophy     26
MDL, Occam's Razor, and the "True Model"     29
Answer to Criticism No. 1     30
Answer to Criticism No. 2     32
History and Forms of MDL     36
What Is MDL?     37
MDL Literature     38
Summary and Outlook     40
Probabilistic and Statistical Preliminaries     41
General Mathematical Preliminaries     41
Probabilistic Preliminaries     46
Definitions; Notational Conventions     46
Probabilistic Sources     53
Limit Theorems and Statements     55
Probabilistic Models     57
Probabilistic Model Classes     60
Kinds of Probabilistic Models     62
Terminological Preliminaries     69
Modeling Preliminaries: Goals and Methods for Inductive Inference     71
Consistency     71
Basic Concepts of Bayesian Statistics     74
Summary and Outlook     78
Information-Theoretic Preliminaries     79
Coding Preliminaries     79
Restriction to Prefix Coding Systems; Descriptions as Messages     83
Different Kinds of Codes     86
Assessing the Efficiency of Description Methods     90
The Most Important Section of This Book: Probabilities and Code Lengths     90
The Kraft Inequality     91
Code Lengths "Are" Probabilities     95
Immediate Insights and Consequences     99
Probabilities and Code Lengths, Part II     101
(Relative) Entropy and the Information Inequality     103
Uniform Codes, Maximum Entropy, and Minimax Codelength     106
Summary, Outlook, Further Reading     106
Information-Theoretic Properties of Statistical Models     109
Introduction     109
Likelihood and Observed Fisher Information     111
KL Divergence and Expected Fisher Information     117
Maximum Likelihood: Data vs. Parameters     124
Summary and Outlook     130
Crude Two-Part Code MDL     131
Introduction: Making Two-Part MDL Precise     132
Two-Part Code MDL for Markov Chain Selection     133
The Code C[subscript 2]     135
The Code C[subscript 1]     137
Crude Two-Part Code MDL for Markov Chains     138
Simplistic Two-Part Code MDL Hypothesis Selection     139
Two-Part MDL for Tasks Other Than Hypothesis Selection     141
Behavior of Two-Part Code MDL     142
Two-Part Code MDL and Maximum Likelihood     144
The Maximum Likelihood Principle     144
MDL vs. ML     147
MDL as a Maximum Probability Principle     148
Computing and Approximating Two-Part MDL in Practice     150
Justifying Crude MDL: Consistency and Code Design     152
A General Consistency Result      153
Code Design for Two-Part Code MDL     157
Summary and Outlook     163
Appendix: Proof of Theorem 5.1     163
Universal Coding     165
Universal Coding with Countable Models     171
Universal Coding: The Basic Idea     172
Two-part Codes as Simple Universal Codes     174
From Universal Codes to Universal Models     175
Formal Definition of Universality     177
The Finite Case     178
Minimax Regret and Normalized ML     179
NML vs. Two-Part vs. Bayes     182
The Countably Infinite Case     184
The Two-Part and Bayesian Codes     184
The NML Code     187
Prequential Universal Models     190
Distributions as Prediction Strategies     190
Bayes Is Prequential; NML and Two-part Are Not     193
The Prequential Plug-In Model     197
Individual vs. Stochastic Universality     199
Stochastic Redundancy     199
Uniformly Universal Models     201
Summary, Outlook and Further Reading     204
Parametric Models: Normalized Maximum Likelihood     207
Introduction     207
Preliminaries      208
Asymptotic Expansion of Parametric Complexity     211
The Meaning of [Characters not reproducible]     216
Complexity and Functional Form     217
KL Divergence and Distinguishability     219
Complexity and Volume     222
Complexity and the Number of Distinguishable Distributions     224
Explicit and Simplified Computations     226
Parametric Models: Bayes     231
The Bayesian Regret     231
Basic Interpretation of Theorem 8.1     233
Bayes Meets Minimax - Jeffreys' Prior     234
Jeffreys' Prior and the Boundary     237
How to Prove the Bayesian and NML Regret Theorems     239
Proof Sketch of Theorem 8.1     239
Beyond Exponential Families     241
Proof Sketch of Theorem 7.1     243
Stochastic Universality     244
Appendix: Proofs of Theorem 8.1 and Theorem 8.2     248
Parametric Models: Prequential Plug-in     257
Prequential Plug-in for Exponential Families     257
The Plug-in vs. the Bayes Universal Model     262
More Precise Asymptotics     265
Summary     269
Parametric Models: Two-Part     271
The Ordinary Two-Part Universal Model     271
Derivation of the Two-Part Code Regret     274
Proof Sketch of Theorem 10.1     277
Discussion     282
The Conditional Two-Part Universal Code     284
Conditional Two-Part Codes for Discrete Exponential Families     286
Distinguishability and the Phase Transition     290
Summary and Outlook     293
NML With Infinite Complexity     295
Introduction     295
Examples of Undefined NML Distribution     298
Examples of Undefined Jeffreys' Prior     299
Metauniversal Codes     301
Constrained Parametric Complexity     302
Meta-Two-Part Coding     303
Renormalized Maximum Likelihood     306
NML with Luckiness     308
Asymptotic Expansion of LNML     312
Conditional Universal Models     316
Bayesian Approach with Jeffreys' Prior     317
Conditional NML     320
Liang and Barron's Approach     325
Summary and Remarks     329
Appendix: Proof of Theorem 11.4     329
Linear Regression     335
Introduction     336
Prelude: The Normal Location Family      338
Least-Squares Estimation     340
The Normal Equations     342
Composition of Experiments     345
Penalized Least-Squares     346
The Linear Model     348
Bayesian Linea Model M[superscript X] with Gaussian Prior     354
Bayesian Linear Models M[superscript X] and S[superscript X] with Noninformative Priors     359
Universal Models for Linear Regression     363
NML     363
Bayes and LNML     364
Bayes-Jeffreys and CNML     365
Beyond Parametrics     369
Introduction     370
CUP: Unions of Parametric Models     372
CUP vs. Parametric Models     375
Universal Codes Based on Histograms     376
Redundancy of Universal CUP Histogram Codes     380
Nonparametric Redundancy     383
Standard CUP Universal Codes     384
Minimax Nonparametric Redundancy     387
Gaussian Process Regression     390
Kernelization of Bayesian Linear Regression     390
Gaussian Processes     394
Gaussian Processes as Universal Models     396
Conclusion and Further Reading     402
Refined MDL     403
MDL Model Selection     409
Introduction     409
Simple Refined MDL Model Selection     411
Compression Interpretation     415
Counting Interpretation     416
Bayesian Interpretation     418
Prequential Interpretation     419
General Parametric Model Selection     420
Models with Infinite Complexities     420
Comparing Many or Infinitely Many Models     422
The General Picture     425
Practical Issues in MDL Model Selection     428
Calculating Universal Codelengths     428
Computational Efficiency and Practical Quality of Non-NML Universal Codes     429
Model Selection with Conditional NML and Plug-in Codes     431
General Warnings about Model Selection     435
MDL Model Selection for Linear Regression     438
Rissanen's RNML Approach     439
Hansen and Yu's gMDL Approach     443
Liang and Barron's Approach     446
Discussion     448
Worst Case vs. Average Case     451
MDL Prediction and Estimation     459
Introduction     459
MDL for Prediction and Predictive Estimation     460
Prequential MDL Estimators      461
Prequential MDL Estimators Are Consistent     465
Parametric and Nonparametric Examples     469
Cesaro KL consistency vs. KL consistency     472
Two-Part Code MDL for Point Hypothesis Selection     476
Discussion of Two-Part Consistency Theorem     478
MDL Parameter Estimation     483
MDL Estimators vs. Luckiness ML Estimators     487
What Estimator To Use?     491
Comparison to Bayesian Estimators     493
Summary and Outlook     498
Appendix: Proof of Theorem 15.3     499
MDL Consistency and Convergence     501
Introduction     501
The Scenarios Considered     501
Consistency: Prequential and Two-Part MDL Estimators     502
Consistency: MDL Model Selection     505
Selection between a Union of Parametric Models     505
Nonparametric Model Selection Based on CUP Model Class     508
MDL Consistency Peculiarities     511
Risks and Rates     515
Relations between Divergences and Risk Measures     517
Minimax Rates     519
MDL Rates of Convergence     520
Prequential and Two-Part MDL Estimators     520
MDL Model Selection     522
MDL in Context     523
MDL and Frequentist Paradigms     524
Sanity Check or Design Principle?     525
The Weak Prequential Principle     528
MDL vs. Frequentist Principles: Remaining Issues     529
MDL and Bayesian Inference     531
Luckiness Functions vs. Prior Distributions     534
MDL, Bayes, and Occam     539
MDL and Brands of Bayesian Statistics     544
Conclusion: a Common Future after All?     548
MDL, AIC and BIC     549
BIC     549
AIC     550
Combining the Best of AIC and BIC     552
MDL and MML     555
Strict Minimum Message Length     556
Comparison to MDL     558
The Wallace-Freeman Estimator     560
MDL and Prequential Analysis     562
MDL and Cross-Validation     565
MDL and Maximum Entropy     567
Kolmogorov Complexity and Structure Function     570
MDL and Individual Sequence Prediction     573
MDL and Statistical Learning Theory     579
Structural Risk Minimization     581
PAC-Bayesian Approaches     585
PAC-Bayesand MDL     588
The Road Ahead     592
Additional Background     597
The Exponential or "Maximum Entropy" Families     599
Introduction     600
Definition and Overview     601
Basic Properties     605
Mean-Value, Canonical, and Other Parameterizations     609
The Mean Value Parameterization     609
Other Parameterizations     611
Relating Mean-Value and Canonical Parameters     613
Exponential Families of General Probabilistic Sources     617
Fisher Information Definitions and Characterizations     619
Information-Theoretic Properties of Exponential Families     623
Introduction     624
Robustness of Exponential Family Codes     624
If [Characters not reproducible][subscript mean] Does Not Contain the Mean     627
Behavior at the ML Estimate [Beta]     629
Behavior of the ML Estimate [Beta]     632
Central Limit Theorem     633
Large Deviations     634
Maximum Entropy and Minimax Codelength     637
Exponential Families and Maximum Entropy     638
Exponential Families and Minimax Codelength     641
The Compression Game      643
Likelihood Ratio Families and Renyi Divergences     645
The Likelihood Ratio Family     647
Summary     650
References     651
List of Symbols     675
Subject Index     679

Customer Reviews

Average Review:

Post to your social network


Most Helpful Customer Reviews

See all customer reviews