Benford's law states that the leading digits of many data sets are not uniformly distributed from one through nine, but rather exhibit a profound bias. This bias is evident in everything from electricity bills and street addresses to stock prices, population numbers, mortality rates, and the lengths of rivers. Here, Steven Miller brings together many of the world’s leading experts on Benford’s law to demonstrate the many useful techniques that arise from the law, show how truly multidisciplinary it is, and encourage collaboration.
Beginning with the general theory, the contributors explain the prevalence of the bias, highlighting explanations for when systems should and should not follow Benford’s law and how quickly such behavior sets in. They go on to discuss important applications in disciplines ranging from accounting and economics to psychology and the natural sciences. The contributors describe how Benford’s law has been successfully used to expose fraud in elections, medical tests, tax filings, and financial reports. Additionally, numerous problems, background materials, and technical details are available online to help instructors create courses around the book.
Emphasizing common challenges and techniques across the disciplines, this accessible book shows how Benford’s law can serve as a productive meeting ground for researchers and practitioners in diverse fields.
|Publisher:||Princeton University Press|
|Product dimensions:||6.50(w) x 9.40(h) x 1.50(d)|
About the Author
Steven J. Miller is associate professor of mathematics at Williams College. He is the coauthor of An Invitation to Modern Number Theory (Princeton).
Read an Excerpt
Benford's Law: Theory and Applications
By Steven J. Miller
PRINCETON UNIVERSITY PRESSCopyright © 2015 Princeton University Press
All rights reserved.
A Quick Introduction to Benford's Law
Steven J. Miller
The history of Benford's Law is a fascinating and unexpected story of the interplay between theory and applications. From its beginnings in understanding the distribution of digits in tables of logarithms, the subject has grown enormously. Currently hundreds of papers are being written by accountants, computer scientists, engineers, mathematicians, statisticians and many others. In this chapter we start by stating Benford's Law of digit bias and describing its history. We discuss its origins and give numerous examples of data sets that follow this law, as well as some that do not. From these examples we extract several explanations as to the prevalence of Benford's Law, which are described in greater detail later in the book. We end by quickly summarizing many of the diverse situations in which Benford's Law holds, and why an observation that began in looking at the wear and tear in tables of logarithms has become a major tool in subjects as diverse as detecting tax fraud and building efficient computers. We then continue in the next chapters with rigorous derivations, and then launch into a survey of some of the many applications. In particular, in the next chapter we put Benford's Law on a solid foundation. There we explore several different categorizations of Benford's Law, and rigorously prove that certain systems satisfy these conditions.
We live in an age when we are constantly bombarded with massive amounts of data. Satellites orbiting the Earth daily transmit more information than is in the entire Library of Congress; researchers must quickly sort through these data sets to find the relevant pieces. It is thus not surprising that people are interested in patterns in data. One of the more interesting, and initially surprising, is Benford's Law on the distribution of the first or the leading digits.
In this chapter we concentrate on a mostly non-technical introduction to the subject, saving the details for later. Before we can describe the law, we must first set notation. At some point in secondary school, we are introduced to scientific notation: any positive number x may be written as S(x) · 10k, where S(x) [member of] [1,10) is the significand and k is an integer (called the exponent). The integer part of the significand is called the leading digit or the first digit. Some people prefer to call S(x) the mantissa and not the significand; unfortunately this can lead to confusion, as the mantissa is the fractional part of the logarithm, and this quantity too will be important in our investigations. As always, examples help clarify the notation. The number 1701.24601 would be written as 1.70124601 · 103 in scientific notation. The significand is 1.70124601, the exponent is 3 and the leading digit is 1. If we take the logarithm base 10, we find log10 1701.24601 ≈ 3.2307671196444460726, so the mantissa is approximately .2307671196444460726.
There are many advantages to studying the first digits of a data set. One reason is that it helps us compare apples and apples and not apples and oranges. By this we mean the following: two different data sets could have very different scales; one could be masses of subatomic particles while another could be closing stock prices. While the units are different and the magnitudes differ greatly, every number has a unique leading digit, and thus we can compare the distribution of the first digits of the two data sets.
The most natural guess would be to assert that for a generic data set, all numbers are equally likely to be the leading digit. We would then posit that we should observe about 11% of the time a leading digit of 1, 2, ..., 9 (note that we would guess each number occurs one-ninth of the time and not one-tenth of the time, as 0 is the leading digit for only one number, namely 0). The content of Benford's Law is that this is frequently not so; specifically, in many situations we expect the leading digit to be d, with probability approximately log10 ([d+1]/d), which means the probability of a first digit of 1 is about 30% while a first digit of 9 happens about 4.6% of the time.
Though it is called Benford's Law, he was not the first to observe this digit bias. Our story begins with the astronomer–mathematician Simon Newcomb, who observed this behavior more than 50 years before Benford. Newcomb was born in Nova Scotia in 1835 and died in Washington, DC in 1909. In 1881 he published a short article in the American Journal of Mathematics, Note on the Frequency of Use of the Different Digits in Natural Numbers (see [New]). The article begins,
That the ten digits do not occur with equal frequency must be evident to any one making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9. The question naturally arises whether the reverse would be true of logarithms. That is, in a table of anti-logarithms, would the last part be more used than the first, or would every part be used equally? The law of frequency in the one case may be deduced from that in the other. The question we have to consider is, what is the probability that if a natural number be taken at random its first significant digit will be n, its second n', etc.
As natural numbers occur in nature, they are to be considered as the ratios of quantities. Therefore, instead of selecting a number at random, we must select two numbers, and inquire what is the probability that the first significant digit of their ratio is the digit n. To solve the problem we may form an indefinite number of such ratios, taken independently; and then must make the same inquiry respecting their quotients, and continue the process so as to find the limit towards which the probability approaches.
In this short article two very important properties of the distribution of digits are noted. The first is that all digits are not equally likely. The article ends with a quantification of how oftener the first digit is a 1 than a 9, with Newcomb stating,
The law of probability of the occurrence of numbers is such that all mantissa of their logarithms are equally probable.
Specifically, Newcomb gives a table (see Table 1.1) for the probabilities of first and second digits.
The second key observation of his paper is noting the importance of scale. The numerical value of a physical quantity clearly depends on the scale used, and thus Newcomb suggests that the correct items to study are ratios of measurements.
The next step forward in studying the distribution of the leading digits of numbers was Frank Benford's The Law of Anomalous Numbers, published in the Proceedings of the American Philosophical Society in 1938 (see [Ben]). In addition to advancing explanations as to why digits have this distribution, he also presents some justification as to why this is a problem worthy of study.
It has been observed that the pages of a much used table of common logarithms show evidences of a selective use of the natural numbers. The pages containing the logarithms of the low numbers 1 and 2 are apt to be more stained and frayed by use than those of the higher numbers 8 and 9. Of course, no one could be expected to be greatly interested in the condition of a table of logarithms, but the matter may be considered more worthy of study when we recall that the table is used in the building up of our scientific, engineering, and general factual literature. There may be, in the relative cleanliness of the pages of a logarithm table, data on how we think and how we react when dealing with things that can be described by means of numbers.
Benford studied the distribution of leading digits of 20 sets of data, including rivers, areas, populations, physical constants, mathematical sequences (such as [square root of n], n!, n2, ...), sports, an issue of Reader's Digest and the first 342 street addresses given in the (then) current American Men of Science. We reproduce his observations in Table 1.2.
Benford's paper contains many of the key observations in the subject. One of the most important is that while individual data sets may fail to satisfy Benford's Law, amalgamating many different sets of data leads to a new sequence whose behavior is typically closer to Benford's Law. This is seen both in the row corresponding to n, n2, ... (where we can prove that each of these is non-Benford) as well as in the average over all data sets.
Benford's article suffered a much better fate than Newcomb's paper, possibly in part because it immediately preceded a physics article by Bethe, Rose and Smith on the multiple scattering of electrons. Whereas it was decades before there was another article building on Newcomb's work, the next article after Benford's paper was six years later (by S. A. Goutsmit and W. H. Furry, Significant Figures of Numbers in Statistical Tables, in Nature), and after that the papers started occurring more and more frequently. See Hurlimann's extensive bibliography [Hu] for a list of papers, books and reports on Benford's Law from 1881 to 2006, as well as the online bibliography maintained by Arno Berger and Ted Hill [BerH2].
1.4 STATEMENT OF BENFORD'S LAW
We are now ready to give precise statements of Benford's Law.
Definition 1.4.1 (Benford's Law for the Leading Digit). A set of numbers satisfies Benford's Law for the Leading Digit if the probability of observing a first digit of d is log10 ([d+1]/d).
While clean and easy to state, the above definition has several problems when we apply it to real data sets. The most glaring is that the numbers log10 ([d+1]/d) are irrational. If we have a data set with N observations, then the number of times the first digit is d must be an integer, and hence the observed frequencies are always rational numbers.
One solution to this issue is to consider only infinite sets. Unfortunately this is not possible in many cases of interest, as most real-world data sets are finite (i.e., there are only finitely many counties or finitely many trading days). Thus, while Definition 1.4.1 is fine for mathematical investigations of sequences and functions, it is not practical for many sets of interest. We therefore adjust the definition to
Definition 1.4.2 (Benford's Law for the Leading Digit (Working Definition)).
We say a data set satisfies Benford's Law for the Leading Digit if the probability of observing a first digit of d is approximately log10 ([d+1]/d).
Note that the above definition is vague, as we need to clarify what is meant by "approximately." It is a non-trivial task to find good statistical tests for large data sets. The famous and popular chi-square tests, for example, frequently cannot be used with extensive data sets as this test becomes very sensitive to small deviations when there are many observations. For now, we shall use the above definition and interpret "approximately" to mean a good visual fit. This approach works quite well for many applications. For example, in Chapter 8 we shall see that many corporate and other financial data sets follow Benford's Law, and thus if the distribution is visually far from Benford, it is quite likely that the data's integrity has been compromised.
Finally, instead of studying just the leading digit we could study the entire significand. Thus in place of asking for the probability of a first digit of 1 or 2 or 3, we now ask for the probability of observing a significand between 1 and 2, or between n and e. This generalization is frequently called the Strong Benford's Law.
Definition 1.4.3 (Strong Benford's Law for the Leading Digits (Working Definition)).We say a data set satisfies the Strong Benford's Law if the probability of observing a significand in [1, s) is log10s.
Note that Strong Benford behavior implies Benford behavior; the probability of a first digit of d is just the probability the significand is in [d,d+1). Writing [d,d+1) as [1, d+1)\[1, d), we see this probability is just log10(d+1)–log10d = log10([d+1]/d)
1.5 EXAMPLES AND EXPLANATIONS
In this section we briefly give some explanations for why so many different and diverse data sets satisfy Benford's Law, saving for later chapters more detailed explanation. It's worthwhile to take a few minutes to reflect on how Benford's Law was discovered, and to see whether or not similar behavior might be lurking in other systems. The story is that Newcomb was led to the law by observing that the pages in logarithm tables corresponding to numbers beginning with 1 were significantly more worn than the pages corresponding to numbers with higher first digit. A reasonable explanation for the additional wear and tear is that numbers with a low first digit are more common than those with a higher first digit. It is thus quite fortunate for the field that there were no calculators back then, as otherwise the law could easily have been missed. Though few (if any) of us still use logarithm tables, it is possible to see a similar phenomenon in the real world today. Our analysis of this leads to one of the most important theorems in probability and statistics, the Central Limit Theorem, which plays a role in understanding the ubiquity of Benford's Law.
Instead of looking at logarithm tables, we can look at the steps in an old building, or how worn the grass is on college campuses. Assuming the steps haven't been replaced and that there is a reasonable amount of traffic in and out of the building, then lots of people will walk up and down these stairs. Each person causes a small amount of wear and tear on the steps; though each person's contribution is small, if there are enough people over a long enough time period then the cumulative effect will be visually apparent. Typically the steps are significantly more worn towards the center and less so as one moves towards the edges. A little thought suggests the obvious answer: people typically walk up the middle of a flight of stairs unless someone else is coming down. Similar to carbon dating, one could attempt to determine the age of a building by the indentation of the steps. Looking at these patterns, we would probably see something akin to the normal distribution, and if we were fortunate we might "discover" the Central Limit Theorem. There are many other examples from everyday life. We can also observe this in looking at lawns. Everyone knows the shortest distance between two points is a line, and people frequently leave the sidewalks and paths and cut across the grass, wearing it down to dirt in some places and leaving it untouched in others. Another example is to look at keyboards, and compare the well-worn "E" to the almost pristine "Q." Or the wear and tear on doors. The list is virtually endless.
In Figure 1.1 we look at the leading digits of the several "natural" data sets. Four arise from the real world, coming from the 2000 census in the United States (population and area in square miles of U.S. counties), daily volumes of transactions on the New York Stock Exchange (NYSE) from 2000 through 2003 and the physical constants posted on the homepage of the National Institute for Standards and Technology (NIST); the remaining two data sets are popular mathematical sequences: the first 3219 Fibonacci numbers and factorials (we chose this number so that we would have as many entries as we do counties).
If these are "generic" data sets, then we see that no one law describes the behavior of each set. Some of the sets are quite close to following Benford's Law, others are far off; none are close to having each digit equally likely to be the leading digit. Except for the second and third sets, the rest of the data behaves similarly; this is easier to see if we remove these two examples, which we do in Figure 1.2.
Before launching into explanations of why so many data sets are Benford (or at least close to it), it's worth briefly remarking why many are not. There are several reasons and ways a data set can fail to be Benford; we quickly introduce some of these reasons now, and expand on them more when we advance explanations for Benford's Law below. For example, imagine we are recording hourly temperatures in May at London Heathrow Airport. In Fahrenheit the temperatures range from lows of around 40 degrees to highs of around 80. As all digits are not accessible, it's impossible to be Benford, though perhaps given this restriction, the relative probabilities of the digits are Benford.
Excerpted from Benford's Law: Theory and Applications by Steven J. Miller. Copyright © 2015 Princeton University Press. Excerpted by permission of PRINCETON UNIVERSITY PRESS.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
Part I General Theory I: Basis of Benford's Law 1
Chapter 1 A Quick Introduction to Benford's Law 3
1.1 Overview 3
1.2 Newcomb 4
1.3 Benford 5
1.4 Statement of Benford's Law 7
1.5 Examples and Explanations 8
1.6 Questions 16
Chapter 2 A Short Introduction to the Mathematical Theory of Benford's Law 23
2.1 Introduction 23
2.2 Significant Digits and the Significand 24
2.3 The Benford Property 28
2.4 Characterizations of Benford's Law 31
2.5 Benford's Law for Deterministic Processes 43
2.6 Benford's Law for Random Processes 55
Chapter 3 Fourier Analysis and Benford's Law 68
3.1 Introduction 68
3.2 Benford-Good Processes 70
3.3 Products of Independent Random Variables 81
3.4 Chains of Random Variables 88
3.5 Weibull Random Variables, Survival Distributions, and Order Statistics 96
3.6 Benfordness of Cauchy Distributions 102
Part II General Theory II: Distributions and Rates of Convergence 107
Chapter 4 Benford's Law Geometry 109
4.1 Introduction 109
4.2 Common Probability Distributions 111
4.3 Probability Distributions Satisfying Benford's Law 113
4.4 Conclusions 118
Chapter 5 Explicit Error Bounds via Total Variation 119
5.1 Introduction 119
5.2 Preliminaries 120
5.3 Error Bounds in Terms of TV(f) 123
5.4 Error Bounds in Terms of TV(f(k)) 125
5.5 Proofs 130
Chapter 6 Levy Processes and Benford's Law 135
6.1 Overview, Basic Definitions, and Examples 136
6.2 Expectations of Normalized Functionals 149
6.3 A.S. Convergence of Normalized Functionals 155
6.4 Necessary and Sufficient Conditions for (D) or (SC) 161
6.5 Statistical Applications 164
6.6 Appendix 1: Another Variant of Poisson Summation 169
6.7 Appendix 2: An Elementary Property of Conditional Expectations 172
Part III Applications I: Accounting and Vote Fraud 175
Chapter 7 Benford's Law as a Bridge between Statistics and Accounting 177
7.1 The Case for Accountants to Learn Statistics 177
7.2 The Financial Statement Auditor's Work Environment 179
7.3 Practical and Statistical Hypotheses 183
7.4 From Statistical Hypothesis to Decision Making 185
7.5 Example for Classroom Use 188
7.6 Conclusion and Recommendations 189
Chapter 8 Detecting Fraud and Errors Using Benford's Law 191
8.1 Introduction 191
8.2 Benford's Original Paper 192
8.3 Case Studies with Authentic Data 193
5.1 Case Studies with Fraudulent Data 202
8.1 Discussion 210
Chapter 9 Can Vote Counts' Digits and Benford's Law Diagnose Elections? 212
9.1 Introduction 212
9.2 2BL and Precinct Vote Counts 213
9.3 An Example of Strategic Behavior by Voters 218
9.4 Discussion 222
Chapter 10 Complementing Benford's Law for Small N: A Local Bootstrap 223
10.1 The 2009 Iranian Presidential Election 223
10.2 Applicability of Benford's Law and the K7 Anomaly 224
10.3 A Conservative Alternative to Benford's Law: A Small N, Empirical, Local Bootstrap Model 227
10.4 Using a Suspected Anomaly to Select Subsets of the Data 229
10.5 When Local Bootstraps Complement Benford's Law 231
Part IV Applications II: Economics 233
Chapter 11 Measuring the Quality of European Statistics 235
11.1 Introduction 235
11.2 Macroeconomic Statistics in the EU 236
11.3 Benford's Law and Macroeconomic Data 237
11.4 Conclusion 242
Chapter 12 Benford's Law and Fraud in Economic Research 244
12.1 Introduction 244
12.2 On Benford's Law 245
12.3 Benford's Law in Macroeconomic Data and Forecasts 248
12.4 Benford's Law in Published Economic Research 250
12.5 Replication and Benford's Law 253
12.6 Conclusions 255
Chapter 13 Testing for Strategic Manipulation of Economic and Financial Data 257
13.1 Benford in Economics 257
13.2 An Application to Value-at-Risk Data 260
Part V Applications III: Sciences 265
Chapter 14 Psychology and Benford's Law 267
14.1 A Behavioral Approach 267
14.2 Early Behavioral Research 268
14.3 Recent Research 270
14.4 Why Do People Approximate Benford's Law? 273
14.5 Conclusions and Future Directions 274
Chapter 15 Managing Risk in Numbers Games: Benford's Law and the Small-Number Phenomenon 276
15.1 Introduction 276
15.2 Patterns in Number Selection: The Small-Number Phenomenon 277
15.3 Modeling Number Selection with Benford's Law 280
15.4 Managerial Implications 284
15.5 Conclusions 289
Chapter 16 Benford's Law in the Natural Sciences 290
16.1 Introduction 290
16.2 Origins of Benford's Law in Scientific Data 291
16.3 Examples of Bernard's Law in Scientific Data Sets 294
16.4 Applications of Benford's Law in the Natural Sciences 300
16.5 Conclusion 303
Chapter 17 Generalizing Benford's Law: A Reexamination of Falsified Clinical Data 304
17.1 Introduction 304
17.2 Connecting Benford's Law to Stigler's Distribution 305
17.3 Connecting Stigler's Law to Information-Theoretic Methods 307
17.4 Clinical Data 310
17.5 Summary and Implications 315
Part VI Applications IV: Images 317
Chapter 18 Partial Volume Modeling of Medical Imaging Systems Using the Benford Distribution 319
18.1 Introduction 319
18.2 The Partial Volume Effect 322
18.3 Modeling of the PV Effect 324
18.4 Materials and Methods 331
18.5 Results and Discussion 334
18.6 Conclusions 337
Chapter 19 Application of Benford's Law to Images 338
19.1 Introduction 338
19.2 Background 339
19.3 Application of Benford's Law to Images 340
19.4 AEourier-Series-Based Model 343
19.5 Results Concerning Ensembles of DCT Coefficients 350
19.6 Jolion's Results Revisited 354
19.7 Image Forensics 360
19.8 Summary 365
19.9 Appendix 366
Part VII Exercises 371
Chapter 20 Exercises 373
20.1 A Quick Introduction to Benford's Law 373
20.2 A Short Introduction to the Mathematical Theory of Benford's Law 376
20.3 Fourier Analysis and Benford's Law 377
20.4 Benford's Law Geometry 386
20.5 Explicit Error Bounds via Total Variation 386
20.6 Levy Processes and Benford's Law 387
20.7 Benford's Law as a Bridge between Statistics and Accounting 393
20.8 Detecting Fraud and Errors Using Benford's Law 395
20.9 Can Vote Counts' Digits and Benford's Law Diagnose Elections? 396
20.10 Complementing Benford's Law for Small N: A Local Bootstrap 396
20.11 Measuring the Quality of European Statistics 396
20.12 Benford's Law and Fraud in Economic Research 397
20.13 Testing for Strategic Manipulation of Economic and Financial Data 398
20.14 Psychology and Benford's Law 398
20.15 Managing Risk in Numbers Games: Benford's Law and the Small-Number Phenomenon 399
20.16 Benford's Law in the Natural Sciences 399
20.17 Generalizing Benford's Law: A Reexamination of Falsified Clinical Data 400
20.18 Partial Volume Modeling of Medical Imaging Systems Using the Benford Distribution 401
20.19 Application of Benford's Law to Images 401