The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives

The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives

by Deirdre Nansen McCloskey, Steve Ziliak
The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives

The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives

by Deirdre Nansen McCloskey, Steve Ziliak

eBook

$25.49  $29.95 Save 15% Current price is $25.49, Original price is $29.95. You Save 15%.

Available on Compatible NOOK Devices and the free NOOK Apps.
WANT A NOOK?  Explore Now

Related collections and offers

LEND ME® See Details

Overview

“McCloskey and Ziliak have been pushing this very elementary, very correct, very important argument through several articles over several years and for reasons I cannot fathom it is still resisted. If it takes a book to get it across, I hope this book will do it. It ought to.”

—Thomas Schelling, Distinguished University Professor, School of Public Policy, University of Maryland, and 2005 Nobel Prize Laureate in Economics

“With humor, insight, piercing logic and a nod to history, Ziliak and McCloskey show how economists—and other scientists—suffer from a mass delusion about statistical analysis. The quest for statistical significance that pervades science today is a deeply flawed substitute for thoughtful analysis. . . . Yet few participants in the scientific bureaucracy have been willing to admit what Ziliak and McCloskey make clear: the emperor has no clothes.”

—Kenneth Rothman, Professor of Epidemiology, Boston University School of Health

The Cult of Statistical Significance shows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesn’t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots.

Stephen T. Ziliak is the author or editor of many articles and two books. He currently lives in Chicago, where he is Professor of Economics at Roosevelt University. Deirdre N. McCloskey, Distinguished Professor of Economics, History, English, and Communication at the University of Illinois at Chicago, is the author of twenty books and three hundred scholarly articles. She has held Guggenheim and National Humanities Fellowships. She is best known for How to Be Human* Though an Economist (University of Michigan Press, 2000) and her most recent book, The Bourgeois Virtues: Ethics for an Age of Commerce (2006).


Product Details

ISBN-13: 9780472026104
Publisher: University of Michigan Press
Publication date: 02/11/2010
Series: Economics, Cognition, And Society
Sold by: Barnes & Noble
Format: eBook
Pages: 352
File size: 839 KB

About the Author

Stephen T. Ziliak is the author or editor of many articles and two books. He currently lives in Chicago, where he is Professor of Economics at Roosevelt University.

Deirdre N. McCloskey, Distinguished Professor of Economics, History, English, and Communication at the University of Illinois at Chicago, is the author of twenty books and three hundred scholarly articles. She has held Guggenheim and National Humanities Fellowships. She is best known for How to Be Human* Though an Economist (University of Michigan Press, 2000), and her most recent book, The Bourgeois Virtues: Ethics for an Age of Commerce (2006).

Read an Excerpt


THE CULT OF STATISTICAL SIGNIFICANCE

How the Standard Error Costs Us Jobs, Justice, and Lives



By Stephen T. Ziliak Deirdre N. McCloskey
The University of Michigan Press
Copyright © 2008

University of Michigan
All right reserved.



ISBN: 978-0-472-07007-7



Chapter One Dieting "Significance" and the Case of Vioxx

The rationale for the 5% "accept-reject syndrome" which afflicts econometrics and other areas requires immediate attention. ARNOLD ZELLNER 1984, 277

The harm from the common misinterpretation of p = 0.05 as an error probability is apparent. JAMES O. BERGER 2003, 4

Precision Is Nice but Oomph Is the Bomb

Suppose you want to help your mother lose weight and are considering two diet pills with identical prices and side effects. You are determined to choose one of the two pills for her.

The first pill, named Oomph, will on average take off twenty pounds. But it is very uncertain in its effects-at plus or minus ten pounds (you can if you wish take "plus or minus" here to signify technically "two standard errors around the mean"). Oomph gives a big effect, you see, but with a high variance.

Alternatively the pill Precision will take off five pounds on average. But it is much more certain in its effects. Choosing Precision entails a probable error of plus or minus a mere one-half pound. Pill Precision is estimated, in other words, much more precisely than is Oomph, at any rate in view of the sampling schemes that measured the amount of variation in each.

So which pill for Mother, whose goal is to lose weight?

The problem we are describing is that the sizeless sciences-from agronomy to zoology-choose Precision over Oomph every time.

Being precise is not, we repeat, a bad thing. Statistical significance at some arbitrary level, the favored instrument of precision lovers, reports on a particular sort of "signal-to-noise ratio," the ratio of the music you can hear clearly relative to the static interference. Clear signals are nice, especially so in the rare cases in which the noise of small samples and not of misspecification or other "real" errors (as Gosset put it) is your chief problem. A high signal-to-noise ratio in the matter of random samples is helpful if your biggest problem is that your sample is too small, though the clarity of the signal itself is a radically incomplete criterion for making a rational decision.

The signal-to-noise ratio is calculated by dividing a measure of what one wants-the sound of a Miles Davis number, the losing of body fat, the impact of the interest rate on capital investment-by a measure of the uncertainty of the signal such as the variability caused by static interference on the radio or the random variation from a smallish sample. In diet pill terms the noise-the uncertainty of the signal, the variability-is the random effects, such as the way one person reacts to the pill by contrast with the way another person does or the way one unit of capital input interacts with the financial sector compared with some other. In formal hypothesis-testing terms, the signal-the observed effect-is typically compared to a "null hypothesis," an alternative belief. The null hypothesis is a belief used to test against the data on hand, allowing one to find a difference from it if there really is one.

In the weight loss example one can choose the null hypothesis to be a literal zero effect, which is a very common choice of a null. That is, the average weight loss afforded by each diet pill is being tested against the null hypothesis, or alternative belief, that the pill in question will not take any weight at all off Mom. The formula for the signal-to-noise ratio is:

Observed Effect-Hypothesized Null Effect/Variation of Observed Effect

Plugging in the numbers from the example yields for pill Oomph (20-0)/ 10 = 2 and for pill Precision (5-0)/0.5. = 10. In other words, the signal-to-noise ratio of pill Oomph is 2 to 1 and of pill Precision 10 to 1. Precision, we find, gives a much clearer signal-five times clearer.

All right, then, once more: which pill for Mother? Recall: the pills are identical in every other way, including price and side effects. "Well," say our significance-testing, sizeless scientific colleagues, "the pill with the highest signal-to-noise ratio is Precision. Precision is what scientists want and what the people, such as your mother, need. So, of course, choose Precision."

But Precision is obviously the wrong choice. Wrong for Mother's weight management program and wrong for the many other victims of the sizeless scientist. The sizeless scientist decides whether something is important or not-she decides "whether there exists an effect," as she puts it-by looking not at the something's oomph but at how precisely it is estimated. Diet pill Oomph is potent, she admits. But, after all, it is very imprecise, promising to shed anything from 10 to 30 pounds. Diet pill Precision will, by contrast, shed only 4.5 to 5.5 pounds, she concedes, but, goodness, it is very precise-in Fisher's terms, very statistically significant. From 1925 to 1962, Ronald A. Fisher instructed scientists in many fields to choose Precision over Oomph every time. Now they do.

Common sense, like Gosset himself, would of course recommend Oomph. Mom wants to lose weight, not gain precision. Mom cares about the spread around her waist. She cares little-or not at all-for the spread around the average of an imaginary, infinitely repeated, random sample. The minimax solution (to pick one type of loss function) is obvious: in all states of the world, Oomph dominates Precision. Oomph wins. Choosing the inferior pill, that is, pill Precision, instead maximizes failure-the failure to lose up to an additional 25.5 (30-4.5) pounds. You should have picked Oomph.

Statistical significance, or sampling precision, says nothing about the oomph of a variable or model. Yet scientists in economics and medicine and the other statistical fields are deciding about oomph on the basis of this one kind of precision. A lottery is a lottery is a lottery, they seem to be saying. A pile of hay is a pile of hay; a mustard packet is a child.

The attention lavished on the signal-to-noise ratio is difficult to fathom, even for acoustical purists such as the noted violinist Stefan Hersh. "Even I get the point about the phoniness of statistical significance," he said to Ziliak one day over lunch. It seems to be hard for scientists trained in Fisherian methods to see how bizarre the methods in fact are and increasingly harder the better trained in Fisherian methods they are.

The level of significance, precision so defined, says what? That "one in a hundred times in samples like this one, if random, the signals will be confused." Or "Nine times out of ten, if the problem is a sampling problem, the data will line up this way relative to the assumed hypothesis without specifying how important the deviations or signal confusions are." Logically speaking, a measurement of sampling precision can't possibly be the end of the inquiry. In the sizeless sciences, from economics to medicine, though, it is. If a result is "precise" in the narrow sense of sampling, then it is hailed as "significant."

Rarely do the sizeless scientists speak in Neyman's sampling terms about confidence intervals or in Gosset's non-sampling terms about real "error bars" (Student 1927). Even more rarely do they speak of the relevant range of effects in the manner of Leamer's (1982) "extreme bounds analysis." And still more rarely do they attend to all the different kinds of errors, errors more dangerous, Gosset insisted, than mere error from sampling-which is merely the easiest error to know and to control. They focus and stare fixedly at tests on the single-point percentage of red balls and white balls drawn hypothetically repeatedly and independently from an urn of nature. (Fisherians do not literally conduct repeated experiments. The brewer did.) But the test of "significance" defined this way, a number-a single point in a distribution-without a scale on which to judge its relevance, says almost nothing. It says nothing at all about what people want unless they want only insurance against a particular kind of sampling error-Type I error, the error of undue skepticism-along a scale on which every red ball or white ball has the same impact on life and judgment.

A century and a half ago Charles Darwin said he had "no Faith in anything short of actual Measurement and the Rule of Three," by which he appeared to mean the peak of arithmetical accomplishment in a nineteenth-century gentleman, solving for x in "6 is to 3 as 9 is to x." Some decades later, in the early 1900s, Karl Pearson shifted the meaning of the Rule of Three-"take 3[sigma] [three standard deviations] as definitely significant"-and claimed it for his new journal of significance testing, Biometrika. Even Darwin late in life seems to have fallen into the confusion. Francis Galton (1822-1911), Darwin's first cousin, mailed Darwin a variety of plants. Darwin had been thinking about point estimates on the heights of self- and cross-fertilized plants that depart three "probable errors" or more from the assumed hypothesis, a difference in height significant at about the 1 percent level.

But the gentlemanly faith in the New Rule of Three was misplaced. A statistically significant difference at the 1 percent level (an estimate departing three or more standard deviations from what after Fisher we call the null) may for purposes of botanical or evolutionary significance be of zero importance (cf. Fisher 1935, 27-41). That is, some cause of natural selection may have a high probability of replicability in additional samples but be trivial. Yet, on the other hand, a cause may have a low probability of replicability but be important. This is what we mean when we say that a test of significance is neither necessary nor sufficient for a finding of importance. In significance testing the substantive question of what matters and how much has been translated into a 0 to 1.0 probability, regardless of the nature of the substance, probabilistically measured.

After Fisher, the loss function intuited by Gosset has been mislaid. It has been mislaid by scientists wandering our academic hallways transfixed in a sizeless stare. That economists have lost it is particularly baffling. Economists would call the missing value of oomph the "reservation price" of a possible course of action, the opportunity cost at the margin of individual-level or groupwise decision. Without it our actual measurements-our economic decisions-come up short (fig. 1.1). As W. Edwards Deming put it, "Statistical 'significance' by itself is not a rational basis for action" (1938, 30).

Yet excellent publishing scientists in the sizeless sciences talk as though they think otherwise. They talk as though establishing the statistical significance of a number in the Fisherian sense is the same thing as establishing the significance of the number in the common sense. Here, for example, is a sentence from an article in economic science coauthored by a scientist we regard as among the best of his generation, Gary Becker (b. 1930), a Nobel laureate of 1992. Becker's article was published in a leading journal in 1994: "The absolute t ratio [the signal-to-noise ratio, using Student's t] associated with the coefficients of this variable is 5.06 in model (i), 5.54 in model (ii), and 6.45 in model (iii).... These results suggest [because Student's t exceeds 2.0] that decisions about current consumption depend on future price" (Becker, Grossman, and Murphy 1994, 404; italics supplied). Notice the rhetoric of depend/not-depend, exist/not-exist, whether/ not, and significant/insignificant even from such a splendid economic scientist as Becker. He has confused a measurement of sampling precision-that is, the size of the t statistics-with a quantitative/behavioral demonstration-that is, the size of the coefficients. Something is wrong.

"Significance" and Merck

Merck was in 2005 the third-largest drug manufacturer in the United States. Its painkiller Vioxx was first distributed in the United States in 1999 and by 2003 had been marketed in over eighty countries. At its peak in 2003 Vioxx (also known as Ceoxx) brought in some $2.5 billion. In that year a seventy-three-year-old woman died suddenly of a heart attack while taking as directed her prescribed Vioxx pills. Anticipating a lawsuit the senior scientists and company officials at Merck, newspaper accounts have said, huddled over the statistical significance of the original clinical trial.

From what an outsider can infer, the report of the clinical trial appears to have been fudged. Data that made Vioxx look bad were allegedly simply omitted from the report. A rheumatologist at the University of Arizona and lead author of the 2003 Vioxx study, Jeffrey Lisse, admitted later that not he but Merck "actually wrote the report." Perhaps there is some explanation of the Vioxx study consistent with a more reputable activity than data fudging. We don't know.

"Data fudging and significance testing are not the same," you will say. "Most of us do not commit fraud." True. But listen.

The clinical trial was conducted in 2000, and the findings were published three years later in the Annals of Internal Medicine (Lisse et al. 2003). The scientific article reported that "five [note the number, five] patients taking Vioxx had suffered heart attacks during the trial, compared with one [note the number, one] taking naproxen [the generic drug, such as Aleve, given to a control group], a difference that did not reach statistical significance." The signal-to-noise ratio did not rise to 1.96, the 5 percent level of significance that the Annals of Internal Medicine uses as a strict line of demarcation, discriminating the "significant" from the insignificant, the scientific from the nonscientific, in Fisher's and today's conventional way of thinking.

Therefore, Merck claimed, given the lack of statistical significance at the 5 percent level, there was no difference in the effects of the two pills. No difference in oomph on the human heart, they said, despite a Vioxx disadvantage of about 5 to 1. Then the alleged fraud: the published article neglected to mention that in the same clinical trial three additional takers of Vioxx, including the seventy-three-year-old woman whose survivors brought the problem to public attention, suffered heart attacks. Eight, in fact, suffered or died in the clinical trial, not five. It appears that the scientists, or the Merck employees who wrote the report, simply dropped the three observations.

Why? Why did they drop the three? We do not know for sure. The courts are deciding. But an outsider could be forgiven for inferring that they dropped the three observations in order to get an amount of statistical significance low enough to claim-illogically, but this is the usual procedure-a zero effect. That's the pseudo-qualitative problem created by the backward logic of Fisher's method. Statistical significance, as the authors of the Vioxx study were well aware, is used as an on-off switch for establishing scientific credibility. No significance, no risk to the heart. That appears to have been their logic.

Fisher would not have approved of data fudging. But it was he who developed and legislated the on-off switch that the Vioxx scientists and the Annals (and, to repeat, many courts themselves) mechanically indulged. In this case, as in many others, the reasoning is that if you can keep your sample small enough-by dropping parts of it, for example, especially, as in this apparently fraudulent case, the unfavorable results-you can claim insignificance and continue marketing. In the published article on Vioxx you can see that the authors believed they were testing, with that magic formula, whether an effect existed. "The Fisher exact test," they wrote in typical sizeless scientific fashion, and in apparent ignorance of the scientific values of Gosset, "was used to compare incidence of confirmed perforations, ulcers, bleeding, thrombotic events, and cardiovascular events.... All statistical tests ... were performed at an {alpha} level of 0.05" (Lisse et al. 2003, 541).

(Continues...)




Excerpted from THE CULT OF STATISTICAL SIGNIFICANCE by Stephen T. Ziliak Deirdre N. McCloskey Copyright © 2008 by University of Michigan . Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents

\rrhp\ \lrrh: Contents\ \1h\ Contents \xt\ \comp: set page numbers on page proof\ Preface Acknowledgments A Significant Problem In many of the life and human sciences the existence/whether question of the philosophical disciplines has substituted for the size-matters/how-much question of the scientific disciplines. The substitution is causing a loss of jobs, justice, profits, environmental quality, and even life. The substitution we are worrying about here is called "statistical significance"- -a qualitative, philosophical rule that has substituted for a quantitative, scientific magnitude and judgment. Chapter 1. Dieting "Significance" and the Case of Vioxx Since R. A. Fisher (1890---1962) the sciences that have put statistical significance at their centers have misused it. They have lost interest in estimating and testing for the actual effects of drugs or fertilizers or economic policies. The big problem began when Fisher ignored the size-matters/how-much question central to a statistical test invented by William Sealy Gosset (1876---1937), so-called Student's t. Fisher substituted for it a qualitative question concerning the "existence" of an effect, by which he meant "low sampling error by an arbitrary standard of variance." Forgetting after Fisher what is known in statistics as a "minimax strategy," or other "loss function," many sciences have fallen into a sizeless stare. They seek sampling precision only. And they end by asserting that sampling precision just is oomph, magnitude, practical significance. The minke and sperm whales of Antarctica and the users and makers of Vioxx are some of the recent victims of this bizarre ritual. Chapter 2. The Sizeless Stare of Statistical Significance Crossing frantically a busy street to save your child from certain death is a good gamble. Crossing frantically to get another mustard packet for your hot dog is not. The size of the potential loss if you don't hurry to save your child is larger, most will agree, than the potential loss if you don't get the mustard. But a majority of scientists in economics, medicine, and other statistical fields appear not to grasp the difference. If they have been trained in exclusively Fisherian methods (and nearly all of them have) they look only for a probability of success in the crossing--the existence of a probability of success better than .99 or .95 or .90, and this within the restricted frame of sampling--ignoring in any spiritual or financial currency the value of the prize and the expected cost of pursuing it. In the life and human sciences a majority of scientists look at the world with what we have dubbed "the sizeless stare of statistical significance." Chapter 3. What the Sizeless Scientists Say in Defense The sizeless scientists act as if they believe the size of an effect does not matter. In their hearts they do care about size, magnitude, oomph. But strangely they don't measure it. They substitute "significance" measured in Fisher's way. Then they take the substitution a step further by limiting their concern for error to errors in sampling only. And then they take it a step further still, reducing all errors in sampling to one kind of error--that of excessive skepticism, "Type I error." Their main line of defense for this surprising and unscientific procedure is that, after all, "statistical significance," which they have calculated, is "objective." But so too are the digits in the New York City telephone directory, objective, and the spins of a roulette wheel. These are no more relevant to the task of finding out the sizes and properties of viruses or star clusters or investment rates of return than is statistical significance. In short, statistical scientists after Fisher neither test nor estimate, really, truly. They "testimate." \comp: lowercase Greek beta and lowercase Greek alpha in chapter 4\ Chapter 4. Better Practice: -Importance vs. à-"Significance" The most popular test was invented, we've noted, by Gosset, better known by his pen name "Student," a chemist and brewer at Guinness in Dublin. Gosset didn't think his test was very important to his main goal, which was of course brewing a good beer at a good price. The test, Gosset warned right from the beginning, does not deal with substantive importance. It does not begin to measure what Gosset called "real error" and "pecuniary advantage," two terms worth reviving in current statistical practice. But Karl Pearson and especially the amazing Ronald Fisher didn't listen. In two great books written and revised during the 1920s and 1930s, Fisher imposed a Rule of Two: if a result departs from an assumed hypothesis by two or more standard deviations of its own sampling variation, regardless of the size of the prize and the expected cost of going for it, then it is to be called a "significant" scientific finding. If not, not. Fisher told the subjectivity-phobic scientists that if they wanted to raise their studies "to the rank of sciences" they must employ his rule. He later urged them to ignore the size-matters/how-much approaches of Gosset, Neyman, Egon Pearson, Wald, Jeffreys, Deming, Shewhart, and Savage. Most statistical scientists listened to Fisher. Chapter 5. A Lot Can Go Wrong in the Use of Significance Tests in Economics We ourselves in our home field of economics were long enchanted by Fisherian significance and the Rule of Two. But at length we came to wonder why the correlation of prices at home with prices abroad must be "within two standard deviations of 1.0 in the sample" before one could speak about the integration of world markets. And we came to think it strange that the U.S. Department of Labor refused to discuss black teenage unemployment rates of 30 or 40 percent because they were, by Fisher's circumscribed definition, "insignificant." After being told repeatedly, if implausibly, that such mistakes in the use of Gosset's test were not common in economics, we developed in the 1990s a questionnaire to test in economics articles for economic as against statistical significance. We applied it to the behavior of our tribe during the 1980s. Chapter 6. A Lot Did Go Wrong in the American Economic Review during the 1980s We did not study the scientific writings of amateurs. On the contrary, we studied the American Economic Review (known to its friends as the AER), a leading journal of economics. With questionnaire in hand we read every full-length article it published that used a test of statistical significance from January 1980 to December 1989. As we expected, in the 1980s more than 70 percent of the articles made the significant mistake of R. A. Fisher. Chapter 7. Is Economic Practice Improving? We published our article in 1996. Some of our colleagues replied, "In the old days [of the 1980s] people made that mistake, but [in the 1990s] we modern sophisticates do not." So in 2004 we published a follow-up study, reading all the articles published in the AER in the next decade, the 1990s. Sadly, our colleagues were again mistaken. Since the 1980s the practice in important respects got worse, not better. About 80 percent of the articles made the mistaken Fisherian substitution, failing to examine the magnitudes of their results. And less than 10 percent showed full concern for oomph. In a leading journal of economics, in other words, nine out of ten articles in the 1990s acted as if size doesn't matter for deciding whether a number is big or small, whether an effect is big or small enough to matter. The significance asterisk, the flickering star of *, has become a totem of economic belief. Chapter 8. How Big Is Big in Economics? Does globalization hurt the poor, does the minimum wage increase unemployment, does world money cause inflation, does public welfare undermine self-reliance? Such scientific questions are always matters of economic significance. How much hurt, increase, cause, undermining? Size matters. Oomph is what we seek. But that is not what is found by the statistical methods of modern economics. Chapter 9. What the Sizeless Stare Costs, Economically Speaking Sizeless economic research has produced mistaken findings about purchasing power parity, unemployment programs, monetary policy, rational addiction, and the minimum wage. In truth, it has vitiated most econometric findings since the 1920s and virtually all of them since the significance error was institutionalized in the 1940s. The conclusions of Fisherian studies might occasionally be correct. But only by accident. Chapter 10. How Economics Stays That Way: The Textbooks and the Referees New assistant professors are not to blame. Look rather at the report card of their teachers and editors and referees--notwithstanding cries of anguish from the wise Savages, Zellners, Grangers, and Leamers of the economics profession. Economists received a quiet warning by F. Y. Edgeworth in 1885--too quiet, it seems--that sampling precision is not the same as oomph. They ignored it and have ignored other warnings, too. Chapter 11. The Not-Boring Rise of Significance in Psychology Did other fields, such as psychology, do the same? Yes. In 1919 Edwin Boring warned his fellow psychologists about confusing so-called statistical with actual significance. Boring was a famous experimentalist at Harvard. But during his lectures on scientific inference his colleagues appear to have dozed off. Fisher's 5 percent philosophy was eventually codified by the Publication Manual of the American Psychological Association, which dictated the erroneous method worldwide to thousands of academic journals in psychology, education, and related sciences, including forensics. Chapter 12. Psychometrics Lacks Power "Power" is a neglected statistical offset to the "first kind of error" of null-hypothesis significance testing. Power assigns a likelihood to the "second kind of error," that of undue gullibility. The leading journals of psychometrics have had their power examined by insiders to the field. The power of most psychological science in the age of Fisher turns out to have been embarrassingly low or, in more than a few cases, spuriously "high"--as was found in a seventy-thousand-observation examination of the matter. Like economists the psychologists developed a fetish for testimation and wandered away from powerful measures of oomph. Chapter 13. The Psychology of Psychological Significance Testing Psychologists and economists have said for decades that people are "Bayesian learners" or "Neyman-Pearson signal detectors." We learn by doing and staying alert to the signals. But when psychologists and others propose to test those very hypotheses they use Fisher's Rule of Two. That is, they erase their own learning and power to detect the signal. They seek a foundation in a Popperian falsificationism long known to be philosophically dubious. What in logic is called the "fallacy of the transposed conditional" has grossly misled psychology and other sizeless sciences. An example is the overdiagnosis of schizophrenia. Chapter 14. Medicine Seeks a Magic Pill We found that medicine and epidemiology, too, are doing damage with Student's t--more in human terms perhaps than are economics and psychology. The scale along which one would measure oomph is very clear in medicine: life or death. Cardiovascular epidemiology, to take one example, combines with gusto the fallacy of the transposed conditional and the sizeless stare of statistical significance. Your mother, with her weak heart, needs to know the oomph of a treatment. Medical testimators aren't saying. Chapter 15. Rothman's Revolt Some medical editors have battled against the 5 percent philosophy. But even the New England Journal of Medicine could not lead medical research back to William Sealy Gosset and the promised land of real science. Neither could the International Committee of Medical Journal Editors, though covering worldwide hundreds of journals. Kenneth Rothman, the founder of Epidemiology, forced change in his journal. But only his journal. Decades ago a sensible few in education, ecology, and sociology initiated a "significance test controversy." But grantors, journal referees, and tenure committees in the statistical sciences had faith that probability spaces can judge--the "judgment" merely that p < .05 is "better" for variable X than p < .11 for variable Y. It's not. It depends on the oomph of X and Y. Chapter 16. On Drugs, Disability, and Death The upshot is that because of Fisher's standard error you are being given dangerous medicines, and are being denied the best medicines. The Centers for Disease Control is infected with p-values in a grant, for example, to study drug use in Atlanta. Public health has been infected, too. An outbreak of salmonella in South Carolina was studied using significance tests. In consequence a good deal of the outbreak was ignored. In 1995 a Cancer Trialists' Collaborative Group came to a rare consensus on effect size: ten different studies agreed that a certain drug for treating prostate cancer can increase patient survival by 12 percent. An eleventh study published in the New England Journal of Medicine dismissed the drug. The dismissal was based not on effect size bounded by confidence intervals based on what Gosset called "real" error but on a single p-value only, indicating, the Fisherian authors believed, "no clinically meaningful improvement" in survival. Chapter 17. Edgeworth's Significance The history of this persistent but mistaken practice is a social study of science. In 1885 an eccentric and brilliant Oxford don, Francis Ysidro Edgeworth, coined the very term significance. Edgeworth was prolific in science and philosophy, but was especially interested in watching bees and wasps. In measuring their behavioral differences, though, he focused on the sizes and meanings of the differences. He never depended on statistical significance. \comp: lowercase Greek sigma in chapter 18 title\ Chapter 18. "Take 3å as Definitely Significant": Pearson's Rule By contrast, Edgeworth's younger colleague in London, the great and powerful Karl Pearson, used "significance" very heavily indeed. As such things were defined in 1900 Pearson was an advanced thinker--for example, he was an imperialist and a racist and one of the founding fathers of neopositivism and eugenics. Seeking to resolve a tension between passion and science, ethics and rationality, Pearson mistook significance for "revelations about the objective world." In 1901 he believed 1.5 to 3 standard deviations were "definitely significant." By 1906, he tried to codify the sizeless stare with a Rule of Three and tried to teach it to Gosset. Chapter 19. Who Sits on the Egg of Cuculus Canorus? Not Karl Pearson Pearson's journal, Biometrika (1901--- ), was for decades a major nest for the significance mistake. An article on the brooding habits of the cuckoo bird, published in the inaugural volume, shows the sizeless stare at its beginnings. Chapter 20. Gosset: The Fable of the Bee Gosset revolutionized statistics in 1908 with two articles published in this same Pearson's journal, "The Probable Error of a Mean" and "The Probable Error of a Correlation Coefficient." Gosset also independently invented Monte Carlo analysis and the economic design of experiments. He conceived in 1926 the ideas if not the words of "power" and "loss," which he gave to Egon Pearson and Jerzy Neyman to complete. Yet most statistical workers know nothing about Gosset. He was exceptionally humble, kindly to other scientists, a good father and husband, altogether a paragon. As suits an amiable worker bee, he planted edible berries, blew a pennywhistle, repaired entire, functioning fishing boats with a penknife, and--though a great scientist--was for 38thirty-eight years a businessman brewing Guinness. Gosset always wanted to answer the how-much question. Guinness needed to know. Karl Pearson couldn't understand. Chapter 21. Fisher: The Fable of the Wasp The tragedy in the fable arose from Gosset the bee losing out to R. A. Fisher the wasp. All agree that Fisher was a genius. Richard Dawkins calls him "the greatest of Darwin's successors." But Fisher was a genius at a certain kind of academic rhetoric and politics as much as at mathematical statistics and genetics. His ascent came at a cost to science--and to Gosset. 22. How the Wasp Stung the Bee and Took over Some Sciences Fisher asked Gosset to calculate Gosset's tables of t for him, gratis. He then took Gosset's tables, copyrighted them for himself, and in the journal Metron and in his Statistical Methods for Research Workers, later to be published in thirteen editions and many languages, he promoted his own circumscribed version of Gosset's test. The new assignment of authorship and the faux machinery for science were spread by disciples and by Fisher himself to America and beyond. For decades Harold Hotelling, an important statistician and economist, enthusiastically carried the Fisherian flag. P. C. Mahalanobis, the great Indian scientist, was spellbound. Chapter 23. Eighty Years of Trained Incapacity: How Such a Thing Could Happen R. A. Fisher was a necessary condition for the standard error of regressions. No Fisher, no lasting error. But for null-hypothesis significance testing to persist in the face of its logical and practical difficulties, something else must be operating. Perhaps it is what Thorstein Veblen called "trained incapacity," to which might be added what Robert Merton called the "bureaucratization of knowledge" and what Friedrich Hayek called the "scientistic prejudice." We suggest that the sizeless sciences need to reform their scientistic bureaucracies. Chapter 24. What to Do What, then? Get back to size in science, and to "real error" seriously considered. It is more difficult than Fisherian procedures, and cannot be reduced to mechanical procedures. How big is big is a necessary question in any science and has no answer independent of the conversation of scientists. But it has the merit at least of being relevant to science, business, and life. The Fisherian procedures are not. A Reader's Guide Notes Works Cited Index \to come\
From the B&N Reads Blog

Customer Reviews