Read an Excerpt
HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS
By Robert Nisbet John Elder Gary Miner
Copyright © 2009 Elsevier Inc.
All right reserved.
Chapter One The Background for Data Mining Practice
A Short History of Statistics and Data Mining 4
Modern Statistics: A Duality? 5
Two Views of Reality 8
The Rise of Modern Statistical Analysis: The Second Generation 10
Machine Learning Methods: The Third Generation 11
Statistical Learning Theory: The Fourth Generation 12
You must be interested in learning how to practice data mining; otherwise, you would not be reading this book. We know that there are many books available that will give a good introduction to the process of data mining. Most books on data mining focus on the features and functions of various data mining tools or algorithms. Some books do focus on the challenges of performing data mining tasks. This book is designed to give you an introduction to the practice of data mining in the real world of business.
One of the first things considered in building a business data mining capability in a company is the selection of the data mining tool. It is difficult to penetrate the hype erected around the description of these tools by the vendors. The fact is that even the most mediocre of data mining tools can create models that are at least 90% as good as the best tools. A 90% solution performed with a relatively cheap tool might be more cost effective in your organization than a more expensive tool. How do you choose your data mining tool? Few reviews are available. The best listing of tools by popularity is maintained and updated yearly by KDNuggets.com. Some detailed reviews available in the literature go beyond just a discussion of the features and functions of the tools (see Nisbet, 2006, Parts 1–3). The interest in an unbiased and detailed comparison is great. We are told the "most downloaded document in data mining" is the comprehensive but decade-old tool review by Elder and Abbott (1998).
The other considerations in building a business's data mining capability are forming the data mining team, building the data mining platform, and forming a foundation of good data mining practice. This book will not discuss the building of the data mining platform. This subject is discussed in many other books, some in great detail. A good overview of how to build a data mining platform is presented in Data Mining: Concepts and Techniques (Han and Kamber, 2006). The primary focus of this book is to present a practical approach to building cost-effective data mining models aimed at increasing company profitability, using tutorials and demo versions of common data mining tools.
Just as important as these considerations in practice is the background against which they must be performed. We must not imagine that the background doesn't matter ... it does matter, whether or not we recognize it initially. The reason it matters is that the capabilities of statistical and data mining methodology were not developed in a vacuum. Analytical methodology was developed in the context of prevailing statistical and analytical theory. But the major driver in this development was a very pressing need to provide a simple and repeatable analysis methodology in medical science. From this beginning developed modern statistical analysis and data mining. To understand the strengths and limitations of this body of methodology and use it effectively, we must understand the strengths and limitations of the statistical theory from which they developed. This theory was developed by scientists and mathematicians who "thought" it out. But this thinking was not one sided or unidirectional; there arose several views on how to solve analytical problems. To understand how to approach the solving of an analytical problem, we must understand the different ways different people tend to think. This history of statistical theory behind the development of various statistical techniques bears strongly on the ability of the technique to serve the tasks of a data mining project.
A SHORT HISTORY OF STATISTICS AND DATA MINING
Analysis of patterns in data is not new. The concepts of average and grouping can be dated back to the 6th century BC in Ancient China, following the invention of the bamboo rod abacus (Goodman, 1968). In Ancient China and Greece, statistics were gathered to help heads of state govern their countries in fiscal and military matters. (This makes you wonder if the words statistic and state might have sprung from the same root.) In the sixteenth and seventeenth centuries, games of chance were popular among the wealthy, prompting many questions about probability to be addressed to famous mathematicians (Fermat, Leibnitz, etc.). These questions led to much research in mathematics and statistics during the ensuing years.
MODERN STATISTICS: A DUALITY?
Two branches of statistical analysis developed in the eighteenth century: Bayesian and classical statistics. (See Figure 1.1.) To treat both fairly in the context of history, we will consider both in the First Generation of statistical analysis. For the Bayesians, the probability of an event's occurrence is equal to the probability of its past occurrence times the likelihood of its occurrence in the future. Analysis proceeds based on the concept of conditional probability: the probability of an event occurring given that another event has already occurred. Bayesian analysis begins with the quantification of the investigator's existing state of knowledge, beliefs, and assumptions. These subjective priors are combined with observed data quantified probabilistically through an objective function of some sort. The classical statistical approach (that flowed out of mathematical works of Gauss and Laplace) considered that the joint probability, rather than the conditional probability, was the appropriate basis for analysis. The joint probability function expresses the probability that simultaneously X takes the specific values x and Y takes value y, as a function of x and y.
Interest in probability picked up early among biologists following Mendel in the latter part of the nineteenth century. Sir Francis Galton, founder of the School of Eugenics in England, and his successor, Karl Pearson, developed the concepts of regression and correlation for analyzing genetic data. Later, Pearson and colleagues extended their work to the social sciences. Following Pearson, Sir R. A. Fisher in England developed his system for inference testing in medical studies based on his concept of standard deviation. While the development of probability theory flowed out of the work of Galton and Pearson, early predictive methods followed Bayes's approach. Bayesian approaches to inference testing could lead to widely different conclusions by different medical investigators because they used different sets of subjective priors. Fisher's goal in developing his system of statistical inference was to provide medical investigators with a common set of tools for use in comparison studies of effects of different treatments by different investigators. But to make his system work even with large samples, Fisher had to make a number of assumptions to define his "Parametric Model."
Assumptions of the Parametric Model
1. Data Fits a Known Distribution (e.g., Normal, Logistic, Poisson, etc.)
Fisher's early work was based on calculation of the parameter standard deviation, which assumes that data are distributed in a normal distribution. The normal distribution is bell-shaped, with the mean (average) at the top of the bell, with "tails" falling off evenly at the sides. Standard deviation is simply the "average" of the absolute deviation of a value from the mean. In this calculation, however, averaging is accomplished by dividing the sum of the absolute deviations by the total – 1. This subtraction expresses (to some extent) the increased uncertainty of the result due to grouping (summing the absolute deviations). Subsequent developments used modified parameters based on the logistic or Poisson distributions. The assumption of a particular known distribution is necessary in order to draw upon the characteristics of the distribution function for making inferences. All of these parametric methods run the gauntlet of dangers related to force-fitting data from the real world into a mathematical construct that does not fit.
2. Factor Independency
In parametric predictive systems, the variable to be predicted (Y) is considered as a function of predictor variables (X's) that are assumed to have independent effects on Y. That is, the effect on Y of each X-variable is not dependent on effects on Y of any other X- variable. This situation could be created in the laboratory by allowing only one factor (e.g., a treatment) to vary, while keeping all other factors constant (e.g., temperature, moisture, light, etc.). But, in the real world, such laboratory control is absent. As a result, some factors that do affect other factors are permitted to have a joint effect on Y. This problem is called collinearity. When it occurs between more than two factors, it is termed multicollinearity. The multicollinearity problem led statisticians to use an interaction term in the relationship that supposedly represented the combined effects. Use of this interaction term functioned as a magnificent kluge, and the reality of its effects was seldom analyzed. Later development included a number of interaction terms, one for each interaction the investigator might be presenting.
3. Linear Additivity
Not only must the X-variables be independent, their effects on Y must be cumulative and linear. That means the effect of each factor is added to or subtracted from the combined effects of all X-variables on Y. But what if the relationship between Y and the predictors (X-variables) is not additive, but multiplicative or divisive? Such functions can be expressed only by exponential equations that usually generate very nonlinear relationships. Assumption of linear additivity for these relationships may cause large errors in the predicted outputs. This is often the case with their use in business data systems.
4. Constant Variance (Homoscedasticity)
The variance throughout the range of each variable is assumed to be constant. This means that if you divided the range of a variable into bins, the variance across all records for bin #1 is the same as the range for all the other bins in the range of that variable. If the variance throughout the range of a variable differs significantly from constancy, it is said to be heteroscedastic. The error in the predicted value caused by the combined heteroscedasticity among all variables can be quite significant.
5. Variables Must Be Numerical and Continuous
The assumption that variables must be numerical and continuous means that data must be numeric (or it must be transformable to a number before analysis) and the number must be part of a distribution that is inherently continuous. Integer values in a string are not continuous; they are discrete. Classical parametric statistical methods are not valid for use with discrete data, because the probability distributions for continuous and discrete data are different. But both scientists and business analysts have used them anyway.
In his landmark paper, Fisher (1921; see Figure 1.2) began with the broad definition of probability as the intrinsic probability of an event's occurrence divided by the probability of occurrence of all competing events (very Bayesian). By the end of his paper, Fisher modified his definition of probability for use in medical analysis (the goal of his research) as the intrinsic probability of an event's occurrence period. He named this quantity likelihood. From that foundation, he developed the concepts of standard deviation based on the normal distribution. Those who followed Fisher began to refer to likelihood as probability. The concept of likelihood approaches the classical concept of probability only as the sample size becomes very large and the effects of subjective priors approach zero (von Mises, 1957). In practice, these two conditions may be satisfied sufficiently if the initial distribution of the data is known and the sample size is relatively large (following the Law of Large Numbers).
Why did this duality of thought arise in the development of statistics? Perhaps it is because of the broader duality that pervades all of human thinking. This duality can be traced all the way back to the ancient debate between Plato and Aristotle.
TWO VIEWS OF REALITY
Whenever we consider solving a problem or answering a question, we start by conceptualizing it. That means we do one of two things: (1) try to reduce it to key elements or (2) try to conceive of it in general terms. We call people who take each of these approaches "detail people" and "big picture people," respectively. What we don't consider is that this distinction has its roots deep in Greek philosophy in the works of Aristotle and Plato.
Aristotle (Figure 1.3) believed that the true being of things (reality) could be discerned only by what the eye could see, the hand could touch, etc. He believed that the highest level of intellectual activity was the detailed study of the tangible world around us. Only in that way could we understand reality. Based on this approach to truth, Aristotle was led to believe that you could break down a complex system into pieces, describe the pieces in detail, put the pieces together and understand the whole. For Aristotle, the "whole" was equal to the sum of its parts. This nature of the whole was viewed by Aristotle in a manner that was very machine-like.
Science gravitated toward Aristotle very early. The nature of the world around us was studied by looking very closely at the physical elements and biological units (species) that composed it. As our understanding of the natural world matured into the concept of the ecosystem, it was discovered that many characteristics of ecosystems could not be explained by traditional (Aristotelian) approaches. For example, in the science of forestry, we discovered that when a tropical rain forest is cut down on the periphery of its range, it may take a very long time to regenerate (if it does at all). We learned that the reason for this is that in areas of relative stress (e.g., peripheral areas), the primary characteristics necessary for the survival and growth of tropical trees are maintained by the forest itself! High rainfall leaches nutrients down beyond the reach of the tree roots, so almost all of the nutrients for tree growth must come from recently fallen leaves and branches. When you cut down the forest, you remove that source of nutrients. The forest canopy also maintains favorable conditions of light, moisture, and temperature required by the trees. Removing the forest removes the very factors necessary for it to exist at all in that location. These factors emerge only when the system is whole and functioning. Many complex systems are like that, even business systems. In fact, these emergent properties may be the major drivers of system stability and predictability.
To understand the failure of Aristotelian philosophy for completely defining the world, we must return to Ancient Greece and consider Aristotle's rival, Plato.
Plato (Figure 1.4) was Aristotle's teacher for 20 years, and they both agreed to disagree on the nature of being. While Aristotle focused on describing tangible things in the world by detailed studies, Plato focused on the world of ideas that lay behind these tangibles. For Plato, the only thing that had lasting being was an idea. He believed that the most important things in human existence were beyond what the eye could see and the hand could touch. Plato believed that the influence of ideas transcended the world of tangible things that commanded so much of Aristotle's interest. For Plato, the "whole" of reality was greater than the sum of its tangible parts.
Excerpted from HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS by Robert Nisbet John Elder Gary Miner Copyright © 2009 by Elsevier Inc.. Excerpted by permission of Academic Press. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.