Handbook of Statistical Analysis and Data Mining Applications

Handbook of Statistical Analysis and Data Mining Applications

5.0 2
by Robert Nisbet, Gary Miner, John Elder IV

View All Available Formats & Editions

The Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem,


The Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem, understand the strengths and weaknesses of modern data mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries - from science and engineering, to medicine, academia and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions.

  • Written "By Practitioners for Practitioners"
  • Non-technical explanations build understanding without jargon and equations
  • Tutorials in numerous fields of study provide step-by-step instruction on how to use supplied tools to build models
  • Practical advice from successful real-world implementations
  • Includes extensive case studies, examples, MS PowerPoint slides and datasets
  • CD-DVD with valuable fully-working  90-day software included:  "Complete Data Miner - QC-Miner - Text Miner" bound with book

Editorial Reviews

From the Publisher
“Great introduction to the real-world process of data mining. The overviews, practical advise, tutorials, and extra CD material make this book an invaluable resource for both new and experienced data miners.”

- Karl Rexer, PhD (President & Founder of Rexer Analytics, Boston, Massachusetts)

If you want to roll-up your sleeves and execute on predictive analytics, this is your definite, go-to resource. To put it lightly, if this book isn't on your shelf, you're not a data miner.

- Eric Siegel, Ph.D., President, Prediction Impact, Inc. and Founding Chair, Predictive Analytics World

Product Details

Elsevier Science
Publication date:
Sold by:
Barnes & Noble
File size:
25 MB
This product may take a few minutes to download.

Read an Excerpt


By Robert Nisbet John Elder Gary Miner

Academic Press

Copyright © 2009 Elsevier Inc.
All right reserved.

ISBN: 978-0-08-091203-5

Chapter One

The Background for Data Mining Practice


Preamble 3

A Short History of Statistics and Data Mining 4

Modern Statistics: A Duality? 5

Two Views of Reality 8

The Rise of Modern Statistical Analysis: The Second Generation 10

Machine Learning Methods: The Third Generation 11

Statistical Learning Theory: The Fourth Generation 12

Postscript 13


You must be interested in learning how to practice data mining; otherwise, you would not be reading this book. We know that there are many books available that will give a good introduction to the process of data mining. Most books on data mining focus on the features and functions of various data mining tools or algorithms. Some books do focus on the challenges of performing data mining tasks. This book is designed to give you an introduction to the practice of data mining in the real world of business.

One of the first things considered in building a business data mining capability in a company is the selection of the data mining tool. It is difficult to penetrate the hype erected around the description of these tools by the vendors. The fact is that even the most mediocre of data mining tools can create models that are at least 90% as good as the best tools. A 90% solution performed with a relatively cheap tool might be more cost effective in your organization than a more expensive tool. How do you choose your data mining tool? Few reviews are available. The best listing of tools by popularity is maintained and updated yearly by KDNuggets.com. Some detailed reviews available in the literature go beyond just a discussion of the features and functions of the tools (see Nisbet, 2006, Parts 1–3). The interest in an unbiased and detailed comparison is great. We are told the "most downloaded document in data mining" is the comprehensive but decade-old tool review by Elder and Abbott (1998).

The other considerations in building a business's data mining capability are forming the data mining team, building the data mining platform, and forming a foundation of good data mining practice. This book will not discuss the building of the data mining platform. This subject is discussed in many other books, some in great detail. A good overview of how to build a data mining platform is presented in Data Mining: Concepts and Techniques (Han and Kamber, 2006). The primary focus of this book is to present a practical approach to building cost-effective data mining models aimed at increasing company profitability, using tutorials and demo versions of common data mining tools.

Just as important as these considerations in practice is the background against which they must be performed. We must not imagine that the background doesn't matter ... it does matter, whether or not we recognize it initially. The reason it matters is that the capabilities of statistical and data mining methodology were not developed in a vacuum. Analytical methodology was developed in the context of prevailing statistical and analytical theory. But the major driver in this development was a very pressing need to provide a simple and repeatable analysis methodology in medical science. From this beginning developed modern statistical analysis and data mining. To understand the strengths and limitations of this body of methodology and use it effectively, we must understand the strengths and limitations of the statistical theory from which they developed. This theory was developed by scientists and mathematicians who "thought" it out. But this thinking was not one sided or unidirectional; there arose several views on how to solve analytical problems. To understand how to approach the solving of an analytical problem, we must understand the different ways different people tend to think. This history of statistical theory behind the development of various statistical techniques bears strongly on the ability of the technique to serve the tasks of a data mining project.


Analysis of patterns in data is not new. The concepts of average and grouping can be dated back to the 6th century BC in Ancient China, following the invention of the bamboo rod abacus (Goodman, 1968). In Ancient China and Greece, statistics were gathered to help heads of state govern their countries in fiscal and military matters. (This makes you wonder if the words statistic and state might have sprung from the same root.) In the sixteenth and seventeenth centuries, games of chance were popular among the wealthy, prompting many questions about probability to be addressed to famous mathematicians (Fermat, Leibnitz, etc.). These questions led to much research in mathematics and statistics during the ensuing years.


Two branches of statistical analysis developed in the eighteenth century: Bayesian and classical statistics. (See Figure 1.1.) To treat both fairly in the context of history, we will consider both in the First Generation of statistical analysis. For the Bayesians, the probability of an event's occurrence is equal to the probability of its past occurrence times the likelihood of its occurrence in the future. Analysis proceeds based on the concept of conditional probability: the probability of an event occurring given that another event has already occurred. Bayesian analysis begins with the quantification of the investigator's existing state of knowledge, beliefs, and assumptions. These subjective priors are combined with observed data quantified probabilistically through an objective function of some sort. The classical statistical approach (that flowed out of mathematical works of Gauss and Laplace) considered that the joint probability, rather than the conditional probability, was the appropriate basis for analysis. The joint probability function expresses the probability that simultaneously X takes the specific values x and Y takes value y, as a function of x and y.

Interest in probability picked up early among biologists following Mendel in the latter part of the nineteenth century. Sir Francis Galton, founder of the School of Eugenics in England, and his successor, Karl Pearson, developed the concepts of regression and correlation for analyzing genetic data. Later, Pearson and colleagues extended their work to the social sciences. Following Pearson, Sir R. A. Fisher in England developed his system for inference testing in medical studies based on his concept of standard deviation. While the development of probability theory flowed out of the work of Galton and Pearson, early predictive methods followed Bayes's approach. Bayesian approaches to inference testing could lead to widely different conclusions by different medical investigators because they used different sets of subjective priors. Fisher's goal in developing his system of statistical inference was to provide medical investigators with a common set of tools for use in comparison studies of effects of different treatments by different investigators. But to make his system work even with large samples, Fisher had to make a number of assumptions to define his "Parametric Model."

Assumptions of the Parametric Model

1. Data Fits a Known Distribution (e.g., Normal, Logistic, Poisson, etc.)

Fisher's early work was based on calculation of the parameter standard deviation, which assumes that data are distributed in a normal distribution. The normal distribution is bell-shaped, with the mean (average) at the top of the bell, with "tails" falling off evenly at the sides. Standard deviation is simply the "average" of the absolute deviation of a value from the mean. In this calculation, however, averaging is accomplished by dividing the sum of the absolute deviations by the total – 1. This subtraction expresses (to some extent) the increased uncertainty of the result due to grouping (summing the absolute deviations). Subsequent developments used modified parameters based on the logistic or Poisson distributions. The assumption of a particular known distribution is necessary in order to draw upon the characteristics of the distribution function for making inferences. All of these parametric methods run the gauntlet of dangers related to force-fitting data from the real world into a mathematical construct that does not fit.

2. Factor Independency

In parametric predictive systems, the variable to be predicted (Y) is considered as a function of predictor variables (X's) that are assumed to have independent effects on Y. That is, the effect on Y of each X-variable is not dependent on effects on Y of any other X- variable. This situation could be created in the laboratory by allowing only one factor (e.g., a treatment) to vary, while keeping all other factors constant (e.g., temperature, moisture, light, etc.). But, in the real world, such laboratory control is absent. As a result, some factors that do affect other factors are permitted to have a joint effect on Y. This problem is called collinearity. When it occurs between more than two factors, it is termed multicollinearity. The multicollinearity problem led statisticians to use an interaction term in the relationship that supposedly represented the combined effects. Use of this interaction term functioned as a magnificent kluge, and the reality of its effects was seldom analyzed. Later development included a number of interaction terms, one for each interaction the investigator might be presenting.

3. Linear Additivity

Not only must the X-variables be independent, their effects on Y must be cumulative and linear. That means the effect of each factor is added to or subtracted from the combined effects of all X-variables on Y. But what if the relationship between Y and the predictors (X-variables) is not additive, but multiplicative or divisive? Such functions can be expressed only by exponential equations that usually generate very nonlinear relationships. Assumption of linear additivity for these relationships may cause large errors in the predicted outputs. This is often the case with their use in business data systems.

4. Constant Variance (Homoscedasticity)

The variance throughout the range of each variable is assumed to be constant. This means that if you divided the range of a variable into bins, the variance across all records for bin #1 is the same as the range for all the other bins in the range of that variable. If the variance throughout the range of a variable differs significantly from constancy, it is said to be heteroscedastic. The error in the predicted value caused by the combined heteroscedasticity among all variables can be quite significant.

5. Variables Must Be Numerical and Continuous

The assumption that variables must be numerical and continuous means that data must be numeric (or it must be transformable to a number before analysis) and the number must be part of a distribution that is inherently continuous. Integer values in a string are not continuous; they are discrete. Classical parametric statistical methods are not valid for use with discrete data, because the probability distributions for continuous and discrete data are different. But both scientists and business analysts have used them anyway.

In his landmark paper, Fisher (1921; see Figure 1.2) began with the broad definition of probability as the intrinsic probability of an event's occurrence divided by the probability of occurrence of all competing events (very Bayesian). By the end of his paper, Fisher modified his definition of probability for use in medical analysis (the goal of his research) as the intrinsic probability of an event's occurrence period. He named this quantity likelihood. From that foundation, he developed the concepts of standard deviation based on the normal distribution. Those who followed Fisher began to refer to likelihood as probability. The concept of likelihood approaches the classical concept of probability only as the sample size becomes very large and the effects of subjective priors approach zero (von Mises, 1957). In practice, these two conditions may be satisfied sufficiently if the initial distribution of the data is known and the sample size is relatively large (following the Law of Large Numbers).

Why did this duality of thought arise in the development of statistics? Perhaps it is because of the broader duality that pervades all of human thinking. This duality can be traced all the way back to the ancient debate between Plato and Aristotle.


Whenever we consider solving a problem or answering a question, we start by conceptualizing it. That means we do one of two things: (1) try to reduce it to key elements or (2) try to conceive of it in general terms. We call people who take each of these approaches "detail people" and "big picture people," respectively. What we don't consider is that this distinction has its roots deep in Greek philosophy in the works of Aristotle and Plato.


Aristotle (Figure 1.3) believed that the true being of things (reality) could be discerned only by what the eye could see, the hand could touch, etc. He believed that the highest level of intellectual activity was the detailed study of the tangible world around us. Only in that way could we understand reality. Based on this approach to truth, Aristotle was led to believe that you could break down a complex system into pieces, describe the pieces in detail, put the pieces together and understand the whole. For Aristotle, the "whole" was equal to the sum of its parts. This nature of the whole was viewed by Aristotle in a manner that was very machine-like.

Science gravitated toward Aristotle very early. The nature of the world around us was studied by looking very closely at the physical elements and biological units (species) that composed it. As our understanding of the natural world matured into the concept of the ecosystem, it was discovered that many characteristics of ecosystems could not be explained by traditional (Aristotelian) approaches. For example, in the science of forestry, we discovered that when a tropical rain forest is cut down on the periphery of its range, it may take a very long time to regenerate (if it does at all). We learned that the reason for this is that in areas of relative stress (e.g., peripheral areas), the primary characteristics necessary for the survival and growth of tropical trees are maintained by the forest itself! High rainfall leaches nutrients down beyond the reach of the tree roots, so almost all of the nutrients for tree growth must come from recently fallen leaves and branches. When you cut down the forest, you remove that source of nutrients. The forest canopy also maintains favorable conditions of light, moisture, and temperature required by the trees. Removing the forest removes the very factors necessary for it to exist at all in that location. These factors emerge only when the system is whole and functioning. Many complex systems are like that, even business systems. In fact, these emergent properties may be the major drivers of system stability and predictability.

To understand the failure of Aristotelian philosophy for completely defining the world, we must return to Ancient Greece and consider Aristotle's rival, Plato.


Plato (Figure 1.4) was Aristotle's teacher for 20 years, and they both agreed to disagree on the nature of being. While Aristotle focused on describing tangible things in the world by detailed studies, Plato focused on the world of ideas that lay behind these tangibles. For Plato, the only thing that had lasting being was an idea. He believed that the most important things in human existence were beyond what the eye could see and the hand could touch. Plato believed that the influence of ideas transcended the world of tangible things that commanded so much of Aristotle's interest. For Plato, the "whole" of reality was greater than the sum of its tangible parts.


Excerpted from HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS by Robert Nisbet John Elder Gary Miner Copyright © 2009 by Elsevier Inc.. Excerpted by permission of Academic Press. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Meet the Author

Dr. Robert Nisbet was trained initially in Ecology and Ecosystems Analysis. He has over 30 years’ experience in complex systems analysis and modeling, most recently as a Researcher (University of California, Santa Barbara). In business, he pioneered the design and development of configurable data mining applications for retail sales forecasting, and Churn, Propensity-to-buy, and Customer Acquisition in Telecommunications Insurance, Banking, and Credit industries. In addition to data mining, he has expertise in data warehousing technology for Extract, Transform, and Load (ETL) operations, Business Intelligence reporting, and data quality analyses. He is lead author of the “Handbook of Statistical Analysis&Data Mining Applications” (Academic Press, 2009), and a co-author of "Practical Text Mining" (Academic Press, 2012). Currently, he serves as an Instructor in the University of California, Irvine Predictive Analytics Certification Program, teaching online courses in Effective Data preparation, and co-teaching Introduction to Predictive Analytics.
Dr. Gary Miner received a B.S. from Hamline University, St. Paul, MN, with biology, chemistry, and education majors; an M.S. in zoology and population genetics from the University of Wyoming; and a Ph.D. in biochemical genetics from the University of Kansas as the recipient of a NASA pre-doctoral fellowship. He pursued additional National Institutes of Health postdoctoral studies at the U of Minnesota and U of Iowa eventually becoming immersed in the study of affective disorders and Alzheimer's disease.

In 1985, he and his wife, Dr. Linda Winters-Miner, founded the Familial Alzheimer's Disease Research Foundation, which became a leading force in organizing both local and international scientific meetings, bringing together all the leaders in the field of genetics of Alzheimer's from several countries, resulting in the first major book on the genetics of Alzheimer’s disease. In the mid-1990s, Dr. Miner turned his data analysis interests to the business world, joining the team at StatSoft and deciding to specialize in data mining. He started developing what eventually became the Handbook of Statistical Analysis and Data Mining Applications (co-authored with Drs. Robert A. Nisbet and John Elder), which received the 2009 American Publishers Award for Professional and Scholarly Excellence (PROSE). Their follow-up collaboration, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, also received a PROSE award in February of 2013. Overall, Dr. Miner’s career has focused on medicine and health issues, so serving as the ‘project director’ for this current book on ‘Predictive Analytics of Medicine - Healthcare Issues’ fit his knowledge and skills perfectly.

Gary also serves as VP&Scientific Director of Healthcare Predictive Analytics Corp; as Merit Reviewer for PCORI (Patient Centered Outcomes Research Institute) that awards grants for predictive analytics research into the comparative effectiveness and heterogeneous treatment effects of medical interventions including drugs among different genetic groups of patients; additionally he teaches on-line classes in ‘Introduction to Predictive Analytics’, ‘Text Analytics’, and ‘Risk Analytics’ for the University of California-Irvine, and other classes in medical predictive analytics for the University of California-San Diego; he spends most of his time in his primary role as Senior Analyst-Healthcare Applications Specialist for Dell | Information Management Group, Dell Software (through Dell’s acquisition of StatSoft in April 2014).
Dr. John Elder heads the United States’ leading data mining consulting team, with offices in Charlottesville, Virginia; Washington, D.C.; and Baltimore, Maryland (www.datamininglab.com). Founded in 1995, Elder Research, Inc. focuses on investment, commercial, and security applications of advanced analytics, including text mining, image recognition, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market sector timing, and fraud detection. John obtained a B.S. and an M.E.E. in electrical engineering from Rice University and a Ph.D. in systems engineering from the University of Virginia, where he’s an adjunct professor teaching Optimization or Data Mining. Prior to 16 years at ERI, he spent five years in aerospace defense consulting, four years heading research at an investment management firm, and two years in Rice's Computational&Applied Mathematics Department.

Customer Reviews

Average Review:

Write a Review

and post it to your social network


Most Helpful Customer Reviews

See all customer reviews >

Handbook of Statistical Analysis and Data Mining Applications 5 out of 5 based on 0 ratings. 2 reviews.
Ruben57 More than 1 year ago
This is an excellent book on data mining statistics and procedures - for both beginners and professionals in the field. The book comes with Statistica, SPSS, and SAS data mining software for use with the many tutorials given as practice. The book is thorough and easy to read and follow. The learn by doing approach is wonderful, and is much more effective (and a lot less boring!) than simply reading alone. Very highly recommended - you won't be disappointed.
JMH1944 More than 1 year ago
The "Handbook of Statistical Analysis & Data Mining Applications" is the finest book I have seen on the subject. It is not only a beautifully crafted book, with numerous color graphs, chart, tables, and screen shots, but the statistical discussion is both clear and comprehensive. The text does not use only one statistical data mining application to display examples, but provides a rather thorough training in the use of both SAS-Enterprise Miner and STATISTICA Data Miner. A section on SPSS Clementine is also provided, giving comparisons between the various packages. Also employed are STATISTICA's C&RT, CHAID, MARSpline, and other data mining and graphical analytic tools. The text does not burden the typical data mining researcher with the internals of how the various tools work. It is therefore not steeped in equations. Some are to be found, of course, but the emphasis is on understanding the concepts involved and on how to apply these concepts to real data - which is provided to the reader in terms of data tutorials. Specialized datasets have been prepared by both authors and outside experts in various areas of inquiry ranging from entertainment, financial, engineering, clinical psychology, dentistry, demographics, medical informatics, meteorology, astronomy, and more. Each tutorial is associated with data stored on either the associated CD that comes with the book, or which can be downloaded from a companion web site. Worked out examples of how to use data mining techniques on such data is provided to help the reader gain a solid feel for the data mining enterprise. The final third of the book is devoted to a partial selection of the available tutorials. The two earlier chapters demonstrate how to use data mining software for the analysis of data. I highly recommend this work to anyone having an interest in data mining. I might also add that the Barnes and Noble member price of $72 is truly excellent for an 864 page academic text, having full color tables and screen shots on some one-third of the pages, plus a CD. A bargain indeed.