Handbook of Statistical Analysis and Data Mining Applications

The Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem, understand the strengths and weaknesses of modern data mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries - from science and engineering, to medicine, academia and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions. - Written "By Practitioners for Practitioners" - Non-technical explanations build understanding without jargon and equations - Tutorials in numerous fields of study provide step-by-step instruction on how to use supplied tools to build models - Practical advice from successful real-world implementations - Includes extensive case studies, examples, MS PowerPoint slides and datasets - CD-DVD with valuable fully-working 90-day software included: "Complete Data Miner - QC-Miner - Text Miner" bound with book

Handbook of Statistical Analysis and Data Mining Applications

92.95 In Stock

Handbook of Statistical Analysis and Data Mining Applications

Add to Wishlist

Handbook of Statistical Analysis and Data Mining Applications

eBook

$92.95

eBook
$92.95

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9780080912035
Publisher:	Elsevier Science & Technology Books
Publication date:	05/14/2009
Sold by:	Barnes & Noble
Format:	eBook
Pages:	864
File size:	182 MB
Note:	This product may take a few minutes to download.

About the Author

Bob Nisbet, PhD, is a Data Scientist, currently modeling precancerous colon polyp presence with clinical data at the UC-Irvine Medical Center. He has experience in predictive modeling in Telecommunications, Insurance, Credit, Banking. His academic experience includes teaching in Ecology and in Data Science. His industrial experience includes predictive modeling at AT&T, NCR, and FICO. He has worked also in Insurance, Credit, membership organizations (e.g. AAA), Education, and Health Care industries. He retired as an Assistant Vice President of Santa Barbara Bank & Trust in charge of business intelligence reporting and customer relationship management (CRM) modeling.Dr. John Elder heads the United States' leading data mining consulting team, with offices in Charlottesville, Virginia; Washington, D.C.; and Baltimore, Maryland (www.datamininglab.com). Founded in 1995, Elder Research, Inc. focuses on investment, commercial, and security applications of advanced analytics, including text mining, image recognition, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market sector timing, and fraud detection. John obtained a B.S. and an M.E.E. in electrical engineering from Rice University and a Ph.D. in systems engineering from the University of Virginia, where he's an adjunct professor teaching Optimization or Data Mining. Prior to 16 years at ERI, he spent five years in aerospace defense consulting, four years heading research at an investment management firm, and two years in Rice's Computational & Applied Mathematics Department.
Dr. John Elder heads the United States’ leading data mining consulting team, with offices in Charlottesville, Virginia; Washington, D.C.; and Baltimore, Maryland (www.datamininglab.com). Founded in 1995, Elder Research, Inc. focuses on investment, commercial, and security applications of advanced analytics, including text mining, image recognition, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market sector timing, and fraud detection. John obtained a B.S. and an M.E.E. in electrical engineering from Rice University and a Ph.D. in systems engineering from the University of Virginia, where he’s an adjunct professor teaching Optimization or Data Mining. Prior to 16 years at ERI, he spent five years in aerospace defense consulting, four years heading research at an investment management firm, and two years in Rice's Computational & Applied Mathematics Department.
Dr. Gary Miner PhD received a B.S. from Hamline University, St. Paul, MN, with biology, chemistry, and education majors; an M.S. in zoology and population genetics from the University of Wyoming; and a Ph.D. in biochemical genetics from the University of Kansas as the recipient of a NASA pre-doctoral fellowship. He pursued additional National Institutes of Health postdoctoral studies at the U of Minnesota and U of Iowa eventually becoming immersed in the study of affective disorders and Alzheimer's disease. In 1985, he and his wife, Dr. Linda Winters-Miner, founded the Familial Alzheimer's Disease Research Foundation, which became a leading force in organizing both local and international scientific meetings, bringing together all the leaders in the field of genetics of Alzheimer's from several countries, resulting in the first major book on the genetics of Alzheimer’s disease. In the mid-1990s, Dr. Miner turned his data analysis interests to the business world, joining the team at StatSoft and deciding to specialize in data mining. He started developing what eventually became the Handbook of Statistical Analysis and Data Mining Applications (co-authored with Drs. Robert A. Nisbet and John Elder), which received the 2009 American Publishers Award for Professional and Scholarly Excellence (PROSE). Their follow-up collaboration, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, also received a PROSE award in February of 2013. Gary was also co-author of “Practical Predictive Analytics and Decisioning Systems for Medicine (Academic Press, 2015). Overall, Dr. Miner’s career has focused on medicine and health issues, and the use of data analytics (statistics and predictive analytics) in analyzing medical data to decipher fact from fiction.Gary has also served as Merit Reviewer for PCORI (Patient Centered Outcomes Research Institute) that awards grants for predictive analytics research into the comparative effectiveness and heterogeneous treatment effects of medical interventions including drugs among different genetic groups of patients; additionally he teaches on-line classes in ‘Introduction to Predictive Analytics’, ‘Text Analytics’, ‘Risk Analytics’, and ‘Healthcare Predictive Analytics’ for the University of California-Irvine. Recently, until ‘official retirement’ 18 months ago, he spent most of his time in his primary role as Senior Analyst-Healthcare Applications Specialist for Dell | Information Management Group, Dell Software (through Dell’s acquisition of StatSoft (www.StatSoft.com) in April 2014). Currently Gary is working on two new short popular books on ‘Healthcare Solutions for the USA’ and ‘Patient-Doctor Genomics Stories’.

Read an Excerpt

HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

By Robert Nisbet John Elder Gary Miner

Academic Press

Copyright © 2009 Elsevier Inc.
All right reserved.
ISBN: 978-0-08-091203-5

Chapter One

The Background for Data Mining Practice

OUTLINE
Preamble 3
A Short History of Statistics and Data Mining 4
Modern Statistics: A Duality? 5
Two Views of Reality 8
The Rise of Modern Statistical Analysis: The Second Generation 10
Machine Learning Methods: The Third Generation 11
Statistical Learning Theory: The Fourth Generation 12
Postscript 13

PREAMBLE

You must be interested in learning how to practice data mining; otherwise, you would not be reading this book. We know that there are many books available that will give a good introduction to the process of data mining. Most books on data mining focus on the features and functions of various data mining tools or algorithms. Some books do focus on the challenges of performing data mining tasks. This book is designed to give you an introduction to the practice of data mining in the real world of business.

One of the first things considered in building a business data mining capability in a company is the selection of the data mining tool. It is difficult to penetrate the hype erected around the description of these tools by the vendors. The fact is that even the most mediocre of data mining tools can create models that are at least 90% as good as the best tools. A 90% solution performed with a relatively cheap tool might be more cost effective in your organization than a more expensive tool. How do you choose your data mining tool? Few reviews are available. The best listing of tools by popularity is maintained and updated yearly by KDNuggets.com. Some detailed reviews available in the literature go beyond just a discussion of the features and functions of the tools (see Nisbet, 2006, Parts 1–3). The interest in an unbiased and detailed comparison is great. We are told the "most downloaded document in data mining" is the comprehensive but decade-old tool review by Elder and Abbott (1998).

The other considerations in building a business's data mining capability are forming the data mining team, building the data mining platform, and forming a foundation of good data mining practice. This book will not discuss the building of the data mining platform. This subject is discussed in many other books, some in great detail. A good overview of how to build a data mining platform is presented in Data Mining: Concepts and Techniques (Han and Kamber, 2006). The primary focus of this book is to present a practical approach to building cost-effective data mining models aimed at increasing company profitability, using tutorials and demo versions of common data mining tools.

Just as important as these considerations in practice is the background against which they must be performed. We must not imagine that the background doesn't matter ... it does matter, whether or not we recognize it initially. The reason it matters is that the capabilities of statistical and data mining methodology were not developed in a vacuum. Analytical methodology was developed in the context of prevailing statistical and analytical theory. But the major driver in this development was a very pressing need to provide a simple and repeatable analysis methodology in medical science. From this beginning developed modern statistical analysis and data mining. To understand the strengths and limitations of this body of methodology and use it effectively, we must understand the strengths and limitations of the statistical theory from which they developed. This theory was developed by scientists and mathematicians who "thought" it out. But this thinking was not one sided or unidirectional; there arose several views on how to solve analytical problems. To understand how to approach the solving of an analytical problem, we must understand the different ways different people tend to think. This history of statistical theory behind the development of various statistical techniques bears strongly on the ability of the technique to serve the tasks of a data mining project.

A SHORT HISTORY OF STATISTICS AND DATA MINING

Analysis of patterns in data is not new. The concepts of average and grouping can be dated back to the 6th century BC in Ancient China, following the invention of the bamboo rod abacus (Goodman, 1968). In Ancient China and Greece, statistics were gathered to help heads of state govern their countries in fiscal and military matters. (This makes you wonder if the words statistic and state might have sprung from the same root.) In the sixteenth and seventeenth centuries, games of chance were popular among the wealthy, prompting many questions about probability to be addressed to famous mathematicians (Fermat, Leibnitz, etc.). These questions led to much research in mathematics and statistics during the ensuing years.

MODERN STATISTICS: A DUALITY?

Two branches of statistical analysis developed in the eighteenth century: Bayesian and classical statistics. (See Figure 1.1.) To treat both fairly in the context of history, we will consider both in the First Generation of statistical analysis. For the Bayesians, the probability of an event's occurrence is equal to the probability of its past occurrence times the likelihood of its occurrence in the future. Analysis proceeds based on the concept of conditional probability: the probability of an event occurring given that another event has already occurred. Bayesian analysis begins with the quantification of the investigator's existing state of knowledge, beliefs, and assumptions. These subjective priors are combined with observed data quantified probabilistically through an objective function of some sort. The classical statistical approach (that flowed out of mathematical works of Gauss and Laplace) considered that the joint probability, rather than the conditional probability, was the appropriate basis for analysis. The joint probability function expresses the probability that simultaneously X takes the specific values x and Y takes value y, as a function of x and y.

Interest in probability picked up early among biologists following Mendel in the latter part of the nineteenth century. Sir Francis Galton, founder of the School of Eugenics in England, and his successor, Karl Pearson, developed the concepts of regression and correlation for analyzing genetic data. Later, Pearson and colleagues extended their work to the social sciences. Following Pearson, Sir R. A. Fisher in England developed his system for inference testing in medical studies based on his concept of standard deviation. While the development of probability theory flowed out of the work of Galton and Pearson, early predictive methods followed Bayes's approach. Bayesian approaches to inference testing could lead to widely different conclusions by different medical investigators because they used different sets of subjective priors. Fisher's goal in developing his system of statistical inference was to provide medical investigators with a common set of tools for use in comparison studies of effects of different treatments by different investigators. But to make his system work even with large samples, Fisher had to make a number of assumptions to define his "Parametric Model."

Assumptions of the Parametric Model

1. Data Fits a Known Distribution (e.g., Normal, Logistic, Poisson, etc.)

Fisher's early work was based on calculation of the parameter standard deviation, which assumes that data are distributed in a normal distribution. The normal distribution is bell-shaped, with the mean (average) at the top of the bell, with "tails" falling off evenly at the sides. Standard deviation is simply the "average" of the absolute deviation of a value from the mean. In this calculation, however, averaging is accomplished by dividing the sum of the absolute deviations by the total – 1. This subtraction expresses (to some extent) the increased uncertainty of the result due to grouping (summing the absolute deviations). Subsequent developments used modified parameters based on the logistic or Poisson distributions. The assumption of a particular known distribution is necessary in order to draw upon the characteristics of the distribution function for making inferences. All of these parametric methods run the gauntlet of dangers related to force-fitting data from the real world into a mathematical construct that does not fit.

2. Factor Independency

In parametric predictive systems, the variable to be predicted (Y) is considered as a function of predictor variables (X's) that are assumed to have independent effects on Y. That is, the effect on Y of each X-variable is not dependent on effects on Y of any other X- variable. This situation could be created in the laboratory by allowing only one factor (e.g., a treatment) to vary, while keeping all other factors constant (e.g., temperature, moisture, light, etc.). But, in the real world, such laboratory control is absent. As a result, some factors that do affect other factors are permitted to have a joint effect on Y. This problem is called collinearity. When it occurs between more than two factors, it is termed multicollinearity. The multicollinearity problem led statisticians to use an interaction term in the relationship that supposedly represented the combined effects. Use of this interaction term functioned as a magnificent kluge, and the reality of its effects was seldom analyzed. Later development included a number of interaction terms, one for each interaction the investigator might be presenting.

3. Linear Additivity

Not only must the X-variables be independent, their effects on Y must be cumulative and linear. That means the effect of each factor is added to or subtracted from the combined effects of all X-variables on Y. But what if the relationship between Y and the predictors (X-variables) is not additive, but multiplicative or divisive? Such functions can be expressed only by exponential equations that usually generate very nonlinear relationships. Assumption of linear additivity for these relationships may cause large errors in the predicted outputs. This is often the case with their use in business data systems.

4. Constant Variance (Homoscedasticity)

The variance throughout the range of each variable is assumed to be constant. This means that if you divided the range of a variable into bins, the variance across all records for bin #1 is the same as the range for all the other bins in the range of that variable. If the variance throughout the range of a variable differs significantly from constancy, it is said to be heteroscedastic. The error in the predicted value caused by the combined heteroscedasticity among all variables can be quite significant.

5. Variables Must Be Numerical and Continuous

The assumption that variables must be numerical and continuous means that data must be numeric (or it must be transformable to a number before analysis) and the number must be part of a distribution that is inherently continuous. Integer values in a string are not continuous; they are discrete. Classical parametric statistical methods are not valid for use with discrete data, because the probability distributions for continuous and discrete data are different. But both scientists and business analysts have used them anyway.

In his landmark paper, Fisher (1921; see Figure 1.2) began with the broad definition of probability as the intrinsic probability of an event's occurrence divided by the probability of occurrence of all competing events (very Bayesian). By the end of his paper, Fisher modified his definition of probability for use in medical analysis (the goal of his research) as the intrinsic probability of an event's occurrence period. He named this quantity likelihood. From that foundation, he developed the concepts of standard deviation based on the normal distribution. Those who followed Fisher began to refer to likelihood as probability. The concept of likelihood approaches the classical concept of probability only as the sample size becomes very large and the effects of subjective priors approach zero (von Mises, 1957). In practice, these two conditions may be satisfied sufficiently if the initial distribution of the data is known and the sample size is relatively large (following the Law of Large Numbers).

Why did this duality of thought arise in the development of statistics? Perhaps it is because of the broader duality that pervades all of human thinking. This duality can be traced all the way back to the ancient debate between Plato and Aristotle.

TWO VIEWS OF REALITY

Whenever we consider solving a problem or answering a question, we start by conceptualizing it. That means we do one of two things: (1) try to reduce it to key elements or (2) try to conceive of it in general terms. We call people who take each of these approaches "detail people" and "big picture people," respectively. What we don't consider is that this distinction has its roots deep in Greek philosophy in the works of Aristotle and Plato.

Aristotle

Aristotle (Figure 1.3) believed that the true being of things (reality) could be discerned only by what the eye could see, the hand could touch, etc. He believed that the highest level of intellectual activity was the detailed study of the tangible world around us. Only in that way could we understand reality. Based on this approach to truth, Aristotle was led to believe that you could break down a complex system into pieces, describe the pieces in detail, put the pieces together and understand the whole. For Aristotle, the "whole" was equal to the sum of its parts. This nature of the whole was viewed by Aristotle in a manner that was very machine-like.

Science gravitated toward Aristotle very early. The nature of the world around us was studied by looking very closely at the physical elements and biological units (species) that composed it. As our understanding of the natural world matured into the concept of the ecosystem, it was discovered that many characteristics of ecosystems could not be explained by traditional (Aristotelian) approaches. For example, in the science of forestry, we discovered that when a tropical rain forest is cut down on the periphery of its range, it may take a very long time to regenerate (if it does at all). We learned that the reason for this is that in areas of relative stress (e.g., peripheral areas), the primary characteristics necessary for the survival and growth of tropical trees are maintained by the forest itself! High rainfall leaches nutrients down beyond the reach of the tree roots, so almost all of the nutrients for tree growth must come from recently fallen leaves and branches. When you cut down the forest, you remove that source of nutrients. The forest canopy also maintains favorable conditions of light, moisture, and temperature required by the trees. Removing the forest removes the very factors necessary for it to exist at all in that location. These factors emerge only when the system is whole and functioning. Many complex systems are like that, even business systems. In fact, these emergent properties may be the major drivers of system stability and predictability.

To understand the failure of Aristotelian philosophy for completely defining the world, we must return to Ancient Greece and consider Aristotle's rival, Plato.

Plato

Plato (Figure 1.4) was Aristotle's teacher for 20 years, and they both agreed to disagree on the nature of being. While Aristotle focused on describing tangible things in the world by detailed studies, Plato focused on the world of ideas that lay behind these tangibles. For Plato, the only thing that had lasting being was an idea. He believed that the most important things in human existence were beyond what the eye could see and the hand could touch. Plato believed that the influence of ideas transcended the world of tangible things that commanded so much of Aristotle's interest. For Plato, the "whole" of reality was greater than the sum of its tangible parts.

(Continues...)

Excerpted from HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS by Robert Nisbet John Elder Gary Miner Copyright © 2009 by Elsevier Inc.. Excerpted by permission of Academic Press. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

PART I: History of Phases of Data Analysis, Basic Theory, and the Data Mining Process1. History – The Phases of Data Analysis throughout the Ages2. Theory3. The Data Mining Process4. Data Understanding and Preparation5. Feature Selection – Selecting the Best Variables6: Accessory Tools and Advanced Features in DataPART II: - The Algorithms in Data Mining and Text Mining, and the Organization of the Three most common Data Mining Tools7. Basic Algorithms8: Advanced Algorithms9. Text Mining10. Organization of 3 Leading Data Mining Tools11. Classification Trees = Decision Trees12. Numerical Prediction (Neural Nets and GLM)13. Model Evaluation and Enhancement14. Medical Informatics15. Bioinformatics16. Customer Response Models17. Fraud DetectionPART III: Tutorials - Step-by-Step Case Studies as a Starting Point to learn how to do Data Mining AnalysesListing of Guest Authors of the TutorialsTutorials within the book pages:How to use the DMRecipeAviation Safety using DMRecipeMovie Box-Office Hit Prediction using SPSS CLEMENTINEBank Financial data – using SAS-EMCredit ScoringCRM Retention using CLEMENTINEAutomobile – Cars – Text MiningQuality Control using Data MiningThree integrated tutorials from different domains, but all using C&RT to predict and display possible structural relationships among data:Business Administration in a Medical IndustryClinical Psychology– Finding Predictors of Correct DiagnosisEducation – Leadership Training: for Business and EducationAdditional tutorials are available either on the accompanying CD-DVD, or the Elsevier Web site for this bookListing of Tutorials on Accompanying CD PART IV: Paradox of Complex Models; using the "right model for the right use", on-going development, and the Future.18: Paradox of Ensembles and Complexity19: The Right Model for the Right Use20: The Top 10 Data Mining Mistakes21: Prospect for the Future – Developing Areas in Data Mining22: SummaryGLOSSARY of STATISICAL and DATA MINING TERMSINDEXCD – With Additional Tutorials, data sets, Power Points, and Data Mining software (STATISTICA Data Miner & Text Miner & QC-Miner – 90 day free trial)

What People are Saying About This

From the Publisher

The essential professional reference for data mining applications and statistical analysis

From the B&N Reads Blog

Page 1 of

Handbook of Statistical Analysis and Data Mining Applications

Handbook of Statistical Analysis and Data Mining Applications

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

Academic Press

Chapter One

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

Academic Press

Chapter One

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews