Discovering Knowledge in Data: An Introduction to Data Mining / Edition 1

Discovering Knowledge in Data: An Introduction to Data Mining / Edition 1

by Daniel T. Larose
     
 

ISBN-10: 0471666572

ISBN-13: 2900471666577

Pub. Date: 11/28/2004

Publisher: Wiley

Larose (statistics, Central Connecticut State University) walks readers step-by-step through the various algorithms and statistical structures that underlie data mining software and presents examples of their operation on actual large data sets. Chapter exercises are included, and screenshots and diagrams encourage graphical learning. The text is appropriate for an…  See more details below

Overview

Larose (statistics, Central Connecticut State University) walks readers step-by-step through the various algorithms and statistical structures that underlie data mining software and presents examples of their operation on actual large data sets. Chapter exercises are included, and screenshots and diagrams encourage graphical learning. The text is appropriate for an introductory course in data mining for advanced undergraduate or graduate students. No previous work in calculus, statistics, computer programming, or databases is required. Annotation ©2004 Book News, Inc., Portland, OR

Product Details

ISBN-13:
2900471666577
Publisher:
Wiley
Publication date:
11/28/2004
Edition description:
Older Edition
Pages:
240

Related Subjects

Table of Contents

Prefacexi
1Introduction to Data Mining1
What Is Data Mining?2
Why Data Mining?4
Need for Human Direction of Data Mining4
Cross-Industry Standard Process: CRISP-DM5
Case Study 1Analyzing Automobile Warranty Claims: Example of the CRISP-DM Industry Standard Process in Action8
Fallacies of Data Mining10
What Tasks Can Data Mining Accomplish?11
Description11
Estimation12
Prediction13
Classification14
Clustering16
Association17
Case Study 2Predicting Abnormal Stock Market Returns Using Neural Networks18
Case Study 3Mining Association Rules from Legal Databases19
Case Study 4Predicting Corporate Bankruptcies Using Decision Trees21
Case Study 5Profiling the Tourism Market Using k-Means Clustering Analysis23
References24
Exercises25
2Data Preprocessing27
Why Do We Need to Preprocess the Data?27
Data Cleaning28
Handling Missing Data30
Identifying Misclassifications33
Graphical Methods for Identifying Outliers34
Data Transformation35
Min-Max Normalization36
Z-Score Standardization37
Numerical Methods for Identifying Outliers38
References39
Exercises39
3Exploratory Data Analysis41
Hypothesis Testing versus Exploratory Data Analysis41
Getting to Know the Data Set42
Dealing with Correlated Variables44
Exploring Categorical Variables45
Using EDA to Uncover Anomalous Fields50
Exploring Numerical Variables52
Exploring Multivariate Relationships59
Selecting Interesting Subsets of the Data for Further Investigation61
Binning62
Summary63
References64
Exercises64
4Statistical Approaches to Estimation and Prediction67
Data Mining Tasks in Discovering Knowledge in Data67
Statistical Approaches to Estimation and Prediction68
Univariate Methods: Measures of Center and Spread69
Statistical Inference71
How Confident Are We in Our Estimates?73
Confidence Interval Estimation73
Bivariate Methods: Simple Linear Regression75
Dangers of Extrapolation79
Confidence Intervals for the Mean Value of y Given x80
Prediction Intervals for a Randomly Chosen Value of y Given x80
Multiple Regression83
Verifying Model Assumptions85
References88
Exercises88
5k-Nearest Neighbor Algorithm90
Supervised versus Unsupervised Methods90
Methodology for Supervised Modeling91
Bias-Variance Trade-Off93
Classification Task95
k-Nearest Neighbor Algorithm96
Distance Function99
Combination Function101
Simple Unweighted Voting101
Weighted Voting102
Quantifying Attribute Relevance: Stretching the Axes103
Database Considerations104
k-Nearest Neighbor Algorithm for Estimation and Prediction104
Choosing k105
Reference106
Exercises106
6Decision Trees107
Classification and Regression Trees109
C4.5 Algorithm116
Decision Rules121
Comparison of the C5.0 and CART Algorithms Applied to Real Data122
References126
Exercises126
7Neural Networks128
Input and Output Encoding129
Neural Networks for Estimation and Prediction131
Simple Example of a Neural Network131
Sigmoid Activation Function134
Back-Propagation135
Gradient Descent Method135
Back-Propagation Rules136
Example of Back-Propagation137
Termination Criteria139
Learning Rate139
Momentum Term140
Sensitivity Analysis142
Application of Neural Network Modeling143
References145
Exercises145
8Hierarchical and k-Means Clustering147
Clustering Task147
Hierarchical Clustering Methods149
Single-Linkage Clustering150
Complete-Linkage Clustering151
k-Means Clustering153
Example of k-Means Clustering at Work153
Application of k-Means Clustering Using SAS Enterprise Miner158
Using Cluster Membership to Predict Churn161
References161
Exercises162
9Kohonen Networks163
Self-Organizing Maps163
Kohonen Networks165
Example of a Kohonen Network Study166
Cluster Validity170
Application of Clustering Using Kohonen Networks170
Interpreting the Clusters171
Cluster Profiles175
Using Cluster Membership as Input to Downstream Data Mining Models177
References178
Exercises178
10Association Rules180
Affinity Analysis and Market Basket Analysis180
Data Representation for Market Basket Analysis182
Support, Confidence, Frequent Itemsets, and the A Priori Property183
How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets185
How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules186
Extension from Flag Data to General Categorical Data189
Information-Theoretic Approach: Generalized Rule Induction Method190
J-Measure190
Application of Generalized Rule Induction191
When Not to Use Association Rules193
Do Association Rules Represent Supervised or Unsupervised Learning?196
Local Patterns versus Global Models197
References198
Exercises198
11Model Evaluation Techniques200
Model Evaluation Techniques for the Description Task201
Model Evaluation Techniques for the Estimation and Prediction Tasks201
Model Evaluation Techniques for the Classification Task203
Error Rate, False Positives, and False Negatives203
Misclassification Cost Adjustment to Reflect Real-World Concerns205
Decision Cost/Benefit Analysis207
Lift Charts and Gains Charts208
Interweaving Model Evaluation with Model Building211
Confluence of Results: Applying a Suite of Models212
Reference213
Exercises213
Epilogue: "We've Only Just Begun"215
Index217

Read More

Customer Reviews

Average Review:

Write a Review

and post it to your social network

     

Most Helpful Customer Reviews

See all customer reviews >