Data Preparation for Data Mining Using SAS

Paperback (Print)
Used and New from Other Sellers
Used and New from Other Sellers
from $52.70
Usually ships in 1-2 business days
(Save 41%)
Other sellers (Paperback)
  • All (3) from $52.70   
  • New (2) from $52.70   
  • Used (1) from $74.64   


Are you a data mining analyst, who spends up to 80% of your time assuring data quality, then preparing that data for developing and deploying predictive models? And do you find lots of literature on data mining theory and concepts, but when it comes to practical advice on developing good mining views find little “how to” information? And are you, like most analysts, preparing the data in SAS?

This book is intended to fill this gap as your source of practical recipes. It introduces a framework for the process of data preparation for data mining, and presents the detailed implementation of each step in SAS. In addition, business applications of data mining modeling require you to deal with a large number of variables, typically hundreds if not thousands. Therefore, the book devotes several chapters to the methods of data transformation and variable selection.

• A complete framework for the data preparation process, including implementation details for each step.
• The complete SAS implementation code, which is readily usable by professional analysts and data miners.
• A unique and comprehensive approach for the treatment of missing values, optimal binning, and cardinality reduction.
• Assumes minimal proficiency in SAS and includes a quick-start chapter on writing SAS macros.
• CD includes dozens of SAS macros plus the sample data and the program for the book's case study.

Read More Show Less

Editorial Reviews

From the Publisher
It is easy to write books that address broad topics and ideas leaving the reader with the question “Yes, but how?” By combining a comprehensive guide to data preparation for data mining along with specific examples in SAS, Mamdouh's book is a rare find—a blend of theory and the practical at the same time. As anyone who has mined data will confess, 80% of the problem is in data preparation; Mamdouh addresses this difficult subject with strong practical techniques and methods.

If you are working on an SAS data mining project, this book is a must! If you are working on any data mining project, the techniques and methods will be a guiding light! —Frank Byrum, Chief Scientist, CorMine Intelligent Data, LLC

Read More Show Less

Product Details

Meet the Author

Mamdouh Refaat is a data mining and business analytics consultant advising major organizations in North America and Europe. He has held several positions in consulting organizations and software vendors, including the director of consulting services at ANGOSS Software Corporation, a global data mining software and service provider. During his career, Mamdouh has managed numerous data mining consulting projects in marketing, CRM, and credit risk for Fortune 500 organizations in North America and Europe. In addition, he has delivered over 50 professional training courses in data mining and business analytics. Mamdouh holds a Ph.D. in Engineering from the University of Toronto, and an MBA from the University of Leeds.
During his career, Mamdouh has managed numerous data mining consulting projects in marketing, CRM, and credit risk for Fortune 500 organizations in North America and Europe. In addition, he has delivered over 50 professional training courses in data mining and business analytics.

Mamdouh holds a PhD in Engineering from the University of Toronto, and an MBA from the University of Leeds.

Read More Show Less

Read an Excerpt

Data Preparation for Data Mining Using SAS

By Mamdouh Refaat


Copyright © 2007 Elsevier Inc.
All right reserved.

ISBN: 978-0-08-049100-4

Chapter One

Introduction 1

1.1 The Data Mining Process

The procedure used to perform data mining modeling and analysis has undergone a long transformation from the domain of academic research to a systematic industrial process performed by business and quantitative analysts. Several methodologies have been proposed to cast the steps of developing and deploying data mining models into a standardized process.

This chapter summarizes the main features of these methodologies and highlights the role of the data preparation steps. Furthermore, it presents in detail the definition and contents of the the mining view and the scoring view.

1.2 Methodologies of Data Mining

Different methodologies of data mining attempt to mold the activities the analyst performs in a typical data mining engagement into a set of logical steps or tasks. To date, two major methodologies dominate the practice of data mining: CRISP and SEMMA.

CRISP, which stands for Cross Industry Standard Process for data mining, is an initiative by a consortium of software vendors and industry users of data mining technology to standardize the data mining process. The original CRISP documents can be found on On the other hand, SEMMA, which stands for Sample, Explore, Modify, Model, Assess, has been championed by SAS Institute. SAS has launched a data mining software platform (SAS Enterprise Miner) that implements SEMMA. For a complete description of SEMMA and SAS Enterprise Miner, visit the SAS web site:

In addition to SEMMA and CRISP, numerous other methodologies attempt to do the same thing, that is, to break the data mining process into a sequence of steps to be followed by analysts for the purpose of promoting best practices and standardizing the steps and results.

This book does not delve into the philosophical arguments about the advantages of each methodology. It extracts from them the basic steps to be performed in any data mining engagement to lay out a roadmap for the remaining chapters of this book.

All methodologies contain the following set of main tasks in one form or another.

1. Relevant data elements are extracted from a database or a data warehouse into one table containing all the variables needed for modeling. This table is commonly known as the mining view, or rollup file. In the case when the size of the data cannot be handled efficiently by available data modeling tools, which is frequently the case, sampling is used.

2. A set of data exploration steps are performed to gain some insight about the relationships among the data and to create a summary of the properties. This is known as EDA (Exploratory Data Analysis).

3. Based on the results of EDA, some transformation procedures are invoked to highlight and take advantage of the relationships among the variables in the planned models.

4. A set of data mining models are then developed using different techniques, depending on the objective of the exercise and the types of variables involved. Not all the available variables are used in the modeling phase. Therefore, a data reduction procedure is often invoked to select the most useful set of variables.

5. The data mining models are evaluated and the best performing model is selected according to some performance criteria.

6. The population of data intended for the application of the model is prepared in an identical process to that used in the preparation of the mining view to create what is known as the scoring view. The selected optimal (best) model is used to score the scoring view and produce the scores. These scores are used by the different business units to achieve the required business objective, such as selecting the targeted customers for marketing campaigns or to receive a loan or a credit card.

Typically, these steps are performed iteratively, and not necessarily in the presented linear order. For example, one might extract the mining view, perform EDA, build a set of models, and then, based on the evaluation results of these models, decide to introduce a set of transformations and data reduction steps in an attempt to improve the model performance.

Of the six steps in the data mining process, the data extraction and preparation steps could occupy up to 80% of the project time. In addition, to avoid a "garbage in, garbage out" situation, we have to make sure that we have extracted the right and most useful data. Therefore, data extraction and preparation should have the priority in planning and executing data mining projects. Therefore, this book!

1.3 The Mining View

Most, if not all, data mining algorithms deal with the data in the form of a single matrix (a two-dimensional array). However, the raw data, which contains the information needed for modeling, is rarely stored in such form. Most data is stored in relational databases, where the data is scattered over several tables. Therefore, the first step in collecting the data is to roll up the different tables and aggregate the data to the required rectangular form in anticipation of using mining algorithms. This last table, with all the elements needed, or suspected to be needed, for the modeling work is known as the mining view, rollup file, or modeling table. The tools used to aggregate the data elements into the mining view are usually data management tools such as SQL queries, SAS procedures, or, in the case of legacy systems, custom programs (e.g., in C, C++, and Java).

The mining view is defined as the aggregated modeling data on the specified entity level. The data is assembled in the form of columns, with the entity being unique on the row level. The meaning of the entity in the preceding definition is related to the business objective of modeling. In most business applications, the entity level is the customer level. In this case, we assemble all the relevant data for each customer in the form of columns and ensure that each row represents a unique customer with all the data related to this customer included. Examples of customer-level mining views are customer acquisition, cross selling, customer retention, and customer lifetime value.

In other situations, the entity is defined as the transaction. For example, we try to create a fraud detection system, say for online credit card shopping. In this case, the entity level is the purchase transaction and not the customer. This is because we attempt to stop fraudulent transactions. Similarly, the entity level may be defined as the product level. This could be necessary, for example, in the case of segmentation modeling of products for a supermarket chain where hundreds, or even thousands, of products exist.

The mining view usually undergoes a series of data cleaning and transformation steps before it is ready for use by the modeling algorithm. These operations achieve two purposes:

1. Clean the data of errors, missing values, and outliers.

2. Attempt to create new variables through a set of transformations, which could lead to better models.

Data errors and missing values always occur as a result of data collection and transformation from one data system to another. There are many techniques for cleaning the data from such errors and substituting or imputing the missing values.

Typically, the required data transformations are discovered over several iterations of data preparation, exploration, and pilot modeling. In other words, not all the needed data preparation steps are, or could be, known in advance. This is the nature of knowledge discovery in data mining modeling. However, once specific transformations have been established and tested for a particular dataset for a certain model, they must be recorded in order to be used again on the data to be scored by the model. This leads us to the next view: the scoring view.

1.4 The Scoring View

The scoring view is very similar to the mining view except that the dependent variable (variable to be predicted) is not included. The following are other differences between the mining view and the scoring view.

1. The scoring view is usually much larger than the mining view. The mining view is only a sample from the data population; the scoring view is the population itself. This has implications on the requirements of the hardware and software needed to manipulate the scoring view and perform the necessary transformations on it before using it in scoring.

2. The scoring view may contain only one record. This is the case of online scoring, in which one record is read at a time and its score is calculated. The mining view, for obvious reasons, must have many records to be useful in developing a model.

3. The variables needed to make the mining view are determined by attempting to collect all conceivable variables that may have association with the quantity being predicted or have a relationship to the problem being modeled. The scoring view, on the other hand, contains only the variables that were used to create the model. The model may contain derived and transformed variables. These variables must also be in the scoring view. It is expected, therefore, that the scoring view would have significantly fewer variables than the mining view.

The only special case in which the mining view becomes the scoring view as well is the development of time series models for forecasting. In this case, the mining view is used to fit the predictive model and simultaneously to predict future values, thus removing the distinction between the mining view and the scoring view. We do not deal with data preparation for time series modeling in this book.

The next chapter provides a more detailed description of both the mining view and the scoring view.

1.5 Notes on Data Mining Software

Many software packages are used to develop data mining models. The procedures developed in this text for data preparation are independent of the tool used for the actual model building. Some of these tools include data preparation capabilities, thus allowing analysts to perform functions similar to some of the procedures described in this book. Most analysts prefer to separate the procedures of data preparation and modeling. We have adopted this attitude by developing the procedures described in this book as SAS macros to be implemented independently of the modeling software.

However, the techniques and procedures described in the book could also be applied using many of the data manipulation capabilities of these modeling tools.

Chapter Two

Tasks and Data Flow

2.1 Data Mining Tasks

Data mining is often defined as a set of mathematical models and data manipulation techniques that perform functions aiming at the discovery of new knowledge in databases. The functions, or tasks, performed by these techniques can be classified in terms of either the analytical function they entail or their implementation focus. The first classification scheme takes the point of view of the data mining analyst. In this case, the analyst would classify the tasks on the basis of the problem type as one of the following.

1. Classification

In these problems, the operative is to assign each record in the database a particular class or a category label from a finite set of predefined class labels. For example, a bank would be interested in classifying each of its customers as potentially interested in a new credit card or not. All decisions involving Yes/No selection, such as classifying insurance claims according to the possibility of fraud, also belong to classification problems. Classification problems may involve three or more levels, such as "high," "medium," and "low." The main point is that the number of classes is finite. Note that there could be an implicit order relationship in the definition of the classes, such as "high," "medium," and "low."

2. Estimation

These problems are focused on estimating the unknown value of a continuous variable. For example, taxation authorities might be interested in estimating the real income of households. The number of possible outcomes of an estimation problem is infinite by definition.


Excerpted from Data Preparation for Data Mining Using SAS by Mamdouh Refaat Copyright © 2007 by Elsevier Inc.. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More Show Less

Table of Contents

1 Introduction
2 Tasks and Data Flow
3 Review of Data Mining Modeling Techniques
4 SAS Macros: A Quick Start
5 Data Acquisition and Integration
6 Integrity Checks
8 Sampling and Partitioning
9 Data Transformations
10 Binning and Reduction of Cardinality
11 Treatment of Missing Values
12 Predictive Power and Variable Reduction I
13 Analysis of Nominal and Ordinal Variables
14 Analysis of Continuous Variables
15 Principal Component Analysis (PCA) 2
16 Factor Analysis
17 Predictive Power and Variable Reduction II
18 Putting it All Together
A Listing of SAS Macros

Read More Show Less

Customer Reviews

Average Rating 5
( 1 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously
Sort by: Showing 1 Customer Reviews
  • Anonymous

    Posted November 15, 2006

    The best data mining book so far

    I have been working in data mining and with SAS for the last 10 years. This is the best book without doubt. It is consice, to the point, not a lot of fluf and useless theory. It teaches you how to actually do it! The book took me step by step through the process of data preparation using SAS and let me write fantastic macros. All the macros are included in the CD and are ready to run. I strongly recommend this book to anyone who is using SAS to work the data either for reporting or for modeling. I attended many training courses on data mining, and even data preparation, but nothing is like this book. It reveals all the secrets. For example, how to bin variables using Gini, how to select the best modeling variables using Entropy, Ginin, Chi2, how to reduce the variables using principal component analysis, treatment of missing values occupies and huge chapter that in my opinion has no competitors, mapping categorical variables into dummy variables, reduction of cardinality using Gini (best grouping). All these things until now were the secrets of the 'gurus', not any more thanks to Dr. Refaat and his book! For example, I used to use a decision tree software to select the best variables, then use logistic regression to build the models. Not any more. With the SAS programs in the book, I can now select the best variables and build the model within one SAs script.... I only wish if the author would also write a similar book on modeling... This book is a life saver ...

    1 out of 2 people found this review helpful.

    Was this review helpful? Yes  No   Report this review
Sort by: Showing 1 Customer Reviews

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)