Data Mining: Concepts, Models, Methods, and Algorithms

Presents the latest techniques for analyzing and extracting information from large amounts of data in high-dimensional data spaces

The revised and updated third edition of Data Mining contains in one volume an introduction to a systematic approach to the analysis of large data sets that integrates results from disciplines such as statistics, artificial intelligence, data bases, pattern recognition, and computer visualization. Advances in deep learning technology have opened an entire new spectrum of applications. The author—a noted expert on the topic—explains the basic concepts, models, and methodologies that have been developed in recent years.

This new edition introduces and expands on many topics, as well as providing revised sections on software tools and data mining applications. Additional changes include an updated list of references for further study, and an extended list of problems and questions that relate to each chapter.This third edition presents new and expanded information that:

• Explores big data and cloud computing

• Examines deep learning

• Includes information on convolutional neural networks (CNN)

• Offers reinforcement learning

• Contains semi-supervised learning and S3VM

• Reviews model evaluation for unbalanced data

Written for graduate students in computer science, computer engineers, and computer information systems professionals, the updated third edition of Data Mining continues to provide an essential guide to the basic principles of the technology and the most recent developments in the field.

1132519114

Data Mining: Concepts, Models, Methods, and Algorithms

Presents the latest techniques for analyzing and extracting information from large amounts of data in high-dimensional data spaces

• Explores big data and cloud computing

• Examines deep learning

• Includes information on convolutional neural networks (CNN)

• Offers reinforcement learning

• Contains semi-supervised learning and S3VM

• Reviews model evaluation for unbalanced data

146.75 In Stock

Data Mining: Concepts, Models, Methods, and Algorithms

Add to Wishlist

Data Mining: Concepts, Models, Methods, and Algorithms

Hardcover(3rd ed.)

$146.75

Hardcover(3rd ed.)
$146.75

SHIP THIS ITEM

In stock. Ships in 6-10 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Presents the latest techniques for analyzing and extracting information from large amounts of data in high-dimensional data spaces

• Explores big data and cloud computing

• Examines deep learning

• Includes information on convolutional neural networks (CNN)

• Offers reinforcement learning

• Contains semi-supervised learning and S3VM

• Reviews model evaluation for unbalanced data

Product Details

ISBN-13:	9781119516040
Publisher:	Wiley
Publication date:	11/12/2019
Edition description:	3rd ed.
Pages:	672
Product dimensions:	6.10(w) x 9.10(h) x 1.20(d)

About the Author

MEHMED KANTARDZIC, PHD, is a Professor in the Department of Computer Engineering and Computer Science (CECS) at the University of Louisville, and is Director of the Data Mining Lab and CECS Graduate Programs. He is a member of IEEE, ISCA, KAS, WSEAS, IEE, and SPIE.

Read an Excerpt

Data Mining

Concepts, Models, Methods, and Algorithms

By Mehmed Kantardzic

John Wiley & Sons

ISBN: 0-471-22852-4

Chapter One

Data-Mining Concepts

CHAPTER OBJECTIVES

Understand the need for analyses of large, complex, information-rich data sets.

Identify the goals and primary tasks of the data-mining process.

Describe the roots of data-mining technology.

Recognize the iterative character of a data-mining process and specify its basic steps.

Explain the influence of data quality on a data-mining process.

Establish the relation between data warehousing and data mining.

1.1 INTRODUCTION

Modern science and engineering are based on using first-principle models to describe physical, biological, and social systems. Such an approach starts with a basic scientific model, such as Newton's laws of motion or Maxwell's equations in electromagnetism, and then builds upon them various applications in mechanical engineering or electrical engineering. In this approach, experimental data are used to verify the underlying first-principle models and to estimate some of the parameters that are difficult or sometimes impossible to measure directly. However, in many domains the underlying first principles are unknown, or the systems under study are too complex to be mathematically formalized. With the growing use of computers, there is a great amount of data being generated by such systems. In the absence of first-principlemodels, such readily available data can be used to derive models by estimating useful relationships between a system's variables (i.e., unknown input-output dependencies). Thus there is currently a paradigm shift from classical modeling and analyses based on first principles to developing models and the corresponding analyses directly from data.

We have grown accustomed gradually to the fact that there are tremendous volumes of data filling our computers, networks, and lives. Government agencies, scientific institutions, and businesses have all dedicated enormous resources to collecting and storing data. In reality, only a small amount of these data will ever be used because, in many cases, the volumes are simply too large to manage, or the data structures themselves are too complicated to be analyzed effectively. How could this happen? The primary reason is that the original effort to create a data set is often focused on issues such as storage efficiency; it does not include a plan for how the data will eventually be used and analyzed.

The need to understand large, complex, information-rich data sets is common to virtually all fields of business, science, and engineering. In the business world, corporate and customer data are becoming recognized as a strategic asset. The ability to extract useful knowledge hidden in these data and to act on that knowledge is becoming increasingly important in today's competitive world. The entire process of applying a computer-based methodology, including new techniques, for discovering knowledge from data is called data mining.

Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.

In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data-mining activities into one of two categories:

1) Predictive data mining, which produces the model of the system described by the given data set, or

2) Descriptive data mining, which produces new, nontrivial information based on the available data set.

On the predictive end of the spectrum, the goal of data mining is to produce a model, expressed as an executable code, which can be used to perform classification, prediction, estimation, or other similar tasks. On the other, descriptive, end of the spectrum, the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. The relative importance of prediction and description for particular data-mining applications can vary considerably. The goals of prediction and description are achieved by using data-mining techniques, explained later in this book, for the following primary data-mining tasks:

1. Classification - discovery of a predictive learning function that classifies a data item into one of several predefined classes.

2. Regression - discovery of a predictive learning function, which maps a data item to a real-value prediction variable.

3. Clustering - a common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data.

4. Summarization - an additional descriptive task that involves methods for finding a compact description for a set (or subset) of data.

5. Dependency Modeling - finding a local model that describes significant dependencies between variables or between the values of a feature in a data set or in a part of a data set.

6. Change and Deviation Detection - discovering the most significant changes in the data set.

The more formal approach, with graphical interpretation of data-mining tasks for complex and large data sets and illustrative examples, is given in Chapter 4. Current introductory classifications and definitions are given here only to give the reader a feeling of the wide spectrum of problems and tasks that may be solved using data-mining technology.

The success of a data-mining engagement depends largely on the amount of energy, knowledge, and creativity that the designer puts into it. In essence, data mining is like solving a puzzle. The individual pieces of the puzzle are not complex structures in and of themselves. Taken as a collective whole, however, they can constitute very elaborate systems. As you try to unravel these systems, you will probably get frustrated, start forcing parts together, and generally become annoyed at the entire process; but once you know how to work with the pieces, you realize that it was not really that hard in the first place. The same analogy can be applied to data mining. In the beginning, the designers of the data-mining process probably do not know much about the data sources; if they did, they would most likely not be interested in performing data mining. Individually, the data seem simple, complete, and explainable. But collectively, they take on a whole new appearance that is intimidating and difficult to comprehend, like the puzzle. Therefore, being an analyst and designer in a data-mining process requires, besides thorough professional knowledge, creative thinking and a willingness to see problems in a different light.

Data mining is one of the fastest growing fields in the computer industry. Once a small interest area within computer science and statistics, it has quickly expanded into a field of its own. One of the greatest strengths of data mining is reflected in its wide range of methodologies and techniques that can be applied to a host of problem sets. Since data mining is a natural activity to be performed on large data sets, one of the largest target markets is the entire data warehousing, data-mart, and decision-support community, encompassing professionals from such industries as retail, manufacturing, telecommunications, healthcare, insurance, and transportation. In the business community, data mining can be used to discover new purchasing trends, plan investment strategies, and detect unauthorized expenditures in the accounting system. It can improve marketing campaigns and the outcomes can be used to provide customers with more focused support and attention. Data-mining techniques can be applied to problems of business process reengineering, in which the goal is to understand interactions and relationships among business practices and organizations.

Many law enforcement and special investigative units, whose mission is to identify fraudulent activities and discover crime trends, have also used data mining successfully. For example, these methodologies can aid analysts in the identification of critical behavior patterns in the communication interactions of narcotics organizations, the monetary transactions of money laundering and insider trading operations, the movements of serial killers, and the targeting of smugglers at border crossings. Data-mining techniques have also been employed by people in the intelligence community who maintain many large data sources as a part of the activities relating to matters of national security. Appendix B of the book gives a brief overview of typical commercial applications of data-mining technology today.

1.2 DATA-MINING ROOTS

Looking at how different authors describe data mining, it is clear that we are far from a universal agreement on the definition of data mining or even what constitutes data mining. Is data mining a form of statistics enriched with learning theory or is it a revolutionary new concept? In our view, most data-mining problems and corresponding solutions have roots in classical data analysis. Data mining has its origins in various disciplines, of which the two most important are statistics and machine learning. Statistics has its roots in mathematics, and therefore, there has been an emphasis on mathematical rigor, a desire to establish that something is sensible on theoretical grounds before testing it in practice. In contrast, the machine-learning community has its origins very much in computer practice. This has led to a practical orientation, a willingness to test something out to see how well it performs, without waiting for a formal proof of effectiveness.

If the place given to mathematics and formalizations is one of the major differences between statistical and machine-learning approaches to data mining, another is in the relative emphasis they give to models and algorithms. Modern statistics is almost entirely driven by the notion of a model. This is a postulated structure, or an approximation to a structure, which could have led to the data. In place of the statistical emphasis on models, machine learning tends to emphasize algorithms. This is hardly surprising; the very word "learning" contains the notion of a process, an implicit algorithm.

Basic modeling principles in data mining also have roots in control theory, which is primarily applied to engineering systems and industrial processes. The problem of determining a mathematical model for an unknown system (also referred to as the target system) by observing its input-output data pairs is generally referred to as system identification. The purposes of system identification are multiple and, from a standpoint of data mining, the most important are to predict a system's behavior and to explain the interaction and relationships between the variables of a system.

System identification generally involves two top-down steps:

1. Structure identification - In this step, we need to apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is to be conducted. Usually this class of models is denoted by a parametrized function y = f(u,t), where y is the model's output, u is an input vector, and t is a parameter vector. The determination of the function f is problem-dependent, and the function is based on the designer's experience, intuition, and the laws of nature governing the target system.

2. Parameter identification - In the second step, when the structure of the model is known, all we need to do is apply optimization techniques to determine parameter vector t such that the resulting model y = f(u,t) can describe the system appropriately.

In general, system identification is not a one-pass process: both structure and parameter identification need to be done repeatedly until a satisfactory model is found. This iterative process is represented graphically in Figure 1.1. Typical steps in every iteration are as follows:

1. Specify and parametrize a class of formalized (mathematical) models, y* = f(u,t), representing the system to be identified.

2. Perform parameter identification to choose the parameters that best fit the available data set (the difference y - y is minimal).

3. Conduct validation tests to see if the model identified responds correctly to an unseen data set (often referred as test, validating, or checking data set).

4. Terminate the process once the results of the validation test are satisfactory.

If we do not have any a priori knowledge about the target system, then structure identification becomes difficult, and we have to select the structure by trial and error. While we know a great deal about the structures of most engineering systems and industrial processes, in a vast majority of target systems where we apply data-mining techniques, these structures are totally unknown, or they are so complex that it is impossible to obtain an adequate mathematical model. Therefore, new techniques were developed for parameter identification and they are today a part of the spectra of data-mining techniques.

Finally, we can distinguish between how the terms "model" and "pattern" are interpreted in data mining. A model is a "large scale" structure, perhaps summarizing relationships over many (sometimes all) cases, whereas a pattern is a local structure, satisfied by few cases or in a small region of a data space. It is also worth noting here that the word "pattern", as it is used in pattern recognition, has a rather different meaning for data mining. In pattern recognition it refers to the vector of measurements characterizing a particular object, which is a point in a multidimensional data space. In data mining, a pattern is simply a local model. In this book we refer to n-dimensional vectors of data as samples.

1.3

Continues...

Excerpted from Data Mining by Mehmed Kantardzic Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Preface xiii

Preface to the Second Edition xv

Preface to the First Edition xvii

1 Data-Mining Concepts 1

1.1 Introduction 2

1.2 Data-Mining Roots 4

1.3 Data-Mining Process 6

1.4 From Data Collection to Data Preprocessing 10

1.5 Data Warehouses for Data Mining 15

1.6 From Big Data to Data Science 18

1.7 Business Aspects of Data Mining: Why a Data-Mining Project Fails? 22

1.8 Organization of This Book 26

1.9 Review Questions and Problems 28

1.10 References for Further Study 30

2 Preparing the Data 33

2.1 Representation of Raw Data 34

2.2 Characteristics of Raw Data 38

2.3 Transformation of Raw Data 40

2.4 Missing Data 43

2.5 Time-Dependent Data 44

2.6 Outlier Analysis 49

2.7 Review Questions and Problems 56

2.8 References for Further Study 59

3 Data Reduction 61

3.1 Dimensions of Large Data Sets 62

3.2 Features Reduction 64

3.3 Relief Algorithm 75

3.4 Entropy Measure for Ranking Features 77

3.5 Principal Component Analysis 80

3.6 Value Reduction 83

3.7 Feature Discretization: ChiMerge Technique 86

3.8 Case Reduction 90

3.9 Review Questions and Problems 93

3.10 References for Further Study 95

4 Learning from Data 97

4.1 Learning Machine 99

4.2 Statistical Learning Theory 104

4.3 Types of Learning Methods 110

4.4 Common Learning Tasks 112

4.5 Support Vector Machines 117

4.6 Semi-Supervised Support Vector Machines (S3VM) 131

4.7 kNN: Nearest Neighbor Classifier 134

4.8 Model Selection vs. Generalization 138

4.9 Model Estimation 142

4.10 Imbalanced Data Classification 150

4.11 90% Accuracy … Now What? 154

4.12 Review Questions and Problems 158

4.13 References for Further Study 161

5 Statistical Methods 165

5.1 Statistical Inference 166

5.2 Assessing Differences in Data Sets 168

5.3 Bayesian Inference 172

5.4 Predictive Regression 175

5.5 Analysis of Variance 181

5.6 Logistic Regression 184

5.7 Log-Linear Models 185

5.8 Linear Discriminant Analysis 189

5.9 Review Questions and Problems 191

5.10 References for Further Study 194

6 Decision Trees and Decision Rules 197

6.1 Decision Trees 199

6.2 C4.5 Algorithm: Generating a Decision Tree 201

6.3 Unknown Attribute Values 209

6.4 Pruning Decision Trees 214

6.5 C4.5 Algorithm: Generating Decision Rules 215

6.6 Cart Algorithm and Gini Index 219

6.7 Limitations of Decision Trees and Decision Rules 222

6.8 Review Questions and Problems 225

6.9 References for Further Study 229

7 Artificial Neural Networks 231

7.1 Model of an Artificial Neuron 233

7.2 Architectures of Artificial Neural Networks 237

7.3 Learning Process 239

7.4 Learning Tasks Using Anns 243

7.5 Multilayer Perceptrons 245

7.6 Competitive Networks and Competitive Learning 255

7.7 Self-Organizing Maps 259

7.8 Deep Learning 264

7.9 Convolutional Neural Networks (CNNs) 270

7.10 Review Questions and Problems 273

7.11 References for Further Study 276

8 Ensemble Learning 279

8.1 Ensemble Learning Methodologies 280

8.2 Combination Schemes for Multiple Learners 285

8.3 Bagging and Boosting 286

8.4 AdaBoost 288

8.5 Review Questions and Problems 290

8.6 References for Further Study 293

9 Cluster Analysis 295

9.1 Clustering Concepts 296

9.2 Similarity Measures 299

9.3 Agglomerative Hierarchical Clustering 306

9.4 Partitional Clustering 310

9.5 Incremental Clustering 313

9.6 DBSCAN Algorithm 317

9.7 BIRCH Algorithm 320

9.8 Clustering Validation 323

9.9 Review Questions and Problems 328

9.10 References for Further Study 333

10 Association Rules 335

10.1 Market-Basket Analysis 337

10.2 Algorithm Apriori 338

10.3 From Frequent Itemsets to Association Rules 340

10.4 Improving the Efficiency of the Apriori Algorithm 342

10.5 Frequent Pattern Growth Method 344

10.6 Associative-Classification Method 346

10.7 Multidimensional Association Rule Mining 349

10.8 Review Questions and Problems 351

10.9 References for Further Study 355

11 Web Mining and Text Mining 357

11.1 Web Mining 358

11.2 Web Content, Structure, and Usage Mining 360

11.3 Hits and Logsom Algorithms 362

11.4 Mining Path-Traversal Patterns 368

11.5 PageRank Algorithm 371

11.6 Recommender Systems 374

11.7 Text Mining 375

11.8 Latent Semantic Analysis 379

11.9 Review Questions and Problems 385

11.10 References for Further Study 388

12 Advances in Data Mining 391

12.1 Graph Mining 392

12.2 Temporal Data Mining 406

12.3 Spatial Data Mining 422

12.4 Distributed Data Mining 426

12.5 Correlation Does not Imply Causality! 435

12.6 Privacy, Security, and Legal Aspects of Data Mining 442

12.7 Cloud Computing Based on Hadoop and Map/Reduce 449

12.8 Reinforcement Learning 454

12.9 Review Questions and Problems 459

12.10 References for Further Study 461

13 Genetic Algorithms 465

13.1 Fundamentals of Genetic Algorithms 466

13.2 Optimization Using Genetic Algorithms 468

13.3 A Simple Illustration of a Genetic Algorithm 474

13.4 Schemata 480

13.5 Traveling Salesman Problem 483

13.6 Machine Learning Using Genetic Algorithms 485

13.7 Genetic Algorithms for Clustering 490

13.8 Review Questions and Problems 493

13.9 References for Further Study 494

14 Fuzzy Sets and Fuzzy Logic 497

14.1 Fuzzy Sets 498

14.2 Fuzzy Set Operations 504

14.3 Extension Principle and Fuzzy Relations 509

14.4 Fuzzy Logic and Fuzzy Inference Systems 513

14.5 Multifactorial Evaluation 518

14.6 Extracting Fuzzy Models from Data 521

14.7 Data Mining and Fuzzy Sets 526

14.8 Review Questions and Problems 528

14.9 References for Further Study 530

15 Visualization Methods 533

15.1 Perception and Visualization 534

15.2 Scientific Visualization and Information Visualization 535

15.3 Parallel Coordinates 542

15.4 Radial Visualization 544

15.5 Visualization Using Self-Organizing Maps 547

15.6 Visualization Systems for Data Mining 549

15.7 Review Questions and Problems 554

15.8 References for Further Study 555

Appendix A: Information on Data Mining 559

A.1 Data-Mining Journals 559

A.2 Data-Mining Conferences 564

A.3 Data-Mining Forums/Blogs 568

A.4 Data Sets 570

A.5 Comercially and Publicly Available Tools 574

A.6 Web Site Links 583

Appendix B: Data-Mining Applications 589

B.1 Data Mining for Financial Data Analyses 589

B.2 Data Mining for the Telecomunication Industry 593

B.3 Data Mining for the Retail Industry 596

B.4 Data Mining in Healthcare and Biomedical Research 599

B.5 Data Mining in Science and Engineering 602

B.6 Pitfalls of Data Mining 605

Bibliography 607

Index 633

From the B&N Reads Blog

Page 1 of

Data Mining: Concepts, Models, Methods, and Algorithms

Data Mining: Concepts, Models, Methods, and Algorithms

Hardcover(3rd ed.)

Hardcover(3rd ed.)

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Data Mining

John Wiley & Sons

Chapter One

Table of Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Data Mining

John Wiley & Sons

Chapter One

Table of Contents

Related Subjects

Customer Reviews