Knowledge Discovery from Data Streams

Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams.

The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets.

This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

1101428631

Knowledge Discovery from Data Streams

115.0 In Stock

Knowledge Discovery from Data Streams

Add to Wishlist

Knowledge Discovery from Data Streams

Hardcover

$115.00

View All Available Formats & Editions

Hardcover
$115.00

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 3-7 days. Typically arrives in 3 weeks.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Product Details

ISBN-13:	9781439826119
Publisher:	Taylor & Francis
Publication date:	05/25/2010
Series:	Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Pages:	258
Product dimensions:	6.12(w) x 9.19(h) x (d)

About the Author

João Gama is an associate professor and senior researcher in the Laboratory of Artificial Intelligence and Decision Support (LIAAD) at the University of Porto in Portugal.

List of Tables xi

List of Figures xiii

List of Algorithms xv

Foreword xvii

Acknowledgments xix

1 Knowledge Discovery from Data Streams 1

1.1 Introduction 1

1.2 An Illustrative Example 2

1.3 A World in Movement 4

1.4 Data Mining and Data Streams 5

2 Introduction to Data Streams 7

2.1 Data Stream Models 7

2.1.1 Research Issues in Data Stream Management Systems 8

2.1.2 An Illustrative Problem 8

2.2 Basic Streaming Methods 9

2.2.1 Illustrative Examples 10

2.2.1.1 Counting the Number of Occurrences of the Elements in a Stream 10

2.2.1.2 Counting the Number of Distinct Values in a Stream 11

2.2.2 Bounds of Random Variables 11

2.2.3 Poisson Processes 13

2.2.4 Maintaining Simple Statistics from Data Streams 14

2.2.5 Sliding Windows 14

2.2.5.1 Computing Statistics over Sliding Windows: The ADWIN Algorithm 16

2.2.6 Data Synopsis 19

2.2.6.1 Sampling 19

2.2.6.2 Synopsis and Histograms 20

2.2.6.3 Wavelets 21

2.2.6.4 Discrete Fourier Transform 22

2.3 Illustrative Applications 23

2.3.1 A Data Warehouse Problem: Hot-Lists 23

2.3.2 Computing the Entropy in a Stream 24

2.3.3 Monitoring Correlations Between Data Streams 27

2.3.4 Monitoring Threshold Functions over Distributed Data Streams 29

2.4 Notes 30

3 Change Detection 33

3.1 Introduction 33

3.2 Tracking Drifting Concepts 34

3.2.1 The Nature of Change 35

3.2.2 Characterization of Drift Detection Methods 36

3.2.2.1 Data Management 37

3.2.2.2 Detection Methods 38

3.2.2.3 Adaptation Methods 40

3.2.2.4 Decision Model Management 41

3.2.3 A Note on Evaluating Change Detection Methods 41

3.3 Monitoring the Learning Process 42

3.3.1 Drift Detection Using Statistical Process Control 42

3.3.2 An Illustrative Example 45

3.4 Final Remarks 46

3.5 Notes 47

4 Maintaining Histograms from Data Streams 49

4.1 Introduction 49

4.2 Histograms from Data Streams 50

4.2.1 K-buckets Histograms 50

4.2.2 Exponential Histograms 51

4.2.2.1 An Illustrative Example 52

4.2.2.2 Discussion 52

4.3 The Partition Incremental Discretization Algorithm - PiD 53

4.3.1 Analysis of the Algorithm 56

4.3.2 Change Detection in Histograms 56

4.3.3 An Illustrative Example 57

4.4 Applications to Data Mining 59

4.4.1 Applying PiD in Supervised Learning 59

4.4.2 Time-Changing Environments 61

4.5 Notes 62

5 Evaluating Streaming Algorithms 63

5.1 Introduction 63

5.2 Learning from Data Streams 64

5.3 Evaluation Issues 65

5.3.1 Design of Evaluation Experiments 66

5.3.2 Evaluation Metrics 67

5.3.2.1 Error Estimators Using a Single Algorithm and a Single Dataset 68

5.3.2.2 An Illustrative Example 68

5.3.3 Comparative Assessment 69

5.3.3.1 The 0 - 1 Loss Function 70

5.3.3.2 Illustrative Example 71

5.3.4 Evaluation Methodology in Non-Stationary Environments 72

5.3.4.1 The Page-Hinkley Algorithm 72

5.3.4.2 Illustrative Example 73

5.4 Lessons Learned and Open Issues 75

5.5 Notes 77

6 Clustering from Data Streams 79

6.1 Introduction 79

6.2 Clustering Examples 80

6.2.1 Basic Concepts 80

6.2.2 Partitioning Clustering 82

6.2.2.1 The Leader Algorithm 82

6.2.2.2 Single Pass k-Means 82

6.2.3 Hierarchical Clustering 83

6.2.4 Micro Clustering 85

6.2.4.1 Discussion 86

6.2.4.2 Monitoring Cluster Evolution 86

6.2.5 Grid Clustering 87

6.2.5.1 Computing the Fractal Dimension 88

6.2.5.2 Fractal Clustering 88

6.3 Clustering Variables 90

6.3.1 A Hierarchical Approach 91

6.3.1.1 Growing the Hierarchy 91

6.3.1.2 Aggregating at Concept Drift Detection 94

6.3.1.3 Analysis of the Algorithm 96

6.4 Notes 96

7 Frequent Pattern Mining 97

7.1 Introduction to Frequent Itemset Mining 97

7.1.1 The Search Space 98

7.1.2 The FP-growth Algorithm 100

7.1.3 Summarizing Itemsets 100

7.2 Heavy Hitters 101

7.3 Mining Frequent Itemsets from Data Streams 103

7.3.1 Landmark Windows 104

7.3.1.1 The LossyCounting Algorithm 104

7.3.1.2 Frequent Itemsets Using LossyCounting 104

7.3.2 Mining Recent Frequent Itemsets 105

7.3.2.1 Maintaining Frequent Itemsets in Sliding Windows 105

7.3.2.2 Mining Closed Frequent Itemsets over Sliding Windows 106

7.3.3 Frequent Itemsets at Multiple Time Granularities 108

7.4 Sequence Pattern Mining 110

7.4.1 Reservoir Sampling for Sequential Pattern Mining over Data Streams 111

7.5 Notes 113

8 Decision Trees from Data Streams 115

8.1 Introduction 115

8.2 The Very Fast Decision Tree Algorithm 116

8.2.1 VFDT-The Base Algorithm 116

8.2.2 Analysis of the VFDT Algorithm 118

8.3 Extensions to the Basic Algorithm 119

8.3.1 Processing Continuous Attributes 119

8.3.1.1 Exhaustive Search 119

8.3.1.2 Discriminant Analysis 121

8.3.2 Functional Tree Leaves 123

8.3.3 Concept Drift 124

8.3.3.1 Detecting Changes 126

8.3.3.2 Reacting to Changes 127

8.3.4 Final Comments 128

8.4 OLIN: Info-Fuzzy Algorithms 129

8.5 Notes 132

9 Novelty Detection in Data Streams 133

9.1 Introduction 133

9.2 Learning and Novelty 134

9.2.1 Desiderata for Novelty Detection 135

9.3 Novelty Detection as a One-Class Classification Problem 135

9.3.1 Autoassociator Networks 136

9.3.2 The Positive Naive-Bayes 137

9.3.3 Decision Trees for One-Class Classification 138

9.3.4 The One-Class SVM 138

9.3.5 Evaluation of One-Class Classification Algorithms 139

9.4 Learning New Concepts 141

9.4.1 Approaches Based on Extreme Values 141

9.4.2 Approaches Based on the Decision Structure 142

9.4.3 Approaches Based on Frequency 143

9.4.4 Approaches Based on Distances 144

9.5 The Online Novelty and Drift Detection Algorithm 144

9.5.1 Initial Learning Phase 145

9.5.2 Continuous Unsupervised Learning Phase 146

9.5.2.1 Identifying Novel Concepts 147

9.5.2.2 Attempting to Determine the Nature of New Concepts 149

9.5.2.3 Merging Similar Concepts 149

9.5.2.4 Automatically Adapting the Number of Clusters 150

9.5.3 Computational Cost 150

9.6 Notes 151

10 Ensembles of Classifiers 153

10.1 Introduction 153

10.2 Linear Combination of Ensembles 155

10.3 Sampling from a Training Set 156

10.3.1 Online Bagging 157

10.3.2 Online Boosting 158

10.4 Ensembles of Trees 160

10.4.1 Option Trees 160

10.4.2 Forest of Trees 161

10.4.2.1 Generating forest of Trees 162

10.4.2.2 Classifying Test Examples 162

10.5 Adapting to Drift Using Ensembles of Classifiers 162

10.6 Mining Skewed Data Streams with Ensembles 165

10.7 Notes 166

11 Time Series Data Streams 167

11.1 Introduction to Time Series Analysis 167

11.1.1 Trend 167

11.1.2 Seasonality 169

11.1.3 Stationarity 169

11.2 Time-Series Prediction 169

11.2.1 The Kalman Filter 170

11.2.2 Least Mean Squares 173

11.2.3 Neural Nets and Data Streams 173

11.2.3.1 Stochastic Sequential Learning of Neural Networks 174

11.2.3.2 Illustrative Example: Load Forecast in Data Streams 175

11.3 Similarity between Time-Series 177

11.3.1 Euclidean Distance 177

11.3.2 Dynamic Time-Warping 178

11.4 Symbolic Approximation-SAX 180

11.4.1 The SAX Transform 180

11.4.1.1 Piecewise Aggregate Approximation (PAA) 181

11.4.1.2 Symbolic Discretization 181

11.4.1.3 Distance Measure 182

11.4.1.4 Discussion 182

11.4.2 Finding Motifs Using SAX 183

11.4.3 Finding Discords Using SAX 183

11.5 Notes 184

12 Ubiquitous Data Mining 185

12.1 Introduction to Ubiquitous Data Mining 185

12.2 Distributed Data Stream Monitoring 186

12.2.1 Distributed Computing of Linear Functions 187

12.2.1.1 A General Algorithm for Computing Linear Functions 188

12.2.2 Computing Sparse Correlation Matrices Efficiently 189

12.2.2.1 Monitoring Sparse Correlation Matrices 191

12.2.2.2 Detecting Significant Correlations 192

12.2.2.3 Dealing with Data Streams 192

12.3 Distributed Clustering 193

12.3.1 Conquering the Divide 193

12.3.1.1 Furthest Point Clustering 193

12.3.1.2 The Parallel Guessing Clustering 193

12.3.2 DGClust - Distributed Grid Clustering 194

12.3.2.1 Local Adaptive Grid 194

12.3.2.2 Frequent State Monitoring 195

12.3.2.3 Centralized Online Clustering 196

12.4 Algorithm Granularity 197

12.4.1 Algorithm Granularity Overview 199

12.4.2 Formalization of Algorithm Granularity 200

12.4.2.1 Algorithm Granularity Procedure 200

12.4.2.2 Algorithm Output Granularity 201

12.5 Notes 203

13 Final Comments 205

13.1 The Next Generation of Knowledge Discovery 205

13.1.1 Mining Spatial Data 206

13.1.2 The Time Situation of Data 206

13.1.3 Structured Data 206

13.2 Where We Want to Go 206

Appendix A Resources 209

A.1 Software 209

A.2 Datasets 209

Bibliography 211

Index 235

What People are Saying About This

From the Publisher

… this book is the first authored text (that is, not an edited collection) about the area … The book covers a lot of ground in just 200 pages, including discussion of relatively advanced methods such as wavelets, bagging, boosting, dynamic time warping, and symbolic representation of time series. There is also, I was pleased to see, a chapter on evaluating streaming algorithms … . Evaluation, in general, deserves more attention than it generally receives, so I was delighted to see the focus on it here. … a good introduction to an area of data analysis which is going to be very important indeed.
—David J. Hand, International Statistical Review, 2012

Gama is one of the leading investigators in the hottest research topic in machine learning and data mining: data streams. … This book is the first book to didactically cover in a clear, comprehensive and mathematically rigorous way the main machine learning related aspects of this relevant research field. … an up-to-date, broad and useful source of reference for all those interested in knowledge acquisition by learning techniques.
—From the Foreword by André Ponce de Leon Ferreira de Carvalho, University of São Paulo, Brazil

From the B&N Reads Blog

Page 1 of

Editorial Reviews

… this book is the first authored text (that is, not an edited collection) about the area … The book covers a lot of ground in just 200 pages, including discussion of relatively advanced methods such as wavelets, bagging, boosting, dynamic time warping, and symbolic representation of time series. There is also, I was pleased to see, a chapter on evaluating streaming algorithms … . Evaluation, in general, deserves more attention than it generally receives, so I was delighted to see the focus on it here. … a good introduction to an area of data analysis which is going to be very important indeed.
—David J. Hand, International Statistical Review, 2012

Gama is one of the leading investigators in the hottest research topic in machine learning and data mining: data streams. … This book is the first book to didactically cover in a clear, comprehensive and mathematically rigorous way the main machine learning related aspects of this relevant research field. … an up-to-date, broad and useful source of reference for all those interested in knowledge acquisition by learning techniques.
—From the Foreword by André Ponce de Leon Ferreira de Carvalho, University of São Paulo, Brazil

From the Publisher

Knowledge Discovery from Data Streams

Knowledge Discovery from Data Streams

Hardcover

Hardcover

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews