ISBN-10:
0262037793
ISBN-13:
9780262037792
Pub. Date:
03/02/2018
Publisher:
MIT Press
Machine Learning for Data Streams: with Practical Examples in MOA

Machine Learning for Data Streams: with Practical Examples in MOA

Current price is , Original price is $55.0. You

Temporarily Out of Stock Online

Please check back later for updated availability.

Temporarily Out of Stock Online

Overview

A hands-on approach to tasks and techniques in data stream mining and real-time analytics, with examples in MOA, a popular freely available open-source software framework.

Today many information sources—including sensor networks, financial markets, social networks, and healthcare monitoring—are so-called data streams, arriving sequentially and at high speed. Analysis must take place in real time, with partial data and without the capacity to store the entire data set. This book presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), a popular, freely available open-source software framework, allowing readers to try out the techniques after reading the explanations.

The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA.

Product Details

ISBN-13: 9780262037792
Publisher: MIT Press
Publication date: 03/02/2018
Series: Adaptive Computation and Machine Learning seriesSeries Series
Pages: 288
Product dimensions: 7.00(w) x 9.10(h) x 0.90(d)
Age Range: 18 Years

About the Author

Albert Bifet is Professor of Computer Science at Télécom ParisTech.

Ricard Gavaldà is Professor of Computer Science at the Politècnica de Catalunya, Barcelona.

Geoff Holmes is Professor and Dean of Computing at the University of Waikato in Hamilton, New Zealand.

Bernhard Pfahringer is Professor of Computer Science at the University of Auckland, New Zealand.

Table of Contents

List of Figures xiii

List of Tables xvii

Preface xix

I Introduction 1

1 Introduction 3

1.1 Big Data 3

1.1.1 Tools: Open-Source Revolution 5

1.1.2 Challenges in Big Data 6

1.2 Real-Time Analytics 8

1.2.1 Data Streams 8

1.2.2 Time and Memory 8

1.2.3 Applications 8

1.3 What This Book Is About 10

2 Big Data Stream Mining 11

2.1 Algorithms 11

2.2 Classification 12

2.2.1 Classifier Evaluation in Data Streams 14

2.2.2 Majority Class Classifier 15

2.2.3 No-Change Classifier 15

2.2.4 Lazy Classifier 15

2.2.5 Naive Bayes 16

2.2.6 Decision Trees 16

2.2.7 Ensembles 17

2.3 Regression 17

2.4 Clustering 17

2.5 Frequent Pattern Mining 18

3 Hands-on Introduction to MOA 21

3.1 Getting Started 21

3.2 The Graphical User Interface for Classification 23

3.2.1 Drift Stream Generators 25

3.3 Using the Command Line 29

II Stream Mining 33

4 Stream Mining 33

4.1 Streams and Sketches 35

4.2 Concentration Inequalities 37

4.3 Sampling 39

4.4 Counting Total Items 41

4.5 Counting Distinct Elements 42

4.5.1 Linear Counting 43

4.5.2 Cohen's Logarithmic Counter 44

4.5.3 The Flajolet-Martin Counter and HyperLogLog 45

4.5.4 An Application: Computing Distance Functions in Graphs 47

4.5.5 Discussion: Log vs. Linear 48

4.6 Frequency Problems 48

4.6.1 The SpaceSaving Sketch 49

4.6.2 The CM-Sketch Algorithm 51

4.6.3 CountSketch 54

4.6.4 Moment Computation 56

4.7 Exponential Histograms for Sliding Windows 57

4.8 Distributed Sketching: Mergeability 60

4.9 Some Technical Discussions and Additional Material 61

4.9.1 Hash Functions 61

4.9.2 Creating (ε, δ)-Approximation Algorithms 62

4.9.3 Other Sketching Techniques 63

4.10 Exercises 63

5 Dealing with Change 67

5.1 Notion of Change in Streams 67

5.2 Estimators 72

5.2.1 Sliding Windows and Linear Estimators 73

5.2.2 Exponentially Weighted Moving Average 73

5.2.3 Unidimensional Kalman Filter 74

5.3 Change Detection 75

5.3.1 Evaluating Change Detection 75

5.3.2 The CUSUM and Page-Hinkley Tests 75

5.3.3 Statistical Tests 76

5.3.4 Drift Detection Method 78

5.3.5 ADWIN 79

5.4 Combination with Other Sketches and Multidimensional Data 81

5.5 Exercises 81

6 Classification 85

6.1 Classifier Evaluation 86

6.1.1 Error Estimation 87

6.1.2 Distributed Evaluation 88

6.1.3 Performance Evaluation Measures 90

6.1.4 Statistical Significance 92

6.1.5 A Cost Measure for the Mining Process 93

6.2 Baseline Classifiers 94

6.2.1 Majority Class 94

6.2.2 No-change Classifier 94

6.2.3 Naive Bayes 95

6.2.4 Multinomial Naive Bayes 98

6.3 Decision Trees 99

6.3.1 Estimating Split Criteria 101

6.3.2 The Hoeffding Tree 102

6.3.3 CVFDT 105

6.3.4 VFDTc and UFFT 107

6.3.5 Hoeffding Adaptive Tree 108

6.4 Handling Numeric Attributes 109

6.4.1 VFML 110

6.4.2 Exhaustive Binary Tree 110

6.4.3 Greenwald and Khanna's Quantile Summaries 111

6.4.4 Gaussian Approximation 111

6.5 Perceptron 113

6.6 Lazy Learning 114

6.7 Multi-label Classification 115

6.7.1 Multi-label Hoeffding Trees 116

6.8 Active Learning 117

6.8.1 Random Strategy 119

6.8.2 Fixed Uncertainty Strategy 119

6.8.3 Variable Uncertainty Strategy 119

6.8.4 Uncertainty Strategy with Randomization 121

6.9 Concept Evolution 121

6.10 Lab Session with MOA 122

7 Ensemble Methods 129

7.1 Accuracy-Weighted Ensembles 129

7.2 Weighted Majority 130

7.3 Stacking 132

7.4 Bagging 133

7.4.1 Online Bagging Algorithm 133

7.4.2 Bagging with a Change Detector 133

7.4.3 Leveraging Bagging 134

7.5 Boosting 135

7.6 Ensembles of Hoeffding Trees 136

7.6.1 Hoeffding Option Trees 136

7.6.2 Random Forests 136

7.6.3 Perceptron Stacking of Restricted Hoeffding Trees 137

7.6.4 Adaptive-Size Hoeffding Trees 138

7.7 Recurrent Concepts 139

7.8 Lab Session with MOA 139

8 Regression 143

8.1 Introduction 143

8.2 Evaluation 144

8.3 Perceptron Learning 145

8.4 Lazy Learning 145

8.5 Decision Tree Learning 146

8.6 Decision Rules 146

8.7 Regression in MOA 148

9 Clustering 149

9.1 Evaluation Measures 150

9.2 The k-means Algorithm 151

9.3 BIRCH, BICO, and CluStream 152

9.4 Density-Based Methods: DBSCAN and den-Stream 154

9.5 ClusTree 156

9.6 StreamKM++: Coresets 158

9.7 Additional Material 159

9.8 Lab Session with MOA 160

10 Frequent Pattern Mining 165

10.1 An Introduction to Pattern Mining 165

10.1.1 Patterns: Definitions and Examples 165

10.1.2 Batch Algorithms for Frequent Pattern Mining 168

10.1.3 Closed and Maximal Patterns 169

10.2 Frequent Pattern Mining in Streams: Approaches 170

10.2.1 Coresets of Closed Patterns 172

10.3 Frequent Itemset Mining on Streams 174

10.3.1 Reduction to Heavy Hitters 174

10.3.2 Moment 174

10.3.3 FP-Stream 175

10.3.4 IncMine 176

10.4 Frequent Subgraph Mining on Streams 178

10.4.1 WinGraphMiner 179

10.4.2 AdaGraphMiner 179

10.5 Additional Material 181

10.6 Exercises 182

III The MOA Software 185

11 Introduction to MOA and Its Ecosystem 187

11.1 MOA Architecture 188

11.2 Installation 188

11.3 Recent Developments in MOA 188

11.4 Extensions to MOA 189

11.5 ADAMS 190

11.6 MEKA 193

11.7 OpenML 194

11.8 StreamDM 195

11.9 Streams 196

11.10 Apache SAMOA 196

12 The Graphical User Interface 201

12.1 Getting Started with the GUI 201

12.2 Classification and Regression 201

12.2.1 Tasks 203

12.2.2 Data Feeds and Data Generators 204

12.2.3 Bayesian Classifiers 208

12.2.4 Decision Trees 208

12.2.5 Meta Classifiers (Ensembles) 209

12.2.6 Function Classifiers 210

12.2.7 Drift Classifiers 210

12.2.8 Active Learning Classifiers 211

12.3 Clustering 211

12.3.1 Data Feeds and Data Generators 212

12.3.2 Stream Clustering Algorithms 212

12.3.3 Visualization and Analysis 212

13 Using the Command Line 217

13.1 Learning Task for Classification and Regression 217

13.2 Evaluation Tasks for Classification and Regression 217

13.3 Learning and Evaluation Tasks for Classification and Regression 218

13.4 Comparing Two Classifiers 219

14 Using the API 221

14.1 MOA Objects 221

14.2 Options 221

14.3 Prequential Evaluation Example 224

15 Developing New Methods in MOA 227

15.1 Main Classes in MOA 227

15.2 Creating a New Classifier 228

15.3 Compiling a Classifier 237

15.4 Good Programming Practices in MOA 237

Bibliography 239

Index 257

What People are Saying About This

Stan Matwin

An excellent introduction to stream data analytics from the Big Data perspective. Clear and lucid presentation of state of the art methods for working with data in motion. It brings a fresh, unique focus on sketches, often overlooked in monographs, as well as its highly practical, hands-on grounding in the open-source MOA system. Not to be missed by anyone with serious interest in Big Data and Data Science.

Endorsement

An excellent introduction to stream data analytics from the Big Data perspective. Clear and lucid presentation of state of the art methods for working with data in motion. It brings a fresh, unique focus on sketches, often overlooked in monographs, as well as its highly practical, hands-on grounding in the open-source MOA system. Not to be missed by anyone with serious interest in Big Data and Data Science.

Stan Matwin, Canada Research Chair and Director, Institute for Big Data Analytics, Dalhousie University; Distinguished Professor at the University of Ottawa, Canada; State Professor at the Institute for Computer Science of the Polish Academy of Sciences; Area Chair for Applications of the Springer Encyclopedia of Machine Learning

From the Publisher

An excellent introduction to stream data analytics from the Big Data perspective. Clear and lucid presentation of state of the art methods for working with data in motion. It brings a fresh, unique focus on sketches, often overlooked in monographs, as well as its highly practical, hands-on grounding in the open-source MOA system. Not to be missed by anyone with serious interest in Big Data and Data Science.

Stan Matwin, Canada Research Chair and Director, Institute for Big Data Analytics, Dalhousie University; Distinguished Professor at the University of Ottawa, Canada; State Professor at the Institute for Computer Science of the Polish Academy of Sciences; Area Chair for Applications of the Springer Encyclopedia of Machine Learning

Customer Reviews