Contents

List of Figures

List of Tables

Preface

I              INTRODUCTION

1              Introduction

1.1      Big Data

1.1.1     Tools: Open-Source Revolution

1.1.2     Challenges in Big Data

1.2      Real-Time Analytics

1.2.1     Data Streams

1.2.2     Time and Memory

1.2.3     Applications

1.3      What This Book Is About

2              Big Data Stream Mining

2.1      Algorithms

2.2      Classification

2.2.1     Classifier Evaluation in Data Streams

2.2.2     Majority Class Classifier

2.2.3     No-Change Classifier

2.2.4     Lazy Classifier

2.2.5     Naive Bayes

2.2.6     Decision Trees

2.2.7     Ensembles

2.3      Regression

2.4      Clustering

2.5      Frequent Pattern Mining

3              Hands-on Introduction to MOA

3.1      Getting Started

3.2      The Graphical User Interface for Classification

3.2.1     Drift Stream Generators

3.3      Using the Command Line

II             STREAM MINING

4              Streams and Sketches

4.1      Setting: Approximation Algorithms

4.2      Concentration Inequalities

4.3      Sampling

4.4      Counting Total Items

4.5      Counting Distinct Elements

4.5.1     Linear Counting

4.5.2     Cohen’s Logarithmic Counter

4.5.3     The Flajolet-Martin Counter and HyperLogLog

4.5.4     An Application: Computing Distance Functions in Graphs

4.5.5     Discussion: Log vs. Linear

4.6      Frequency Problems

4.6.1     The S PACE S AVING Sketch

4.6.2     The CM-Sketch Algorithm

4.6.3     CountSketch

4.6.4     Moment Computation

4.7      Exponential Histograms for Sliding Windows

4.8      Distributed Sketching: Mergeability

4.9      Some Technical Discussions and Additional Material

4.9.1     Hash Functions

4.9.2     Creating ( 𝜖 , δ )-Approximation Algorithms

4.9.3     Other Sketching Techniques

4.10    Exercises

5              Dealing with Change

5.1      Notion of Change in Streams

5.2      Estimators

5.2.1     Sliding Windows and Linear Estimators

5.2.2     Exponentially Weighted Moving Average

5.2.3     Unidimensional Kalman Filter

5.3      Change Detection

5.3.1     Evaluating Change Detection

5.3.2     The CUSUM and Page-Hinkley Tests

5.3.3     Statistical Tests

5.3.4     Drift Detection Method

5.3.5     ADWIN

5.4      Combination with Other Sketches and Multidimensional Data

5.5      Exercises

6              Classification

6.1      Classifier Evaluation

6.1.1     Error Estimation

6.1.2     Distributed Evaluation

6.1.3     Performance Evaluation Measures

6.1.4     Statistical Significance

6.1.5     A Cost Measure for the Mining Process

6.2      Baseline Classifiers

6.2.1     Majority Class

6.2.2     No-change Classifier

6.2.3     Naive Bayes

6.2.4     Multinomial Naive Bayes

6.3      Decision Trees

6.3.1     Estimating Split Criteria

6.3.2     The Hoeffding Tree

6.3.3     CVFDT

6.3.4     VFDTc and UFFT

6.3.5     Hoeffding Adaptive Tree

6.4      Handling Numeric Attributes

6.4.1     VFML

6.4.2     Exhaustive Binary Tree

6.4.3     Greenwald and Khanna’s Quantile Summaries

6.4.4     Gaussian Approximation

6.5      Perceptron

6.6      Lazy Learning

6.7      Multi-label Classification

6.7.1     Multi-label Hoeffding Trees

6.8      Active Learning

6.8.1     Random Strategy

6.8.2     Fixed Uncertainty Strategy

6.8.3     Variable Uncertainty Strategy

6.8.4     Uncertainty Strategy with Randomization

6.9      Concept Evolution

6.10    Lab Session with MOA

7              Ensemble Methods

7.1      Accuracy-Weighted Ensembles

7.2      Weighted Majority

7.3      Stacking

7.4      Bagging

7.4.1     Online Bagging Algorithm

7.4.2     Bagging with a Change Detector

7.4.3     Leveraging Bagging

7.5      Boosting

7.6      Ensembles of Hoeffding Trees

7.6.1     Hoeffding Option Trees

7.6.2     Random Forests

7.6.3     Perceptron Stacking of Restricted Hoeffding Trees

7.6.4     Adaptive-Size Hoeffding Trees

7.7      Recurrent Concepts

7.8      Lab Session with MOA

8              Regression

8.1      Introduction

8.2      Evaluation

8.3      Perceptron Learning

8.4      Lazy Learning

8.5      Decision Tree Learning

8.6      Decision Rules

8.7      Regression in MOA

9              Clustering

9.1      Evaluation Measures

9.2      The k -means Algorithm

9.3      BIRCH, BICO, and C LU S TREAM

9.4      Density-Based Methods: DBSCAN and Den-Stream

9.5      C LUS T REE

9.6      StreamKM++: Coresets

9.7      Additional Material

9.8      Lab Session with MOA

10            Frequent Pattern Mining

10.1    An Introduction to Pattern Mining

10.1.1   Patterns: Definitions and Examples

10.1.2   Batch Algorithms for Frequent Pattern Mining

10.1.3   Closed and Maximal Patterns

10.2    Frequent Pattern Mining in Streams: Approaches

10.2.1   Coresets of Closed Patterns

10.3    Frequent Itemset Mining on Streams

10.3.1   Reduction to Heavy Hitters

10.3.2   Moment

10.3.3   FP-S TREAM

10.3.4   IncMine

10.4    Frequent Subgraph Mining on Streams

10.4.1   W IN G RAPH M INER

10.4.2   A DA G RAPH M INER

10.5    Additional Material

10.6    Exercises

III          THE MOA SOFTWARE

11            Introduction to MOA and Its Ecosystem

11.1    MOA Architecture

11.2    Installation

11.3    Recent Developments in MOA

11.4    Extensions to MOA

11.5    ADAMS

11.6    MEKA

11.7    OpenML

11.8    StreamDM

11.9    Streams

11.10  Apache SAMOA

12            The Graphical User Interface

12.1    Getting Started with the GUI

12.2    Classification and Regression

12.2.1   Tasks

12.2.2   Data Feeds and Data Generators

12.2.3   Bayesian Classifiers

12.2.4   Decision Trees

12.2.5   Meta Classifiers (Ensembles)

12.2.6   Function Classifiers

12.2.7   Drift Classifiers

12.2.8   Active Learning Classifiers

12.3    Clustering

12.3.1   Data Feeds and Data Generators

12.3.2   Stream Clustering Algorithms

12.3.3   Visualization and Analysis

13            Using the Command Line

13.1    Learning Task for Classification and Regression

13.2    Evaluation Tasks for Classification and Regression

13.3    Learning and Evaluation Tasks for Classification and Regression

13.4    Comparing Two Classifiers

14            Using the API

14.1    MOA Objects

14.2    Options

14.3    Prequential Evaluation Example

15            Developing New Methods in MOA

15.1    Main Classes in MOA

15.2    Creating a New Classifier

15.3    Compiling a Classifier

15.4    Good Programming Practices in MOA

Bibliography

Index