Index

Accuracy-Weighted Ensembles, 129 , 209

AccuracyUpdatedEnsemble, 130 , 209

AccuracyWeightedEnsemble, 130 , 209

active learning, 13 , 117

Fixed Uncertainty Strategy, 119

in MOA, 211

Random Strategy, 119

Uncertainty Strategy with Randomization, 121

Variable Uncertainty Strategy, 119

ActiveClassifier, 211

Adaboost, 135

AdaGraphMiner algorithm, 179 , 189

AdaHoeffdingOptionTree, 209

ADAMS project, 190

adaptive bagging, see ADWIN Bagging

Adaptive Random Forests, 137

Adaptive-Size Hoeffding Trees, 138 , 209

AddNoiseFilter, 206

ADWIN Bagging, 17 , 133 , 200 , 209

ADWIN sketch, 79 , 82 , 108 , 179 , 211

AgrawalGenerator, 206

Agresti-Coull bound, 39

AMRules, 147 , 200

AMS (Alon-Matias-Szegedy) sketch, 57

Android operating system, 190

Apex, 196

approximation, 36

absolute, 36

( 𝜖 , δ )-approximation, 36 , 37 , 62 , 64

relative, 36

Apriori algorithm, 19 , 168

Area under the curve (AUC), 90

ARFF files, 22 , 203

ArffFileStream, 204

ARL, Average Run Length, 75

attributes, 85

AUC, 90

bagging, 17 , 133

Bayes’ theorem, 95

Bernstein’s inequality, 39

bias (in classifiers), 94

BICO algorithm, 154

Big Data, 3

challenges, 6

hidden, 7

Three V’s, 3

visualization, 7 , 212

BIRCH algorithms, 152

Bloom filter, 43

boosting, 135

bootstrap, 133

C++ language, 195

C4.5, 101 , 117

CART, 101

centers (clustering), 149

centroids (clustering), 149

CF trees, 153

change in data streams, see drift

CHARM algorithm, 170 , 178

Chebyshev’s inequality, 38 , 46 , 62 , 92

Chernoff’s bound, 38 , 92

classification, 11 , 85

comparing classifiers, 92

concept evolution, 121

CVFDT, 105

decision stump, 208

decision trees, 99 , 208

delayed feedback, 13

ensembles, 71 , 82 , see also ensembles

evaluation, 86

Hoeffding Adaptive Tree, 108

Hoeffding Tree, 102

in MOA, 190 , 201 , 208–210

k -NN, 114 , 190

lazy learning, see k -NN (nearest neighbors)

Majority Class classifier, 94

missing feedback, 13

multi-label, 115

Multinomial Naive Bayes, 98

Naive Bayes, 95

No-change classifier, 94

perceptron, 113

UFFT, 107

VFDT, 104

VFDTc, 107

closed pattern, 169

CloseGraph algorithm, 170 , 179 , 182

cluster mapping measure (CMM), 151

clustering, 11 , 17 , 149

BICO, 154

BIRCH, 152

centroids or centers, 149

CluStream, 154

ClusTree, 156

CobWeb, 212

cost functions, 149

DBSCAN, 155

Den-Stream, 155

density-based, 155

distance function, 149

distributed, 200

evaluation, 150

in MOA, 160 , 211

k -means, 18 , 151

k -means++, 152

microclusters, 152

other methods, 159

similarity, 149

StreamKM++, 158 , 212

surveys, 159

CluStream algorithm, 154 , 212 , 213

ClusTree algorithm, 156 , 212

CM-sketch, see Count-Min sketch

CMM (cluster mapping measure), 151

CobWeb algorithm, 212

Cohen’s counter, 44 , 60

cohesion measure (clustering), 150

communities, 18

comparing classifiers, 92

concentration inequalities, 37 , 101

concept drift, see drift

concept evolution, 121

ConceptDriftRealStream, 205

ConceptDriftStream, 204

confidence intervals, 37 , 92

confusion matrix, 91

coresets

coreset tree, 158

in clustering, 158

in pattern mining, 172 , 178 , 182

cost measures, 93

Count-Min sketch, 51 , 60 , 81 , 82

counting

distinct or unique items, 40 , 42 , 48

items, 41

CountSketch, 54

cross-validation, 87 , 204

distributed, 88

CUSUM test, 75 , 82 , 211

CVFDT, 105 , 110

data streams, 35

adversarial vs. stochastic, 35 , 69

change, see drift

definition, 8 , 11

distributed, 61 , 88 , 197

frequency moments, 56

in computer security, 9 , 121

in disaster management, 9

in e-commerce, 9

in healthcare, 9

in marketing, 9

in social media, 9 , 189 , 190

in utilities, 9

items, 36

Markovian, 69

scenarios, 8 , 85 , 121 , 143

dataset shift, 68

DBSCAN algorithm, 155

DDM, Drift Detection Method, 78 , 82 , 83 , 107 , 211

decay factor, 73

decision rules, 146 , 200

Decision Stump classifier, 208

decision trees, 16 , 99 , 208

split criteria, 101

delayed feedback, 13

δ , confidence parameter, 37

Δ-support, 178 , 183

Den-Stream algorithm, 155 , 212

density-based clustering, 155

discretization, 109 , 190

distinct items, see counting

distributed evaluation, 88

drift, 67

gradual, 69

in MOA, 190 , 210

recurrent concepts, 69 , 139

shift, 69

simulating in MOA, 22 , 25 , 204

strategies to manage, 70

types of, 69

Eclat algorithm, 19 , 169

ensembles, 17 , 71 , 82 , 129

Accuracy-Weighted, 129

Adaboost, 135

Adaptive Random Forests, 137

Adaptive Size Hoeffding Tree, 138

ADWIN Bagging, 17 , 133

bagging, 17 , 133

boosting, 135

exponentiated gradient, 132

Hoeffding Option Tree, 136

in MOA, 209

Leveraging Bagging, 134

Online Bagging, 133

Online Boosting, 135

random forests, 136

stacking, 132 , 137

Weighted Majority, 130

entropy, 101 , 117

𝜖 , accuracy parameter, 36

Equal-frequency discretization, 109

Equal-width discretization, 109

error-correcting output codes, 134

estimators, 72

evaluation, 14 , 86

AUC, 90

cross-validation, see cross-validation

distributed, see distributed evaluation

holdout, see holdout evaluation

in clustering, 150

in MOA, 22–31 , 203

interleaved chunks, see interleaved chunks evaluation

prequential, see prequential evaluation

statistical significance, 92

test-then-train, see test-then-train evaluation

EWMA estimator, 73 , 82 , 151 , 211

exhaustive binary tree, 110 , 146

Exponential Histograms, 57 , 61 , 64 , 73 , 80

exponentiated gradient algorithm, 132

Facebook graph, 48

fading factor, 73

Fayyad and Irani’s discretization, 109

feature extraction, 10

features, see attributes

FilteredStream, 205

FIMT-DD, 146

Flajolet-Martin counter, 45 , 60

Flink, 6 , 196

FP-Growth algorithm, 19 , 168 , 175

FP-Stream algorithm, 175

FP-Tree, 168

frequency moments (in streams), 56

frequency problems, 48

frequent elements, see heavy hitters

frequent pattern, see pattern mining

Frequent sketch, 49

FrugalStreaming sketch, 54

Gaussian distribution, 38 , 111

Gini impurity index, 101

gnuplot, 219

GPU computing, 137

graph mining, 10 , 178

graphical models, 94

GraphX, 6

Hadoop, 5 , 196

hash functions, 43 , 44 , 61

families of random, 61

fully independent, 61

in practice, 62

pairwise independent, 61

HDFS, 5

heavy hitters, 49 , 64

by sampling, 49

in itemset mining, 174

in pattern mining, 174

surveys, 49

Hoeffding Adaptive Tree classifier, 17 , 108 , 209

Hoeffding adaptive tree classifier, 195

Hoeffding Option Tree classifier, 136 , 146 , 209

Hoeffding Tree classifier, 16 , 102 , 190 , 208

multi-label, 117

vertical, 200

Hoeffding’s bound, 38 , 46 , 63 , 65 , 81 , 82 , 92 , 101 , 102 , 172 , 177

holdout evaluation, 14 , 87 , 204

Huawei, 195

HyperANF counter, 47

HyperLogLog counter, 46 , 47

HyperplaneGenerator, 206

hypothesis testing, see statistical tests

IBLStreams, 145 , 189

iceberg queries, 49

IID assumption, 69 , 86 , 91

IncMine algorithm, 19 , 176 , 183 , 189

information gain, 101 , 101 , 117

interleaved chunks evaluation, 88 , 204

Internet of Things, 3 , 8

items, 36

itemset, 165

Java language, 187 , 188 , 195 , 196 , 221 , 227

good practices, 238

Kalman filter estimator, 74

Kappa architecture, 6

Kappa M statistic, 90

Kappa statistic, 90

Kappa temporal statistic, 91

kernel methods, 94 , 148

k -grams, counting, 42

k -means algorithm, 18 , 151

k -means++ algorithm, 152

k -NN (nearest neighbors), 15 , 190

for classification, 114 , 122

for regression, 145

Lambda architecture, 6

Laplace correction, 97 , 99

large-deviation bounds, see concentration inequalities

lazy learning, see k -NN (nearest neighbors)

learning rate, 114

LEDGenerator, 206

LEDGeneratorDrift, 207

Leveraging Bagging, 134 , 210

LimAttClassifier, 138 , 210

Linear counting, 43 , 60

linear estimator, 73

linear regression, 143

Lossy Counting sketch, 49 , 174

Mahout, 6

Majority Class classifier, 15 , 94 , 210

Markov’s inequality, 38 , 53 , 92

maximal pattern, 169

McDiarmid’s inequality, 39 , 101

McNemar’s test, 93

MDL, Minimum Description Length, 109

MDR, Missed Detection Rate, 75

MEKA project, 193

Mergeability, 60

Merging sketches, 60

microclusters, 18 , 152 , 154 , 200

Milgram’s degrees of separation, 48

Misra-Gries counter, 49

missing data, 10

missing feedback, 13

MLIB, 6

MOA, 10 , 21 , 187

adding classes to, 227

API, 221

classification, 201 , 218

clustering, 160

Command Line Interface (CLI), 29 , 217

compiling code for, 237

discretization, 190

distributed, see SAMOA

evaluation, 22–31 , 203 , 218

extensions, 189

for Android, 190

for social media analysis, 189 , 190 , 192

for video processing, 193

generators, 160 , 204 , 204 , 212

good programming practices, 237

GUI, 22 , 23 , 201

Hadoop, 196

installing, 21 , 188

modifying the behavior of, 227

multi-target learning, 188

outlier detection, 188

platforms, 187 , 188 , 190

programming applications that use, 221

recent developments, 188

recommender systems, 189

regression, 148 , 218

running tasks, 22 , 123 , 201 , 217

SAMOA, 196

Spark, 195

tasks, 188 , 203 , 217

visualization, 212

MOA-TweetReader, 189

MOAReduction, 190

Moment algorithm, 19 , 174 , 189

moment computation, 56

Morris’s counter, 41 , 61 , 63

motif discovery, 10

MTD, Mean Time to Detection, 75

MTFA, Mean Time between False Alarms, 75

multi-label classification, 115

BR method, 115

in MOA, 193

LC method, 115

multi-label Hoeffding Tree, 116

PW method, 116

multi-target learning, 188

Multinomial Naive Bayes classifier, 98 , 208

Naive Bayes

Multinomial, see Multinomial Naive Bayes classifier

Naive Bayes classifier, 16 , 95 , 105 , 208

neighborhood function (in graphs), 47

No-change classifier, 15 , 94 , 210

normal approximation, 38 , 92 , 172

normal distribution, 38 , 111

numeric attributes, 109 , 143

in MOA, 190

OCBoost, 209

Online Bagging, 133

Online Bagging algorithm, 209

Online Boosting algorithm, 209

Onling Boosting algorithm, 135 , 209

OpenML project, 194

outliers, 70 , 81 , 109 , 113 , 188

overfitting (in classifiers), 94

OzaBag, 133 , 209

OzaBagADWIN, 133 , 209

OzaBagASHT, 138 , 209

OzaBoost, 135 , 209

PAC-learning, 37

Page-Hinkley test, 76 , 82 , 83 , 146 , 211

pattern mining, 11 , 18 , 165 , 167

AdaGraphMiner, 179

Apriori, 168

association rules, 182

candidate pattern, 168

CHARM, 170 , 178

closed pattern, 169 , 182 , 183

CloseGraph, 170 , 179 , 182

coresets, 172 , 178 , 182

Eclat, 169

FP-Growth, 168

FP-Stream, 175

generic algorithm on streams, 170

graph, 166 , 178 , 182

in MOA, 178 , 182 , 189

IncMine, 176 , 183

itemset, 18 , 165 , 181

maximal pattern, 169

Moment, 174

other algorithms, 170 , 181

pattern, 165 , 166

pattern size, 167

sequence, 165 , 181

SPMF, 178 , 182

subpattern, 166

superpattern, 166

support, 166

surveys, 181

tree, 166 , 181

WinGraphMiner, 179

Perceptron, 132 , 146 , 210

for regression, 145

stacking on Hoeffding Trees, 137 , 210

perceptron

for classification, 113

Poisson distribution, 133 , 134

prequential evaluation, 14 , 88 , 90 , 204

Probabilistic counter, see Flajolet-Martin counter

purity measure (clustering), 150

Python language, 195

quantiles, 54

FrugalStreaming sketch, 54

Greenwald and Khanna’s sketch, 111 , 190

in MOA, 190

R language, 191 , 195

RAM-hour, 94

random forests, 136

randomized algorithm, 36

RandomRBFGenerator, 207

RandomRBFGeneratorDrift, 207

RandomSEAGenerator, 207

RandomTreeGenerator, 207

range-sum queries, 53 , 64

ranking / learning to rank, 10

real-time analytics, see data streams

recommender systems, 10 , 189

recurrent concepts, 10 , 69 , 139

regression, 143

AMRules, 147

error measures, 144

FIMT-DD, 146

IBLStreams, 145

in MOA, 148 , 189 , 210

k -NN, 145

linear regression, 143

Perceptron, 145

Spegasos, 148

stochastic gradient descent, 148

reservoir sampling, 40

rule learners, 94 , 146

SAMOA, 196

sampling, 39 , 63

for heavy hitters, 49

reservoir, see reservoir sampling

Samza, 196

semi-supervised learning, 13

SGD, 210

sigmoid, 25 , 204

silhouette coefficient, 150

six degrees of separation, 48

sketches, 35 , 36

ADWIN, 79 , 82 , 108 , 179

AMS (Alon-Matias-Szegedy), 57

Cohen’s counter, 44

Count-Min, 51

CountSketch, 54

Exponential Histograms, 57 , 73

Flajolet-Martin counter, 45

for linear algebra, 63

for massive graphs, 48

Frequent, 49

FrugalStreaming, 54

HyperLogLog counter, 46

Linear counting, 43

Lossy Counting, 49 , 174

merging, 60

Misra-Gries, 49

Morris’s counter, 41

other sketches, 63

quantiles, 54 , 111

range-sum queries, 53

reservoir sampling, 40

Space Saving, 50 , 64 , 82 , 174 , 183

Sticky Sampling, 49

Stream-Summary, 51

skip counting, 41

sliding windows, 58 , 73 , 79 , 83 , 178

Space Saving sketch, 50 , 61 , 64 , 82 , 174 , 183

spam, 11 , 85 , 100

Spark, 6 , 195

Spark Streaming, 6 , 195

SPegasos, 148 , 210

split criteria, 101

split-validation, 89

SPMF framework, 178 , 182

SSQ measure (clustering), 150

stacking, 132

Perceptron on Hoeffding Trees, 137 , 210

STAGGERGenerator, 208

statistical significance, 92

McNemar’s test, 92

statistical tests, 76 , 81

Sticky Sampling sketch, 49

stochastic averaging, 46

stochastic gradient descent, 114 , 148 , 210

Storm, 196

stream cross-validation, 90

Stream-Summary structure, 51

StreamDM-C++ project, 195

streaming, see data streams

StreamKM++ algorithm, 158 , 212

Streams project, 196

subpattern, see pattern mining

summaries, see sketches

superpattern, see pattern mining

supervised learning, 11 , 85

support (of a pattern), 166

support vector machines (SVM), see kernel methods

TemporallyAugmentedClassifier, 95 , 210

TensorFlow, 6

test-then-train evaluation, 14 , 87 , 204

time series, 68

Twitter, 15 , 85 , 96 , 99 , 121 , 189 , 192

UFFT, 107 , 112

unique items, see counting

unsupervised learning, 11 , 149 , 165

Vertical Hoeffding Tree, 200

VFDT, 104 , 110

VFDTc, 107 , 110

VFML, 110

video processing, 193

WaveformGenerator, 208

WaveformGeneratorDrift, 208

Weighted Majority algorithm, 130

WEKA, 10 , 22 , 190 , 193 , 203

WinGraphMiner algorithm, 179