ArticlePDF Available

Data stream analysis: Foundations, major tasks and tools

March 2021
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11(3)

March 2021
11(3)

DOI:10.1002/widm.1405

Authors:

Maroua Bahri

Inria paris

Albert Bifet

Télécom ParisTech

João Gama

University of Porto

Heitor Murilo Gomes

Victoria University of Wellington

Show all 5 authorsHide

The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining • Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Abstract Data Stream Mining

Window models

…

Taxonomy of classification with some well‐known algorithms

…

Figures - available from: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Maroua Bahri

Content may be subject to copyright.

OVERVIEW

Data stream analysis: Foundations, major tasks and tools

Maroua Bahri

| Albert Bifet

1,2

|Jo

ao Gama

| Heitor Murilo Gomes

Silviu Maniu

LTCI, Télécom Paris, IP-Paris, Palaiseau,

France

Department of Computer Science,

University of Waikato, Hamilton,

New Zealand

INESC TEC, University of Porto, Porto,

Portugal

LRI, Université Paris-Saclay, Orsay,

France

Correspondence to:

Maroua Bahri, LTCI, Télécom Paris, IP-

Paris, Palaiseau, France.

maroua.bahri@telecom-paris.fr

Funding information

Huawei Technologies France SASU and

Télécom Paris, Grant/Award Number:

YBN2018125164

Edited by: Sushmita Mitra, Associate

Editor, and Witold Pedrycz, Editor-in-

Chief

Abstract

The significant growth of interconnected Internet-of-Things (IoT) devices, the

use of social networks, along with the evolution of technology in different

domains, lead to a rise in the volume of data generated continuously from

multiple systems. Valuable information can be derived from these evolving

data streams by applying machine learning. In practice, several critical issues

emerge when extracting useful knowledge from these potentially infinite data,

mainly because of their evolving nature and high arrival rate which implies an

inability to store them entirely. In this work, we provide a comprehensive sur-

vey that discusses the research constraints and the current state-of-the-art in

this vibrant framework. Moreover, we present an updated overview of the lat-

est contributions proposed in different stream mining tasks, particularly classi-

fication,regression,clustering, and frequent patterns.

This article is categorized under:

Fundamental Concepts of Data and Knowledge > Key Design Issues in Data

Mining

Fundamental Concepts of Data and Knowledge > Motivation and Emer-

gence of Data Mining

1|INTRODUCTION

In recent decades, the world has been invaded by the ubiquitousness of technology in several sectors of society, such as

healthcare, transport, and banking. This digital revolution involves progressively more and more sensors and systems

that continually generate massive amounts of data in an open-ended way as big data streams. A good example is the

large system of interrelated devices and sensors known as Internet of Things (IoT) (Caiming & Yong, 2020; Da Xu,

He, & Li, 2014). The latter has become a key element of life automation, for instance, cars, cellphones, airplanes, and

drones. These devices create huge amounts of data streams that are expected to grow in the near future. By 2019, 26 bil-

lion of such devices were connected, and this number is expected to increase to almost 80 billion devices that will be

used all over the world by 2025.

Therefore, systems and algorithms to handle these vast flows of data must be

explored.

Stream

data are defined as an “unbounded sequences of multidimensional, sporadic, and transient observations

made available along time”(Bahri, 2020). To automatically extract useful information from data stream, we need to

consider stream computing that analyzes the data generated at high-velocity in real-time as required in big data stream

analytics (Kim, 2017). Hence, the stream mining tasks have become crucial in multiple real-world applications, for

Received: 15 August 2020 Revised: 13 January 2021 Accepted: 1 February 2021

DOI: 10.1002/widm.1405

https://doi.org/10.1002/widm.1405

example, social networks, spam emails filters, and more, that demand real-time (or near real-time) analysis since the

data that they generate are drawn from evolving distributions.

Mining data streams has attracted several researchers due to the importance of its application (Aggarwal, 2007;

Bifet, Gavaldà, Holmes, & Pfahringer, 2018; Gaber, Zaslavsky, & Krishnaswamy, 2005). Amini, Wah, and

Saboohi (2014); Kokate, Deshpande, Mahalle, and Patil (2018); Carnein and Trautmann (2019) reviewed works for

unsupervised learning, clustering, by presenting models that are mainly used for density-based clustering. Current

incremental clustering algorithms rely mainly on techniques such as density-microclustering and density-grid that

require several parameters to be effective (Kokate et al., 2018). On the other hand, diverse works on supervised learning

have been proposed, especially in classification, which is perhaps the most commonly researched and active machine

learning task.

Our goal in this paper is to provide the artificial intelligence audience with a brief literature overview of the most

important foundations when dealing with big data streams by shedding light on how the research in the corresponding

framework can progress. While several books (Aggarwal, 2007; Gama & Gaber, 2007) and articles (Gaber et al., 2005;

Gama, 2012; Nguyen, Woon, & Ng, 2015) provide an overview of the state-of-the-art in the stream context, many new

and promising algorithms have emerged since then. Also, each of these reviews generally studies only one machine

learning task, for instance Losing, Hammer, and Wersing (2018) study the advances in classification, Gomes

et al. (2019b) does not discuss clustering, regression and frequent pattern mining tasks. This is a gap that the current

paper addieases.

We argue that these most recent advances in the data stream (Bahri et al., 2020; Bahri, Maniu, & Bifet, 2018;

Besedin, Blanchart, Crucianu, & Ferecatu, 2017; Gomes et al., 2017; Gomes et al., 2019; Losing, Hammer, &

Wersing, 2016; Montiel, Read, Bifet, & Abdessalem, 2018; Rodrigues, Araújo, Gama, & Lopes, 2018; Veloso, Gama, &

Malheiro, 2018) make this research area worth revisiting with a more ambitious scope. We first provide the basic con-

cepts in the stream setting while elucidating the challenges and how they could be addressed. Then, we review the pro-

gress in the different stream mining tasks with a particular focus the most active task, classification, and report reputed

and recent approaches and frameworks for data streams.

2|FOUNDATIONS

The unbounded nature of evolving data streams raises some technical and practical limitations that make traditional

stream algorithms fail because of the high use of resources (such as time and memory) to process dynamic data distri-

butions. In this section, we present the fundamental research issues encountered in the streaming framework.

2.1 |Challenges

The following challenges are mostly common across the different data stream mining tasks that will be presented in the

next sections (Aggarwal & Philip, 2007; Gama & Gaber, 2007; Kolajo, Daramola, & Adebiyi, 2019).

•Evolving data streams. To cope with the ever-growing size of data, streams algorithms must address the evolving

high-speed nature and complexity of data, because a stream usually delivers instances quickly. Therefore, stream

mining algorithms should be scalable and process recent instances from the stream in a dynamic fashion. Moreover,

we need scalable frameworks (Section 8) to handle big data streams by adopting efficient resource management strat-

egies and parallelization.

•Running time. An online algorithm must process the incoming observations as rapidly as possible. Otherwise, the

algorithm will not be efficient enough for applications where a rapid processing is required.

•Memory usage. Because of the massive amounts of data streams that require a limitless memory to be processed and

stored, it is difficult and even impossible to store the entire stream. So, any stream algorithm must be able to operate

under restricted memory constraints by storing few synopsis of the processed data and the current model(s).

•High-dimensionality. In some scenarios, streaming data may be high-dimensional, for example, text documents,

where distances between instances grow exponentially due to the curse of dimensionality. The latter can potentially

impact any algorithm's performance, mainly, in terms of time and memory.

2of17 BAHRI ET AL.

•Concept drifts. Since data streams are evolving, so the underlying distribution may change at any time, an eventuality

known as concept drift. This phenomenon can impact the predictive performance of algorithms because the current

learned model will be no more representative for the next incoming data. To deal with the new trends, learning algo-

rithms use drift detectors to identify the changes at the same time of their appearance. We redirect the readers to

(Gama, Žliobaite, Bifet, Pechenizkiy, & Bouchachia, 2014) for a review on concept drifts.

•Delayed labeling. Stream mining algorithms mostly suppose that labels are available before the next instance arrives

(immediate labeling). However, labels may arrive with delay which may be fixed or vary for different instances. Thus,

several algorithms that rely on concept drift detection will underperform when faced with a non-negligible delay to

receive the labeled data. This was illustrated in (Gomes et al., 2017), where authors present the same experiments

using both an immediate labeling setting and a fixed delayed setting.

•Imbalanced labels. The presence of a certain class label more than the other(s), referred to as majority class, may

impact learning algorithms' performance because they are designed to optimize for generalization, consequently, the

minority class may be ignored (López, Fernández, García, Palade, & Herrera, 2013).

The aforementioned challenges are commonly significant in the different data stream mining tasks. To cope with these

requirements, the incremental approaches should integrate incremental strategies that will allow such setting. More chal-

lenges arise in the case of distributed systems, such as the integration and heterogeneity, are discussed in Kolajo et al. (2019).

2.2 |Processing

The above-mentioned stream setting requirements can be addressed by using some well-established methods, presented

in the following (Aggarwal & Philip, 2007; Gama & Gaber, 2007):

•Single-pass. Unlike processing static datasets, it is no more possible to analyze data using several passes during the

course of computation because of its unbounded nature. Taking into account this constraint, algorithms work by

processing each instance from the stream only once (or a couple of times) and use it to update the model—or the sta-

tistical information about data—incrementally (instance-incremental algorithms, see Section 3.1). In the case of

batch-incremental algorithms, we process a batch (or chunk) of instances at once instead of only one instance.

•Window models. Storing a data stream and scanning it several times is not allowed. In order to capture significant

information from these evolving data, different kinds of moving windows have been proposed to store a part of the

stream continuously. In the following, we mention the standard ones (Ng & Dash, 2010).

Sliding window model: Whose size is fixed wand each instance is time stamped, that is, the most recent instances

from the stream are kept inside the window. This moving window slides over the stream while maintaining the

same size (Figure 1a).

•Landmark window model: This model starts by predefining an instance as a landmark from which the window grows.

Whenever the landmark changes, all the instances will be removed from the window and the new ones will be

maintained starting from the new landmark instance (Figure 1b). The problem with this window model is when the

landmark instance is fixed at the beginning of the flow, consequently, the window will store the whole stream.

•Damped window model: The damped model uses a fading function that periodically updates the weights of instances

inside the window. “The key idea consists in assigning a weight to each instance from the stream, which is inversely

(a) Sliding window model. (b) Landmark window of size 13. (c) Damped window model.

FIGURE 1 Window models

BAHRI ET AL.3of17

proportional to its age, that is, assign more weights to the recently arrived data. When the weight of an instance

exceeds a given threshold, it will be removed from the model”(see Figure 1c) (Bahri, 2020).

Table 1 describe briefly these windows and shows some of their pros and cons. Besides, different approaches are devel-

oped based on this manner, where the choice of the window model depends on the applications' needs (Ng & Dash, 2010).

2.3 |Summarization

To cope with the resource constraints, brief information and synopsis can be constructed from instances to reduce its

size for processing rather than—or in collaboration with—the aforementioned models. This can be realized through

the selection of some incoming instances or the synopsis construction in data. In what follows, we present a brief

description of some techniques.

•Sampling. Sampling is the simplest way to keep information about data. Since storing unbounded evolving data

streams is an impractical task, thus, it is a crucial step to make a sampling from the stream to maintain some “repre-

sentative”instances and store synopsis from the stream in memory (Babcock, Babu, Datar, Motwani, &

Widom, 2002). However, the disadvantage is that this simplicity comes with a cost: the selected instances can be not

too representative which can impact the results of mining algorithms.

•Histograms. Histograms are commonly used in the offline fashion where multiple passes are allowed and their exten-

sion to the online setting abides challenging. Garofalakis, Gehrke, and Rastogi (2002) proposed some incremental

histogram techniques to handle data streams that failed in some cases where the data distribution is not uniform.

•Sketches. A sketch is a probabilistic data structure that stores summaries and approximations of data (Manku &

Motwani, 2002). Different sketch-based methods exist that construct synopsis of data using a limited amount of mem-

ory. For instance, the count-min sketch (Cormode & Muthukrishnan, 2005) presented as a generalization of the

bloom filters to approximate the counts of objects with strong theoretical guarantees.

•Micro-clusters. Micro-clusters represent a method in the stream clustering (Aggarwal, Han, Wang, & Yu, 2003) that is

used to store synopsis information about the instances in the stream and the clusters.

•Grids. Grid-based methods partition the data space into small cubes, called grids, and instances from the data stream

are mapped to them.

•Dimension reduction. Dimension reduction is a well-known preprocessing method to tackle high-dimensional data

that may increase the cost of any mining algorithm (Bahri et al., 2020). It consists in mapping high-dimensional

instances onto a lower-dimensional representation while conserving the distances between instances (Sorzano, Var-

gas, & Montano, 2014).

3|SUPERVISED LEARNING

In the following, we assume that the stream Scontains an infinite number of instances X

,…,X

,…, where Nis the

number of instances faced so far in the stream and X

is defined as a vector of a attributes (also called features).

TABLE 1 Window models comparison

Window

model Definition Advantages Disadvantages

Sliding Processes the last received

instances

Suitable when the recent instances are of

special interest

Ignores elements from the

stream

Landmark Processes the entire history of the

stream

Suitable for single-pass algorithms Instances have the same

significance

Damped

(fading)

Assigns weights to instances Suitable when the old instances may affect the

results

Unbounded time window

4of17 BAHRI ET AL.

3.1 |Classification

Classification is a very popular supervised learning task that predicts the class label (or target category), (y

), for an

unlabeled observation Xusing a model built on labeled test instances (X,y) where y0=X0

ðÞ(Hand, Mannila, &

Smyth, 2001). The goal of a classification algorithm is to accurately predict the class label of instances.

Classification has been commonly used in the batch setting for static data where a training set (labeled instances) is

used to build a model, then a test set (unlabeled instances) is used for prediction to evaluate the model. So, multiple

access to the data are allowed in this traditional task. However, traditional (or batch) classifiers are unable to process

evolving data streams due to the requirements of the stream framework (Bifet & Kirkby, 2009). Different from the

learning and prediction phases of batch classifiers, the stream algorithms update their models incrementally in a single-

pass, also, they should work within a limited amount of time to afford real-time processing without delay, and a limited

amount of memory to avoid storing massive quantities of data for the prediction task. Incremental approaches are com-

monly divided in two main categories (Read, Bifet, Pfahringer, & Holmes, 2012):

•Instance-incremental approaches (Bifet & Gavaldà, 2009; Domingos & Hulten, 2000; Klawonn & Angelov, 2006; Los-

ing et al., 2016) that update the learned models incrementally with each incoming observation once it arrives. More

approaches are provided in the following sections.

•Batch-incremental approaches that receive a batch of data (composed of multiple instances) and use it to update the

models (Bahri et al., 2020; Cortes & Vapnik, 1995; Holmes, Kirkby, & Bainbridge, 2004; Hosmer Jr, Lemeshow, &

Sturdivant, 2013). For instance, Holmes et al. proposed a batch-incremental ensemble method that divides the data

stream into small chunks and builds a global model which combines the models created by all the member (trees)

inside the ensemble.

A relevant study Read et al. compares the instance-incremental and the batch-incremental methods on commonly

used data streams in the literature showing the difference in results obtained by both categories. They obtain similar

results in terms of accuracy but the instance-incremental approaches use less resources.

Several offline classifiers have been widely studied for the stream setting and turned out to be inefficient with evolv-

ing data streams. Thus, streaming versions of these classifiers have been implemented in this vibrant environment.

Generally, classification algorithms can be grouped into five main categories in Figure 2: (i) frequency-based;

(ii) neighborhood-based; (iii) tree-based; (iv) ensemble-based; and (v) neural network-based algorithms.

3.1.1 |Frequency-based classification

Naive Bayes (NB) (Friedman, Geiger, & Goldszmidt, 1997) is one of the simplest classifiers that performs prediction

(given the test instance) by computing the posterior probability using the Bayes's theorem with the independence

assumption between all attributes given the class label. In practice, this naive independence assumption does not

always hold true, but the classifiers has surprisingly good results with multiple datasets. The NB is an incremental algo-

rithm that does not need an adaptation to handle data streams thanks to its frequency-based scheme that only stores

frequencies about instances.

Data Stream Classification

Neighborhood-basedFrequency-based

NN Hoeffding-tree

Naive Bayes Bagging

EnsemblesTree-based Neural-based

Perceptron

FIGURE 2 Taxonomy of classification with some well-known algorithms

BAHRI ET AL.5of17

In order to reduce the resource usage of the NB algorithm, a sketch-based NB algorithm (Bahri et al., 2018) has been

proposed recently which uses count-min sketch (Cormode & Muthukrishnan, 2005) as a data structure to store properly

approximate frequencies of data. To improve further the result of the sketch-based NB approach, Bahri et al. incorpo-

rated hashing trick (Weinberger, Dasgupta, Langford, Smola, & Attenberg, 2009), a dimensionality reduction technique,

to efficiently handle high-dimensional data streams incrementally.

The multinomial Naive Bayes (Mccallum & Nigam, 1998) is a specific instance of the NB classifier which uses a

multinomial distribution for each of the attributes. It is suitable for classification with discrete attributes (e.g., word

counts for text classification).

3.1.2 |Neighborhood-based classification

k-Nearest Neighbors (kNN) is a lazy learning algorithm that does not build a model but use the whole dataset to make

predictions. Given a test instance, the kNN algorithm works by computing the distances, using a metric such as the

Euclidean distance, between the unlabeled instance and all the others and finish by picking the kclosest ones. After-

wards, by majority vote, kNN predicts, for the test instance, the most frequent class label among the k-nearest neigh-

bors. Since data streams are unbounded, so it is not possible to store all the instances for the prediction phase. To cope

with this issue, an stream version of the kNN algorithm has been proposed which uses a limited memory via a sliding

window (Figure 1a) to store the most recent instances and merge them with the old ones already in the window. Never-

theless, the use of resources depends on the window size and the dimensionality of data. Bifet, Pfahringer, Read, and

Holmes (2013) showed that a small window will decrease the accuracy, while a large window will increase the use of

resources: it is an accuracy-resources tradeoff.

To reduce the resource usage of kNN with high-dimensional data, Bahri et al. proposed a batch-incremental kNN

that incrementally preprocesses the data with uniform manifold approximation and projection (UMAP) (McInnes,

Healy, & Melville, 2018), a neighborhood and graph-based technique, to compress the dimensionality without losing

much in accuracy. Other than the size of the sliding window, the batch-incremental kNN has also the size of batch as

an important hyperparameter to fix which poses an accuracy-resources tradeoff as well.

Self-adjusting memory kNN (samkNN) (Losing et al., 2016) is an incremental kNN algorithm that builds an ensem-

ble with models targeting concept drifts in the stream and uses a dual-memory: (i) short-term memory for the current

concept; and (ii) long-term memory to keep track about past concepts.

3.1.3 |Tree-based classification

Several tree-based algorithms have been proposed in the state-of-the-art to handle evolving data streams (Domingos &

Hulten, 2000; Gama, Rocha, & Medas, 2003; Jankowski, Jackowski, & Cyganek, 2016). For instance, the very fast deci-

sion tree (VFDT) (Domingos & Hulten, 2000), also called Hoeffding tree algorithm (HT), is an incremental decision tree

learner that uses the Hoeffding bound to choose best split nodes. However, HT does not use an explicit drift detection

mechanism to address the changes that may occur in the data distribution.

To tackle this issue, an improved version of VFDT, the concept-adapting very fast decision tree (CVFDT), has been

proposed (Hulten, Spencer, & Domingos, 2001) to adapt to changes by maintaining a sliding window with the most

recent instances from the stream. CVFDR increments the counts corresponding to the new arrived instance and decre-

ments the counts of the oldest instance that has fallen out of the moving window. Bifet and Gavaldà proposed an adap-

tive extension of the VFDT algorithm, called Hoeffding adaptive tree (HAT), to handle changes in the data distribution

using the ADaptive WINdowing (ADWIN), a change detector and predictor, that controls the performance of branches

on the tree. If a drift is detected, the tree will be updated by replacing old branches with new ones trained on the cur-

rent concept (Bifet & Gavalda, 2007). The tree-based algorithms demand more resource, particularly (i) memory when

the tree grows, and (ii) time to choose the best split attributes.

Manapragada et al. introduced an incremental decision tree learning algorithm, extremely fast decision tree

(EFDT), that is similar to the Hoeffding tree (or VFDT) algorithm but uses the Hoeffding bound to select a useful split

node and to replace it whenever a better alternative split node is identified. The EFDT algorithm achieves better predic-

tive performance than the HT algorithm on different data but the latter is more efficient computationally (EFDT

6of17 BAHRI ET AL.

running time exceeds that of HT). Besides, EFDT works on a stationary distribution and is not designed as a learner to

deal with drifts (Manapragada, Webb, & Salehi, 2018).

3.1.4 |Ensemble-based classification

In its purest form, an ensemble is a set of individual models, whose predictions are combined using majority vote or

similar voting strategies. Ensemble learning is a relatively easy off-the-shelf approach to improve predictive perfor-

mance without building a single optimized model for the given learning problem. These gains come at the cost of com-

putational resources used for training the underlying models, while this restriction is negligible for batch problems, it

poses a serious concern in a streaming setting where algorithms must operate under constrained computational

resources.

Several theoretical and empirical studies have demonstrated the benefit of combining various “weak”individual

classifiers that drives to better accuracy than a single classifier (Bifet & Gavaldà, 2009; Breiman, 1996; Dietterich, 2000;

Kuncheva, 2014). Besides the predictive performance gains, ensemble-based methods are popular for data stream learn-

ing due to their flexibility to be coupled with concept drift detection strategies. Kuncheva (2014) mentioned three main

reasons for adopting an ensemble-based method over a single learner: (i) Statistical. Assuming that D

samples of a

training data set Dare used to train Kclassifiers, such that each of them obtain 100% accuracy on the subset of D

used

for training them. It is possible that the generalization capabilities of these classifiers will be different when they are

applied to a test set disjoint from D, therefore their accuracy will vary. From a statistical point of view it is safer to use

the mean of the individual predictions from these Kclassifiers instead of using only one of them, since the chance of

selecting the classifier with the worst generalization capabilities is eliminated. There is a chance that the ensemble

accuracy is worse than that of the best of its members, but at the least we avoid selecting the worst classifier.

(ii) Computational. Some classifiers may converge to a local maximum. Suppose that the local maxima of Kclassifiers

are close to the absolute maximum. Such that, there is a way of combining them in a model even closer to the absolute

maximum (optimal classifier) than any of them is capable individually. (iii) Representational. The classifier used may

not be able to represent the separation surface of the given problem. For example, a single linear classifier is unable to

accurately represent non-linear problems. Although the combination of several linear classifiers can approximate a

nonlinear separation surface.

In a streaming setting, two other reasons for using ensemble methods are worth noting: (iv) Scalability: in several

ensemble methods, such as random forest Breiman (2001), the base models can be trained independently. These ensem-

bles are embarrassingly parallel, which allows for training them in a way that allow to circumvent the computational

resources constraints. (v) Concept drift adaptation: Ensemble methods adapt to drifts rapidly by updating or resetting

the under-performing model(s) inside the ensemble Gomes et al. (2017a). This does not requires a complete reset of the

whole model.

Many streaming ensemble-based methods (Beygelzimer, Kale, & Luo, 2015; Chen, Lin, & Lu, 2012; de Barros, de

Carvalho Santos, & Júnior, 2016; Oza, 2005) are adapted versions of Boosting (Freund & Schapire, 1995) and Bagging

(Breiman, 1996). Online Bagging (Oza, 2005) is a streaming version of Bagging (Breiman, 1996) that uses the Poisson

distribution having λ= 1 to resample data, and generates models trained on diverse set of samples. Unlike the offline

Bagging, the stream Bagging assigns weight to samples using a Poisson distribution which allows dividing samples as

follows: 37% (sampled value 1) of instances are used for training only once and 26% (sampled value superior than 1) are

trained with repetition, and the rest 37% (sampled value 0) are not used for training. In order to promote ensemble

diversity and use more often training instances, Bifet et al. proposed leveraging bagging (LB), similar to the online Bag-

ging, that uses λ= 6 to approximate instances. LB deals with concept drifts using ADWIN (Bifet & Gavalda, 2007) by

resetting the worst performing classifier whenever a change is detected.

Streaming random forests (SRF) (Abdulsalam, Skillicorn, & Martin, 2007) is an incremental ensemble method that

adapts the Random Forests (RF) algorithm (Breiman, 2001) to the stream framework. It grows binary Hoeffding trees

and trains each tree on random samples without replacement by keeping a batch of instances. Adaptive Random For-

ests (ARF) (Gomes et al., 2017) is a recent version of the RF algorithm to handle evolving data streams which uses

Hoeffding trees and randomly selects attributes for each node. ARF induces more diversity to the ensemble through the

resampling technique used in LB (Bifet, Holmes, & Pfahringer, 2010). Besides, ARF includes one drift detection algo-

rithm per ensemble member, which dictates when to start training a background tree (i.e., when a warning is signaled),

and when to replace the current tree by the background tree (i.e., when a drift is signaled). Streaming Random Patches

BAHRI ET AL.7of17

(SRP) (Gomes et al., 2019) is also a novel ensemble-based method that uses a drift detection mechanism similar to one

presented in ARF and integrates random subspaces (selected globally for each base model) and the online Bagging.

3.1.5 |Neural network-based classification

Neural networks (NNs) are another category of models inspired by the biological neurons that form the nervous system.

In recent years, NNs have shifted the attention of the machine learning community to become one of the extremely

active research directions. However, there is a growing demand for the incremental setting to analyze continuously

evolving streams. The latter impose several challenges that need to be handled because of the potential infinite nature

of data and the real-time processing which raises memory and time issues (Besedin et al., 2017; Rutkowski, Jaworski, &

Duda, 2020).

Perceptron is the most simple NN that consists of a single neuron. It is a linear classifier that typically requires sev-

eral iterations over the training data, which is impossible in the stream learning setup. Stream Perceptron has a set of

weights that is updated for each new incoming instance from the stream using Stochastic Gradient Descent (SGD). The

main difference with the batch setting is that instead of doing multiple iterations to improve the accuracy, we only do

one pass over the data for the stream setting (Bifet et al., 2010c). Pratama et al. proposed a randomized neural network

model that provides a scalable solution, for the adaptation of neural models to the stream scenarios, which is able to

process the data one-by-one (instance-incremental) or in chunks (batch-incremental) (Pratama et al., 2017). In a differ-

ent research work, a new type of neural network called recurrent fuzzy neural network has been proposed to model

and capture the dynamical properties of the fuzzy dynamical systems (Zhou & Da Xu, 2001). This network is suitable

for describing dynamic systems, because it can deal with time-varying input or output through its own natural temporal

operation. Because of its dynamic nature, the recurrent neural network has been successfully applied to a wide variety

of applications such as speech processing and time series forecasting. However, the training of a recurrent neural net-

work could be time consuming and thus inappropriate with evolving data streams (Chang, Chen, & Chang, 2012). Jain,

Seera, Lim, and Balasubramaniam (2014) Rutkowski et al. reviewed some neural networks that work in the streaming

environment by replacing the epoch learning by one-by-one or mini-batch type learning.

Deep learning. Recently, deep learning has attracted much attention for applications to data stream processing.

However, only limited progress on online deep learning has been made because the networks should not be too deep to

allow real-time processing. For instance, a generative adversarial network (GAN) has been proposed (Besedin

et al., 2017) to derive a deep network on data streams without the necessity of storing incoming data. This technique

works by regenerating historical training data to compensate for the absence of synopsis of data. In Read, Perez-Cruz,

and Bifet (2015), authors explored two deep learning methods in order to classify semi-labeled data. These methods pro-

vide important advantages in the stream framework, such as learning incrementally in a constant memory usage.

Deep learning is not commonly used so far with evolving data streams due to the high computational cost of train-

ing and its sensitivity to hyper-parameter configurations (e.g., depth, number of neurons) (Marrón, Read, Bifet, &

Navarro, 2017). Moreover, since deep learning methods are resource-intensive, they require powerful GPUs. Another

challenging problem associated with data streams is that the latter may be non-stationary. In order to address this issue

and speed-up the learning, it is more practical to deal with NNs on mini-batches with the assumption that non-

stationary data are separated into chunks (data close in time are assumed to be stationary and follow the same distribu-

tion) (Chen & Lin, 2014).

3.2 |Regression

Regression is a supervised learning task where the goal is to estimate the relationship between a dependent variable

and one or more independent variables. It differs from classification as the dependent variable is numeric. In the con-

text of data streams, regression is often associated with time series analysis and forecasting. However, in a data stream

setting, we assume that data points are independent and identically distributed (iid). Therefore, traditional univariate

and multivariate time series analysis is not applicable.

The most basic regression technique is the simple linear regression, in which a line is fit to represent the relation-

ship between a single independent variable and the dependent variable. Multiple linear regression is an extension to

multiple dependent variables. In a streaming setting, the process of fitting such line to the data can be performed in an

8of17 BAHRI ET AL.

online fashion through SGD as in a Perceptron. To fit more complex data, in which a linear relationship is not suffi-

cient, one can rely on polynomial regression, where the relationship between the independent variable and the depen-

dent variable is modeled as an n−th degree polynomial. Alternatively, neural networks can be employed (as in

classification) following the same scheme of training with SGD.

Incremental decision trees are often used in the streaming setting, notoriously for classification with Hoeffding

Trees (Domingos & Hulten, 2000), but also for regression. The Fast and Incremental Model Trees (FIMT-DD)

(Ikonomovska, Gama, & Džeroski, 2011) builds incremental regression trees similarly to Hoeffding Trees, that is,

FIMT-DD starts with an empty tree that keeps statistics at the leaves from arriving data until a grace period is reached,

such that the features are ranked according to their variance w.r.t the target variable to decide for splits, and if the two

best-ranked differ by at least the Hoeffding Bound (Hoeffding, 1994) the leaf splits. Similarly to other incremental deci-

sion trees, FIMT-DD performs concept drift detection, and adaptation, by periodically resetting subbranches of the tree

in which significant variance increases are observed.

In the search for predictive performance improvements of incremental regression trees, a common approach is to

ensemble several trees, similarly to ensembling classification models (Gomes et al., 2017a). Ikonomovska, Gama, and

Džeroski (2015) proposed the online random forest (ORF) and online bagging (OBag) ensembles that use the FIMT-DD

as the base learner. Based on empirical experiments, the authors concluded that the ORTO-A (online option trees with

averaging) outperformed both OBag and ORF in terms of mean squared error (MSE). Gomes, Barddal, Ferreira, and

Bifet (2018) proposed Adaptive Random Forest regressor (ARF-Reg), an adaptation of the data stream classifier ARF

(Gomes et al., 2017). ARF-Reg builds a forest of FIMT-DD trees as ORF, the main difference between both algorithms

is that ARF-Reg employs one instance of the ADWIN algorithm (Bifet & Gavalda, 2007) per tree to detect concept drifts.

Even though there are some similarities between ensemble classification and regression models, there are also impor-

tant differences, for example, w.r.t. how predictions are combined and how diversity is induced. These were recently

empirically analyzed in (Gomes, Montiel, Mastelini, Pfahringer, & Bifet, 2020).

4|CLUSTERING

Unlike supervised learning, the instances represented in the clustering are not associated with a discrete class label, as

in classification, or a continuous value, as in regression, because the clustering methods aim to discover the possible

classes from the content of the data [de Souza Viana, de Oliveira, da Silva, Falc ao, & Gonçalves, 2018]. The big data

stream clustering task can be defined by the process that continuously maintains a consistent clustering of the data

encountered thus far from the stream while using limited amounts of time and memory (Chen, Oliverio, Kim, &

Shen, 2019; Silva et al., 2013). As mentioned before, the infinite nature of data imposes several challenges and the need

to process them in real-time. Thus, incremental clustering algorithms are needed to maintain the evolving cluster struc-

tures. Moreover, due to the dynamics of the data stream, new clusters might appear, other disappear, whereas some

clusters can move in the instance space, etc.

A recent research study (Chen et al., 2019) have been proposed that presents the clustering categories and the

related methods while discussing the pros and cons of each of them. However, authors reviewed methods that can be

applicable to big data and do not examined them in the streaming context.

The main approaches in clustering streaming data, can be summarized as:

•Partitioning clustering organizes a set of instances into some partitions, in such a way that each partition repre-

sents a cluster. These clusters are formed by minimizing some objective function, such as the sum of squares distances

to the cluster centroids. Examples of well-known algorithms include k-means (Farnstrom, Lewis, & Elkan, 2000), and

k-medoids (Guha, Meyerson, Mishra, Motwani, & O'Callaghan, 2003).

•Micro-cluster-based clustering divides the process into two principal phases: (i) the online phase summarizes the data

stream in micro-clusters: and (ii) the second phase builds a general cluster model using the local micro-clusters. The

most representative streaming algorithms are CluStream (Aggarwal et al., 2003), and Clustree (Kranen, Assent,

Baldauf, & e Seidl, 2011).

•Density-based clustering is based on the idea that a cluster should be built around instances with a significant number

of points in their neighborhoods, that is in dense areas of the instance space. The DBSCAN algorithm (Sander, Ester,

Kriegel, & Xu, 1998) is the most representative density-based offline algorithm. Cao, Ester, Qian, and Zhou (2006)

presented Den-Stream, a streaming density-based algorithm, that extends the main concepts of DBSCAN to the

BAHRI ET AL.9of17

streaming setting by using micro-clusters for online computing summary statistics. It uses a fading window model,

where the weight of each data point decreases exponentially using the function f(t)=2

−λt

where the decay-rate

is λ>0.

•Hierarchical clustering, also known as connectivity-based clustering, is mainly composed of agglomerative and divi-

sive methods. The agglomerative clustering concerns the “bottom-up”methods where each instance starts in its own

cluster and pairs of instances are grouped as one moves up to form the hierarchy. On the other hand, the divisive

clustering is a “top-down”method that starts from the top, where all instances are grouped in one cluster, and splits

into different clusters recursively as one moves down the hierarchy. However, in the stream setting, instances arrive

one by one and are not all available at once. Therefore, the hierarchical clustering was presented in the system

Online Divise-Agglomerative Clustering (ODAC) Rodrigues, Gama, and Pedroso (2008) which continuously main-

tains a hierarchical clustering structure, where the system continuously monitors the diameter of the clusters. The

hierarchy grows, when more information is available, allowing a more detail cluster structure. When a change in the

correlation structure of the process that generates data is detected, the hierarchy contracts, by merging the cluster

where change was detected.

Distributed algorithms for clustering have been developed in the context of sensor networks. There are two main

perspectives: (i) a cluster is a group of data points (Gama, Rodrigues, & Lopes, 2011); and (ii) a cluster is a group of sen-

sors, as in (Rodrigues et al., 2018).

There is no consensus on the topic of evaluation of clustering algorithms. Kremer et al. presented the Cluster Map-

ping Measure, that demonstrates multiple types of errors by considering the characteristics of evolving data.

Spiliopoulou, Ntoutsi, Theodoridis, and Schult (2006) introduces MONIC system that aims to detect and track change in

clusters by assuming that a cluster represents an instance in some geometric space. MONIC works by encompassing

changes that include more than one cluster, enabling for insights on cluster change in the entire clustering. Actually,

the transition tracking mechanism depends on the overlapping degree between two clusters. The notion of overlapping

between any two clusters, C

and C

, can be defined as the number of common instances weighted with the age of the

records. Suppose that the cluster C

and C

are obtained at instances t

and t

, respectively. The degree of overlap

between these two clusters is thus computed using:

overlap C1,C2

ðÞ=

aC1\C2

age a,t2

ðÞ

xC1

age x,t2

ðÞ

The latter permits the deduction of properties that concern the underlying structure of data stream. The cluster transi-

tion at a given instance is defined as a change in a cluster discovered at an earlier instance. MONIC system considers

internal and external transitions which reflect the dynamics and changes in the stream, such as a cluster survives (it is

no going to disappear), a cluster absorbed (absorbed by another cluster); a cluster disappears (totally removed); a cluster

emerges (a new cluster is created). Tracking cluster evolution on panel and longitudinal data appears in Oliveira and

Gama (2012).

5|FREQUENT PATTERNS

Frequent pattern mining is an important unsupervised learning task that can be employed to merely determine the

structure of the data, to figure out the association rules, or to find discriminative attributes that can be exploited for

classification or clustering tasks. Examples of pattern classes can be itemsets, trees, graphs, and sequences (Bifet &

Gavaldà, 2008).

The frequent patterns issue is presented as follows: given a batch dataset or a data stream, that encompasses pat-

terns, and a threshold σ, the task consists in finding all the patterns that emerge as a subpattern in a fraction σof the

patterns in the data. For instance, if the input data is a stream of purchases in a supermarket, and σ= 10%, we would

call {cheese,wine} frequent pattern if at least 10% of the purchases include, among other products, partially cheese and

wine. Another example includes graphs where a triangle is considered as a graph pattern. Given a dataset of graphs,

this pattern would be frequent if at least a fraction σof the graphs contain at least one triangle.

10 of 17 BAHRI ET AL.

In the offline setting, Apriori, Eclat, and FP-growth are well-known algorithms for discovering frequent itemsets in

datasets. Similar approaches for data structures, for example, sequences and graphs can be found in the literature. Nev-

ertheless, the adaptation of these algorithms to the stream setting is not an easy task because they violate the single-pass

requirement and maintain too much information.

To cope with the aforementioned issues, stream approaches for frequent pattern mining have been proposed which

use a batch miner as a base leading to approximate results rather than exact. Thus, other online ideas need to be

explored. Examples of algorithms for frequent itemset mining in data streams are Moment (Chi, Wang, Yu, &

Muntz, 2006) and IncMine (Cheng, Ke, & Ng, 2008).

6|TOWARDS AUTOML

Machine learning benefited from tremendous research progress recently in many application areas, particularly in the

stream setting. The growing number of machine learning algorithms and their respective hyperparameters give rise to

the number of configurations that relies on qualified experts (i.e., human intervention and expertise).

There is no doubt, current (some of them are mentioned in the previous sections) algorithms are suitable for data

streams but they usually require to set the configuration in advance (e.g., the size of the ensemble for the ensemble-

based methods, the number of neighbors kfor the kNN algorithm). Moreover, stream approaches are not totally auto-

mated, that is, the parameterization set at the beginning may not hold for all the parts of the stream since models may

change over time because of concept drifts. Thus, what is the way to deal with this matter?

Auto Machine Learning (autoML) (Hutter, Kotthoff, & Vanschoren, 2019) is a new tool that is receiving increased

attention which aims to tackle the parameters configuration issue using automatic monitoring models. Multiple systems

have been proposed in the offline setting that allow hyperparameter tuning by combining autoML with famous machine

learning softwares, such as Scikit-learn and Weka, AutoWeka

and AutoSklearn,

respectively (Feurer et al., 2015).

Yet, a very limited number of contributions on AutoML for evolving data streams exist in the literature. For

instance, Self Parameter Tuning (SPT) is an automated technique that controls the stream algorithm configuration by

incrementally selecting the best hyperparameter(s) that may change over time (Veloso et al., 2018).

We consider that automatic algorithm configuration for data stream mining can be revolutionary. In fact, selecting

the best hyperparameter configuration for stream algorithms is a tedious task because it may change depending on the

characteristics (e.g., number of attributes) and contents of data. Hence, tuning incremental algorithms' configurations

automatically and continuously is a very promising direction in machine learning for data streams.

7|EVALUATION METRICS

In the stream setting, two main evaluation axes are involved and strongly related to assess the efficiency performance

of algorithms along with their quality (e.g., accuracy for classification) are (i) the Execution time which includes any

preprocessing (such as dimension reduction), prediction, and learning steps; and the Memory used by an algorithm

comprises the storage needed to keep the current model(s) with the statistical information useful to maintain the incre-

mental processing (e.g., the number of instances received so far from the stream).

In the context of supervised learning, it is important to evaluate the trained model and test its applicability in differ-

ent scenarios on different data streams. The prequential evaluation (Dawid, 1984), also called interleaved test-then-train,

is the most used evaluation method proposed exclusively to assess the performance of data stream algorithms incremen-

tally. This evaluation scheme consists in using instances to test (for prediction) the current learned model, and use them

thereafter to update (or train) the model. Another important evaluation task consists in performing the holdout evalua-

tion that uses different test and training datasets. Bifet et al. (2015) propose an evaluation methodology for big data

streams that address different scenarios, including unbalanced data, and data where change occurs on different time

scales. Most notably, Bifet et al. (2015) introduces adaptations of cross-validation to the streaming settings.

Several metrics exist to measure the performance of classification algorithms. Most of them are easily applied to data

stream classification. Accuracy is an intuitive metric that measures the percentage of correctly classified instances with

respect to all predictions made. If the distribution of examples across the class labels is imbalanced, then accuracy can

be misleading as a model that always predicts the majority class will yield high accuracy. In these cases, metrics such as

sensitivity,specificity,g-mean are better alternatives.

BAHRI ET AL.11 of 17

The most common metrics used to evaluate the results of the prediction for the regression task are (i) Root mean

squared error (RMSE). RMSE is the square root of the average squared difference between the target value and the value

predicted by the model; and (ii) Mean absolute error (MAE). MAE is the absolute difference between the value targeted

and the value predicted by the model.

Another topic of interest is the evaluation of data streams when there is a non-negligible delay between the arrival

of the instance data and its corresponding label data. Grzenda, Gomes, and Bifet (2019) claims that besides “how”pre-

dictions affect the predictive performance, it is also essential to consider “when”labels are made available as part of the

evaluation. This leads to the concept of continuous re-evaluations introduced in (Grzenda et al., 2019) and further

explored in (Grzenda, Gomes, & Bifet, 2020). The goal of continuous re-evaluations is to observe if, and how fast,

models can transform an initially incorrect prediction into a correct prediction before the true label arrives in a stream-

ing setting.

Finally, a multitude of evaluation measures (also called validation measures) have been proposed to evaluate the

quality of resulting clusterings. We direct the readers to this work (Kremer et al., 2011) that discusses and compares

these measures in extensive experiments.

8|OPEN SOURCE SOFTWARE

Multiple frameworks for data stream mining have been proposed in the literature. The set of available open source soft-

wares contains a multitude of the state-of-the-art algorithms that can be extended to propose new approaches and/or to

compare against. In the following, we cite some softwares with active growing communities that have been widely used

along with new ones in the literature. Massive online analysis (MOA): MOA

(Bifet et al., 2010) is the most popular open

source framework for machine learning for evolving data streams, written in Java and implemented above Waikato

Environment for Knowledge Analysis (WEKA),

with a very active research community. MOA provides different data

generators (e.g., SEA and LED generators), stream mining algorithms (e.g., algorithms for classification, clustering,

regression, anomaly detection), evaluation methods (e.g, the prequential evaluation), and statistics to evaluate the per-

formance of algorithms (e.g., memory, time, accuracy, kappa). This software can be used via a command line or a user

interface. A recent book (Bifet et al., 2018) has been published that discusses MOA and how to use it along with exer-

cises and lab sessions. Generally, researchers working on MOA, make their contribution codes available in MOA

extension.

Scalable advanced massive online analysis (SAMOA): SAMOA

(Morales & Bifet, 2015) is presented as a library as

well as a framework, written in Java, that combines data stream analysis and distributed processing. SAMOA allows

the creation of distributed stream machine learning algorithms and runs them on distributed stream processing engines

i a fast and scalable manner. This framework provides a collection of distributed versions of some data stream algo-

rithms (e.g., bagging, boosting).

StreamDM: StreamDM

is an open source framework for online machine learning which utilizes Spark streaming

to enable stream processing from a variety of sources. The main advantages of StreamDM are (i) the benefit from the

use of Spark streaming API that enables scalable stream processing in order to handle issues such as “out of order data”

in data sources; and (ii) the fact that it allows batch processing algorithms with streaming algorithms (Bifet et al., 2015).

Scikit-multiflow: Scikit-multiflow

(Montiel et al., 2018) is a new open source software designed for multi-label/

multi-output and single-output stream learning algorithms and inspired by the popular frameworks, scikit-learn

(Pedregosa et al., 2011), MOA (Bifet et al., 2010), and MEKA,

to fill the void in Python for data stream mining. Similar

to MOA, scikit-multiflow contains also stream generators and algorithms for data streams. More recently, scikit-

multiflow has merged with the creme framework (https://maxhalford.github.io/) into a new Python project called

River. Given the increasing popularity of the Python programming language, the advantage of using the scikit-

multiflow framework, that complements the scikit-learn which focuses only on the batch learning, is its similarity to

the latter which is widely used by researchers and practitioners. Moreover, it can be used within the popular interface

Jupyter Notebook, often used by the data science community. On the other hand, the notable drawback is that this soft-

ware may be slow, in comparison with MOA, because Python codes are expected to execute slower than Java codes.

Comparative studies (Behera, Das, Jena, Rath, & Sahoo, 2017; Gomes et al., 2019b; Inoubli, Aridhi, Mezni, Mad-

douri, & Nguifo, 2018) on the stream frameworks, with an evaluation of the performance in terms of resource consump-

tion, have been provided with a focus on the distributed stream processing tools, such as Apache Samza, Apache Spark,

Apache Flink, and Apache Storm.

12 of 17 BAHRI ET AL.

9|CONCLUSION

The aim of this survey paper is to present a holistic view of data stream mining by reviewing the stream mining chal-

lenges and foundations. We also conducted a comprehensive literature review of the baseline algorithms in data stream

mining and discussed the most promising and the recent ones. Moreover, this survey presents some principal metrics

for algorithm evaluation and well-known actively growing open-source softwares for the stream environment. We hope

that this summary provides a clear overview of the main challenges, basics, and recent advances, as well as some open

directions in the stream setting to the AI community.

ACKNOWLEGEMENTS

This work has been carried out in the frame of a cooperation between Huawei Technologies France SASU and Télécom

Paris (Grant no. YBN2018125164).

CONFLICT OF INTEREST

The authors have declared no conflicts of interest for this article.

AUTHOR CONTRIBUTIONS

Albert Bifet: Conceptualization. Heitor Gomes: Conceptualization. Silviu Maniu: Conceptualization.

DATA AVAILABILITY STATEMENT

Data sharing not applicable to this article as no data sets were generated or analyzed during the current study.

ENDNOTES

* Email: maroua.bahri@telecom-paris.fr

www.statista.com/statistics/976079/number-of-iot-connected-objects-worldwide-by-type/

In the sequel, we use the terms streaming,online,orincremental interchangeably.

The offline Bagging applies resampling, i.e. sampling with reposition, to train its ensemble members on different sub-

sets of instances.

https://www.automl.org/automl/

https://www.automl.org/automl/autoweka/

https://www.automl.org/automl/auto-sklearn/

https://moa.cms.waikato.ac.nz/

https://www.cs.waikato.ac.nz/ml/weka/

https://moa.cms.waikato.ac.nz/moa-extensions/.

http://samoa.incubator.apache.org

http://huawei-noah.github.io/streamDM/

https://scikit-multiflow.github.io

The MEKA project provides algorithms for multi-label learning and evaluation.

REFERENCES

Abdulsalam H, Skillicorn DB, Martin P. Streaming random forests. In: International Database Engineering and Applications Symposium

(IDEAS). Banff, Canada: IEEE; 2007, 225–232.

Aggarwal CC. Data streams: models and algorithms, vol. 31. New York: Springer Science & Business Media; 2007.

Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: International Conference on Very Large Data

Bases. VLDB Endowment; 2003, 81–92.

Aggarwal CC, Philip SY. A survey of synopsis construction in data streams. In: Data streams. Springer; 2007, 169–207.

Amini A, Wah TY, Saboohi H. On density-based data streams clustering algorithms: A survey. Journal of Computer Science and Technology

2014, 29:116–141.

Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In: ACM SIGMOD. New York: ACM;

2002, 1–16.

BAHRI ET AL.13 of 17

Bahri, M. (2020). Improving IoT data stream analytics using summarization techniques (Ph.D. thesis). Institut Polytechnique de Paris.

Bahri M, Bifet A, Maniu S, Gomes HM. Survey on feature transformation techniques for data streams. In: International Joint Conference on

Artificial Intelligence (IJCAI). 2020. Yokohama.

Bahri M, Maniu S, Bifet A. Sketch-based naive bayes algorithms for evolving data streams. In: International Conference on Big Data. Seattle:

IEEE; 2018, 604–613.

Bahri M, Pfahringer B, Bifet A, Maniu S. Efficient batch-incremental classification for evolving data streams. In: Intelligent Data Analysis

(IDA). Konstanz: Springer; 2020.

Behera RK, Das S, Jena M, Rath SK, Sahoo B. A comparative study of distributed tools for analyzing streaming data. In: International Confer-

ence on Information Technology (ICIT). Toronto: IEEE; 2017, 79–84.

Besedin, A., Blanchart, P., Crucianu, M. and Ferecatu, M. (2017). Evolutive deep models for online learning on data streams with no storage.

Beygelzimer A, Kale S, Luo H. Optimal and adaptive algorithms for online boosting. In: International Conference on Machine Learning

(ICML). 2015, 2323–2331. Lille.

Bifet A, de Francisci Morales G, Read J, Holmes G, Pfahringer B. Efficient online evaluation of big data stream classifiers. In: Proceedings of

the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015, 59–68. Sydney.

Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: International Conference on Data Mining (ICDM).

SIAM; 2007, 443–448.

Bifet A, Gavaldà R. Mining adaptively frequent closed unlabeled rooted trees in data streams. In: ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. 2008, 34–42. Las Vegas, Nevada, USA. https://doi.org/10.1145/1401890.1401900.

Bifet A, Gavaldà R. Adaptive learning from evolving data streams. In: Intelligent Data Analysis (IDA). Lyon: Springer; 2009, 249–260.

Bifet A, Gavaldà R, Holmes G, Pfahringer B. Machine learning for data streams: With practical examples in MOA. MIT Press; 2018.

Bifet A, Holmes G, Kirkby R, Pfahringer B. Moa: Massive online analysis. Journal of Machine Learning Research (JMLR) 2010, 11:1601–1604.

Bifet A, Holmes G, Pfahringer B. Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and

knowledge discovery in databases. Barcelona: Springer; 2010, 135–150.

Bifet A, Holmes G, Pfahringer B, Frank E. Fast perceptron decision tree learning from evolving data streams. In: Pacific-Asia Conference on

Knowledge Discovery and Data Mining (PAKDD). Springer; 2010, 299–310.

Bifet, A., & Kirkby, R. (2009) Data stream mining a practical approach.

Bifet A, Maniu S, Qian J, Tian G, He C, Fan W. Streamdm: Advanced data mining in spark streaming. In: International Conference on Data

Mining Workshop (ICDMW). Atlantic City: IEEE; 2015, 1608–1611.

Bifet A, Pfahringer B, Read J, Holmes G. Efficient data stream classification via probabilistic adaptive windows. In: Symposium On Applied

Computing (SIGAPP). Coimbra: ACM; 2013, 801–806.

Breiman L. Bagging predictors. Machine Learning 1996, 24:123–140.

Breiman L. Random forests. Machine Learning 2001, 45:5–32.

Caiming Z, Yong C. A review of research relevant to the emerging industry trends: Industry 4.0, iot, blockchain, and business analytics. Jour-

nal of Industrial Integration and Management 2020, 5:165–180.

Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In: Ghosh J, Lambert D, Skillicorn DB,

Srivastava J, eds. Sixth SIAM International Conference on Data Mining. Bethesda, MD, USA: SIAM; 2006, 328–339.

Carnein M, Trautmann H. Optimizing data stream representation: An extensive survey on stream clustering algorithms. Business & Informa-

tion Systems Engineering 2019, 61:277–297.

Chang L-C, Chen P-A, Chang F-J. Reinforced two-step-ahead weight adjustment technique for online training of recurrent neural networks.

IEEE Transactions on Neural Networks and Learning Systems 2012, 23:1269–1278.

Chen S-T, Lin H-T, Lu C-J. An online boosting algorithm with theoretical justifications. In: International Conference on Machine Learning

(ICML). 2012. Edinburgh.

Chen W, Oliverio J, Kim JH, Shen J. The modeling and simulation of data clustering algorithms in data mining with big data. Journal of

Industrial Integration and Management 2019, 4:1850017.

Chen X-W, Lin X. Big data deep learning: challenges and perspectives. IEEE Access 2014, 2:514–525.

Cheng J, Ke Y, Ng W. Maintaining frequent closed itemsets over a sliding window. Journal of Intelligent Information Systems 2008, 31:

191–215 https://doi.org/10.1007/s10844-007-0042-3.

Chi Y, Wang H, Yu PS, Muntz RR. Catch the moment: Maintaining closed frequent itemsets over a data stream sliding window. Knowledge

and Information Systems 2006, 10:265–294 https://doi.org/10.1007/s10115-006-0003-0.

Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 2005,

55:58–75.

Cortes C, Vapnik V. Support-vector networks. ML 1995, 20:273–297.

Da Xu L, He W, Li S. Internet of things in industries: A survey. IEEE Transactions on Industrial Informatics 2014, 10:2233–2243.

Dawid AP. Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the

Royal Statistical Society: Series A (General) 1984, 147:278–290.

de Barros RSM, de Carvalho Santos SGT, Júnior PMG. A boosting-like online learning ensemble. In: International Joint Conference on Neural

Networks (IJCNN). Vancouver: IEEE; 2016, 1871–1878.

de Souza Viana TS, de Oliveira M, da Silva TLC, Falc ao MSR, Gonçalves EJT. A message classifier based on multinomial naive bayes for

online social contexts. Journal of Management Analytics 2018, 5:213–229.

14 of 17 BAHRI ET AL.

Dietterich TG. Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems. Cagliari: Springer;

2000, 1–15.

Domingos P, Hulten G. Mining high-speed data streams. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

Boston: ACM; 2000, 71–80.

Farnstrom F, Lewis J, Elkan C. Scalability for clustering algorithms revisited. ACM SIGKDD Explorations Newsletter 2000, 2:51–57.

Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and robust automated machine learning. Advances in Neu-

ral Information Processing Systems 2015, 28:2962–2970.

Freund Y, Schapire RE. A desicion-theoretic generalization of on-line learning and an application to boosting. In: European Conference on

Computational Learning Theory. Barcelona: Springer; 1995, 23–37.

Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997, 29:131–163.

Gaber MM, Zaslavsky A, Krishnaswamy S. Mining data streams: A review. SIGMOD 2005, 34:18–26.

Gama J. A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence 2012, 1:45–55.

Gama J, Gaber MM. Learning from data streams: Processing techniques in sensor networks. Springer; 2007.

Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams. In: ACM SIGKDD International Conference on

Knowledge Discovery & Data Mining. Washington DC: ACM; 2003, 523–528.

Gama J, Rodrigues PP, Lopes LMB. Clustering distributed sensor data streams using local processing and reduced communication. Intelligent

Data Analysis 2011, 15:3–28.

Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A. A survey on concept drift adaptation. Computing Surveys (CSUR) 2014, 46:44.

Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look a tutorial. In: ACM SIGMOD International

Conference on Management of Data. 2002, 635–635.

Gomes HM, Barddal JP, Enembreck F, Bifet A. A survey on ensemble learning for data stream classification. Computing Surveys (CSUR)

2017, 50:23.

Gomes HM, Barddal JP, Ferreira LEB, Bifet A. Adaptive random forests for data stream regression. In: h European symposium on artificial

neural networks (ESANN). 2018. Bruges: Springer.

Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T. Adaptive random forests for evolving data

stream classification. Machine Learning 2017, 106:1–27.

Gomes HM, Montiel J, Mastelini SM, Pfahringer B, Bifet A. On ensemble techniques for data stream regression. In: IEEE International Joint

Conference on Neural Networks. Glasgow: IEEE; 2020.

Gomes HM, Read J, Bifet A. Streaming random patches for evolving data stream classification. In: International Conference on Data Mining

(ICDM). Beijing: IEEE; 2019.

Gomes HM, Read J, Bifet A, Barddal JP, Gama J. Machine learning for streaming data: state of the art, challenges, and opportunities. ACM

SIGKDD Explorations Newsletter 2019, 21:6–22.

Grzenda M, Gomes HM, Bifet A. Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery. 2019:1–30. Shanghai:

Spring.

Grzenda M, Gomes HM, Bifet A. Performance measures for evolving predictions under delayed labelling classification. In: International Joint

Conference on Neural Networks (IJCNN). Glasgow: IEEE; 2020, 1–8.

Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge

and Data Engineering 2003, 15:515–528.

Hand DJ, Mannila H, Smyth P. Principles of data mining. London: MIT Press; 2001.

Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association. 1994(409–426).

New York: Springer.

Holmes, G., Kirkby, R. B., & Bainbridge, D. (2004). Batch-incremental learning for mining data streams.

Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression, vol. 398. Wiley; 2013.

Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: ACM SIGKDD International Conference on Knowledge Discovery &

Data Mining. San Francisco: ACM; 2001, 97–106.

Hutter F, Kotthoff L, Vanschoren J. Automated machine learning. Springer; 2019.

Ikonomovska E, Gama J, Džeroski S. Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 2011, 23:

128–168.

Ikonomovska E, Gama J, Džeroski S. Online tree-based ensembles and option trees for regression on evolving data streams. Neurocomputing

2015, 150:458–470.

Inoubli W, Aridhi S, Mezni H, Maddouri M, Nguifo E. A comparative study on streaming frameworks for big data. In: Very Large Data Bases

(VLDB). 2018. Rio De Janeiro: Springer.

Jain LC, Seera M, Lim CP, Balasubramaniam P. A review of online learning in supervised neural networks. Neural Computing and Applica-

tions 2014, 25:491–509.

Jankowski D, Jackowski K, Cyganek B. Learning decision trees from data streams with concept drift. Procedia Computer Science 2016, 80:

1682–1691.

Kim JH. Integrating iot with lqr-pid controller for online surveillance and control of flow and pressure in fluid transportation system. Journal

of Industrial Integration and Management 2017, 17:100–127.

BAHRI ET AL.15 of 17

Klawonn F, Angelov P. Evolving extended naive bayes classifiers. In: International Conference on Data Mining Workshops. Hong Kong: IEEE;

2006, 643–647.

Kokate U, Deshpande A, Mahalle P, Patil P. Data stream clustering techniques, applications, and models: comparative analysis and discus-

sion. Big Data and Cognitive Computing 2018, 2:32.

Kolajo T, Daramola O, Adebiyi A. Big data stream analysis: a systematic literature review. Journal of Big Data 2019, 6:47.

Kranen P, Assent I, Baldauf C, e Seidl T. The clustree: Indexing micro-clusters for anytime stream mining. Knowledge and Information Sys-

tems 2011, 29:249–272.

Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B. An effective evaluation measure for clustering on evolving data

streams. In: Apté C, Ghosh J, Smyth P, eds. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego:

ACM; 2011, 868–876.

Kuncheva LI. Combining pattern classifiers: methods and algorithms. Canada: John Wiley & Sons; 2014.

López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current

trends on using data intrinsic characteristics. Information Sciences 2013, 250:113–141.

Losing V, Hammer B, Wersing H. Knn classifier with self adjusting memory for heterogeneous concept drift. In: International Conference on

Data Mining (ICDM). Barcelona: IEEE; 2016, 291–300.

Losing V, Hammer B, Wersing H. Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing

2018, 275:1261–1274.

Manapragada C, Webb GI, Salehi M. Extremely fast decision tree. In: ACM SIGKDD International Conference on Knowledge Discovery & Data

Mining. 2018, 1953–1962. London: ACM.

Manku GS, Motwani R. Approximate frequency counts over data streams. In: Very Large Data Bases (VLDB). Hong Kong: Elsevier; 2002,

346–357.

Marrón D, Read J, Bifet A, Navarro N. Data stream classification using random feature functions and novel method combinations. Journal of

Systems and Software 2017, 127:195–204.

Mccallum A, Nigam K. A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categoriza-

tion. 1998, 752(41–48). Citeseer.

McInnes, L., Healy, J., & Melville, J (2018) Umap: Uniform manifold approximation and projection for dimension reduction arXiv preprint

arXiv:1802.03426.

Montiel J, Read J, Bifet A, Abdessalem T. Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research

(JMLR) 2018, 19:2915–2914.

Morales GDF, Bifet A. Samoa: scalable advanced massive online analysis. Journal of Machine Learning Research (JMLR) 2015, 16:149–153.

Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In: Transactions on large-scale dataand knowledge-centered sys-

tems II. 2010, 1, Springer–30. Berlin: Springer.

Nguyen H-L, Woon Y-K, Ng W-K. A survey on data stream clustering and classification. Knowledge and Information Systems 2015, 45:

535–569.

Oliveira MDB, Gama J. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis 2012,

16:93–111.

Oza NC. Online bagging and boosting. In: International Conference on Systems, Man and Cybernetics, vol. 3. Waikoloa, Hawaii: IEEE; 2005,

2340–2345.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn:

Machine learning in python. Journal of Machine Learning Research (JMLR) 2011, 12:2825–2830.

Pratama M, Angelov PP, Lu J, Lughofer E, Seera M, Lim CP. A randomized neural network for data streams. In: International Joint Confer-

ence on Neural Networks (IJCNN). Anchorage: IEEE; 2017, 3423–3430.

Read J, Bifet A, Pfahringer B, Holmes G. Batch-incremental versus instance-incremental learning in dynamic and evolving data. In: Intelli-

gent Data Analysis (IDA). 2012, 313–323. Helsinki: Springer.

Read J, Perez-Cruz F, Bifet A. Deep learning in partially-labeled data streams. In: Annual ACM Symposium on Applied Computing. 2015,

954–959. Salamanca: ACM.

Rodrigues PP, Araújo J, Gama J, Lopes LMB. A local algorithm to approximate the global clustering of streams generated in ubiquitous sen-

sor networks. International Journal of Distributed Sensor Networks 2018, 14.

Rodrigues PP, Gama J, Pedroso JP. Hierarchical clustering of time-series data streams. IEEE Transactions on Knowledge and Data Engineer-

ing 2008, 20:615–627 https://doi.org/10.1109/TKDE.2007.190727.

Rutkowski L, Jaworski M, Duda P. Probabilistic neural networks for the streaming data classification. In: Stream Data Mining: Algorithms

and Their Probabilistic Properties. Springer; 2020, 245–277.

Sander J, Ester M, Kriegel H, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min-

ing and Knowledge Discovery 1998, 2:169–194 https://doi.org/10.1023/A:1009745219419.

Silva JA, Faria ER, Barros RC, Hruschka ER, de Leon Ferreira de Carvalho ACP, Gama J. Data stream clustering: A survey. ACM Computing

Surveys 2013, 46:13:1–13:31.

Sorzano, C. O. S., Vargas, J., & Montano, A. P. (2014) A survey of dimensionality reduction techniques. arXiv preprint arXiv:1403.2877.

Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R. Monic: modeling and monitoring cluster transitions. In: Proceedings ACM International

Conference on Knowledge Discovery and Data Mining. Philadelphia: ACM Press; 2006, 706–711.

16 of 17 BAHRI ET AL.

Veloso B, Gama J, Malheiro B. Self hyper-parameter tuning for data streams. In: Data Streams. 2018. Split: Springer.

Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J. Feature hashing for large scale multitask learning. In: International Conference

on Machine Learning (ICML). Montreal: ACM; 2009, 1113–1120.

Zhou SM, Da Xu L. A new type of recurrent fuzzy neural network for modeling dynamic systems. Knowledge Based Systems 2001, 14:

243–251.

How to cite this article: Bahri M, Bifet A, Gama J, Gomes HM, Maniu S. Data stream analysis: Foundations,

major tasks and tools. WIREs Data Mining Knowl Discov. 2021;e1405. https://doi.org/10.1002/widm.1405

BAHRI ET AL.17 of 17

A preview of this full-text is provided by Wiley.

Learn more

Content available from Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

This content is subject to copyright. Terms and conditions apply.

Classification Of Hierarchical Data Streams Using Summarization Techniques

Thesis

Full-text available

Feb 2023

Eduardo Tieppo

AutoClass: AutoML for Data Stream Classification

Conference Paper

Full-text available

Dec 2023

Employing Two-Dimensional Word Embedding for Difficult Tabular Data Stream Classification

Preprint

Full-text available

Apr 2024

Pawel Zyblewski

Rapid technological advances are inherently linked to the increased amount of data, a substantial portion of which can be interpreted as data stream, capable of exhibiting the phenomenon of concept drift and having a high imbalance ratio. Consequently, developing new approaches to classifying difficult data streams is a rapidly growing research area. At the same time, the proliferation of deep learning and transfer learning, as well as the success of convolutional neural networks in computer vision tasks, have contributed to the emergence of a new research trend, namely Multi-Dimensional Encoding (MDE), focusing on transforming tabular data into a homogeneous form of a discrete digital signal. This paper proposes Streaming Super Tabular Machine Learning (SSTML), thereby exploring for the first time the potential of MDE in the difficult data stream classification task. SSTML encodes consecutive data chunks into an image representation using the STML algorithm and then performs a single ResNet-18 training epoch. Experiments conducted on synthetic and real data streams have demonstrated the ability of SSTML to achieve classification quality statistically significantly superior to state-of-the-art algorithms while maintaining comparable processing time.

An online ensemble method for auto-scaling NFV-based applications in the edge

Article

Full-text available

May 2024
CLUSTER COMPUT

The synergy of edge computing and Machine Learning (ML) holds immense potential for revolutionizing Internet of Things (IoT) applications, particularly in scenarios characterized by high-speed, continuous data generation. Offline ML algorithms struggle with streaming data as they rely on static datasets for model construction. In contrast, Online Machine Learning (OML) adapts to changing environments by training the model with each new observation in real-time. However, developing OML algorithms introduces complexities such as bias and variance considerations, making the selection of suitable estimators challenging. In this challenging landscape, ensemble learning emerges as a promising approach, offering a strategic framework to navigate the bias-variance tradeoff and enhance prediction accuracy by amalgamating outputs from diverse ML models. This paper introduces a novel ensemble method tailored for edge computing environments, designed to efficiently operate on resource-constrained devices while accommodating various online learning scenarios. The primary objective is to enhance predictive accuracy at the edge, thereby empowering IoT applications with robust decision-making capabilities. Our study addresses the critical challenges of ML in resource-constrained edge computing environments, offering practical insights for enhancing predictive accuracy and scalability in IoT applications. To validate our ensemble’s efficacy, we conducted comprehensive experimental evaluations leveraging both synthetic and real-world datasets. The results indicate that our ensemble surpassed state-of-the-art data stream algorithms and ensemble regressors across a range of regression metrics, underlining its superior predictive prowess. Furthermore, we scrutinized the ensemble’s performance within the realm of auto-scaling for Virtual Network Function (VNF)-based applications situated at the network’s edge, thereby elucidating its applicability and scalability in real-world scenarios.

Online performance and proactive maintenance assessment of data driven prediction models

Article

Full-text available

Mar 2024
J INTELL MANUF

Many Data-driven decisions in manufacturing need accurate and reliable predictions. Due to high complexity and variability of working conditions, a prediction model may deteriorate over time after deployed. Traditional performance evaluation indexes mainly assess the prediction model from a static perspective, which is difficult to meet the actual needs of model selection and proactive maintenance, resulting in unstable online prediction performance. For regression-based prediction models, this paper designs online prediction performance evaluation indexes (OPPEI) to evaluate the prediction model in terms of its accuracy, degradation speed, and stability. For proactive maintenance, this paper proposes a model maintenance evaluation method based on Principal Component Analysis (PCA). We use PCA to transform various performance indexes and extract the first principal component as a model maintenance evaluation index, which could reduce the over-sensitive or insensitive phenomenon of single indicator. The effectiveness of online prediction performance evaluation indexes and PCA-based proactive maintenance evaluation method are verified by simulation and several real-world load forecasting experiments.

On the Search for Potentially Anomalous Traces of Cosmic Ray Particles in Images Acquired by Cmos Detectors for a Continuous Stream of Emerging Observational Data

Article

Full-text available

Mar 2024
SENSORS-BASEL

In this paper we propose the method for detecting potential anomalous cosmic ray particle tracks in big data image dataset acquired by Complementary Metal-Oxide-Semiconductors (CMOS). Those sensors are part of scientific infrastructure of Cosmic Ray Extremely Distributed Observatory (CREDO). The use of Incremental PCA (Principal Components Analysis) allowed approximation of loadings which might be updated at runtime. Incremental PCA with Sequential Karhunen-Loeve Transform results with almost identical embedding as basic PCA. Depending on image preprocessing method the weighted distance between coordinate frame and its approximation was at the level from 0.01 to 0.02 radian for batches with size of 10,000 images. This significantly reduces the necessary calculations in terms of memory complexity so that our method can be used for big data. The use of intuitive parameters of the potential anomalies detection algorithm based on object density in embedding space makes our method intuitive to use. The sets of anomalies returned by our proposed algorithm do not contain any typical morphologies of particle tracks shapes. Thus, one can conclude that our proposed method effectively filter-off typical (in terms of analysis of variance) shapes of particle tracks by searching for those that can be treated as significantly different from the others in the dataset. We also proposed method that can be used to find similar objects, which gives it the potential, for example, to be used in minimal distance-based classification and CREDO image database querying. The proposed algorithm was tested on more than half a million (570,000+) images that contains various morphologies of cosmic particle tracks. To our knowledge, this is the first study of this kind based on data collected using a distributed network of CMOS sensors embedded in the cell phones of participants collaborating within the citizen science paradigm.

Online Machine Learning from Non-stationary Data Streams in the Presence of Concept Drift and Class Imbalance: A Systematic Review

Article

Full-text available

Jan 2024

In IoT environment applications generate continuous non-stationary data streams with in-built problems of concept drift and class imbalance which cause classifier performance degradation. The imbalanced data affects the classifier during concept detection and concept adaptation. In general, for concept detection, a separate mechanism is added in parallel with the classifier to detect the concept drift called a drift detector. For concept adaptation, the classifier updates itself or trains a new classifier to replace the older one. In case, the data stream faces a class imbalance issue, the classifier may not properly adapt to the latest concept. In this survey, we study how the existing work addresses the issues of class imbalance and concept drift while learning from nonstationarydata streams. We further highlight the limitation of existing work and challenges caused by other factors of class imbalance alongwith concept drift in data stream classification. Results of our survey found that, out of 1110 studies, by using our inclusion and exclusion criteria, we were able to narrow the pool of articles down to 35 that directly addressed our study objectives. The study found that issues such as multiple concept drift types, dynamic class imbalance ratio, and multi-class imbalance in presence of concept drift are still open for further research. We also observed that, while major research efforts have been dedicated to resolving concept drift and class imbalance, not much attention has been given to with-in-class imbalance, rear examples, and borderline instances when they exist with concept drift in multi-class data. This paper concludes with some suggested future directions.

A Consolidated Study On Advanced Classification Techniques Used On Stream Data

Conference Paper

Oct 2023

A comprehensive analysis of concept drift locality in data streams

Article

Feb 2024
KNOWL-BASED SYST

Adapting to drifting data streams is a significant challenge in online learning. Concept drift must be detected for effective model adaptation to evolving data properties. Concept drift can impact the data distribution entirely or partially, which makes it difficult for drift detectors to accurately identify the concept drift. Despite the numerous concept drift detectors in the literature, standardized procedures and benchmarks for comprehensive evaluation considering the locality of the drift are lacking. We present a novel categorization of concept drift based on its locality and scale. A systematic approach leads to a test bed of 2760 data stream benchmarks, reflecting various difficulty levels following our proposed categorization. We conduct a comparative assessment of 9 state-of-the-art drift detectors across diverse difficulties, highlighting their strengths and weaknesses for future research. We examine how drift locality influences the classifier performance and propose strategies for different drift categories to minimize the recovery time. Lastly, we provide lessons learned and recommendations for future concept drift research. Our benchmark data streams and experiments are publicly available at https://github.com/gabrieljaguiar/locality-concept-drift

Adaptive learning on hierarchical data streams using window-weighted Gaussian probabilities

Article

Feb 2024
APPL SOFT COMPUT

Performance measures for evolving predictions under delayed labelling classification

Conference Paper

Full-text available

Jul 2020

For many streaming classification tasks, the ground truth labels become available with a non-negligible latency. Given this delayed labelling setting, after the instance data arrives and before its true label is known, the online classifier model may change. Hence, the initial prediction can be replaced with additional periodic predictions gradually produced before the true label becomes available. The quality of these predictions may largely vary. Thus, the question arises of how to summarise the performance of these models when multiple predictions for a single instance are made due to delayed labels. In this study, we aim to provide intuitive performance measures summarising the performance of multiple predictions made for individual instances before their true labels arrive. Particular attention is paid to the fact that under the delayed label setting, the emphasis placed on the quality of initial predictions can vary depending on problem needs. The intermediate performance measures we propose complement existing initial and test-then-train performance evaluation when verification latency is observed. Results provided for both real and synthetic datasets show that the new measures can be used to easily rank methods in terms of their ability to produce and refine predictions before the true labels arrive.

On Ensemble Techniques for Data Stream Regression

Conference Paper

Full-text available

Jul 2020

An ensemble of learners tends to exceed the pre-dictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work, we discuss and analyze several techniques for generating, aggregating, and updating ensembles of regressors for evolving data streams. We investigate the impact of different strategies for inducing diversity into the ensemble by randomizing the input data (resampling, random subspaces and random patches). On top of that, we devote particular attention to techniques that adapt the ensemble model in response to concept drifts, including adaptive window approaches, fixed periodical resets and randomly determined windows. Extensive empirical experiments show that simple techniques can obtain similar predictive performance to sophisticated algorithms that rely on reactive adaptation (i.e., concept drift detection and recovery).

Survey on Feature Transformation Techniques for Data Streams

Conference Paper

Full-text available

Jul 2020

Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-computational costs and numerous passes over large data, these approaches pose a hindrance when processing infinite data streams that are potentially high-dimensional. The latter increases the resource-usage of algorithms that could suffer from the curse of dimensionality. To cope with these issues, some techniques for incremental DR have been proposed. In this paper, we provide a survey on reduction approaches designed to handle data streams and highlight the key benefits of using these approaches for stream mining algorithms.

Improving IoT data stream analytics using summarization techniques

Thesis

Full-text available

Jun 2020

Maroua Bahri

With the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high -- and increasing -- data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder.The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area.In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no -- or minor -- loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP, and also integrate them in ensemble methods.

Efficient Batch-Incremental Classification Using UMAP for Evolving Data Streams

Chapter

Full-text available

Apr 2020

Learning from potentially infinite and high-dimensional data streams poses significant challenges in the classification task. For instance, k-Nearest Neighbors (kNN) is one of the most often used algorithms in the data stream mining area that proved to be very resource-intensive when dealing with high-dimensional spaces. Uniform Manifold Approximation and Projection (UMAP) is a novel manifold technique and one of the most promising dimension reduction and visualization techniques in the non-streaming setting because of its high performance in comparison with competitors. However, there is no version of UMAP that copes with the challenging context of streams. To overcome these restrictions, we propose a batch-incremental approach that pre-processes data streams using UMAP, by producing successive embeddings on a stream of disjoint batches in order to support an incremental kNN classification. Experiments conducted on publicly available synthetic and real-world datasets demonstrate the substantial gains that can be achieved with our proposal compared to state-of-the-art techniques.

Streaming Random Patches for Evolving Data Stream Classification

Conference Paper

Full-text available

Nov 2019

Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availability of efficient incremental base learners, such as Hoeffding Trees. In this work, we introduce the Streaming Random Patches (SRP) algorithm, an ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).

Hyperparameter self-tuning for data streams

Article

Apr 2021
INFORM FUSION

The number of Internet of Things devices generating data streams is expected to grow exponentially with the support of emergent technologies such as 5G networks. The online processing of these data streams therefore requires the design and development of suitable machine learning algorithms, able to learn online, as data is generated. Like their batch-learning counterparts, stream-based learning algorithms require careful hyperparameter settings. However, this problem in exacerbated in online learning settings, especially with the occurrence of concept drifts, that frequently require the reconfiguration of hyperparameters. In this article, we present SSPT, an extension of the Self Parameter Tuning (SPT) optimisation algorithm for data streams. We apply the Nelder–Mead algorithm to dynamically-sized samples, converging to optimal settings in a single-pass over data, while using a relatively small number of hyperparameter configurations. In addition, our proposal automatically readjusts hyperparameters when concept drift occurs. To assess the effectiveness of SSPT, the algorithm is evaluated with three different machine learning problems: recommendation, regression, and classification. Experiments with well-known data sets show that the proposed algorithm can outperform previous hyperparameter tuning efforts by human experts. Results show that SSPT converges significantly faster and presents at least similar accuracy when compared with the previous double-pass version of the SPT algorithm.

From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information

Conference Paper

Jul 2020

Text summarization is the research area aiming at creating a short and condensed version of the original document, which conveys the main idea of the document in a few words. This research topic has started to attract the attention of a large community of researchers, and it is nowadays counted as one of the most promising research areas. In general, text summarization algorithms aim at using a plain text document as input and then output a summary. However, in real-world applications, most of the data is not in a plain text format. Instead, there is much manifold information to be summarized, such as the summary for a web page based on a query in the search engine, extreme long document (e.g. academic paper), dialog history and so on. In this paper, we focus on the survey of these new summarization tasks and approaches in the real-world application.

Automated Machine Learning - Methods, Systems, Challenges

Book

Jan 2019

This open access book presents the first comprehensive overview of general methods in Automated Machine Learning (AutoML), collects descriptions of existing systems based on these methods, and discusses the first series of international challenges of AutoML systems. The recent success of commercial ML applications and the rapid growth of the field has created a high demand for off-the-shelf ML methods that can be used easily and without expert knowledge. However, many of the recent machine learning successes crucially rely on human experts, who manually select appropriate ML architectures (deep learning architectures or more traditional ML workflows) and their hyperparameters. To overcome this problem, the field of AutoML targets a progressive automation of machine learning, based on principles from optimization and machine learning itself. This book serves as a point of entry into this quickly-developing field for researchers and advanced students alike, as well as providing a reference for practitioners aiming to use AutoML in their work.

A Review of Research Relevant to the Emerging Industry Trends: Industry 4.0, IoT, Block Chain, and Business Analytics

Article

Dec 2019

Industry 4.0, Internet of Things, Blockchain, and Business Analytics are the hot research topics and have attracted much attention from scholars and practitioners in recent years. In order to identify the forces driving their development and to promote their development, this paper reviews the extant studies on these topics. The review provides a comprehensive overview of state-of-the-art researches on Industry 4.0, Internet of Things, Blockchain, and Business Analytics. The results assist scholars to figure out the directions of future studies on these topics.

Data stream analysis: Foundations, major tasks and tools

Abstract and Figures

Recommended publications

AutoML for Stream k-Nearest Neighbors Classification

CS-ARF: Compressed Adaptive Random Forests for Evolving Data Stream Classification

Efficient Batch-Incremental Classification Using UMAP for Evolving Data Streams

Incremental k-Nearest Neighbors Using Reservoir Sampling for Data Streams