Content uploaded by Maroua Bahri
Author content
All content in this area was uploaded by Maroua Bahri on Mar 05, 2021
Content may be subject to copyright.
OVERVIEW
Data stream analysis: Foundations, major tasks and tools
Maroua Bahri
1
| Albert Bifet
1,2
|Jo
~
ao Gama
3
| Heitor Murilo Gomes
2
|
Silviu Maniu
4
1
LTCI, Télécom Paris, IP-Paris, Palaiseau,
France
2
Department of Computer Science,
University of Waikato, Hamilton,
New Zealand
3
INESC TEC, University of Porto, Porto,
Portugal
4
LRI, Université Paris-Saclay, Orsay,
France
Correspondence to:
Maroua Bahri, LTCI, Télécom Paris, IP-
Paris, Palaiseau, France.
maroua.bahri@telecom-paris.fr
Funding information
Huawei Technologies France SASU and
Télécom Paris, Grant/Award Number:
YBN2018125164
Edited by: Sushmita Mitra, Associate
Editor, and Witold Pedrycz, Editor-in-
Chief
Abstract
The significant growth of interconnected Internet-of-Things (IoT) devices, the
use of social networks, along with the evolution of technology in different
domains, lead to a rise in the volume of data generated continuously from
multiple systems. Valuable information can be derived from these evolving
data streams by applying machine learning. In practice, several critical issues
emerge when extracting useful knowledge from these potentially infinite data,
mainly because of their evolving nature and high arrival rate which implies an
inability to store them entirely. In this work, we provide a comprehensive sur-
vey that discusses the research constraints and the current state-of-the-art in
this vibrant framework. Moreover, we present an updated overview of the lat-
est contributions proposed in different stream mining tasks, particularly classi-
fication,regression,clustering, and frequent patterns.
This article is categorized under:
Fundamental Concepts of Data and Knowledge > Key Design Issues in Data
Mining
Fundamental Concepts of Data and Knowledge > Motivation and Emer-
gence of Data Mining
1|INTRODUCTION
In recent decades, the world has been invaded by the ubiquitousness of technology in several sectors of society, such as
healthcare, transport, and banking. This digital revolution involves progressively more and more sensors and systems
that continually generate massive amounts of data in an open-ended way as big data streams. A good example is the
large system of interrelated devices and sensors known as Internet of Things (IoT) (Caiming & Yong, 2020; Da Xu,
He, & Li, 2014). The latter has become a key element of life automation, for instance, cars, cellphones, airplanes, and
drones. These devices create huge amounts of data streams that are expected to grow in the near future. By 2019, 26 bil-
lion of such devices were connected, and this number is expected to increase to almost 80 billion devices that will be
used all over the world by 2025.
1
Therefore, systems and algorithms to handle these vast flows of data must be
explored.
Stream
2
data are defined as an “unbounded sequences of multidimensional, sporadic, and transient observations
made available along time”(Bahri, 2020). To automatically extract useful information from data stream, we need to
consider stream computing that analyzes the data generated at high-velocity in real-time as required in big data stream
analytics (Kim, 2017). Hence, the stream mining tasks have become crucial in multiple real-world applications, for
Received: 15 August 2020 Revised: 13 January 2021 Accepted: 1 February 2021
DOI: 10.1002/widm.1405
WIREs Data Mining Knowl Discov. 2021;e1405. wires.wiley.com/dmkd © 2021 Wiley Periodicals LLC. 1of17
https://doi.org/10.1002/widm.1405
example, social networks, spam emails filters, and more, that demand real-time (or near real-time) analysis since the
data that they generate are drawn from evolving distributions.
Mining data streams has attracted several researchers due to the importance of its application (Aggarwal, 2007;
Bifet, Gavaldà, Holmes, & Pfahringer, 2018; Gaber, Zaslavsky, & Krishnaswamy, 2005). Amini, Wah, and
Saboohi (2014); Kokate, Deshpande, Mahalle, and Patil (2018); Carnein and Trautmann (2019) reviewed works for
unsupervised learning, clustering, by presenting models that are mainly used for density-based clustering. Current
incremental clustering algorithms rely mainly on techniques such as density-microclustering and density-grid that
require several parameters to be effective (Kokate et al., 2018). On the other hand, diverse works on supervised learning
have been proposed, especially in classification, which is perhaps the most commonly researched and active machine
learning task.
Our goal in this paper is to provide the artificial intelligence audience with a brief literature overview of the most
important foundations when dealing with big data streams by shedding light on how the research in the corresponding
framework can progress. While several books (Aggarwal, 2007; Gama & Gaber, 2007) and articles (Gaber et al., 2005;
Gama, 2012; Nguyen, Woon, & Ng, 2015) provide an overview of the state-of-the-art in the stream context, many new
and promising algorithms have emerged since then. Also, each of these reviews generally studies only one machine
learning task, for instance Losing, Hammer, and Wersing (2018) study the advances in classification, Gomes
et al. (2019b) does not discuss clustering, regression and frequent pattern mining tasks. This is a gap that the current
paper addieases.
We argue that these most recent advances in the data stream (Bahri et al., 2020; Bahri, Maniu, & Bifet, 2018;
Besedin, Blanchart, Crucianu, & Ferecatu, 2017; Gomes et al., 2017; Gomes et al., 2019; Losing, Hammer, &
Wersing, 2016; Montiel, Read, Bifet, & Abdessalem, 2018; Rodrigues, Araújo, Gama, & Lopes, 2018; Veloso, Gama, &
Malheiro, 2018) make this research area worth revisiting with a more ambitious scope. We first provide the basic con-
cepts in the stream setting while elucidating the challenges and how they could be addressed. Then, we review the pro-
gress in the different stream mining tasks with a particular focus the most active task, classification, and report reputed
and recent approaches and frameworks for data streams.
2|FOUNDATIONS
The unbounded nature of evolving data streams raises some technical and practical limitations that make traditional
stream algorithms fail because of the high use of resources (such as time and memory) to process dynamic data distri-
butions. In this section, we present the fundamental research issues encountered in the streaming framework.
2.1 |Challenges
The following challenges are mostly common across the different data stream mining tasks that will be presented in the
next sections (Aggarwal & Philip, 2007; Gama & Gaber, 2007; Kolajo, Daramola, & Adebiyi, 2019).
•Evolving data streams. To cope with the ever-growing size of data, streams algorithms must address the evolving
high-speed nature and complexity of data, because a stream usually delivers instances quickly. Therefore, stream
mining algorithms should be scalable and process recent instances from the stream in a dynamic fashion. Moreover,
we need scalable frameworks (Section 8) to handle big data streams by adopting efficient resource management strat-
egies and parallelization.
•Running time. An online algorithm must process the incoming observations as rapidly as possible. Otherwise, the
algorithm will not be efficient enough for applications where a rapid processing is required.
•Memory usage. Because of the massive amounts of data streams that require a limitless memory to be processed and
stored, it is difficult and even impossible to store the entire stream. So, any stream algorithm must be able to operate
under restricted memory constraints by storing few synopsis of the processed data and the current model(s).
•High-dimensionality. In some scenarios, streaming data may be high-dimensional, for example, text documents,
where distances between instances grow exponentially due to the curse of dimensionality. The latter can potentially
impact any algorithm's performance, mainly, in terms of time and memory.
2of17 BAHRI ET AL.
•Concept drifts. Since data streams are evolving, so the underlying distribution may change at any time, an eventuality
known as concept drift. This phenomenon can impact the predictive performance of algorithms because the current
learned model will be no more representative for the next incoming data. To deal with the new trends, learning algo-
rithms use drift detectors to identify the changes at the same time of their appearance. We redirect the readers to
(Gama, Žliobaite, Bifet, Pechenizkiy, & Bouchachia, 2014) for a review on concept drifts.
•Delayed labeling. Stream mining algorithms mostly suppose that labels are available before the next instance arrives
(immediate labeling). However, labels may arrive with delay which may be fixed or vary for different instances. Thus,
several algorithms that rely on concept drift detection will underperform when faced with a non-negligible delay to
receive the labeled data. This was illustrated in (Gomes et al., 2017), where authors present the same experiments
using both an immediate labeling setting and a fixed delayed setting.
•Imbalanced labels. The presence of a certain class label more than the other(s), referred to as majority class, may
impact learning algorithms' performance because they are designed to optimize for generalization, consequently, the
minority class may be ignored (López, Fernández, García, Palade, & Herrera, 2013).
The aforementioned challenges are commonly significant in the different data stream mining tasks. To cope with these
requirements, the incremental approaches should integrate incremental strategies that will allow such setting. More chal-
lenges arise in the case of distributed systems, such as the integration and heterogeneity, are discussed in Kolajo et al. (2019).
2.2 |Processing
The above-mentioned stream setting requirements can be addressed by using some well-established methods, presented
in the following (Aggarwal & Philip, 2007; Gama & Gaber, 2007):
•Single-pass. Unlike processing static datasets, it is no more possible to analyze data using several passes during the
course of computation because of its unbounded nature. Taking into account this constraint, algorithms work by
processing each instance from the stream only once (or a couple of times) and use it to update the model—or the sta-
tistical information about data—incrementally (instance-incremental algorithms, see Section 3.1). In the case of
batch-incremental algorithms, we process a batch (or chunk) of instances at once instead of only one instance.
•Window models. Storing a data stream and scanning it several times is not allowed. In order to capture significant
information from these evolving data, different kinds of moving windows have been proposed to store a part of the
stream continuously. In the following, we mention the standard ones (Ng & Dash, 2010).
Sliding window model: Whose size is fixed wand each instance is time stamped, that is, the most recent instances
from the stream are kept inside the window. This moving window slides over the stream while maintaining the
same size (Figure 1a).
•Landmark window model: This model starts by predefining an instance as a landmark from which the window grows.
Whenever the landmark changes, all the instances will be removed from the window and the new ones will be
maintained starting from the new landmark instance (Figure 1b). The problem with this window model is when the
landmark instance is fixed at the beginning of the flow, consequently, the window will store the whole stream.
•Damped window model: The damped model uses a fading function that periodically updates the weights of instances
inside the window. “The key idea consists in assigning a weight to each instance from the stream, which is inversely
(a) Sliding window model. (b) Landmark window of size 13. (c) Damped window model.
FIGURE 1 Window models
BAHRI ET AL.3of17
proportional to its age, that is, assign more weights to the recently arrived data. When the weight of an instance
exceeds a given threshold, it will be removed from the model”(see Figure 1c) (Bahri, 2020).
Table 1 describe briefly these windows and shows some of their pros and cons. Besides, different approaches are devel-
oped based on this manner, where the choice of the window model depends on the applications' needs (Ng & Dash, 2010).
2.3 |Summarization
To cope with the resource constraints, brief information and synopsis can be constructed from instances to reduce its
size for processing rather than—or in collaboration with—the aforementioned models. This can be realized through
the selection of some incoming instances or the synopsis construction in data. In what follows, we present a brief
description of some techniques.
•Sampling. Sampling is the simplest way to keep information about data. Since storing unbounded evolving data
streams is an impractical task, thus, it is a crucial step to make a sampling from the stream to maintain some “repre-
sentative”instances and store synopsis from the stream in memory (Babcock, Babu, Datar, Motwani, &
Widom, 2002). However, the disadvantage is that this simplicity comes with a cost: the selected instances can be not
too representative which can impact the results of mining algorithms.
•Histograms. Histograms are commonly used in the offline fashion where multiple passes are allowed and their exten-
sion to the online setting abides challenging. Garofalakis, Gehrke, and Rastogi (2002) proposed some incremental
histogram techniques to handle data streams that failed in some cases where the data distribution is not uniform.
•Sketches. A sketch is a probabilistic data structure that stores summaries and approximations of data (Manku &
Motwani, 2002). Different sketch-based methods exist that construct synopsis of data using a limited amount of mem-
ory. For instance, the count-min sketch (Cormode & Muthukrishnan, 2005) presented as a generalization of the
bloom filters to approximate the counts of objects with strong theoretical guarantees.
•Micro-clusters. Micro-clusters represent a method in the stream clustering (Aggarwal, Han, Wang, & Yu, 2003) that is
used to store synopsis information about the instances in the stream and the clusters.
•Grids. Grid-based methods partition the data space into small cubes, called grids, and instances from the data stream
are mapped to them.
•Dimension reduction. Dimension reduction is a well-known preprocessing method to tackle high-dimensional data
that may increase the cost of any mining algorithm (Bahri et al., 2020). It consists in mapping high-dimensional
instances onto a lower-dimensional representation while conserving the distances between instances (Sorzano, Var-
gas, & Montano, 2014).
3|SUPERVISED LEARNING
In the following, we assume that the stream Scontains an infinite number of instances X
1
,…,X
N
,…, where Nis the
number of instances faced so far in the stream and X
i
is defined as a vector of a attributes (also called features).
TABLE 1 Window models comparison
Window
model Definition Advantages Disadvantages
Sliding Processes the last received
instances
Suitable when the recent instances are of
special interest
Ignores elements from the
stream
Landmark Processes the entire history of the
stream
Suitable for single-pass algorithms Instances have the same
significance
Damped
(fading)
Assigns weights to instances Suitable when the old instances may affect the
results
Unbounded time window
4of17 BAHRI ET AL.
3.1 |Classification
Classification is a very popular supervised learning task that predicts the class label (or target category), (y
0
), for an
unlabeled observation Xusing a model built on labeled test instances (X,y) where y0=X0
ðÞ(Hand, Mannila, &
Smyth, 2001). The goal of a classification algorithm is to accurately predict the class label of instances.
Classification has been commonly used in the batch setting for static data where a training set (labeled instances) is
used to build a model, then a test set (unlabeled instances) is used for prediction to evaluate the model. So, multiple
access to the data are allowed in this traditional task. However, traditional (or batch) classifiers are unable to process
evolving data streams due to the requirements of the stream framework (Bifet & Kirkby, 2009). Different from the
learning and prediction phases of batch classifiers, the stream algorithms update their models incrementally in a single-
pass, also, they should work within a limited amount of time to afford real-time processing without delay, and a limited
amount of memory to avoid storing massive quantities of data for the prediction task. Incremental approaches are com-
monly divided in two main categories (Read, Bifet, Pfahringer, & Holmes, 2012):
•Instance-incremental approaches (Bifet & Gavaldà, 2009; Domingos & Hulten, 2000; Klawonn & Angelov, 2006; Los-
ing et al., 2016) that update the learned models incrementally with each incoming observation once it arrives. More
approaches are provided in the following sections.
•Batch-incremental approaches that receive a batch of data (composed of multiple instances) and use it to update the
models (Bahri et al., 2020; Cortes & Vapnik, 1995; Holmes, Kirkby, & Bainbridge, 2004; Hosmer Jr, Lemeshow, &
Sturdivant, 2013). For instance, Holmes et al. proposed a batch-incremental ensemble method that divides the data
stream into small chunks and builds a global model which combines the models created by all the member (trees)
inside the ensemble.
A relevant study Read et al. compares the instance-incremental and the batch-incremental methods on commonly
used data streams in the literature showing the difference in results obtained by both categories. They obtain similar
results in terms of accuracy but the instance-incremental approaches use less resources.
Several offline classifiers have been widely studied for the stream setting and turned out to be inefficient with evolv-
ing data streams. Thus, streaming versions of these classifiers have been implemented in this vibrant environment.
Generally, classification algorithms can be grouped into five main categories in Figure 2: (i) frequency-based;
(ii) neighborhood-based; (iii) tree-based; (iv) ensemble-based; and (v) neural network-based algorithms.
3.1.1 |Frequency-based classification
Naive Bayes (NB) (Friedman, Geiger, & Goldszmidt, 1997) is one of the simplest classifiers that performs prediction
(given the test instance) by computing the posterior probability using the Bayes's theorem with the independence
assumption between all attributes given the class label. In practice, this naive independence assumption does not
always hold true, but the classifiers has surprisingly good results with multiple datasets. The NB is an incremental algo-
rithm that does not need an adaptation to handle data streams thanks to its frequency-based scheme that only stores
frequencies about instances.
Data Stream Classification
Neighborhood-basedFrequency-based
NN Hoeffding-tree
Naive Bayes Bagging
EnsemblesTree-based Neural-based
Perceptron
FIGURE 2 Taxonomy of classification with some well-known algorithms
BAHRI ET AL.5of17
In order to reduce the resource usage of the NB algorithm, a sketch-based NB algorithm (Bahri et al., 2018) has been
proposed recently which uses count-min sketch (Cormode & Muthukrishnan, 2005) as a data structure to store properly
approximate frequencies of data. To improve further the result of the sketch-based NB approach, Bahri et al. incorpo-
rated hashing trick (Weinberger, Dasgupta, Langford, Smola, & Attenberg, 2009), a dimensionality reduction technique,
to efficiently handle high-dimensional data streams incrementally.
The multinomial Naive Bayes (Mccallum & Nigam, 1998) is a specific instance of the NB classifier which uses a
multinomial distribution for each of the attributes. It is suitable for classification with discrete attributes (e.g., word
counts for text classification).
3.1.2 |Neighborhood-based classification
k-Nearest Neighbors (kNN) is a lazy learning algorithm that does not build a model but use the whole dataset to make
predictions. Given a test instance, the kNN algorithm works by computing the distances, using a metric such as the
Euclidean distance, between the unlabeled instance and all the others and finish by picking the kclosest ones. After-
wards, by majority vote, kNN predicts, for the test instance, the most frequent class label among the k-nearest neigh-
bors. Since data streams are unbounded, so it is not possible to store all the instances for the prediction phase. To cope
with this issue, an stream version of the kNN algorithm has been proposed which uses a limited memory via a sliding
window (Figure 1a) to store the most recent instances and merge them with the old ones already in the window. Never-
theless, the use of resources depends on the window size and the dimensionality of data. Bifet, Pfahringer, Read, and
Holmes (2013) showed that a small window will decrease the accuracy, while a large window will increase the use of
resources: it is an accuracy-resources tradeoff.
To reduce the resource usage of kNN with high-dimensional data, Bahri et al. proposed a batch-incremental kNN
that incrementally preprocesses the data with uniform manifold approximation and projection (UMAP) (McInnes,
Healy, & Melville, 2018), a neighborhood and graph-based technique, to compress the dimensionality without losing
much in accuracy. Other than the size of the sliding window, the batch-incremental kNN has also the size of batch as
an important hyperparameter to fix which poses an accuracy-resources tradeoff as well.
Self-adjusting memory kNN (samkNN) (Losing et al., 2016) is an incremental kNN algorithm that builds an ensem-
ble with models targeting concept drifts in the stream and uses a dual-memory: (i) short-term memory for the current
concept; and (ii) long-term memory to keep track about past concepts.
3.1.3 |Tree-based classification
Several tree-based algorithms have been proposed in the state-of-the-art to handle evolving data streams (Domingos &
Hulten, 2000; Gama, Rocha, & Medas, 2003; Jankowski, Jackowski, & Cyganek, 2016). For instance, the very fast deci-
sion tree (VFDT) (Domingos & Hulten, 2000), also called Hoeffding tree algorithm (HT), is an incremental decision tree
learner that uses the Hoeffding bound to choose best split nodes. However, HT does not use an explicit drift detection
mechanism to address the changes that may occur in the data distribution.
To tackle this issue, an improved version of VFDT, the concept-adapting very fast decision tree (CVFDT), has been
proposed (Hulten, Spencer, & Domingos, 2001) to adapt to changes by maintaining a sliding window with the most
recent instances from the stream. CVFDR increments the counts corresponding to the new arrived instance and decre-
ments the counts of the oldest instance that has fallen out of the moving window. Bifet and Gavaldà proposed an adap-
tive extension of the VFDT algorithm, called Hoeffding adaptive tree (HAT), to handle changes in the data distribution
using the ADaptive WINdowing (ADWIN), a change detector and predictor, that controls the performance of branches
on the tree. If a drift is detected, the tree will be updated by replacing old branches with new ones trained on the cur-
rent concept (Bifet & Gavalda, 2007). The tree-based algorithms demand more resource, particularly (i) memory when
the tree grows, and (ii) time to choose the best split attributes.
Manapragada et al. introduced an incremental decision tree learning algorithm, extremely fast decision tree
(EFDT), that is similar to the Hoeffding tree (or VFDT) algorithm but uses the Hoeffding bound to select a useful split
node and to replace it whenever a better alternative split node is identified. The EFDT algorithm achieves better predic-
tive performance than the HT algorithm on different data but the latter is more efficient computationally (EFDT
6of17 BAHRI ET AL.
running time exceeds that of HT). Besides, EFDT works on a stationary distribution and is not designed as a learner to
deal with drifts (Manapragada, Webb, & Salehi, 2018).
3.1.4 |Ensemble-based classification
In its purest form, an ensemble is a set of individual models, whose predictions are combined using majority vote or
similar voting strategies. Ensemble learning is a relatively easy off-the-shelf approach to improve predictive perfor-
mance without building a single optimized model for the given learning problem. These gains come at the cost of com-
putational resources used for training the underlying models, while this restriction is negligible for batch problems, it
poses a serious concern in a streaming setting where algorithms must operate under constrained computational
resources.
Several theoretical and empirical studies have demonstrated the benefit of combining various “weak”individual
classifiers that drives to better accuracy than a single classifier (Bifet & Gavaldà, 2009; Breiman, 1996; Dietterich, 2000;
Kuncheva, 2014). Besides the predictive performance gains, ensemble-based methods are popular for data stream learn-
ing due to their flexibility to be coupled with concept drift detection strategies. Kuncheva (2014) mentioned three main
reasons for adopting an ensemble-based method over a single learner: (i) Statistical. Assuming that D
k
samples of a
training data set Dare used to train Kclassifiers, such that each of them obtain 100% accuracy on the subset of D
k
used
for training them. It is possible that the generalization capabilities of these classifiers will be different when they are
applied to a test set disjoint from D, therefore their accuracy will vary. From a statistical point of view it is safer to use
the mean of the individual predictions from these Kclassifiers instead of using only one of them, since the chance of
selecting the classifier with the worst generalization capabilities is eliminated. There is a chance that the ensemble
accuracy is worse than that of the best of its members, but at the least we avoid selecting the worst classifier.
(ii) Computational. Some classifiers may converge to a local maximum. Suppose that the local maxima of Kclassifiers
are close to the absolute maximum. Such that, there is a way of combining them in a model even closer to the absolute
maximum (optimal classifier) than any of them is capable individually. (iii) Representational. The classifier used may
not be able to represent the separation surface of the given problem. For example, a single linear classifier is unable to
accurately represent non-linear problems. Although the combination of several linear classifiers can approximate a
nonlinear separation surface.
In a streaming setting, two other reasons for using ensemble methods are worth noting: (iv) Scalability: in several
ensemble methods, such as random forest Breiman (2001), the base models can be trained independently. These ensem-
bles are embarrassingly parallel, which allows for training them in a way that allow to circumvent the computational
resources constraints. (v) Concept drift adaptation: Ensemble methods adapt to drifts rapidly by updating or resetting
the under-performing model(s) inside the ensemble Gomes et al. (2017a). This does not requires a complete reset of the
whole model.
Many streaming ensemble-based methods (Beygelzimer, Kale, & Luo, 2015; Chen, Lin, & Lu, 2012; de Barros, de
Carvalho Santos, & Júnior, 2016; Oza, 2005) are adapted versions of Boosting (Freund & Schapire, 1995) and Bagging
3
(Breiman, 1996). Online Bagging (Oza, 2005) is a streaming version of Bagging (Breiman, 1996) that uses the Poisson
distribution having λ= 1 to resample data, and generates models trained on diverse set of samples. Unlike the offline
Bagging, the stream Bagging assigns weight to samples using a Poisson distribution which allows dividing samples as
follows: 37% (sampled value 1) of instances are used for training only once and 26% (sampled value superior than 1) are
trained with repetition, and the rest 37% (sampled value 0) are not used for training. In order to promote ensemble
diversity and use more often training instances, Bifet et al. proposed leveraging bagging (LB), similar to the online Bag-
ging, that uses λ= 6 to approximate instances. LB deals with concept drifts using ADWIN (Bifet & Gavalda, 2007) by
resetting the worst performing classifier whenever a change is detected.
Streaming random forests (SRF) (Abdulsalam, Skillicorn, & Martin, 2007) is an incremental ensemble method that
adapts the Random Forests (RF) algorithm (Breiman, 2001) to the stream framework. It grows binary Hoeffding trees
and trains each tree on random samples without replacement by keeping a batch of instances. Adaptive Random For-
ests (ARF) (Gomes et al., 2017) is a recent version of the RF algorithm to handle evolving data streams which uses
Hoeffding trees and randomly selects attributes for each node. ARF induces more diversity to the ensemble through the
resampling technique used in LB (Bifet, Holmes, & Pfahringer, 2010). Besides, ARF includes one drift detection algo-
rithm per ensemble member, which dictates when to start training a background tree (i.e., when a warning is signaled),
and when to replace the current tree by the background tree (i.e., when a drift is signaled). Streaming Random Patches
BAHRI ET AL.7of17
(SRP) (Gomes et al., 2019) is also a novel ensemble-based method that uses a drift detection mechanism similar to one
presented in ARF and integrates random subspaces (selected globally for each base model) and the online Bagging.
3.1.5 |Neural network-based classification
Neural networks (NNs) are another category of models inspired by the biological neurons that form the nervous system.
In recent years, NNs have shifted the attention of the machine learning community to become one of the extremely
active research directions. However, there is a growing demand for the incremental setting to analyze continuously
evolving streams. The latter impose several challenges that need to be handled because of the potential infinite nature
of data and the real-time processing which raises memory and time issues (Besedin et al., 2017; Rutkowski, Jaworski, &
Duda, 2020).
Perceptron is the most simple NN that consists of a single neuron. It is a linear classifier that typically requires sev-
eral iterations over the training data, which is impossible in the stream learning setup. Stream Perceptron has a set of
weights that is updated for each new incoming instance from the stream using Stochastic Gradient Descent (SGD). The
main difference with the batch setting is that instead of doing multiple iterations to improve the accuracy, we only do
one pass over the data for the stream setting (Bifet et al., 2010c). Pratama et al. proposed a randomized neural network
model that provides a scalable solution, for the adaptation of neural models to the stream scenarios, which is able to
process the data one-by-one (instance-incremental) or in chunks (batch-incremental) (Pratama et al., 2017). In a differ-
ent research work, a new type of neural network called recurrent fuzzy neural network has been proposed to model
and capture the dynamical properties of the fuzzy dynamical systems (Zhou & Da Xu, 2001). This network is suitable
for describing dynamic systems, because it can deal with time-varying input or output through its own natural temporal
operation. Because of its dynamic nature, the recurrent neural network has been successfully applied to a wide variety
of applications such as speech processing and time series forecasting. However, the training of a recurrent neural net-
work could be time consuming and thus inappropriate with evolving data streams (Chang, Chen, & Chang, 2012). Jain,
Seera, Lim, and Balasubramaniam (2014) Rutkowski et al. reviewed some neural networks that work in the streaming
environment by replacing the epoch learning by one-by-one or mini-batch type learning.
Deep learning. Recently, deep learning has attracted much attention for applications to data stream processing.
However, only limited progress on online deep learning has been made because the networks should not be too deep to
allow real-time processing. For instance, a generative adversarial network (GAN) has been proposed (Besedin
et al., 2017) to derive a deep network on data streams without the necessity of storing incoming data. This technique
works by regenerating historical training data to compensate for the absence of synopsis of data. In Read, Perez-Cruz,
and Bifet (2015), authors explored two deep learning methods in order to classify semi-labeled data. These methods pro-
vide important advantages in the stream framework, such as learning incrementally in a constant memory usage.
Deep learning is not commonly used so far with evolving data streams due to the high computational cost of train-
ing and its sensitivity to hyper-parameter configurations (e.g., depth, number of neurons) (Marrón, Read, Bifet, &
Navarro, 2017). Moreover, since deep learning methods are resource-intensive, they require powerful GPUs. Another
challenging problem associated with data streams is that the latter may be non-stationary. In order to address this issue
and speed-up the learning, it is more practical to deal with NNs on mini-batches with the assumption that non-
stationary data are separated into chunks (data close in time are assumed to be stationary and follow the same distribu-
tion) (Chen & Lin, 2014).
3.2 |Regression
Regression is a supervised learning task where the goal is to estimate the relationship between a dependent variable
and one or more independent variables. It differs from classification as the dependent variable is numeric. In the con-
text of data streams, regression is often associated with time series analysis and forecasting. However, in a data stream
setting, we assume that data points are independent and identically distributed (iid). Therefore, traditional univariate
and multivariate time series analysis is not applicable.
The most basic regression technique is the simple linear regression, in which a line is fit to represent the relation-
ship between a single independent variable and the dependent variable. Multiple linear regression is an extension to
multiple dependent variables. In a streaming setting, the process of fitting such line to the data can be performed in an
8of17 BAHRI ET AL.
online fashion through SGD as in a Perceptron. To fit more complex data, in which a linear relationship is not suffi-
cient, one can rely on polynomial regression, where the relationship between the independent variable and the depen-
dent variable is modeled as an n−th degree polynomial. Alternatively, neural networks can be employed (as in
classification) following the same scheme of training with SGD.
Incremental decision trees are often used in the streaming setting, notoriously for classification with Hoeffding
Trees (Domingos & Hulten, 2000), but also for regression. The Fast and Incremental Model Trees (FIMT-DD)
(Ikonomovska, Gama, & Džeroski, 2011) builds incremental regression trees similarly to Hoeffding Trees, that is,
FIMT-DD starts with an empty tree that keeps statistics at the leaves from arriving data until a grace period is reached,
such that the features are ranked according to their variance w.r.t the target variable to decide for splits, and if the two
best-ranked differ by at least the Hoeffding Bound (Hoeffding, 1994) the leaf splits. Similarly to other incremental deci-
sion trees, FIMT-DD performs concept drift detection, and adaptation, by periodically resetting subbranches of the tree
in which significant variance increases are observed.
In the search for predictive performance improvements of incremental regression trees, a common approach is to
ensemble several trees, similarly to ensembling classification models (Gomes et al., 2017a). Ikonomovska, Gama, and
Džeroski (2015) proposed the online random forest (ORF) and online bagging (OBag) ensembles that use the FIMT-DD
as the base learner. Based on empirical experiments, the authors concluded that the ORTO-A (online option trees with
averaging) outperformed both OBag and ORF in terms of mean squared error (MSE). Gomes, Barddal, Ferreira, and
Bifet (2018) proposed Adaptive Random Forest regressor (ARF-Reg), an adaptation of the data stream classifier ARF
(Gomes et al., 2017). ARF-Reg builds a forest of FIMT-DD trees as ORF, the main difference between both algorithms
is that ARF-Reg employs one instance of the ADWIN algorithm (Bifet & Gavalda, 2007) per tree to detect concept drifts.
Even though there are some similarities between ensemble classification and regression models, there are also impor-
tant differences, for example, w.r.t. how predictions are combined and how diversity is induced. These were recently
empirically analyzed in (Gomes, Montiel, Mastelini, Pfahringer, & Bifet, 2020).
4|CLUSTERING
Unlike supervised learning, the instances represented in the clustering are not associated with a discrete class label, as
in classification, or a continuous value, as in regression, because the clustering methods aim to discover the possible
classes from the content of the data [de Souza Viana, de Oliveira, da Silva, Falc ao, & Gonçalves, 2018]. The big data
stream clustering task can be defined by the process that continuously maintains a consistent clustering of the data
encountered thus far from the stream while using limited amounts of time and memory (Chen, Oliverio, Kim, &
Shen, 2019; Silva et al., 2013). As mentioned before, the infinite nature of data imposes several challenges and the need
to process them in real-time. Thus, incremental clustering algorithms are needed to maintain the evolving cluster struc-
tures. Moreover, due to the dynamics of the data stream, new clusters might appear, other disappear, whereas some
clusters can move in the instance space, etc.
A recent research study (Chen et al., 2019) have been proposed that presents the clustering categories and the
related methods while discussing the pros and cons of each of them. However, authors reviewed methods that can be
applicable to big data and do not examined them in the streaming context.
The main approaches in clustering streaming data, can be summarized as:
•Partitioning clustering organizes a set of instances into some partitions, in such a way that each partition repre-
sents a cluster. These clusters are formed by minimizing some objective function, such as the sum of squares distances
to the cluster centroids. Examples of well-known algorithms include k-means (Farnstrom, Lewis, & Elkan, 2000), and
k-medoids (Guha, Meyerson, Mishra, Motwani, & O'Callaghan, 2003).
•Micro-cluster-based clustering divides the process into two principal phases: (i) the online phase summarizes the data
stream in micro-clusters: and (ii) the second phase builds a general cluster model using the local micro-clusters. The
most representative streaming algorithms are CluStream (Aggarwal et al., 2003), and Clustree (Kranen, Assent,
Baldauf, & e Seidl, 2011).
•Density-based clustering is based on the idea that a cluster should be built around instances with a significant number
of points in their neighborhoods, that is in dense areas of the instance space. The DBSCAN algorithm (Sander, Ester,
Kriegel, & Xu, 1998) is the most representative density-based offline algorithm. Cao, Ester, Qian, and Zhou (2006)
presented Den-Stream, a streaming density-based algorithm, that extends the main concepts of DBSCAN to the
BAHRI ET AL.9of17
streaming setting by using micro-clusters for online computing summary statistics. It uses a fading window model,
where the weight of each data point decreases exponentially using the function f(t)=2
−λt
where the decay-rate
is λ>0.
•Hierarchical clustering, also known as connectivity-based clustering, is mainly composed of agglomerative and divi-
sive methods. The agglomerative clustering concerns the “bottom-up”methods where each instance starts in its own
cluster and pairs of instances are grouped as one moves up to form the hierarchy. On the other hand, the divisive
clustering is a “top-down”method that starts from the top, where all instances are grouped in one cluster, and splits
into different clusters recursively as one moves down the hierarchy. However, in the stream setting, instances arrive
one by one and are not all available at once. Therefore, the hierarchical clustering was presented in the system
Online Divise-Agglomerative Clustering (ODAC) Rodrigues, Gama, and Pedroso (2008) which continuously main-
tains a hierarchical clustering structure, where the system continuously monitors the diameter of the clusters. The
hierarchy grows, when more information is available, allowing a more detail cluster structure. When a change in the
correlation structure of the process that generates data is detected, the hierarchy contracts, by merging the cluster
where change was detected.
Distributed algorithms for clustering have been developed in the context of sensor networks. There are two main
perspectives: (i) a cluster is a group of data points (Gama, Rodrigues, & Lopes, 2011); and (ii) a cluster is a group of sen-
sors, as in (Rodrigues et al., 2018).
There is no consensus on the topic of evaluation of clustering algorithms. Kremer et al. presented the Cluster Map-
ping Measure, that demonstrates multiple types of errors by considering the characteristics of evolving data.
Spiliopoulou, Ntoutsi, Theodoridis, and Schult (2006) introduces MONIC system that aims to detect and track change in
clusters by assuming that a cluster represents an instance in some geometric space. MONIC works by encompassing
changes that include more than one cluster, enabling for insights on cluster change in the entire clustering. Actually,
the transition tracking mechanism depends on the overlapping degree between two clusters. The notion of overlapping
between any two clusters, C
1
and C
2
, can be defined as the number of common instances weighted with the age of the
records. Suppose that the cluster C
1
and C
2
are obtained at instances t
1
and t
2
, respectively. The degree of overlap
between these two clusters is thus computed using:
overlap C1,C2
ðÞ=
P
aC1\C2
age a,t2
ðÞ
P
xC1
age x,t2
ðÞ
:
The latter permits the deduction of properties that concern the underlying structure of data stream. The cluster transi-
tion at a given instance is defined as a change in a cluster discovered at an earlier instance. MONIC system considers
internal and external transitions which reflect the dynamics and changes in the stream, such as a cluster survives (it is
no going to disappear), a cluster absorbed (absorbed by another cluster); a cluster disappears (totally removed); a cluster
emerges (a new cluster is created). Tracking cluster evolution on panel and longitudinal data appears in Oliveira and
Gama (2012).
5|FREQUENT PATTERNS
Frequent pattern mining is an important unsupervised learning task that can be employed to merely determine the
structure of the data, to figure out the association rules, or to find discriminative attributes that can be exploited for
classification or clustering tasks. Examples of pattern classes can be itemsets, trees, graphs, and sequences (Bifet &
Gavaldà, 2008).
The frequent patterns issue is presented as follows: given a batch dataset or a data stream, that encompasses pat-
terns, and a threshold σ, the task consists in finding all the patterns that emerge as a subpattern in a fraction σof the
patterns in the data. For instance, if the input data is a stream of purchases in a supermarket, and σ= 10%, we would
call {cheese,wine} frequent pattern if at least 10% of the purchases include, among other products, partially cheese and
wine. Another example includes graphs where a triangle is considered as a graph pattern. Given a dataset of graphs,
this pattern would be frequent if at least a fraction σof the graphs contain at least one triangle.
10 of 17 BAHRI ET AL.
In the offline setting, Apriori, Eclat, and FP-growth are well-known algorithms for discovering frequent itemsets in
datasets. Similar approaches for data structures, for example, sequences and graphs can be found in the literature. Nev-
ertheless, the adaptation of these algorithms to the stream setting is not an easy task because they violate the single-pass
requirement and maintain too much information.
To cope with the aforementioned issues, stream approaches for frequent pattern mining have been proposed which
use a batch miner as a base leading to approximate results rather than exact. Thus, other online ideas need to be
explored. Examples of algorithms for frequent itemset mining in data streams are Moment (Chi, Wang, Yu, &
Muntz, 2006) and IncMine (Cheng, Ke, & Ng, 2008).
6|TOWARDS AUTOML
Machine learning benefited from tremendous research progress recently in many application areas, particularly in the
stream setting. The growing number of machine learning algorithms and their respective hyperparameters give rise to
the number of configurations that relies on qualified experts (i.e., human intervention and expertise).
There is no doubt, current (some of them are mentioned in the previous sections) algorithms are suitable for data
streams but they usually require to set the configuration in advance (e.g., the size of the ensemble for the ensemble-
based methods, the number of neighbors kfor the kNN algorithm). Moreover, stream approaches are not totally auto-
mated, that is, the parameterization set at the beginning may not hold for all the parts of the stream since models may
change over time because of concept drifts. Thus, what is the way to deal with this matter?
Auto Machine Learning (autoML) (Hutter, Kotthoff, & Vanschoren, 2019) is a new tool that is receiving increased
attention which aims to tackle the parameters configuration issue using automatic monitoring models. Multiple systems
4
have been proposed in the offline setting that allow hyperparameter tuning by combining autoML with famous machine
learning softwares, such as Scikit-learn and Weka, AutoWeka
5
and AutoSklearn,
6
respectively (Feurer et al., 2015).
Yet, a very limited number of contributions on AutoML for evolving data streams exist in the literature. For
instance, Self Parameter Tuning (SPT) is an automated technique that controls the stream algorithm configuration by
incrementally selecting the best hyperparameter(s) that may change over time (Veloso et al., 2018).
We consider that automatic algorithm configuration for data stream mining can be revolutionary. In fact, selecting
the best hyperparameter configuration for stream algorithms is a tedious task because it may change depending on the
characteristics (e.g., number of attributes) and contents of data. Hence, tuning incremental algorithms' configurations
automatically and continuously is a very promising direction in machine learning for data streams.
7|EVALUATION METRICS
In the stream setting, two main evaluation axes are involved and strongly related to assess the efficiency performance
of algorithms along with their quality (e.g., accuracy for classification) are (i) the Execution time which includes any
preprocessing (such as dimension reduction), prediction, and learning steps; and the Memory used by an algorithm
comprises the storage needed to keep the current model(s) with the statistical information useful to maintain the incre-
mental processing (e.g., the number of instances received so far from the stream).
In the context of supervised learning, it is important to evaluate the trained model and test its applicability in differ-
ent scenarios on different data streams. The prequential evaluation (Dawid, 1984), also called interleaved test-then-train,
is the most used evaluation method proposed exclusively to assess the performance of data stream algorithms incremen-
tally. This evaluation scheme consists in using instances to test (for prediction) the current learned model, and use them
thereafter to update (or train) the model. Another important evaluation task consists in performing the holdout evalua-
tion that uses different test and training datasets. Bifet et al. (2015) propose an evaluation methodology for big data
streams that address different scenarios, including unbalanced data, and data where change occurs on different time
scales. Most notably, Bifet et al. (2015) introduces adaptations of cross-validation to the streaming settings.
Several metrics exist to measure the performance of classification algorithms. Most of them are easily applied to data
stream classification. Accuracy is an intuitive metric that measures the percentage of correctly classified instances with
respect to all predictions made. If the distribution of examples across the class labels is imbalanced, then accuracy can
be misleading as a model that always predicts the majority class will yield high accuracy. In these cases, metrics such as
sensitivity,specificity,g-mean are better alternatives.
BAHRI ET AL.11 of 17
The most common metrics used to evaluate the results of the prediction for the regression task are (i) Root mean
squared error (RMSE). RMSE is the square root of the average squared difference between the target value and the value
predicted by the model; and (ii) Mean absolute error (MAE). MAE is the absolute difference between the value targeted
and the value predicted by the model.
Another topic of interest is the evaluation of data streams when there is a non-negligible delay between the arrival
of the instance data and its corresponding label data. Grzenda, Gomes, and Bifet (2019) claims that besides “how”pre-
dictions affect the predictive performance, it is also essential to consider “when”labels are made available as part of the
evaluation. This leads to the concept of continuous re-evaluations introduced in (Grzenda et al., 2019) and further
explored in (Grzenda, Gomes, & Bifet, 2020). The goal of continuous re-evaluations is to observe if, and how fast,
models can transform an initially incorrect prediction into a correct prediction before the true label arrives in a stream-
ing setting.
Finally, a multitude of evaluation measures (also called validation measures) have been proposed to evaluate the
quality of resulting clusterings. We direct the readers to this work (Kremer et al., 2011) that discusses and compares
these measures in extensive experiments.
8|OPEN SOURCE SOFTWARE
Multiple frameworks for data stream mining have been proposed in the literature. The set of available open source soft-
wares contains a multitude of the state-of-the-art algorithms that can be extended to propose new approaches and/or to
compare against. In the following, we cite some softwares with active growing communities that have been widely used
along with new ones in the literature. Massive online analysis (MOA): MOA
7
(Bifet et al., 2010) is the most popular open
source framework for machine learning for evolving data streams, written in Java and implemented above Waikato
Environment for Knowledge Analysis (WEKA),
8
with a very active research community. MOA provides different data
generators (e.g., SEA and LED generators), stream mining algorithms (e.g., algorithms for classification, clustering,
regression, anomaly detection), evaluation methods (e.g, the prequential evaluation), and statistics to evaluate the per-
formance of algorithms (e.g., memory, time, accuracy, kappa). This software can be used via a command line or a user
interface. A recent book (Bifet et al., 2018) has been published that discusses MOA and how to use it along with exer-
cises and lab sessions. Generally, researchers working on MOA, make their contribution codes available in MOA
extension.
9
Scalable advanced massive online analysis (SAMOA): SAMOA
10
(Morales & Bifet, 2015) is presented as a library as
well as a framework, written in Java, that combines data stream analysis and distributed processing. SAMOA allows
the creation of distributed stream machine learning algorithms and runs them on distributed stream processing engines
i a fast and scalable manner. This framework provides a collection of distributed versions of some data stream algo-
rithms (e.g., bagging, boosting).
StreamDM: StreamDM
11
is an open source framework for online machine learning which utilizes Spark streaming
to enable stream processing from a variety of sources. The main advantages of StreamDM are (i) the benefit from the
use of Spark streaming API that enables scalable stream processing in order to handle issues such as “out of order data”
in data sources; and (ii) the fact that it allows batch processing algorithms with streaming algorithms (Bifet et al., 2015).
Scikit-multiflow: Scikit-multiflow
12
(Montiel et al., 2018) is a new open source software designed for multi-label/
multi-output and single-output stream learning algorithms and inspired by the popular frameworks, scikit-learn
(Pedregosa et al., 2011), MOA (Bifet et al., 2010), and MEKA,
13
to fill the void in Python for data stream mining. Similar
to MOA, scikit-multiflow contains also stream generators and algorithms for data streams. More recently, scikit-
multiflow has merged with the creme framework (https://maxhalford.github.io/) into a new Python project called
River. Given the increasing popularity of the Python programming language, the advantage of using the scikit-
multiflow framework, that complements the scikit-learn which focuses only on the batch learning, is its similarity to
the latter which is widely used by researchers and practitioners. Moreover, it can be used within the popular interface
Jupyter Notebook, often used by the data science community. On the other hand, the notable drawback is that this soft-
ware may be slow, in comparison with MOA, because Python codes are expected to execute slower than Java codes.
Comparative studies (Behera, Das, Jena, Rath, & Sahoo, 2017; Gomes et al., 2019b; Inoubli, Aridhi, Mezni, Mad-
douri, & Nguifo, 2018) on the stream frameworks, with an evaluation of the performance in terms of resource consump-
tion, have been provided with a focus on the distributed stream processing tools, such as Apache Samza, Apache Spark,
Apache Flink, and Apache Storm.
12 of 17 BAHRI ET AL.
9|CONCLUSION
The aim of this survey paper is to present a holistic view of data stream mining by reviewing the stream mining chal-
lenges and foundations. We also conducted a comprehensive literature review of the baseline algorithms in data stream
mining and discussed the most promising and the recent ones. Moreover, this survey presents some principal metrics
for algorithm evaluation and well-known actively growing open-source softwares for the stream environment. We hope
that this summary provides a clear overview of the main challenges, basics, and recent advances, as well as some open
directions in the stream setting to the AI community.
ACKNOWLEGEMENTS
This work has been carried out in the frame of a cooperation between Huawei Technologies France SASU and Télécom
Paris (Grant no. YBN2018125164).
CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.
AUTHOR CONTRIBUTIONS
Albert Bifet: Conceptualization. Heitor Gomes: Conceptualization. Silviu Maniu: Conceptualization.
DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no data sets were generated or analyzed during the current study.
ENDNOTES
1
* Email: maroua.bahri@telecom-paris.fr
www.statista.com/statistics/976079/number-of-iot-connected-objects-worldwide-by-type/
2
In the sequel, we use the terms streaming,online,orincremental interchangeably.
3
The offline Bagging applies resampling, i.e. sampling with reposition, to train its ensemble members on different sub-
sets of instances.
4
https://www.automl.org/automl/
5
https://www.automl.org/automl/autoweka/
6
https://www.automl.org/automl/auto-sklearn/
7
https://moa.cms.waikato.ac.nz/
8
https://www.cs.waikato.ac.nz/ml/weka/
9
https://moa.cms.waikato.ac.nz/moa-extensions/.
10
http://samoa.incubator.apache.org
11
http://huawei-noah.github.io/streamDM/
12
https://scikit-multiflow.github.io
13
The MEKA project provides algorithms for multi-label learning and evaluation.
REFERENCES
Abdulsalam H, Skillicorn DB, Martin P. Streaming random forests. In: International Database Engineering and Applications Symposium
(IDEAS). Banff, Canada: IEEE; 2007, 225–232.
Aggarwal CC. Data streams: models and algorithms, vol. 31. New York: Springer Science & Business Media; 2007.
Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: International Conference on Very Large Data
Bases. VLDB Endowment; 2003, 81–92.
Aggarwal CC, Philip SY. A survey of synopsis construction in data streams. In: Data streams. Springer; 2007, 169–207.
Amini A, Wah TY, Saboohi H. On density-based data streams clustering algorithms: A survey. Journal of Computer Science and Technology
2014, 29:116–141.
Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In: ACM SIGMOD. New York: ACM;
2002, 1–16.
BAHRI ET AL.13 of 17
Bahri, M. (2020). Improving IoT data stream analytics using summarization techniques (Ph.D. thesis). Institut Polytechnique de Paris.
Bahri M, Bifet A, Maniu S, Gomes HM. Survey on feature transformation techniques for data streams. In: International Joint Conference on
Artificial Intelligence (IJCAI). 2020. Yokohama.
Bahri M, Maniu S, Bifet A. Sketch-based naive bayes algorithms for evolving data streams. In: International Conference on Big Data. Seattle:
IEEE; 2018, 604–613.
Bahri M, Pfahringer B, Bifet A, Maniu S. Efficient batch-incremental classification for evolving data streams. In: Intelligent Data Analysis
(IDA). Konstanz: Springer; 2020.
Behera RK, Das S, Jena M, Rath SK, Sahoo B. A comparative study of distributed tools for analyzing streaming data. In: International Confer-
ence on Information Technology (ICIT). Toronto: IEEE; 2017, 79–84.
Besedin, A., Blanchart, P., Crucianu, M. and Ferecatu, M. (2017). Evolutive deep models for online learning on data streams with no storage.
Beygelzimer A, Kale S, Luo H. Optimal and adaptive algorithms for online boosting. In: International Conference on Machine Learning
(ICML). 2015, 2323–2331. Lille.
Bifet A, de Francisci Morales G, Read J, Holmes G, Pfahringer B. Efficient online evaluation of big data stream classifiers. In: Proceedings of
the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015, 59–68. Sydney.
Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: International Conference on Data Mining (ICDM).
SIAM; 2007, 443–448.
Bifet A, Gavaldà R. Mining adaptively frequent closed unlabeled rooted trees in data streams. In: ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. 2008, 34–42. Las Vegas, Nevada, USA. https://doi.org/10.1145/1401890.1401900.
Bifet A, Gavaldà R. Adaptive learning from evolving data streams. In: Intelligent Data Analysis (IDA). Lyon: Springer; 2009, 249–260.
Bifet A, Gavaldà R, Holmes G, Pfahringer B. Machine learning for data streams: With practical examples in MOA. MIT Press; 2018.
Bifet A, Holmes G, Kirkby R, Pfahringer B. Moa: Massive online analysis. Journal of Machine Learning Research (JMLR) 2010, 11:1601–1604.
Bifet A, Holmes G, Pfahringer B. Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and
knowledge discovery in databases. Barcelona: Springer; 2010, 135–150.
Bifet A, Holmes G, Pfahringer B, Frank E. Fast perceptron decision tree learning from evolving data streams. In: Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD). Springer; 2010, 299–310.
Bifet, A., & Kirkby, R. (2009) Data stream mining a practical approach.
Bifet A, Maniu S, Qian J, Tian G, He C, Fan W. Streamdm: Advanced data mining in spark streaming. In: International Conference on Data
Mining Workshop (ICDMW). Atlantic City: IEEE; 2015, 1608–1611.
Bifet A, Pfahringer B, Read J, Holmes G. Efficient data stream classification via probabilistic adaptive windows. In: Symposium On Applied
Computing (SIGAPP). Coimbra: ACM; 2013, 801–806.
Breiman L. Bagging predictors. Machine Learning 1996, 24:123–140.
Breiman L. Random forests. Machine Learning 2001, 45:5–32.
Caiming Z, Yong C. A review of research relevant to the emerging industry trends: Industry 4.0, iot, blockchain, and business analytics. Jour-
nal of Industrial Integration and Management 2020, 5:165–180.
Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In: Ghosh J, Lambert D, Skillicorn DB,
Srivastava J, eds. Sixth SIAM International Conference on Data Mining. Bethesda, MD, USA: SIAM; 2006, 328–339.
Carnein M, Trautmann H. Optimizing data stream representation: An extensive survey on stream clustering algorithms. Business & Informa-
tion Systems Engineering 2019, 61:277–297.
Chang L-C, Chen P-A, Chang F-J. Reinforced two-step-ahead weight adjustment technique for online training of recurrent neural networks.
IEEE Transactions on Neural Networks and Learning Systems 2012, 23:1269–1278.
Chen S-T, Lin H-T, Lu C-J. An online boosting algorithm with theoretical justifications. In: International Conference on Machine Learning
(ICML). 2012. Edinburgh.
Chen W, Oliverio J, Kim JH, Shen J. The modeling and simulation of data clustering algorithms in data mining with big data. Journal of
Industrial Integration and Management 2019, 4:1850017.
Chen X-W, Lin X. Big data deep learning: challenges and perspectives. IEEE Access 2014, 2:514–525.
Cheng J, Ke Y, Ng W. Maintaining frequent closed itemsets over a sliding window. Journal of Intelligent Information Systems 2008, 31:
191–215 https://doi.org/10.1007/s10844-007-0042-3.
Chi Y, Wang H, Yu PS, Muntz RR. Catch the moment: Maintaining closed frequent itemsets over a data stream sliding window. Knowledge
and Information Systems 2006, 10:265–294 https://doi.org/10.1007/s10115-006-0003-0.
Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 2005,
55:58–75.
Cortes C, Vapnik V. Support-vector networks. ML 1995, 20:273–297.
Da Xu L, He W, Li S. Internet of things in industries: A survey. IEEE Transactions on Industrial Informatics 2014, 10:2233–2243.
Dawid AP. Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the
Royal Statistical Society: Series A (General) 1984, 147:278–290.
de Barros RSM, de Carvalho Santos SGT, Júnior PMG. A boosting-like online learning ensemble. In: International Joint Conference on Neural
Networks (IJCNN). Vancouver: IEEE; 2016, 1871–1878.
de Souza Viana TS, de Oliveira M, da Silva TLC, Falc ao MSR, Gonçalves EJT. A message classifier based on multinomial naive bayes for
online social contexts. Journal of Management Analytics 2018, 5:213–229.
14 of 17 BAHRI ET AL.
Dietterich TG. Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems. Cagliari: Springer;
2000, 1–15.
Domingos P, Hulten G. Mining high-speed data streams. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Boston: ACM; 2000, 71–80.
Farnstrom F, Lewis J, Elkan C. Scalability for clustering algorithms revisited. ACM SIGKDD Explorations Newsletter 2000, 2:51–57.
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and robust automated machine learning. Advances in Neu-
ral Information Processing Systems 2015, 28:2962–2970.
Freund Y, Schapire RE. A desicion-theoretic generalization of on-line learning and an application to boosting. In: European Conference on
Computational Learning Theory. Barcelona: Springer; 1995, 23–37.
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997, 29:131–163.
Gaber MM, Zaslavsky A, Krishnaswamy S. Mining data streams: A review. SIGMOD 2005, 34:18–26.
Gama J. A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence 2012, 1:45–55.
Gama J, Gaber MM. Learning from data streams: Processing techniques in sensor networks. Springer; 2007.
Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams. In: ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. Washington DC: ACM; 2003, 523–528.
Gama J, Rodrigues PP, Lopes LMB. Clustering distributed sensor data streams using local processing and reduced communication. Intelligent
Data Analysis 2011, 15:3–28.
Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A. A survey on concept drift adaptation. Computing Surveys (CSUR) 2014, 46:44.
Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look a tutorial. In: ACM SIGMOD International
Conference on Management of Data. 2002, 635–635.
Gomes HM, Barddal JP, Enembreck F, Bifet A. A survey on ensemble learning for data stream classification. Computing Surveys (CSUR)
2017, 50:23.
Gomes HM, Barddal JP, Ferreira LEB, Bifet A. Adaptive random forests for data stream regression. In: h European symposium on artificial
neural networks (ESANN). 2018. Bruges: Springer.
Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T. Adaptive random forests for evolving data
stream classification. Machine Learning 2017, 106:1–27.
Gomes HM, Montiel J, Mastelini SM, Pfahringer B, Bifet A. On ensemble techniques for data stream regression. In: IEEE International Joint
Conference on Neural Networks. Glasgow: IEEE; 2020.
Gomes HM, Read J, Bifet A. Streaming random patches for evolving data stream classification. In: International Conference on Data Mining
(ICDM). Beijing: IEEE; 2019.
Gomes HM, Read J, Bifet A, Barddal JP, Gama J. Machine learning for streaming data: state of the art, challenges, and opportunities. ACM
SIGKDD Explorations Newsletter 2019, 21:6–22.
Grzenda M, Gomes HM, Bifet A. Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery. 2019:1–30. Shanghai:
Spring.
Grzenda M, Gomes HM, Bifet A. Performance measures for evolving predictions under delayed labelling classification. In: International Joint
Conference on Neural Networks (IJCNN). Glasgow: IEEE; 2020, 1–8.
Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge
and Data Engineering 2003, 15:515–528.
Hand DJ, Mannila H, Smyth P. Principles of data mining. London: MIT Press; 2001.
Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association. 1994(409–426).
New York: Springer.
Holmes, G., Kirkby, R. B., & Bainbridge, D. (2004). Batch-incremental learning for mining data streams.
Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression, vol. 398. Wiley; 2013.
Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. San Francisco: ACM; 2001, 97–106.
Hutter F, Kotthoff L, Vanschoren J. Automated machine learning. Springer; 2019.
Ikonomovska E, Gama J, Džeroski S. Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 2011, 23:
128–168.
Ikonomovska E, Gama J, Džeroski S. Online tree-based ensembles and option trees for regression on evolving data streams. Neurocomputing
2015, 150:458–470.
Inoubli W, Aridhi S, Mezni H, Maddouri M, Nguifo E. A comparative study on streaming frameworks for big data. In: Very Large Data Bases
(VLDB). 2018. Rio De Janeiro: Springer.
Jain LC, Seera M, Lim CP, Balasubramaniam P. A review of online learning in supervised neural networks. Neural Computing and Applica-
tions 2014, 25:491–509.
Jankowski D, Jackowski K, Cyganek B. Learning decision trees from data streams with concept drift. Procedia Computer Science 2016, 80:
1682–1691.
Kim JH. Integrating iot with lqr-pid controller for online surveillance and control of flow and pressure in fluid transportation system. Journal
of Industrial Integration and Management 2017, 17:100–127.
BAHRI ET AL.15 of 17
Klawonn F, Angelov P. Evolving extended naive bayes classifiers. In: International Conference on Data Mining Workshops. Hong Kong: IEEE;
2006, 643–647.
Kokate U, Deshpande A, Mahalle P, Patil P. Data stream clustering techniques, applications, and models: comparative analysis and discus-
sion. Big Data and Cognitive Computing 2018, 2:32.
Kolajo T, Daramola O, Adebiyi A. Big data stream analysis: a systematic literature review. Journal of Big Data 2019, 6:47.
Kranen P, Assent I, Baldauf C, e Seidl T. The clustree: Indexing micro-clusters for anytime stream mining. Knowledge and Information Sys-
tems 2011, 29:249–272.
Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B. An effective evaluation measure for clustering on evolving data
streams. In: Apté C, Ghosh J, Smyth P, eds. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego:
ACM; 2011, 868–876.
Kuncheva LI. Combining pattern classifiers: methods and algorithms. Canada: John Wiley & Sons; 2014.
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current
trends on using data intrinsic characteristics. Information Sciences 2013, 250:113–141.
Losing V, Hammer B, Wersing H. Knn classifier with self adjusting memory for heterogeneous concept drift. In: International Conference on
Data Mining (ICDM). Barcelona: IEEE; 2016, 291–300.
Losing V, Hammer B, Wersing H. Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing
2018, 275:1261–1274.
Manapragada C, Webb GI, Salehi M. Extremely fast decision tree. In: ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 2018, 1953–1962. London: ACM.
Manku GS, Motwani R. Approximate frequency counts over data streams. In: Very Large Data Bases (VLDB). Hong Kong: Elsevier; 2002,
346–357.
Marrón D, Read J, Bifet A, Navarro N. Data stream classification using random feature functions and novel method combinations. Journal of
Systems and Software 2017, 127:195–204.
Mccallum A, Nigam K. A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categoriza-
tion. 1998, 752(41–48). Citeseer.
McInnes, L., Healy, J., & Melville, J (2018) Umap: Uniform manifold approximation and projection for dimension reduction arXiv preprint
arXiv:1802.03426.
Montiel J, Read J, Bifet A, Abdessalem T. Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research
(JMLR) 2018, 19:2915–2914.
Morales GDF, Bifet A. Samoa: scalable advanced massive online analysis. Journal of Machine Learning Research (JMLR) 2015, 16:149–153.
Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In: Transactions on large-scale dataand knowledge-centered sys-
tems II. 2010, 1, Springer–30. Berlin: Springer.
Nguyen H-L, Woon Y-K, Ng W-K. A survey on data stream clustering and classification. Knowledge and Information Systems 2015, 45:
535–569.
Oliveira MDB, Gama J. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis 2012,
16:93–111.
Oza NC. Online bagging and boosting. In: International Conference on Systems, Man and Cybernetics, vol. 3. Waikoloa, Hawaii: IEEE; 2005,
2340–2345.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn:
Machine learning in python. Journal of Machine Learning Research (JMLR) 2011, 12:2825–2830.
Pratama M, Angelov PP, Lu J, Lughofer E, Seera M, Lim CP. A randomized neural network for data streams. In: International Joint Confer-
ence on Neural Networks (IJCNN). Anchorage: IEEE; 2017, 3423–3430.
Read J, Bifet A, Pfahringer B, Holmes G. Batch-incremental versus instance-incremental learning in dynamic and evolving data. In: Intelli-
gent Data Analysis (IDA). 2012, 313–323. Helsinki: Springer.
Read J, Perez-Cruz F, Bifet A. Deep learning in partially-labeled data streams. In: Annual ACM Symposium on Applied Computing. 2015,
954–959. Salamanca: ACM.
Rodrigues PP, Araújo J, Gama J, Lopes LMB. A local algorithm to approximate the global clustering of streams generated in ubiquitous sen-
sor networks. International Journal of Distributed Sensor Networks 2018, 14.
Rodrigues PP, Gama J, Pedroso JP. Hierarchical clustering of time-series data streams. IEEE Transactions on Knowledge and Data Engineer-
ing 2008, 20:615–627 https://doi.org/10.1109/TKDE.2007.190727.
Rutkowski L, Jaworski M, Duda P. Probabilistic neural networks for the streaming data classification. In: Stream Data Mining: Algorithms
and Their Probabilistic Properties. Springer; 2020, 245–277.
Sander J, Ester M, Kriegel H, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min-
ing and Knowledge Discovery 1998, 2:169–194 https://doi.org/10.1023/A:1009745219419.
Silva JA, Faria ER, Barros RC, Hruschka ER, de Leon Ferreira de Carvalho ACP, Gama J. Data stream clustering: A survey. ACM Computing
Surveys 2013, 46:13:1–13:31.
Sorzano, C. O. S., Vargas, J., & Montano, A. P. (2014) A survey of dimensionality reduction techniques. arXiv preprint arXiv:1403.2877.
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R. Monic: modeling and monitoring cluster transitions. In: Proceedings ACM International
Conference on Knowledge Discovery and Data Mining. Philadelphia: ACM Press; 2006, 706–711.
16 of 17 BAHRI ET AL.
Veloso B, Gama J, Malheiro B. Self hyper-parameter tuning for data streams. In: Data Streams. 2018. Split: Springer.
Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J. Feature hashing for large scale multitask learning. In: International Conference
on Machine Learning (ICML). Montreal: ACM; 2009, 1113–1120.
Zhou SM, Da Xu L. A new type of recurrent fuzzy neural network for modeling dynamic systems. Knowledge Based Systems 2001, 14:
243–251.
How to cite this article: Bahri M, Bifet A, Gama J, Gomes HM, Maniu S. Data stream analysis: Foundations,
major tasks and tools. WIREs Data Mining Knowl Discov. 2021;e1405. https://doi.org/10.1002/widm.1405
BAHRI ET AL.17 of 17
A preview of this full-text is provided by Wiley.
Content available from Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
This content is subject to copyright. Terms and conditions apply.