ArticlePDF Available

Data stream analysis: Foundations, major tasks and tools

Authors:

Abstract and Figures

The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining • Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Abstract Data Stream Mining
This content is subject to copyright. Terms and conditions apply.
OVERVIEW
Data stream analysis: Foundations, major tasks and tools
Maroua Bahri
1
| Albert Bifet
1,2
|Jo
~
ao Gama
3
| Heitor Murilo Gomes
2
|
Silviu Maniu
4
1
LTCI, Télécom Paris, IP-Paris, Palaiseau,
France
2
Department of Computer Science,
University of Waikato, Hamilton,
New Zealand
3
INESC TEC, University of Porto, Porto,
Portugal
4
LRI, Université Paris-Saclay, Orsay,
France
Correspondence to:
Maroua Bahri, LTCI, Télécom Paris, IP-
Paris, Palaiseau, France.
maroua.bahri@telecom-paris.fr
Funding information
Huawei Technologies France SASU and
Télécom Paris, Grant/Award Number:
YBN2018125164
Edited by: Sushmita Mitra, Associate
Editor, and Witold Pedrycz, Editor-in-
Chief
Abstract
The significant growth of interconnected Internet-of-Things (IoT) devices, the
use of social networks, along with the evolution of technology in different
domains, lead to a rise in the volume of data generated continuously from
multiple systems. Valuable information can be derived from these evolving
data streams by applying machine learning. In practice, several critical issues
emerge when extracting useful knowledge from these potentially infinite data,
mainly because of their evolving nature and high arrival rate which implies an
inability to store them entirely. In this work, we provide a comprehensive sur-
vey that discusses the research constraints and the current state-of-the-art in
this vibrant framework. Moreover, we present an updated overview of the lat-
est contributions proposed in different stream mining tasks, particularly classi-
fication,regression,clustering, and frequent patterns.
This article is categorized under:
Fundamental Concepts of Data and Knowledge > Key Design Issues in Data
Mining
Fundamental Concepts of Data and Knowledge > Motivation and Emer-
gence of Data Mining
1|INTRODUCTION
In recent decades, the world has been invaded by the ubiquitousness of technology in several sectors of society, such as
healthcare, transport, and banking. This digital revolution involves progressively more and more sensors and systems
that continually generate massive amounts of data in an open-ended way as big data streams. A good example is the
large system of interrelated devices and sensors known as Internet of Things (IoT) (Caiming & Yong, 2020; Da Xu,
He, & Li, 2014). The latter has become a key element of life automation, for instance, cars, cellphones, airplanes, and
drones. These devices create huge amounts of data streams that are expected to grow in the near future. By 2019, 26 bil-
lion of such devices were connected, and this number is expected to increase to almost 80 billion devices that will be
used all over the world by 2025.
1
Therefore, systems and algorithms to handle these vast flows of data must be
explored.
Stream
2
data are defined as an unbounded sequences of multidimensional, sporadic, and transient observations
made available along time(Bahri, 2020). To automatically extract useful information from data stream, we need to
consider stream computing that analyzes the data generated at high-velocity in real-time as required in big data stream
analytics (Kim, 2017). Hence, the stream mining tasks have become crucial in multiple real-world applications, for
Received: 15 August 2020 Revised: 13 January 2021 Accepted: 1 February 2021
DOI: 10.1002/widm.1405
WIREs Data Mining Knowl Discov. 2021;e1405. wires.wiley.com/dmkd © 2021 Wiley Periodicals LLC. 1of17
https://doi.org/10.1002/widm.1405
example, social networks, spam emails filters, and more, that demand real-time (or near real-time) analysis since the
data that they generate are drawn from evolving distributions.
Mining data streams has attracted several researchers due to the importance of its application (Aggarwal, 2007;
Bifet, Gavaldà, Holmes, & Pfahringer, 2018; Gaber, Zaslavsky, & Krishnaswamy, 2005). Amini, Wah, and
Saboohi (2014); Kokate, Deshpande, Mahalle, and Patil (2018); Carnein and Trautmann (2019) reviewed works for
unsupervised learning, clustering, by presenting models that are mainly used for density-based clustering. Current
incremental clustering algorithms rely mainly on techniques such as density-microclustering and density-grid that
require several parameters to be effective (Kokate et al., 2018). On the other hand, diverse works on supervised learning
have been proposed, especially in classification, which is perhaps the most commonly researched and active machine
learning task.
Our goal in this paper is to provide the artificial intelligence audience with a brief literature overview of the most
important foundations when dealing with big data streams by shedding light on how the research in the corresponding
framework can progress. While several books (Aggarwal, 2007; Gama & Gaber, 2007) and articles (Gaber et al., 2005;
Gama, 2012; Nguyen, Woon, & Ng, 2015) provide an overview of the state-of-the-art in the stream context, many new
and promising algorithms have emerged since then. Also, each of these reviews generally studies only one machine
learning task, for instance Losing, Hammer, and Wersing (2018) study the advances in classification, Gomes
et al. (2019b) does not discuss clustering, regression and frequent pattern mining tasks. This is a gap that the current
paper addieases.
We argue that these most recent advances in the data stream (Bahri et al., 2020; Bahri, Maniu, & Bifet, 2018;
Besedin, Blanchart, Crucianu, & Ferecatu, 2017; Gomes et al., 2017; Gomes et al., 2019; Losing, Hammer, &
Wersing, 2016; Montiel, Read, Bifet, & Abdessalem, 2018; Rodrigues, Araújo, Gama, & Lopes, 2018; Veloso, Gama, &
Malheiro, 2018) make this research area worth revisiting with a more ambitious scope. We first provide the basic con-
cepts in the stream setting while elucidating the challenges and how they could be addressed. Then, we review the pro-
gress in the different stream mining tasks with a particular focus the most active task, classification, and report reputed
and recent approaches and frameworks for data streams.
2|FOUNDATIONS
The unbounded nature of evolving data streams raises some technical and practical limitations that make traditional
stream algorithms fail because of the high use of resources (such as time and memory) to process dynamic data distri-
butions. In this section, we present the fundamental research issues encountered in the streaming framework.
2.1 |Challenges
The following challenges are mostly common across the different data stream mining tasks that will be presented in the
next sections (Aggarwal & Philip, 2007; Gama & Gaber, 2007; Kolajo, Daramola, & Adebiyi, 2019).
Evolving data streams. To cope with the ever-growing size of data, streams algorithms must address the evolving
high-speed nature and complexity of data, because a stream usually delivers instances quickly. Therefore, stream
mining algorithms should be scalable and process recent instances from the stream in a dynamic fashion. Moreover,
we need scalable frameworks (Section 8) to handle big data streams by adopting efficient resource management strat-
egies and parallelization.
Running time. An online algorithm must process the incoming observations as rapidly as possible. Otherwise, the
algorithm will not be efficient enough for applications where a rapid processing is required.
Memory usage. Because of the massive amounts of data streams that require a limitless memory to be processed and
stored, it is difficult and even impossible to store the entire stream. So, any stream algorithm must be able to operate
under restricted memory constraints by storing few synopsis of the processed data and the current model(s).
High-dimensionality. In some scenarios, streaming data may be high-dimensional, for example, text documents,
where distances between instances grow exponentially due to the curse of dimensionality. The latter can potentially
impact any algorithm's performance, mainly, in terms of time and memory.
2of17 BAHRI ET AL.
Concept drifts. Since data streams are evolving, so the underlying distribution may change at any time, an eventuality
known as concept drift. This phenomenon can impact the predictive performance of algorithms because the current
learned model will be no more representative for the next incoming data. To deal with the new trends, learning algo-
rithms use drift detectors to identify the changes at the same time of their appearance. We redirect the readers to
(Gama, Žliobaite, Bifet, Pechenizkiy, & Bouchachia, 2014) for a review on concept drifts.
Delayed labeling. Stream mining algorithms mostly suppose that labels are available before the next instance arrives
(immediate labeling). However, labels may arrive with delay which may be fixed or vary for different instances. Thus,
several algorithms that rely on concept drift detection will underperform when faced with a non-negligible delay to
receive the labeled data. This was illustrated in (Gomes et al., 2017), where authors present the same experiments
using both an immediate labeling setting and a fixed delayed setting.
Imbalanced labels. The presence of a certain class label more than the other(s), referred to as majority class, may
impact learning algorithms' performance because they are designed to optimize for generalization, consequently, the
minority class may be ignored (López, Fernández, García, Palade, & Herrera, 2013).
The aforementioned challenges are commonly significant in the different data stream mining tasks. To cope with these
requirements, the incremental approaches should integrate incremental strategies that will allow such setting. More chal-
lenges arise in the case of distributed systems, such as the integration and heterogeneity, are discussed in Kolajo et al. (2019).
2.2 |Processing
The above-mentioned stream setting requirements can be addressed by using some well-established methods, presented
in the following (Aggarwal & Philip, 2007; Gama & Gaber, 2007):
Single-pass. Unlike processing static datasets, it is no more possible to analyze data using several passes during the
course of computation because of its unbounded nature. Taking into account this constraint, algorithms work by
processing each instance from the stream only once (or a couple of times) and use it to update the modelor the sta-
tistical information about dataincrementally (instance-incremental algorithms, see Section 3.1). In the case of
batch-incremental algorithms, we process a batch (or chunk) of instances at once instead of only one instance.
Window models. Storing a data stream and scanning it several times is not allowed. In order to capture significant
information from these evolving data, different kinds of moving windows have been proposed to store a part of the
stream continuously. In the following, we mention the standard ones (Ng & Dash, 2010).
Sliding window model: Whose size is fixed wand each instance is time stamped, that is, the most recent instances
from the stream are kept inside the window. This moving window slides over the stream while maintaining the
same size (Figure 1a).
Landmark window model: This model starts by predefining an instance as a landmark from which the window grows.
Whenever the landmark changes, all the instances will be removed from the window and the new ones will be
maintained starting from the new landmark instance (Figure 1b). The problem with this window model is when the
landmark instance is fixed at the beginning of the flow, consequently, the window will store the whole stream.
Damped window model: The damped model uses a fading function that periodically updates the weights of instances
inside the window. The key idea consists in assigning a weight to each instance from the stream, which is inversely
(a) Sliding window model. (b) Landmark window of size 13. (c) Damped window model.
FIGURE 1 Window models
BAHRI ET AL.3of17
proportional to its age, that is, assign more weights to the recently arrived data. When the weight of an instance
exceeds a given threshold, it will be removed from the model(see Figure 1c) (Bahri, 2020).
Table 1 describe briefly these windows and shows some of their pros and cons. Besides, different approaches are devel-
oped based on this manner, where the choice of the window model depends on the applications' needs (Ng & Dash, 2010).
2.3 |Summarization
To cope with the resource constraints, brief information and synopsis can be constructed from instances to reduce its
size for processing rather thanor in collaboration withthe aforementioned models. This can be realized through
the selection of some incoming instances or the synopsis construction in data. In what follows, we present a brief
description of some techniques.
Sampling. Sampling is the simplest way to keep information about data. Since storing unbounded evolving data
streams is an impractical task, thus, it is a crucial step to make a sampling from the stream to maintain some repre-
sentativeinstances and store synopsis from the stream in memory (Babcock, Babu, Datar, Motwani, &
Widom, 2002). However, the disadvantage is that this simplicity comes with a cost: the selected instances can be not
too representative which can impact the results of mining algorithms.
Histograms. Histograms are commonly used in the offline fashion where multiple passes are allowed and their exten-
sion to the online setting abides challenging. Garofalakis, Gehrke, and Rastogi (2002) proposed some incremental
histogram techniques to handle data streams that failed in some cases where the data distribution is not uniform.
Sketches. A sketch is a probabilistic data structure that stores summaries and approximations of data (Manku &
Motwani, 2002). Different sketch-based methods exist that construct synopsis of data using a limited amount of mem-
ory. For instance, the count-min sketch (Cormode & Muthukrishnan, 2005) presented as a generalization of the
bloom filters to approximate the counts of objects with strong theoretical guarantees.
Micro-clusters. Micro-clusters represent a method in the stream clustering (Aggarwal, Han, Wang, & Yu, 2003) that is
used to store synopsis information about the instances in the stream and the clusters.
Grids. Grid-based methods partition the data space into small cubes, called grids, and instances from the data stream
are mapped to them.
Dimension reduction. Dimension reduction is a well-known preprocessing method to tackle high-dimensional data
that may increase the cost of any mining algorithm (Bahri et al., 2020). It consists in mapping high-dimensional
instances onto a lower-dimensional representation while conserving the distances between instances (Sorzano, Var-
gas, & Montano, 2014).
3|SUPERVISED LEARNING
In the following, we assume that the stream Scontains an infinite number of instances X
1
,,X
N
,, where Nis the
number of instances faced so far in the stream and X
i
is defined as a vector of a attributes (also called features).
TABLE 1 Window models comparison
Window
model Definition Advantages Disadvantages
Sliding Processes the last received
instances
Suitable when the recent instances are of
special interest
Ignores elements from the
stream
Landmark Processes the entire history of the
stream
Suitable for single-pass algorithms Instances have the same
significance
Damped
(fading)
Assigns weights to instances Suitable when the old instances may affect the
results
Unbounded time window
4of17 BAHRI ET AL.
3.1 |Classification
Classification is a very popular supervised learning task that predicts the class label (or target category), (y
0
), for an
unlabeled observation Xusing a model built on labeled test instances (X,y) where y0=X0
ðÞ(Hand, Mannila, &
Smyth, 2001). The goal of a classification algorithm is to accurately predict the class label of instances.
Classification has been commonly used in the batch setting for static data where a training set (labeled instances) is
used to build a model, then a test set (unlabeled instances) is used for prediction to evaluate the model. So, multiple
access to the data are allowed in this traditional task. However, traditional (or batch) classifiers are unable to process
evolving data streams due to the requirements of the stream framework (Bifet & Kirkby, 2009). Different from the
learning and prediction phases of batch classifiers, the stream algorithms update their models incrementally in a single-
pass, also, they should work within a limited amount of time to afford real-time processing without delay, and a limited
amount of memory to avoid storing massive quantities of data for the prediction task. Incremental approaches are com-
monly divided in two main categories (Read, Bifet, Pfahringer, & Holmes, 2012):
Instance-incremental approaches (Bifet & Gavaldà, 2009; Domingos & Hulten, 2000; Klawonn & Angelov, 2006; Los-
ing et al., 2016) that update the learned models incrementally with each incoming observation once it arrives. More
approaches are provided in the following sections.
Batch-incremental approaches that receive a batch of data (composed of multiple instances) and use it to update the
models (Bahri et al., 2020; Cortes & Vapnik, 1995; Holmes, Kirkby, & Bainbridge, 2004; Hosmer Jr, Lemeshow, &
Sturdivant, 2013). For instance, Holmes et al. proposed a batch-incremental ensemble method that divides the data
stream into small chunks and builds a global model which combines the models created by all the member (trees)
inside the ensemble.
A relevant study Read et al. compares the instance-incremental and the batch-incremental methods on commonly
used data streams in the literature showing the difference in results obtained by both categories. They obtain similar
results in terms of accuracy but the instance-incremental approaches use less resources.
Several offline classifiers have been widely studied for the stream setting and turned out to be inefficient with evolv-
ing data streams. Thus, streaming versions of these classifiers have been implemented in this vibrant environment.
Generally, classification algorithms can be grouped into five main categories in Figure 2: (i) frequency-based;
(ii) neighborhood-based; (iii) tree-based; (iv) ensemble-based; and (v) neural network-based algorithms.
3.1.1 |Frequency-based classification
Naive Bayes (NB) (Friedman, Geiger, & Goldszmidt, 1997) is one of the simplest classifiers that performs prediction
(given the test instance) by computing the posterior probability using the Bayes's theorem with the independence
assumption between all attributes given the class label. In practice, this naive independence assumption does not
always hold true, but the classifiers has surprisingly good results with multiple datasets. The NB is an incremental algo-
rithm that does not need an adaptation to handle data streams thanks to its frequency-based scheme that only stores
frequencies about instances.
Data Stream Classification
Neighborhood-basedFrequency-based
NN Hoeffding-tree
Naive Bayes Bagging
EnsemblesTree-based Neural-based
Perceptron
FIGURE 2 Taxonomy of classification with some well-known algorithms
BAHRI ET AL.5of17
In order to reduce the resource usage of the NB algorithm, a sketch-based NB algorithm (Bahri et al., 2018) has been
proposed recently which uses count-min sketch (Cormode & Muthukrishnan, 2005) as a data structure to store properly
approximate frequencies of data. To improve further the result of the sketch-based NB approach, Bahri et al. incorpo-
rated hashing trick (Weinberger, Dasgupta, Langford, Smola, & Attenberg, 2009), a dimensionality reduction technique,
to efficiently handle high-dimensional data streams incrementally.
The multinomial Naive Bayes (Mccallum & Nigam, 1998) is a specific instance of the NB classifier which uses a
multinomial distribution for each of the attributes. It is suitable for classification with discrete attributes (e.g., word
counts for text classification).
3.1.2 |Neighborhood-based classification
k-Nearest Neighbors (kNN) is a lazy learning algorithm that does not build a model but use the whole dataset to make
predictions. Given a test instance, the kNN algorithm works by computing the distances, using a metric such as the
Euclidean distance, between the unlabeled instance and all the others and finish by picking the kclosest ones. After-
wards, by majority vote, kNN predicts, for the test instance, the most frequent class label among the k-nearest neigh-
bors. Since data streams are unbounded, so it is not possible to store all the instances for the prediction phase. To cope
with this issue, an stream version of the kNN algorithm has been proposed which uses a limited memory via a sliding
window (Figure 1a) to store the most recent instances and merge them with the old ones already in the window. Never-
theless, the use of resources depends on the window size and the dimensionality of data. Bifet, Pfahringer, Read, and
Holmes (2013) showed that a small window will decrease the accuracy, while a large window will increase the use of
resources: it is an accuracy-resources tradeoff.
To reduce the resource usage of kNN with high-dimensional data, Bahri et al. proposed a batch-incremental kNN
that incrementally preprocesses the data with uniform manifold approximation and projection (UMAP) (McInnes,
Healy, & Melville, 2018), a neighborhood and graph-based technique, to compress the dimensionality without losing
much in accuracy. Other than the size of the sliding window, the batch-incremental kNN has also the size of batch as
an important hyperparameter to fix which poses an accuracy-resources tradeoff as well.
Self-adjusting memory kNN (samkNN) (Losing et al., 2016) is an incremental kNN algorithm that builds an ensem-
ble with models targeting concept drifts in the stream and uses a dual-memory: (i) short-term memory for the current
concept; and (ii) long-term memory to keep track about past concepts.
3.1.3 |Tree-based classification
Several tree-based algorithms have been proposed in the state-of-the-art to handle evolving data streams (Domingos &
Hulten, 2000; Gama, Rocha, & Medas, 2003; Jankowski, Jackowski, & Cyganek, 2016). For instance, the very fast deci-
sion tree (VFDT) (Domingos & Hulten, 2000), also called Hoeffding tree algorithm (HT), is an incremental decision tree
learner that uses the Hoeffding bound to choose best split nodes. However, HT does not use an explicit drift detection
mechanism to address the changes that may occur in the data distribution.
To tackle this issue, an improved version of VFDT, the concept-adapting very fast decision tree (CVFDT), has been
proposed (Hulten, Spencer, & Domingos, 2001) to adapt to changes by maintaining a sliding window with the most
recent instances from the stream. CVFDR increments the counts corresponding to the new arrived instance and decre-
ments the counts of the oldest instance that has fallen out of the moving window. Bifet and Gavaldà proposed an adap-
tive extension of the VFDT algorithm, called Hoeffding adaptive tree (HAT), to handle changes in the data distribution
using the ADaptive WINdowing (ADWIN), a change detector and predictor, that controls the performance of branches
on the tree. If a drift is detected, the tree will be updated by replacing old branches with new ones trained on the cur-
rent concept (Bifet & Gavalda, 2007). The tree-based algorithms demand more resource, particularly (i) memory when
the tree grows, and (ii) time to choose the best split attributes.
Manapragada et al. introduced an incremental decision tree learning algorithm, extremely fast decision tree
(EFDT), that is similar to the Hoeffding tree (or VFDT) algorithm but uses the Hoeffding bound to select a useful split
node and to replace it whenever a better alternative split node is identified. The EFDT algorithm achieves better predic-
tive performance than the HT algorithm on different data but the latter is more efficient computationally (EFDT
6of17 BAHRI ET AL.
running time exceeds that of HT). Besides, EFDT works on a stationary distribution and is not designed as a learner to
deal with drifts (Manapragada, Webb, & Salehi, 2018).
3.1.4 |Ensemble-based classification
In its purest form, an ensemble is a set of individual models, whose predictions are combined using majority vote or
similar voting strategies. Ensemble learning is a relatively easy off-the-shelf approach to improve predictive perfor-
mance without building a single optimized model for the given learning problem. These gains come at the cost of com-
putational resources used for training the underlying models, while this restriction is negligible for batch problems, it
poses a serious concern in a streaming setting where algorithms must operate under constrained computational
resources.
Several theoretical and empirical studies have demonstrated the benefit of combining various weakindividual
classifiers that drives to better accuracy than a single classifier (Bifet & Gavaldà, 2009; Breiman, 1996; Dietterich, 2000;
Kuncheva, 2014). Besides the predictive performance gains, ensemble-based methods are popular for data stream learn-
ing due to their flexibility to be coupled with concept drift detection strategies. Kuncheva (2014) mentioned three main
reasons for adopting an ensemble-based method over a single learner: (i) Statistical. Assuming that D
k
samples of a
training data set Dare used to train Kclassifiers, such that each of them obtain 100% accuracy on the subset of D
k
used
for training them. It is possible that the generalization capabilities of these classifiers will be different when they are
applied to a test set disjoint from D, therefore their accuracy will vary. From a statistical point of view it is safer to use
the mean of the individual predictions from these Kclassifiers instead of using only one of them, since the chance of
selecting the classifier with the worst generalization capabilities is eliminated. There is a chance that the ensemble
accuracy is worse than that of the best of its members, but at the least we avoid selecting the worst classifier.
(ii) Computational. Some classifiers may converge to a local maximum. Suppose that the local maxima of Kclassifiers
are close to the absolute maximum. Such that, there is a way of combining them in a model even closer to the absolute
maximum (optimal classifier) than any of them is capable individually. (iii) Representational. The classifier used may
not be able to represent the separation surface of the given problem. For example, a single linear classifier is unable to
accurately represent non-linear problems. Although the combination of several linear classifiers can approximate a
nonlinear separation surface.
In a streaming setting, two other reasons for using ensemble methods are worth noting: (iv) Scalability: in several
ensemble methods, such as random forest Breiman (2001), the base models can be trained independently. These ensem-
bles are embarrassingly parallel, which allows for training them in a way that allow to circumvent the computational
resources constraints. (v) Concept drift adaptation: Ensemble methods adapt to drifts rapidly by updating or resetting
the under-performing model(s) inside the ensemble Gomes et al. (2017a). This does not requires a complete reset of the
whole model.
Many streaming ensemble-based methods (Beygelzimer, Kale, & Luo, 2015; Chen, Lin, & Lu, 2012; de Barros, de
Carvalho Santos, & Júnior, 2016; Oza, 2005) are adapted versions of Boosting (Freund & Schapire, 1995) and Bagging
3
(Breiman, 1996). Online Bagging (Oza, 2005) is a streaming version of Bagging (Breiman, 1996) that uses the Poisson
distribution having λ= 1 to resample data, and generates models trained on diverse set of samples. Unlike the offline
Bagging, the stream Bagging assigns weight to samples using a Poisson distribution which allows dividing samples as
follows: 37% (sampled value 1) of instances are used for training only once and 26% (sampled value superior than 1) are
trained with repetition, and the rest 37% (sampled value 0) are not used for training. In order to promote ensemble
diversity and use more often training instances, Bifet et al. proposed leveraging bagging (LB), similar to the online Bag-
ging, that uses λ= 6 to approximate instances. LB deals with concept drifts using ADWIN (Bifet & Gavalda, 2007) by
resetting the worst performing classifier whenever a change is detected.
Streaming random forests (SRF) (Abdulsalam, Skillicorn, & Martin, 2007) is an incremental ensemble method that
adapts the Random Forests (RF) algorithm (Breiman, 2001) to the stream framework. It grows binary Hoeffding trees
and trains each tree on random samples without replacement by keeping a batch of instances. Adaptive Random For-
ests (ARF) (Gomes et al., 2017) is a recent version of the RF algorithm to handle evolving data streams which uses
Hoeffding trees and randomly selects attributes for each node. ARF induces more diversity to the ensemble through the
resampling technique used in LB (Bifet, Holmes, & Pfahringer, 2010). Besides, ARF includes one drift detection algo-
rithm per ensemble member, which dictates when to start training a background tree (i.e., when a warning is signaled),
and when to replace the current tree by the background tree (i.e., when a drift is signaled). Streaming Random Patches
BAHRI ET AL.7of17
(SRP) (Gomes et al., 2019) is also a novel ensemble-based method that uses a drift detection mechanism similar to one
presented in ARF and integrates random subspaces (selected globally for each base model) and the online Bagging.
3.1.5 |Neural network-based classification
Neural networks (NNs) are another category of models inspired by the biological neurons that form the nervous system.
In recent years, NNs have shifted the attention of the machine learning community to become one of the extremely
active research directions. However, there is a growing demand for the incremental setting to analyze continuously
evolving streams. The latter impose several challenges that need to be handled because of the potential infinite nature
of data and the real-time processing which raises memory and time issues (Besedin et al., 2017; Rutkowski, Jaworski, &
Duda, 2020).
Perceptron is the most simple NN that consists of a single neuron. It is a linear classifier that typically requires sev-
eral iterations over the training data, which is impossible in the stream learning setup. Stream Perceptron has a set of
weights that is updated for each new incoming instance from the stream using Stochastic Gradient Descent (SGD). The
main difference with the batch setting is that instead of doing multiple iterations to improve the accuracy, we only do
one pass over the data for the stream setting (Bifet et al., 2010c). Pratama et al. proposed a randomized neural network
model that provides a scalable solution, for the adaptation of neural models to the stream scenarios, which is able to
process the data one-by-one (instance-incremental) or in chunks (batch-incremental) (Pratama et al., 2017). In a differ-
ent research work, a new type of neural network called recurrent fuzzy neural network has been proposed to model
and capture the dynamical properties of the fuzzy dynamical systems (Zhou & Da Xu, 2001). This network is suitable
for describing dynamic systems, because it can deal with time-varying input or output through its own natural temporal
operation. Because of its dynamic nature, the recurrent neural network has been successfully applied to a wide variety
of applications such as speech processing and time series forecasting. However, the training of a recurrent neural net-
work could be time consuming and thus inappropriate with evolving data streams (Chang, Chen, & Chang, 2012). Jain,
Seera, Lim, and Balasubramaniam (2014) Rutkowski et al. reviewed some neural networks that work in the streaming
environment by replacing the epoch learning by one-by-one or mini-batch type learning.
Deep learning. Recently, deep learning has attracted much attention for applications to data stream processing.
However, only limited progress on online deep learning has been made because the networks should not be too deep to
allow real-time processing. For instance, a generative adversarial network (GAN) has been proposed (Besedin
et al., 2017) to derive a deep network on data streams without the necessity of storing incoming data. This technique
works by regenerating historical training data to compensate for the absence of synopsis of data. In Read, Perez-Cruz,
and Bifet (2015), authors explored two deep learning methods in order to classify semi-labeled data. These methods pro-
vide important advantages in the stream framework, such as learning incrementally in a constant memory usage.
Deep learning is not commonly used so far with evolving data streams due to the high computational cost of train-
ing and its sensitivity to hyper-parameter configurations (e.g., depth, number of neurons) (Marrón, Read, Bifet, &
Navarro, 2017). Moreover, since deep learning methods are resource-intensive, they require powerful GPUs. Another
challenging problem associated with data streams is that the latter may be non-stationary. In order to address this issue
and speed-up the learning, it is more practical to deal with NNs on mini-batches with the assumption that non-
stationary data are separated into chunks (data close in time are assumed to be stationary and follow the same distribu-
tion) (Chen & Lin, 2014).
3.2 |Regression
Regression is a supervised learning task where the goal is to estimate the relationship between a dependent variable
and one or more independent variables. It differs from classification as the dependent variable is numeric. In the con-
text of data streams, regression is often associated with time series analysis and forecasting. However, in a data stream
setting, we assume that data points are independent and identically distributed (iid). Therefore, traditional univariate
and multivariate time series analysis is not applicable.
The most basic regression technique is the simple linear regression, in which a line is fit to represent the relation-
ship between a single independent variable and the dependent variable. Multiple linear regression is an extension to
multiple dependent variables. In a streaming setting, the process of fitting such line to the data can be performed in an
8of17 BAHRI ET AL.
online fashion through SGD as in a Perceptron. To fit more complex data, in which a linear relationship is not suffi-
cient, one can rely on polynomial regression, where the relationship between the independent variable and the depen-
dent variable is modeled as an nth degree polynomial. Alternatively, neural networks can be employed (as in
classification) following the same scheme of training with SGD.
Incremental decision trees are often used in the streaming setting, notoriously for classification with Hoeffding
Trees (Domingos & Hulten, 2000), but also for regression. The Fast and Incremental Model Trees (FIMT-DD)
(Ikonomovska, Gama, & Džeroski, 2011) builds incremental regression trees similarly to Hoeffding Trees, that is,
FIMT-DD starts with an empty tree that keeps statistics at the leaves from arriving data until a grace period is reached,
such that the features are ranked according to their variance w.r.t the target variable to decide for splits, and if the two
best-ranked differ by at least the Hoeffding Bound (Hoeffding, 1994) the leaf splits. Similarly to other incremental deci-
sion trees, FIMT-DD performs concept drift detection, and adaptation, by periodically resetting subbranches of the tree
in which significant variance increases are observed.
In the search for predictive performance improvements of incremental regression trees, a common approach is to
ensemble several trees, similarly to ensembling classification models (Gomes et al., 2017a). Ikonomovska, Gama, and
Džeroski (2015) proposed the online random forest (ORF) and online bagging (OBag) ensembles that use the FIMT-DD
as the base learner. Based on empirical experiments, the authors concluded that the ORTO-A (online option trees with
averaging) outperformed both OBag and ORF in terms of mean squared error (MSE). Gomes, Barddal, Ferreira, and
Bifet (2018) proposed Adaptive Random Forest regressor (ARF-Reg), an adaptation of the data stream classifier ARF
(Gomes et al., 2017). ARF-Reg builds a forest of FIMT-DD trees as ORF, the main difference between both algorithms
is that ARF-Reg employs one instance of the ADWIN algorithm (Bifet & Gavalda, 2007) per tree to detect concept drifts.
Even though there are some similarities between ensemble classification and regression models, there are also impor-
tant differences, for example, w.r.t. how predictions are combined and how diversity is induced. These were recently
empirically analyzed in (Gomes, Montiel, Mastelini, Pfahringer, & Bifet, 2020).
4|CLUSTERING
Unlike supervised learning, the instances represented in the clustering are not associated with a discrete class label, as
in classification, or a continuous value, as in regression, because the clustering methods aim to discover the possible
classes from the content of the data [de Souza Viana, de Oliveira, da Silva, Falc ao, & Gonçalves, 2018]. The big data
stream clustering task can be defined by the process that continuously maintains a consistent clustering of the data
encountered thus far from the stream while using limited amounts of time and memory (Chen, Oliverio, Kim, &
Shen, 2019; Silva et al., 2013). As mentioned before, the infinite nature of data imposes several challenges and the need
to process them in real-time. Thus, incremental clustering algorithms are needed to maintain the evolving cluster struc-
tures. Moreover, due to the dynamics of the data stream, new clusters might appear, other disappear, whereas some
clusters can move in the instance space, etc.
A recent research study (Chen et al., 2019) have been proposed that presents the clustering categories and the
related methods while discussing the pros and cons of each of them. However, authors reviewed methods that can be
applicable to big data and do not examined them in the streaming context.
The main approaches in clustering streaming data, can be summarized as:
Partitioning clustering organizes a set of instances into some partitions, in such a way that each partition repre-
sents a cluster. These clusters are formed by minimizing some objective function, such as the sum of squares distances
to the cluster centroids. Examples of well-known algorithms include k-means (Farnstrom, Lewis, & Elkan, 2000), and
k-medoids (Guha, Meyerson, Mishra, Motwani, & O'Callaghan, 2003).
Micro-cluster-based clustering divides the process into two principal phases: (i) the online phase summarizes the data
stream in micro-clusters: and (ii) the second phase builds a general cluster model using the local micro-clusters. The
most representative streaming algorithms are CluStream (Aggarwal et al., 2003), and Clustree (Kranen, Assent,
Baldauf, & e Seidl, 2011).
Density-based clustering is based on the idea that a cluster should be built around instances with a significant number
of points in their neighborhoods, that is in dense areas of the instance space. The DBSCAN algorithm (Sander, Ester,
Kriegel, & Xu, 1998) is the most representative density-based offline algorithm. Cao, Ester, Qian, and Zhou (2006)
presented Den-Stream, a streaming density-based algorithm, that extends the main concepts of DBSCAN to the
BAHRI ET AL.9of17
streaming setting by using micro-clusters for online computing summary statistics. It uses a fading window model,
where the weight of each data point decreases exponentially using the function f(t)=2
λt
where the decay-rate
is λ>0.
Hierarchical clustering, also known as connectivity-based clustering, is mainly composed of agglomerative and divi-
sive methods. The agglomerative clustering concerns the bottom-upmethods where each instance starts in its own
cluster and pairs of instances are grouped as one moves up to form the hierarchy. On the other hand, the divisive
clustering is a top-downmethod that starts from the top, where all instances are grouped in one cluster, and splits
into different clusters recursively as one moves down the hierarchy. However, in the stream setting, instances arrive
one by one and are not all available at once. Therefore, the hierarchical clustering was presented in the system
Online Divise-Agglomerative Clustering (ODAC) Rodrigues, Gama, and Pedroso (2008) which continuously main-
tains a hierarchical clustering structure, where the system continuously monitors the diameter of the clusters. The
hierarchy grows, when more information is available, allowing a more detail cluster structure. When a change in the
correlation structure of the process that generates data is detected, the hierarchy contracts, by merging the cluster
where change was detected.
Distributed algorithms for clustering have been developed in the context of sensor networks. There are two main
perspectives: (i) a cluster is a group of data points (Gama, Rodrigues, & Lopes, 2011); and (ii) a cluster is a group of sen-
sors, as in (Rodrigues et al., 2018).
There is no consensus on the topic of evaluation of clustering algorithms. Kremer et al. presented the Cluster Map-
ping Measure, that demonstrates multiple types of errors by considering the characteristics of evolving data.
Spiliopoulou, Ntoutsi, Theodoridis, and Schult (2006) introduces MONIC system that aims to detect and track change in
clusters by assuming that a cluster represents an instance in some geometric space. MONIC works by encompassing
changes that include more than one cluster, enabling for insights on cluster change in the entire clustering. Actually,
the transition tracking mechanism depends on the overlapping degree between two clusters. The notion of overlapping
between any two clusters, C
1
and C
2
, can be defined as the number of common instances weighted with the age of the
records. Suppose that the cluster C
1
and C
2
are obtained at instances t
1
and t
2
, respectively. The degree of overlap
between these two clusters is thus computed using:
overlap C1,C2
ðÞ=
P
aC1\C2
age a,t2
ðÞ
P
xC1
age x,t2
ðÞ
:
The latter permits the deduction of properties that concern the underlying structure of data stream. The cluster transi-
tion at a given instance is defined as a change in a cluster discovered at an earlier instance. MONIC system considers
internal and external transitions which reflect the dynamics and changes in the stream, such as a cluster survives (it is
no going to disappear), a cluster absorbed (absorbed by another cluster); a cluster disappears (totally removed); a cluster
emerges (a new cluster is created). Tracking cluster evolution on panel and longitudinal data appears in Oliveira and
Gama (2012).
5|FREQUENT PATTERNS
Frequent pattern mining is an important unsupervised learning task that can be employed to merely determine the
structure of the data, to figure out the association rules, or to find discriminative attributes that can be exploited for
classification or clustering tasks. Examples of pattern classes can be itemsets, trees, graphs, and sequences (Bifet &
Gavaldà, 2008).
The frequent patterns issue is presented as follows: given a batch dataset or a data stream, that encompasses pat-
terns, and a threshold σ, the task consists in finding all the patterns that emerge as a subpattern in a fraction σof the
patterns in the data. For instance, if the input data is a stream of purchases in a supermarket, and σ= 10%, we would
call {cheese,wine} frequent pattern if at least 10% of the purchases include, among other products, partially cheese and
wine. Another example includes graphs where a triangle is considered as a graph pattern. Given a dataset of graphs,
this pattern would be frequent if at least a fraction σof the graphs contain at least one triangle.
10 of 17 BAHRI ET AL.
In the offline setting, Apriori, Eclat, and FP-growth are well-known algorithms for discovering frequent itemsets in
datasets. Similar approaches for data structures, for example, sequences and graphs can be found in the literature. Nev-
ertheless, the adaptation of these algorithms to the stream setting is not an easy task because they violate the single-pass
requirement and maintain too much information.
To cope with the aforementioned issues, stream approaches for frequent pattern mining have been proposed which
use a batch miner as a base leading to approximate results rather than exact. Thus, other online ideas need to be
explored. Examples of algorithms for frequent itemset mining in data streams are Moment (Chi, Wang, Yu, &
Muntz, 2006) and IncMine (Cheng, Ke, & Ng, 2008).
6|TOWARDS AUTOML
Machine learning benefited from tremendous research progress recently in many application areas, particularly in the
stream setting. The growing number of machine learning algorithms and their respective hyperparameters give rise to
the number of configurations that relies on qualified experts (i.e., human intervention and expertise).
There is no doubt, current (some of them are mentioned in the previous sections) algorithms are suitable for data
streams but they usually require to set the configuration in advance (e.g., the size of the ensemble for the ensemble-
based methods, the number of neighbors kfor the kNN algorithm). Moreover, stream approaches are not totally auto-
mated, that is, the parameterization set at the beginning may not hold for all the parts of the stream since models may
change over time because of concept drifts. Thus, what is the way to deal with this matter?
Auto Machine Learning (autoML) (Hutter, Kotthoff, & Vanschoren, 2019) is a new tool that is receiving increased
attention which aims to tackle the parameters configuration issue using automatic monitoring models. Multiple systems
4
have been proposed in the offline setting that allow hyperparameter tuning by combining autoML with famous machine
learning softwares, such as Scikit-learn and Weka, AutoWeka
5
and AutoSklearn,
6
respectively (Feurer et al., 2015).
Yet, a very limited number of contributions on AutoML for evolving data streams exist in the literature. For
instance, Self Parameter Tuning (SPT) is an automated technique that controls the stream algorithm configuration by
incrementally selecting the best hyperparameter(s) that may change over time (Veloso et al., 2018).
We consider that automatic algorithm configuration for data stream mining can be revolutionary. In fact, selecting
the best hyperparameter configuration for stream algorithms is a tedious task because it may change depending on the
characteristics (e.g., number of attributes) and contents of data. Hence, tuning incremental algorithms' configurations
automatically and continuously is a very promising direction in machine learning for data streams.
7|EVALUATION METRICS
In the stream setting, two main evaluation axes are involved and strongly related to assess the efficiency performance
of algorithms along with their quality (e.g., accuracy for classification) are (i) the Execution time which includes any
preprocessing (such as dimension reduction), prediction, and learning steps; and the Memory used by an algorithm
comprises the storage needed to keep the current model(s) with the statistical information useful to maintain the incre-
mental processing (e.g., the number of instances received so far from the stream).
In the context of supervised learning, it is important to evaluate the trained model and test its applicability in differ-
ent scenarios on different data streams. The prequential evaluation (Dawid, 1984), also called interleaved test-then-train,
is the most used evaluation method proposed exclusively to assess the performance of data stream algorithms incremen-
tally. This evaluation scheme consists in using instances to test (for prediction) the current learned model, and use them
thereafter to update (or train) the model. Another important evaluation task consists in performing the holdout evalua-
tion that uses different test and training datasets. Bifet et al. (2015) propose an evaluation methodology for big data
streams that address different scenarios, including unbalanced data, and data where change occurs on different time
scales. Most notably, Bifet et al. (2015) introduces adaptations of cross-validation to the streaming settings.
Several metrics exist to measure the performance of classification algorithms. Most of them are easily applied to data
stream classification. Accuracy is an intuitive metric that measures the percentage of correctly classified instances with
respect to all predictions made. If the distribution of examples across the class labels is imbalanced, then accuracy can
be misleading as a model that always predicts the majority class will yield high accuracy. In these cases, metrics such as
sensitivity,specificity,g-mean are better alternatives.
BAHRI ET AL.11 of 17
The most common metrics used to evaluate the results of the prediction for the regression task are (i) Root mean
squared error (RMSE). RMSE is the square root of the average squared difference between the target value and the value
predicted by the model; and (ii) Mean absolute error (MAE). MAE is the absolute difference between the value targeted
and the value predicted by the model.
Another topic of interest is the evaluation of data streams when there is a non-negligible delay between the arrival
of the instance data and its corresponding label data. Grzenda, Gomes, and Bifet (2019) claims that besides howpre-
dictions affect the predictive performance, it is also essential to consider whenlabels are made available as part of the
evaluation. This leads to the concept of continuous re-evaluations introduced in (Grzenda et al., 2019) and further
explored in (Grzenda, Gomes, & Bifet, 2020). The goal of continuous re-evaluations is to observe if, and how fast,
models can transform an initially incorrect prediction into a correct prediction before the true label arrives in a stream-
ing setting.
Finally, a multitude of evaluation measures (also called validation measures) have been proposed to evaluate the
quality of resulting clusterings. We direct the readers to this work (Kremer et al., 2011) that discusses and compares
these measures in extensive experiments.
8|OPEN SOURCE SOFTWARE
Multiple frameworks for data stream mining have been proposed in the literature. The set of available open source soft-
wares contains a multitude of the state-of-the-art algorithms that can be extended to propose new approaches and/or to
compare against. In the following, we cite some softwares with active growing communities that have been widely used
along with new ones in the literature. Massive online analysis (MOA): MOA
7
(Bifet et al., 2010) is the most popular open
source framework for machine learning for evolving data streams, written in Java and implemented above Waikato
Environment for Knowledge Analysis (WEKA),
8
with a very active research community. MOA provides different data
generators (e.g., SEA and LED generators), stream mining algorithms (e.g., algorithms for classification, clustering,
regression, anomaly detection), evaluation methods (e.g, the prequential evaluation), and statistics to evaluate the per-
formance of algorithms (e.g., memory, time, accuracy, kappa). This software can be used via a command line or a user
interface. A recent book (Bifet et al., 2018) has been published that discusses MOA and how to use it along with exer-
cises and lab sessions. Generally, researchers working on MOA, make their contribution codes available in MOA
extension.
9
Scalable advanced massive online analysis (SAMOA): SAMOA
10
(Morales & Bifet, 2015) is presented as a library as
well as a framework, written in Java, that combines data stream analysis and distributed processing. SAMOA allows
the creation of distributed stream machine learning algorithms and runs them on distributed stream processing engines
i a fast and scalable manner. This framework provides a collection of distributed versions of some data stream algo-
rithms (e.g., bagging, boosting).
StreamDM: StreamDM
11
is an open source framework for online machine learning which utilizes Spark streaming
to enable stream processing from a variety of sources. The main advantages of StreamDM are (i) the benefit from the
use of Spark streaming API that enables scalable stream processing in order to handle issues such as out of order data
in data sources; and (ii) the fact that it allows batch processing algorithms with streaming algorithms (Bifet et al., 2015).
Scikit-multiflow: Scikit-multiflow
12
(Montiel et al., 2018) is a new open source software designed for multi-label/
multi-output and single-output stream learning algorithms and inspired by the popular frameworks, scikit-learn
(Pedregosa et al., 2011), MOA (Bifet et al., 2010), and MEKA,
13
to fill the void in Python for data stream mining. Similar
to MOA, scikit-multiflow contains also stream generators and algorithms for data streams. More recently, scikit-
multiflow has merged with the creme framework (https://maxhalford.github.io/) into a new Python project called
River. Given the increasing popularity of the Python programming language, the advantage of using the scikit-
multiflow framework, that complements the scikit-learn which focuses only on the batch learning, is its similarity to
the latter which is widely used by researchers and practitioners. Moreover, it can be used within the popular interface
Jupyter Notebook, often used by the data science community. On the other hand, the notable drawback is that this soft-
ware may be slow, in comparison with MOA, because Python codes are expected to execute slower than Java codes.
Comparative studies (Behera, Das, Jena, Rath, & Sahoo, 2017; Gomes et al., 2019b; Inoubli, Aridhi, Mezni, Mad-
douri, & Nguifo, 2018) on the stream frameworks, with an evaluation of the performance in terms of resource consump-
tion, have been provided with a focus on the distributed stream processing tools, such as Apache Samza, Apache Spark,
Apache Flink, and Apache Storm.
12 of 17 BAHRI ET AL.
9|CONCLUSION
The aim of this survey paper is to present a holistic view of data stream mining by reviewing the stream mining chal-
lenges and foundations. We also conducted a comprehensive literature review of the baseline algorithms in data stream
mining and discussed the most promising and the recent ones. Moreover, this survey presents some principal metrics
for algorithm evaluation and well-known actively growing open-source softwares for the stream environment. We hope
that this summary provides a clear overview of the main challenges, basics, and recent advances, as well as some open
directions in the stream setting to the AI community.
ACKNOWLEGEMENTS
This work has been carried out in the frame of a cooperation between Huawei Technologies France SASU and Télécom
Paris (Grant no. YBN2018125164).
CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.
AUTHOR CONTRIBUTIONS
Albert Bifet: Conceptualization. Heitor Gomes: Conceptualization. Silviu Maniu: Conceptualization.
DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no data sets were generated or analyzed during the current study.
ENDNOTES
1
* Email: maroua.bahri@telecom-paris.fr
www.statista.com/statistics/976079/number-of-iot-connected-objects-worldwide-by-type/
2
In the sequel, we use the terms streaming,online,orincremental interchangeably.
3
The offline Bagging applies resampling, i.e. sampling with reposition, to train its ensemble members on different sub-
sets of instances.
4
https://www.automl.org/automl/
5
https://www.automl.org/automl/autoweka/
6
https://www.automl.org/automl/auto-sklearn/
7
https://moa.cms.waikato.ac.nz/
8
https://www.cs.waikato.ac.nz/ml/weka/
9
https://moa.cms.waikato.ac.nz/moa-extensions/.
10
http://samoa.incubator.apache.org
11
http://huawei-noah.github.io/streamDM/
12
https://scikit-multiflow.github.io
13
The MEKA project provides algorithms for multi-label learning and evaluation.
REFERENCES
Abdulsalam H, Skillicorn DB, Martin P. Streaming random forests. In: International Database Engineering and Applications Symposium
(IDEAS). Banff, Canada: IEEE; 2007, 225232.
Aggarwal CC. Data streams: models and algorithms, vol. 31. New York: Springer Science & Business Media; 2007.
Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: International Conference on Very Large Data
Bases. VLDB Endowment; 2003, 8192.
Aggarwal CC, Philip SY. A survey of synopsis construction in data streams. In: Data streams. Springer; 2007, 169207.
Amini A, Wah TY, Saboohi H. On density-based data streams clustering algorithms: A survey. Journal of Computer Science and Technology
2014, 29:116141.
Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In: ACM SIGMOD. New York: ACM;
2002, 116.
BAHRI ET AL.13 of 17
Bahri, M. (2020). Improving IoT data stream analytics using summarization techniques (Ph.D. thesis). Institut Polytechnique de Paris.
Bahri M, Bifet A, Maniu S, Gomes HM. Survey on feature transformation techniques for data streams. In: International Joint Conference on
Artificial Intelligence (IJCAI). 2020. Yokohama.
Bahri M, Maniu S, Bifet A. Sketch-based naive bayes algorithms for evolving data streams. In: International Conference on Big Data. Seattle:
IEEE; 2018, 604613.
Bahri M, Pfahringer B, Bifet A, Maniu S. Efficient batch-incremental classification for evolving data streams. In: Intelligent Data Analysis
(IDA). Konstanz: Springer; 2020.
Behera RK, Das S, Jena M, Rath SK, Sahoo B. A comparative study of distributed tools for analyzing streaming data. In: International Confer-
ence on Information Technology (ICIT). Toronto: IEEE; 2017, 7984.
Besedin, A., Blanchart, P., Crucianu, M. and Ferecatu, M. (2017). Evolutive deep models for online learning on data streams with no storage.
Beygelzimer A, Kale S, Luo H. Optimal and adaptive algorithms for online boosting. In: International Conference on Machine Learning
(ICML). 2015, 23232331. Lille.
Bifet A, de Francisci Morales G, Read J, Holmes G, Pfahringer B. Efficient online evaluation of big data stream classifiers. In: Proceedings of
the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015, 5968. Sydney.
Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: International Conference on Data Mining (ICDM).
SIAM; 2007, 443448.
Bifet A, Gavaldà R. Mining adaptively frequent closed unlabeled rooted trees in data streams. In: ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. 2008, 3442. Las Vegas, Nevada, USA. https://doi.org/10.1145/1401890.1401900.
Bifet A, Gavaldà R. Adaptive learning from evolving data streams. In: Intelligent Data Analysis (IDA). Lyon: Springer; 2009, 249260.
Bifet A, Gavaldà R, Holmes G, Pfahringer B. Machine learning for data streams: With practical examples in MOA. MIT Press; 2018.
Bifet A, Holmes G, Kirkby R, Pfahringer B. Moa: Massive online analysis. Journal of Machine Learning Research (JMLR) 2010, 11:16011604.
Bifet A, Holmes G, Pfahringer B. Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and
knowledge discovery in databases. Barcelona: Springer; 2010, 135150.
Bifet A, Holmes G, Pfahringer B, Frank E. Fast perceptron decision tree learning from evolving data streams. In: Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD). Springer; 2010, 299310.
Bifet, A., & Kirkby, R. (2009) Data stream mining a practical approach.
Bifet A, Maniu S, Qian J, Tian G, He C, Fan W. Streamdm: Advanced data mining in spark streaming. In: International Conference on Data
Mining Workshop (ICDMW). Atlantic City: IEEE; 2015, 16081611.
Bifet A, Pfahringer B, Read J, Holmes G. Efficient data stream classification via probabilistic adaptive windows. In: Symposium On Applied
Computing (SIGAPP). Coimbra: ACM; 2013, 801806.
Breiman L. Bagging predictors. Machine Learning 1996, 24:123140.
Breiman L. Random forests. Machine Learning 2001, 45:532.
Caiming Z, Yong C. A review of research relevant to the emerging industry trends: Industry 4.0, iot, blockchain, and business analytics. Jour-
nal of Industrial Integration and Management 2020, 5:165180.
Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In: Ghosh J, Lambert D, Skillicorn DB,
Srivastava J, eds. Sixth SIAM International Conference on Data Mining. Bethesda, MD, USA: SIAM; 2006, 328339.
Carnein M, Trautmann H. Optimizing data stream representation: An extensive survey on stream clustering algorithms. Business & Informa-
tion Systems Engineering 2019, 61:277297.
Chang L-C, Chen P-A, Chang F-J. Reinforced two-step-ahead weight adjustment technique for online training of recurrent neural networks.
IEEE Transactions on Neural Networks and Learning Systems 2012, 23:12691278.
Chen S-T, Lin H-T, Lu C-J. An online boosting algorithm with theoretical justifications. In: International Conference on Machine Learning
(ICML). 2012. Edinburgh.
Chen W, Oliverio J, Kim JH, Shen J. The modeling and simulation of data clustering algorithms in data mining with big data. Journal of
Industrial Integration and Management 2019, 4:1850017.
Chen X-W, Lin X. Big data deep learning: challenges and perspectives. IEEE Access 2014, 2:514525.
Cheng J, Ke Y, Ng W. Maintaining frequent closed itemsets over a sliding window. Journal of Intelligent Information Systems 2008, 31:
191215 https://doi.org/10.1007/s10844-007-0042-3.
Chi Y, Wang H, Yu PS, Muntz RR. Catch the moment: Maintaining closed frequent itemsets over a data stream sliding window. Knowledge
and Information Systems 2006, 10:265294 https://doi.org/10.1007/s10115-006-0003-0.
Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 2005,
55:5875.
Cortes C, Vapnik V. Support-vector networks. ML 1995, 20:273297.
Da Xu L, He W, Li S. Internet of things in industries: A survey. IEEE Transactions on Industrial Informatics 2014, 10:22332243.
Dawid AP. Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the
Royal Statistical Society: Series A (General) 1984, 147:278290.
de Barros RSM, de Carvalho Santos SGT, Júnior PMG. A boosting-like online learning ensemble. In: International Joint Conference on Neural
Networks (IJCNN). Vancouver: IEEE; 2016, 18711878.
de Souza Viana TS, de Oliveira M, da Silva TLC, Falc ao MSR, Gonçalves EJT. A message classifier based on multinomial naive bayes for
online social contexts. Journal of Management Analytics 2018, 5:213229.
14 of 17 BAHRI ET AL.
Dietterich TG. Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems. Cagliari: Springer;
2000, 115.
Domingos P, Hulten G. Mining high-speed data streams. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Boston: ACM; 2000, 7180.
Farnstrom F, Lewis J, Elkan C. Scalability for clustering algorithms revisited. ACM SIGKDD Explorations Newsletter 2000, 2:5157.
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and robust automated machine learning. Advances in Neu-
ral Information Processing Systems 2015, 28:29622970.
Freund Y, Schapire RE. A desicion-theoretic generalization of on-line learning and an application to boosting. In: European Conference on
Computational Learning Theory. Barcelona: Springer; 1995, 2337.
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997, 29:131163.
Gaber MM, Zaslavsky A, Krishnaswamy S. Mining data streams: A review. SIGMOD 2005, 34:1826.
Gama J. A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence 2012, 1:4555.
Gama J, Gaber MM. Learning from data streams: Processing techniques in sensor networks. Springer; 2007.
Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams. In: ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. Washington DC: ACM; 2003, 523528.
Gama J, Rodrigues PP, Lopes LMB. Clustering distributed sensor data streams using local processing and reduced communication. Intelligent
Data Analysis 2011, 15:328.
Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A. A survey on concept drift adaptation. Computing Surveys (CSUR) 2014, 46:44.
Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look a tutorial. In: ACM SIGMOD International
Conference on Management of Data. 2002, 635635.
Gomes HM, Barddal JP, Enembreck F, Bifet A. A survey on ensemble learning for data stream classification. Computing Surveys (CSUR)
2017, 50:23.
Gomes HM, Barddal JP, Ferreira LEB, Bifet A. Adaptive random forests for data stream regression. In: h European symposium on artificial
neural networks (ESANN). 2018. Bruges: Springer.
Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T. Adaptive random forests for evolving data
stream classification. Machine Learning 2017, 106:127.
Gomes HM, Montiel J, Mastelini SM, Pfahringer B, Bifet A. On ensemble techniques for data stream regression. In: IEEE International Joint
Conference on Neural Networks. Glasgow: IEEE; 2020.
Gomes HM, Read J, Bifet A. Streaming random patches for evolving data stream classification. In: International Conference on Data Mining
(ICDM). Beijing: IEEE; 2019.
Gomes HM, Read J, Bifet A, Barddal JP, Gama J. Machine learning for streaming data: state of the art, challenges, and opportunities. ACM
SIGKDD Explorations Newsletter 2019, 21:622.
Grzenda M, Gomes HM, Bifet A. Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery. 2019:130. Shanghai:
Spring.
Grzenda M, Gomes HM, Bifet A. Performance measures for evolving predictions under delayed labelling classification. In: International Joint
Conference on Neural Networks (IJCNN). Glasgow: IEEE; 2020, 18.
Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge
and Data Engineering 2003, 15:515528.
Hand DJ, Mannila H, Smyth P. Principles of data mining. London: MIT Press; 2001.
Hoeffding W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association. 1994(409426).
New York: Springer.
Holmes, G., Kirkby, R. B., & Bainbridge, D. (2004). Batch-incremental learning for mining data streams.
Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression, vol. 398. Wiley; 2013.
Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. San Francisco: ACM; 2001, 97106.
Hutter F, Kotthoff L, Vanschoren J. Automated machine learning. Springer; 2019.
Ikonomovska E, Gama J, Džeroski S. Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 2011, 23:
128168.
Ikonomovska E, Gama J, Džeroski S. Online tree-based ensembles and option trees for regression on evolving data streams. Neurocomputing
2015, 150:458470.
Inoubli W, Aridhi S, Mezni H, Maddouri M, Nguifo E. A comparative study on streaming frameworks for big data. In: Very Large Data Bases
(VLDB). 2018. Rio De Janeiro: Springer.
Jain LC, Seera M, Lim CP, Balasubramaniam P. A review of online learning in supervised neural networks. Neural Computing and Applica-
tions 2014, 25:491509.
Jankowski D, Jackowski K, Cyganek B. Learning decision trees from data streams with concept drift. Procedia Computer Science 2016, 80:
16821691.
Kim JH. Integrating iot with lqr-pid controller for online surveillance and control of flow and pressure in fluid transportation system. Journal
of Industrial Integration and Management 2017, 17:100127.
BAHRI ET AL.15 of 17
Klawonn F, Angelov P. Evolving extended naive bayes classifiers. In: International Conference on Data Mining Workshops. Hong Kong: IEEE;
2006, 643647.
Kokate U, Deshpande A, Mahalle P, Patil P. Data stream clustering techniques, applications, and models: comparative analysis and discus-
sion. Big Data and Cognitive Computing 2018, 2:32.
Kolajo T, Daramola O, Adebiyi A. Big data stream analysis: a systematic literature review. Journal of Big Data 2019, 6:47.
Kranen P, Assent I, Baldauf C, e Seidl T. The clustree: Indexing micro-clusters for anytime stream mining. Knowledge and Information Sys-
tems 2011, 29:249272.
Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B. An effective evaluation measure for clustering on evolving data
streams. In: Apté C, Ghosh J, Smyth P, eds. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego:
ACM; 2011, 868876.
Kuncheva LI. Combining pattern classifiers: methods and algorithms. Canada: John Wiley & Sons; 2014.
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current
trends on using data intrinsic characteristics. Information Sciences 2013, 250:113141.
Losing V, Hammer B, Wersing H. Knn classifier with self adjusting memory for heterogeneous concept drift. In: International Conference on
Data Mining (ICDM). Barcelona: IEEE; 2016, 291300.
Losing V, Hammer B, Wersing H. Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing
2018, 275:12611274.
Manapragada C, Webb GI, Salehi M. Extremely fast decision tree. In: ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 2018, 19531962. London: ACM.
Manku GS, Motwani R. Approximate frequency counts over data streams. In: Very Large Data Bases (VLDB). Hong Kong: Elsevier; 2002,
346357.
Marrón D, Read J, Bifet A, Navarro N. Data stream classification using random feature functions and novel method combinations. Journal of
Systems and Software 2017, 127:195204.
Mccallum A, Nigam K. A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categoriza-
tion. 1998, 752(4148). Citeseer.
McInnes, L., Healy, J., & Melville, J (2018) Umap: Uniform manifold approximation and projection for dimension reduction arXiv preprint
arXiv:1802.03426.
Montiel J, Read J, Bifet A, Abdessalem T. Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research
(JMLR) 2018, 19:29152914.
Morales GDF, Bifet A. Samoa: scalable advanced massive online analysis. Journal of Machine Learning Research (JMLR) 2015, 16:149153.
Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In: Transactions on large-scale dataand knowledge-centered sys-
tems II. 2010, 1, Springer30. Berlin: Springer.
Nguyen H-L, Woon Y-K, Ng W-K. A survey on data stream clustering and classification. Knowledge and Information Systems 2015, 45:
535569.
Oliveira MDB, Gama J. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis 2012,
16:93111.
Oza NC. Online bagging and boosting. In: International Conference on Systems, Man and Cybernetics, vol. 3. Waikoloa, Hawaii: IEEE; 2005,
23402345.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn:
Machine learning in python. Journal of Machine Learning Research (JMLR) 2011, 12:28252830.
Pratama M, Angelov PP, Lu J, Lughofer E, Seera M, Lim CP. A randomized neural network for data streams. In: International Joint Confer-
ence on Neural Networks (IJCNN). Anchorage: IEEE; 2017, 34233430.
Read J, Bifet A, Pfahringer B, Holmes G. Batch-incremental versus instance-incremental learning in dynamic and evolving data. In: Intelli-
gent Data Analysis (IDA). 2012, 313323. Helsinki: Springer.
Read J, Perez-Cruz F, Bifet A. Deep learning in partially-labeled data streams. In: Annual ACM Symposium on Applied Computing. 2015,
954959. Salamanca: ACM.
Rodrigues PP, Araújo J, Gama J, Lopes LMB. A local algorithm to approximate the global clustering of streams generated in ubiquitous sen-
sor networks. International Journal of Distributed Sensor Networks 2018, 14.
Rodrigues PP, Gama J, Pedroso JP. Hierarchical clustering of time-series data streams. IEEE Transactions on Knowledge and Data Engineer-
ing 2008, 20:615627 https://doi.org/10.1109/TKDE.2007.190727.
Rutkowski L, Jaworski M, Duda P. Probabilistic neural networks for the streaming data classification. In: Stream Data Mining: Algorithms
and Their Probabilistic Properties. Springer; 2020, 245277.
Sander J, Ester M, Kriegel H, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min-
ing and Knowledge Discovery 1998, 2:169194 https://doi.org/10.1023/A:1009745219419.
Silva JA, Faria ER, Barros RC, Hruschka ER, de Leon Ferreira de Carvalho ACP, Gama J. Data stream clustering: A survey. ACM Computing
Surveys 2013, 46:13:113:31.
Sorzano, C. O. S., Vargas, J., & Montano, A. P. (2014) A survey of dimensionality reduction techniques. arXiv preprint arXiv:1403.2877.
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R. Monic: modeling and monitoring cluster transitions. In: Proceedings ACM International
Conference on Knowledge Discovery and Data Mining. Philadelphia: ACM Press; 2006, 706711.
16 of 17 BAHRI ET AL.
Veloso B, Gama J, Malheiro B. Self hyper-parameter tuning for data streams. In: Data Streams. 2018. Split: Springer.
Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J. Feature hashing for large scale multitask learning. In: International Conference
on Machine Learning (ICML). Montreal: ACM; 2009, 11131120.
Zhou SM, Da Xu L. A new type of recurrent fuzzy neural network for modeling dynamic systems. Knowledge Based Systems 2001, 14:
243251.
How to cite this article: Bahri M, Bifet A, Gama J, Gomes HM, Maniu S. Data stream analysis: Foundations,
major tasks and tools. WIREs Data Mining Knowl Discov. 2021;e1405. https://doi.org/10.1002/widm.1405
BAHRI ET AL.17 of 17
... Nonetheless, data stream classification seems to have not yet explored the hierarchical classification subtask. On one hand, recently proposed data stream classification methods do not consider any kind of hierarchy in their designs (BAHRI et al., 2021), and literature reviews regarding data stream classification do not even mention the hierarchical subtask (GOMES et al., 2019;KREMPL et al., 2014;RAMÍREZ-GALLEGO et al., 2017). On the other hand, most studies on hierarchical classification do not consider the data stream scenario, handling only batch and stationary data (DUMAIS; CHEN, 2000;CARVALHO, 2007;KOSMOPOULOS et al., 2015;SILLA;FREITAS, 2011). ...
... PRIYANTA, 2019), in which grounding concepts and terminologies of the area were formally defined.Similarly, comprehensive reviews of data stream classification were presented in(GOMES et al., 2019), (WANKHADE; DONGRE; JONDHALE, 2020)(BAHRI et al., 2021), showing that a fair amount of effort has been devoted to scenarios where data are made available as a stream and how its challenges can be tackled. ...
... So, the currently learned classifiers in M may no longer be representative of the next upcoming data, thus classifiers and their hyper-parameters need to change. For this to be achieved, we process the stream in sliding window [20] of a fixed size w by keeping the most recent instances from the stream at each time (line 15, Algorithm 1). We evaluate the predictive performance of every algorithm and configuration in the ensemble M and use the method with the highest accuracy for prediction. ...
... We aim to select the optimal algorithm and its hyperparameterization among the available stream classification algorithms in MOA [2] for a specific stream. To do so, we consider a representative subset of the baseline classifiers and particularly use state-of-the-art classification algorithms [20]: Hoeffding Tree (HT) [16], Hoeffding Adaptive Tree (HAT) [30], kNN, and kNN with ADWIN (kNN a ) [26] to handle concept drifts. For each algorithm, we aim to optimize the parameters that influence the classification prediction. ...
... The widespread access to the internet and the development of technology related to all areas of daily life resulted in the largest-ever influx of data, the analysis of which can bring both research and commercial benefits. At the same time, it is important to note that the data is now often no longer static but rather a potentially infinite stream whose characteristics can change over time [3]. Despite the increasing amount of multimodal data, often focusing on discrete digital signals, acoustic signals, or text, heterogeneous tabular data is still the most prevalent data type [28]. ...
Preprint
Full-text available
Rapid technological advances are inherently linked to the increased amount of data, a substantial portion of which can be interpreted as data stream, capable of exhibiting the phenomenon of concept drift and having a high imbalance ratio. Consequently, developing new approaches to classifying difficult data streams is a rapidly growing research area. At the same time, the proliferation of deep learning and transfer learning, as well as the success of convolutional neural networks in computer vision tasks, have contributed to the emergence of a new research trend, namely Multi-Dimensional Encoding (MDE), focusing on transforming tabular data into a homogeneous form of a discrete digital signal. This paper proposes Streaming Super Tabular Machine Learning (SSTML), thereby exploring for the first time the potential of MDE in the difficult data stream classification task. SSTML encodes consecutive data chunks into an image representation using the STML algorithm and then performs a single ResNet-18 training epoch. Experiments conducted on synthetic and real data streams have demonstrated the ability of SSTML to achieve classification quality statistically significantly superior to state-of-the-art algorithms while maintaining comparable processing time.
... Time is a significant factor for stream processing. Thus, ensemble algorithms for stream data must handle a high volume of data quickly [60]. The OML models used in our proposal are lightweight, with low memory and running time footprint, as depicted in Sect. ...
Article
Full-text available
The synergy of edge computing and Machine Learning (ML) holds immense potential for revolutionizing Internet of Things (IoT) applications, particularly in scenarios characterized by high-speed, continuous data generation. Offline ML algorithms struggle with streaming data as they rely on static datasets for model construction. In contrast, Online Machine Learning (OML) adapts to changing environments by training the model with each new observation in real-time. However, developing OML algorithms introduces complexities such as bias and variance considerations, making the selection of suitable estimators challenging. In this challenging landscape, ensemble learning emerges as a promising approach, offering a strategic framework to navigate the bias-variance tradeoff and enhance prediction accuracy by amalgamating outputs from diverse ML models. This paper introduces a novel ensemble method tailored for edge computing environments, designed to efficiently operate on resource-constrained devices while accommodating various online learning scenarios. The primary objective is to enhance predictive accuracy at the edge, thereby empowering IoT applications with robust decision-making capabilities. Our study addresses the critical challenges of ML in resource-constrained edge computing environments, offering practical insights for enhancing predictive accuracy and scalability in IoT applications. To validate our ensemble’s efficacy, we conducted comprehensive experimental evaluations leveraging both synthetic and real-world datasets. The results indicate that our ensemble surpassed state-of-the-art data stream algorithms and ensemble regressors across a range of regression metrics, underlining its superior predictive prowess. Furthermore, we scrutinized the ensemble’s performance within the realm of auto-scaling for Virtual Network Function (VNF)-based applications situated at the network’s edge, thereby elucidating its applicability and scalability in real-world scenarios.
... This phenomenon is referred to as con- ceptual drift, which poses a challenge for pre-trained models as they can quickly become outdated. Consequently, when deployed in practical settings, these models may exhibit poor performance, i.e., poor generalization or adaptive capabilities (Bahri et al., 2021;Gama, 2012). Research on how to effectively handle the uncertainty of data streams, enhance the generalization and adaptive capabilities of prediction models, and develop models with self-monitoring and proactive maintenance abilities is crucial in manufacturing and automation, which is a guarantee for higher production efficiency, reduced downtime, improved quality control, cost optimization, and continuous improvement (Lee et al., 2011;Walser & Sauer, 2021). ...
Article
Full-text available
Many Data-driven decisions in manufacturing need accurate and reliable predictions. Due to high complexity and variability of working conditions, a prediction model may deteriorate over time after deployed. Traditional performance evaluation indexes mainly assess the prediction model from a static perspective, which is difficult to meet the actual needs of model selection and proactive maintenance, resulting in unstable online prediction performance. For regression-based prediction models, this paper designs online prediction performance evaluation indexes (OPPEI) to evaluate the prediction model in terms of its accuracy, degradation speed, and stability. For proactive maintenance, this paper proposes a model maintenance evaluation method based on Principal Component Analysis (PCA). We use PCA to transform various performance indexes and extract the first principal component as a model maintenance evaluation index, which could reduce the over-sensitive or insensitive phenomenon of single indicator. The effectiveness of online prediction performance evaluation indexes and PCA-based proactive maintenance evaluation method are verified by simulation and several real-world load forecasting experiments.
... In all of these projects, the use of mobile detectors based on optical sensors that record traces of particle radiation energy offers great flexibility and the possibility to extend observation coverage on a global scale [81]. Detectors acquires a huge amount of measurement data of various types, which requires appropriate analysis especially automatic recognition in big data streams [88,89]. ...
Article
Full-text available
In this paper we propose the method for detecting potential anomalous cosmic ray particle tracks in big data image dataset acquired by Complementary Metal-Oxide-Semiconductors (CMOS). Those sensors are part of scientific infrastructure of Cosmic Ray Extremely Distributed Observatory (CREDO). The use of Incremental PCA (Principal Components Analysis) allowed approximation of loadings which might be updated at runtime. Incremental PCA with Sequential Karhunen-Loeve Transform results with almost identical embedding as basic PCA. Depending on image preprocessing method the weighted distance between coordinate frame and its approximation was at the level from 0.01 to 0.02 radian for batches with size of 10,000 images. This significantly reduces the necessary calculations in terms of memory complexity so that our method can be used for big data. The use of intuitive parameters of the potential anomalies detection algorithm based on object density in embedding space makes our method intuitive to use. The sets of anomalies returned by our proposed algorithm do not contain any typical morphologies of particle tracks shapes. Thus, one can conclude that our proposed method effectively filter-off typical (in terms of analysis of variance) shapes of particle tracks by searching for those that can be treated as significantly different from the others in the dataset. We also proposed method that can be used to find similar objects, which gives it the potential, for example, to be used in minimal distance-based classification and CREDO image database querying. The proposed algorithm was tested on more than half a million (570,000+) images that contains various morphologies of cosmic particle tracks. To our knowledge, this is the first study of this kind based on data collected using a distributed network of CMOS sensors embedded in the cell phones of participants collaborating within the citizen science paradigm.
... The complex multiclass imbalance with other factors such as class overlapping, and rare examples cause more difficulty in retraining the new classifier to address the changing behaviour of data (Brzezinski et al., 2021). During online learning, there are constraints of limited data, time, and memory for retaining the new model (Bahri et al., 2021). ...
Article
Full-text available
In IoT environment applications generate continuous non-stationary data streams with in-built problems of concept drift and class imbalance which cause classifier performance degradation. The imbalanced data affects the classifier during concept detection and concept adaptation. In general, for concept detection, a separate mechanism is added in parallel with the classifier to detect the concept drift called a drift detector. For concept adaptation, the classifier updates itself or trains a new classifier to replace the older one. In case, the data stream faces a class imbalance issue, the classifier may not properly adapt to the latest concept. In this survey, we study how the existing work addresses the issues of class imbalance and concept drift while learning from nonstationarydata streams. We further highlight the limitation of existing work and challenges caused by other factors of class imbalance alongwith concept drift in data stream classification. Results of our survey found that, out of 1110 studies, by using our inclusion and exclusion criteria, we were able to narrow the pool of articles down to 35 that directly addressed our study objectives. The study found that issues such as multiple concept drift types, dynamic class imbalance ratio, and multi-class imbalance in presence of concept drift are still open for further research. We also observed that, while major research efforts have been dedicated to resolving concept drift and class imbalance, not much attention has been given to with-in-class imbalance, rear examples, and borderline instances when they exist with concept drift in multi-class data. This paper concludes with some suggested future directions.
Article
Adapting to drifting data streams is a significant challenge in online learning. Concept drift must be detected for effective model adaptation to evolving data properties. Concept drift can impact the data distribution entirely or partially, which makes it difficult for drift detectors to accurately identify the concept drift. Despite the numerous concept drift detectors in the literature, standardized procedures and benchmarks for comprehensive evaluation considering the locality of the drift are lacking. We present a novel categorization of concept drift based on its locality and scale. A systematic approach leads to a test bed of 2760 data stream benchmarks, reflecting various difficulty levels following our proposed categorization. We conduct a comparative assessment of 9 state-of-the-art drift detectors across diverse difficulties, highlighting their strengths and weaknesses for future research. We examine how drift locality influences the classifier performance and propose strategies for different drift categories to minimize the recovery time. Lastly, we provide lessons learned and recommendations for future concept drift research. Our benchmark data streams and experiments are publicly available at https://github.com/gabrieljaguiar/locality-concept-drift
Conference Paper
Full-text available
For many streaming classification tasks, the ground truth labels become available with a non-negligible latency. Given this delayed labelling setting, after the instance data arrives and before its true label is known, the online classifier model may change. Hence, the initial prediction can be replaced with additional periodic predictions gradually produced before the true label becomes available. The quality of these predictions may largely vary. Thus, the question arises of how to summarise the performance of these models when multiple predictions for a single instance are made due to delayed labels. In this study, we aim to provide intuitive performance measures summarising the performance of multiple predictions made for individual instances before their true labels arrive. Particular attention is paid to the fact that under the delayed label setting, the emphasis placed on the quality of initial predictions can vary depending on problem needs. The intermediate performance measures we propose complement existing initial and test-then-train performance evaluation when verification latency is observed. Results provided for both real and synthetic datasets show that the new measures can be used to easily rank methods in terms of their ability to produce and refine predictions before the true labels arrive.
Conference Paper
Full-text available
An ensemble of learners tends to exceed the pre-dictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work, we discuss and analyze several techniques for generating, aggregating, and updating ensembles of regressors for evolving data streams. We investigate the impact of different strategies for inducing diversity into the ensemble by randomizing the input data (resampling, random subspaces and random patches). On top of that, we devote particular attention to techniques that adapt the ensemble model in response to concept drifts, including adaptive window approaches, fixed periodical resets and randomly determined windows. Extensive empirical experiments show that simple techniques can obtain similar predictive performance to sophisticated algorithms that rely on reactive adaptation (i.e., concept drift detection and recovery).
Conference Paper
Full-text available
Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-computational costs and numerous passes over large data, these approaches pose a hindrance when processing infinite data streams that are potentially high-dimensional. The latter increases the resource-usage of algorithms that could suffer from the curse of dimensionality. To cope with these issues, some techniques for incremental DR have been proposed. In this paper, we provide a survey on reduction approaches designed to handle data streams and highlight the key benefits of using these approaches for stream mining algorithms.
Thesis
Full-text available
With the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high -- and increasing -- data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder.The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area.In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no -- or minor -- loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP, and also integrate them in ensemble methods.
Chapter
Full-text available
Learning from potentially infinite and high-dimensional data streams poses significant challenges in the classification task. For instance, k-Nearest Neighbors (kNN) is one of the most often used algorithms in the data stream mining area that proved to be very resource-intensive when dealing with high-dimensional spaces. Uniform Manifold Approximation and Projection (UMAP) is a novel manifold technique and one of the most promising dimension reduction and visualization techniques in the non-streaming setting because of its high performance in comparison with competitors. However, there is no version of UMAP that copes with the challenging context of streams. To overcome these restrictions, we propose a batch-incremental approach that pre-processes data streams using UMAP, by producing successive embeddings on a stream of disjoint batches in order to support an incremental kNN classification. Experiments conducted on publicly available synthetic and real-world datasets demonstrate the substantial gains that can be achieved with our proposal compared to state-of-the-art techniques.
Conference Paper
Full-text available
Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availability of efficient incremental base learners, such as Hoeffding Trees. In this work, we introduce the Streaming Random Patches (SRP) algorithm, an ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).
Article
The number of Internet of Things devices generating data streams is expected to grow exponentially with the support of emergent technologies such as 5G networks. The online processing of these data streams therefore requires the design and development of suitable machine learning algorithms, able to learn online, as data is generated. Like their batch-learning counterparts, stream-based learning algorithms require careful hyperparameter settings. However, this problem in exacerbated in online learning settings, especially with the occurrence of concept drifts, that frequently require the reconfiguration of hyperparameters. In this article, we present SSPT, an extension of the Self Parameter Tuning (SPT) optimisation algorithm for data streams. We apply the Nelder–Mead algorithm to dynamically-sized samples, converging to optimal settings in a single-pass over data, while using a relatively small number of hyperparameter configurations. In addition, our proposal automatically readjusts hyperparameters when concept drift occurs. To assess the effectiveness of SSPT, the algorithm is evaluated with three different machine learning problems: recommendation, regression, and classification. Experiments with well-known data sets show that the proposed algorithm can outperform previous hyperparameter tuning efforts by human experts. Results show that SSPT converges significantly faster and presents at least similar accuracy when compared with the previous double-pass version of the SPT algorithm.
Conference Paper
Text summarization is the research area aiming at creating a short and condensed version of the original document, which conveys the main idea of the document in a few words. This research topic has started to attract the attention of a large community of researchers, and it is nowadays counted as one of the most promising research areas. In general, text summarization algorithms aim at using a plain text document as input and then output a summary. However, in real-world applications, most of the data is not in a plain text format. Instead, there is much manifold information to be summarized, such as the summary for a web page based on a query in the search engine, extreme long document (e.g. academic paper), dialog history and so on. In this paper, we focus on the survey of these new summarization tasks and approaches in the real-world application.
Book
This open access book presents the first comprehensive overview of general methods in Automated Machine Learning (AutoML), collects descriptions of existing systems based on these methods, and discusses the first series of international challenges of AutoML systems. The recent success of commercial ML applications and the rapid growth of the field has created a high demand for off-the-shelf ML methods that can be used easily and without expert knowledge. However, many of the recent machine learning successes crucially rely on human experts, who manually select appropriate ML architectures (deep learning architectures or more traditional ML workflows) and their hyperparameters. To overcome this problem, the field of AutoML targets a progressive automation of machine learning, based on principles from optimization and machine learning itself. This book serves as a point of entry into this quickly-developing field for researchers and advanced students alike, as well as providing a reference for practitioners aiming to use AutoML in their work.
Article
Industry 4.0, Internet of Things, Blockchain, and Business Analytics are the hot research topics and have attracted much attention from scholars and practitioners in recent years. In order to identify the forces driving their development and to promote their development, this paper reviews the extant studies on these topics. The review provides a comprehensive overview of state-of-the-art researches on Industry 4.0, Internet of Things, Blockchain, and Business Analytics. The results assist scholars to figure out the directions of future studies on these topics.