ChapterPDF Available

CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

Authors:

Abstract and Figures

The concept of Data Stream has emerged as a result of the evolution of technologies in different domains such as banking, e‐commerce, social media, and many others. It is defined as a sequence of data instances generated at very high speed, which can be hard to store in memory. Thus, it became hard to extract knowledge from the continuous data stream using traditional data mining. Data stream mining algorithms should fulfill some requirements such as limited memory, concept drift detection, and one scan processing. Concept drift must be tracked to avoid poor performance and inaccurate results of predictive models. It refers to changing data stream distributions due to several reasons, including the changes in the environment, individual preferences, or adversary activities. In this chapter, we will present the data stream mining components. The problem of concept drift in classification algorithms and several existing state‐of‐the‐art handling methods are highlighted. Besides, the most used datasets, tools, applications, and evaluation methods will be presented.
Content may be subject to copyright.
CHAPTER 7
A REVIEW ON CLASSIFICATION OF
CONCEPT DRIFT IN EVOLVING DATA
STREAM
Mashail Althabiti, Manal Abdullah
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, SA.
Email:
Abstract
Keywords:
Emerging Extended Reality Technologies for Industry 4.0.
Edited by Jolanda G. Tromp et al. Copyright c
2020 Scrivener Publishing
109
110 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
7.1 Introduction
Nowadays, millions of people around the world share data anywhere and anytime. The
emerging technologies in telecommunications, entertainments industry, social media sites,
banking services, and other applications have led to the massive growth of generated data
stream. The data stream can be referred to the sequence of data examples that produced
at a very high rate and arrive continuously at a potentially infinite stream. According to
the ”10 Key Marketing Trends For 2017” report, 90% of the data in the world has been
produced in the last two years only, 2.5 quintillion bytes each day [1]. As a result, the
massive amount of data cannot be stored for farther processing and mining. So, the data
stream mining concept has emerged to extract the knowledge from the data stream and
provide real-time processing.
However, it has some constraints that must cope with such as and concept drift, limited
memory, and one scan of the data. Concept Drift is referring to the changes in data concepts
over time. This may result in wrong predictions and inaccurate results. In non-stationary
environments, learning a model from unstable data can result in inaccurate results and pre-
dictions. The underlying data distribution may change, so the model will not be consistent
with the new data anymore. For example, predicting a customer’s behavior toward shop-
ping, where her/his preferences have been changed. Thus, this will produce wrong results
based on old data. So, concept drift must be handled and tracked using detecting methods
that can cope with the data stream. Data stream mining with concept drift handling were
highly studied last decade.
There are many surveys studies have literature the data stream mining from different
perspectives. The authors in [2] have surveyed the state-of-the-art adaptive learning algo-
rithms with concept drift detection. They have addressed the concept drift through various
applications and highlighted several evaluation techniques. In [3] the authors have pre-
sented general criteria to help researchers in designing their concept drift handling meth-
ods. They have categorized the existing concept drift algorithms according to these criteria.
Also, 14 drift detectors have been evaluated including six artificial datasets and compared
in term of accuracy and detection. The authors have used Naive Bayes (NB) and Hoeffding
Tree classifiers to test the drift detectors.
7.2 Data Mining
Traditional data mining can be defined as the process of finding and extracting valuable
knowledge and patterns from massive data sets [4]. It can be applied to different kind of
data such as database data, data warehouse data, text data, and multimedia data. The data
mining process was done in an offline mode where the historical data were used to train
the predictive model.
Data mining consists of the following steps as shown in Figure 7.1 [5]:
Figure 7.1 Traditional data mining process.
DATA STREAM MINING 111
(1) Problem definition: it concerned with the understanding of the problem, objectives
and formulating the hypothesis.
(2) Data collecting: it can be generated under the expert’s control, or it is collected with-
out expert’s influence.
(3) Preprocess the data: It composed of several tasks such as detecting outliers or any
abnormal values in the collected data. Also, variable scaling, encoding and selecting
features.
(4) Modeling: it concerned with developing an appropriate model using a data-mining
technique, and
(5) Interpret the model: In this step, the model will be interpreted to draw conclusions
and make decision.
7.3 Data Stream Mining
Learning a model is considered as an essential step in data mining and machine learning
[6]. Previously, it was done in static environments where the whole datasets are available,
stored and can be accessed many times. On the contrary, learning from massive datasets in
non-stationary environments which has become a challenging area. The huge generation
of continuous data in everyday applications has emerged the concept of data stream.
The data stream is a sequence of potentially non-stop data instances that can be read
and processed only once. As technologies evolving, traditional data mining has become
hard to deal with the stream of data on the Internet of Things IoT, web searches, banking
transactions and many more [7]. Thus, the data stream mining has become an attractive
research area.
Data stream mining refers to the process of finding knowledge and valuable pattern in
continuous, potentially infinite, and high-volume data streams. It plays an essential role in
predictive modeling and decision-making.
The main differences between traditional data mining and stream data mining are pre-
sented in Table 7.1 [8].
Table 7.1 Comparison of Data Mining and Data Stream Mining
112 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
7.3.1 Data Stream Challenges
Data stream mining has serval challenges that must be overcome including the following:
Resource constraints: the data stream is potentially infinite, huge and comes in high
speed, so it is hard to be stored in a memory. Also, the processing time must be as
shorter as possible [5].
One scan: data stream cannot be accessed randomly or many times[5].
Data preprocessing: Since data is continuously arriving, it is not feasible to use man-
ual data preprocessing methods. it should be fully automated and automatically up-
dated as data evolving [9].
Privacy and confidentiality: the data stream is infinite and comes in portions, so the
information will not be incomplete. In this case, it is hard to judge the privacy of a
model that has a data stream as input [9].
Concept drift: data instance may change over time.
Parameter dependence: Algorithms with a lot of parameters are impractical for data
stream applications. So, it desirable for data stream algorithms to have few user-
adjustable parameters [9].
Distinguishing between correct concept drift and noise: Some algorithms interpret
noise as concept drift, while data stream mining algorithms should be robust to noise
and recognize it from concept drift [10].
7.3.2 Data Stream Methods Features
Stream mining methods should include the following features [11]:
The stream mining model should be incremental, so it updates itself for the new con-
tinuous incoming data stream.
The stream mining model should be able to mine and process based on single access
of incoming data stream.
The stream mining model should have high processing and occupies a small space of
memory.
The stream mining model should track and handle concept drifts.
The stream mining model should be able to generate results at any time because the
data stream is potentially infinite.
7.4 Data Stream Sources
Nowadays, a massive data stream can be generated from social media applications such
as Twitter, Facebook, Pinterest, and others [12]. The streaming of posts, likes, comments,
and feeds have a significant role in social media analysis. The health industry is another
source of data stream where a vast amount can be collected including laboratory records,
DATA STREAM MINING COMPONENTS 113
electronic medical records, doctor note, and billing. Also, the financial industry can form
an infinite data stream through credit and debit cards, loans, insurance, and other services.
The presence of online banking service has increased the amount of generated data stream.
The data stream can be considered as one of the sources of big data [9]. Big data can
be defined as an enormous data comes in various structures including structured, semi-
structured and unstructured that hard to be processed and analyzed using traditional tech-
niques. It has four properties including volume, variety, velocity, and value.
Every second, a massive amount of data is growing and generating continuously from
different sources including global positioning system (GBS), IoT, social media, health
technologies and many more. Its generated in various formats such as text, audio, video,
and others. The value of big data indicates the process of capturing the generated data and
extract the hidden knowledge. Big data and data stream have an essential role in predictive
modeling; thus, it was highly studied last decade [9]. However, a changing in the data
stream may happen and observed over time due to some reasons; therefore, it will affect
the accuracy of the predictive model results.
7.5 Data Stream Mining Components
The massive data stream can be generated from social media applications such as Twitter,
Facebook, Pinterest [12]. Also, the health, economic, financial industry, and many others.
The process of the data stream mining involves several components as shown in Figure 2
including the streaming data as input, estimator, data mining algorithm, drift detection and
the extracted knowledge as output.
Figure 7.2 The Main Components of Data Stream Mining.
7.5.1 Input
The data stream can be generated from different sources such as web searches, social
media posts, real-time surveillance systems, banking activities, and other non-stationary
environments.
7.5.2 Estimators
Estimation is a critical step that prepares the data stream to the knowledge extraction pro-
cess. Data stream examples must be processed in real time due to its high speed. Also, it
cannot all be stored in a memory. So, Estimation methods are needed to select a subset of
the arriving data stream. It can be categorized into data-based and task-based techniques
as shown in Table 7.2 [13].
114 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
Table 7.2 Estimation method.
7.6 Data Stream Classification and Concept Drift
7.6.1 Data Stream Classification
Classification is a supervised machine learning algorithm. It uses the past data (training
set) to build a model and then (2) use it to predict class labels (testing set) [4]. Classifi-
cation algorithms in non-stationary environments must fulfill the data stream requirements
regarding the processing time, limited memory and one-time scan. In the dynamic environ-
ment, some instances in streaming data may change over time because of the high velocity
and limited memory, and this called concept drift. It is the changes in data distribution
of the output given the input, while the distribution of the input may stay unchanged [2].
For example, predicting a customer’s behavior toward shopping, where her/his preferences
have been changed. Thus, this will produce wrong results based on old data.
7.6.2 Concept Drift
Concept drift occurs between two points of time t0and t1when the joint distribution of x
(independent variable) and y( target variable) at time t0is not equal to the joint distribution
DATA STREAM CLASSIFICATION AND CONCEPT DRIFT 115
of (x, y)in t1[2]. It can be represented as:
pt0(X, y)6=pt1(X, y)(7.1)
Concept drifts may occur if there is a change in:
(1) The prior probabilities of classes p(y),
(2) The class-conditional probability distributions p(X, y ), or
(3) The posterior probabilities p(y|X).
It has been classified into two types real concept drift and virtual concpt drift as shown
in Figure 7.2. Real concept drift occures in case of changing in p(y|X). Also, it’s called
concept shift. (2) Virtual drift occures if p(X)changes without affecting p(y|X).
Concept drift in the data stream may happen due to different reasons such as the changes
in environment, individual preferences, or adversary activities [14]. It may happen in dif-
ferent forms [2] as shown in Figure 7.3:
Figure 7.3 Forms of Concept Drifts.
Data stream classification algorithms can be categorized into single classifiers and en-
sembles algorithms. Regarding concept drift adaptation, some algorithms update their
classifiers continuously in the occurrence of drift or not. Others algorithms trigger changes
in the classifier whenever a drift is detected [2].
7.6.3 Data Stream Classification Algorithms with Concept Drift
Data stream classification algorithms with concept drift can be classified into single clas-
sifier algorithms and ensemble algorithms.
7.6.4 Single classifier
Some single classifier algorithms observe and detect the drift in the data distribution by
using statistical methods and keep track of the base classifier performance [6]. In case of
discovering drift, it will alarm the base classifier to update it or rebuilt it such as:
The authors in have designed the Drift Detection Method (DDM) that monitors the
classifier error-rate. If the error rate reaches the warning and drift level, then we can
observe that a data distribution has been changed [15].
DDM detector has been modified into an improved version named Early Drift Detec-
tion Method (EDDM) in [16]. EDDM used to detect gradual drift that emerge slowly
by considering the distances between the classification errors.
116 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
Reactive Drift Detection Method (RDDM) is another modified version of DDM [17].
The proposed Algorithm has overcome the problem of performance loss of DDM by
discarding the older examples. It periodically recalculates the DMM calculations that
determine the alarm and drift levels. Also, the drift occurs whenever the number of
examples in the alarm level reached the threshold.
Other algorithms detect concept drift using windowing techniques by comparing the
distributions the windows such as:
ADWIN [18] detects the different types of changes using sliding windows with the
most recent examples. Concept drift can be observed if the means between two sub-
windows is greater than the threshold. As the window grows, the processing time
becomes longer. Thus, authors have proposed a developed version called ADWEN2
to satisfy the memory and time requirements.
Similarly, The SEED algorithm uses a sliding window to detect change within the
data stream. If the mean between two sub-windows exceeds a certain threshold, it
suggests a drift and drops the old sub-window. The authors have evaluated and fed the
proposed algorithm with a stream generated from a Bernoulli distribution generator to
run the experiments. They found that seed is faster and uses less memory compared
to ADWIN2 [19].
Concept adapting very fast decision tree is another algorithms that use single classifier
[20]. It extended the VFDT algorithm with the ability to detect the concept drift. Also,
employs sliding window to keep the classifier updated with the recent instances.
Authors in [21] have proposed a window-based algorithm called ADDM where the
size of the window is dynamically determined. It detects the concept drift by keep-
ing track the entropy of the window. It reports a concept drift when the entropy is
equal one. ADDM has been evaluated using seven datasets containing different types
of concept drift. It showed good performance in detecting drift comparing to other
methods. Also, it obtained high accuracy.
The authors in [22] have proposed a fuzzy windowing method to adapt concept drift,
named FW-DA. The proposed algorithm reports a drift when there is a significant
difference between the test statistics of the current window and the old window.
The authors in [23] have proposed STEPD algorithm that considers the accuracy of
two windows recent and old. Drift is discovered if there is a significant difference
between the two windows which calculated through a statistical test.
Similarly, to STEPD, the authors in [24] proposed three detection methods named
Fisher Proportions Drift Detector (FPDD), Fisher Square Drift Detector (FSDD), and
Fisher Test Drift Detector (FTDD). The proposed methods adopted Fishers Exact
statistical test and sliding windows (old and recent) as the same in STEPD. FPDD
is a variation of STEPD that used when the number of correct predictions over the
two windows is small. While the FSDD adopted Fishers Exact test with the chi-
square statistical test for homogeneity of proportions. FTDD is the simplest where it
uses Fishers Exact test to detect drifts. The authors have run several experiments to
compare the proposed methods against well-known detectors. Also, they have used
two base learners, Naive Bayes (NB) and Hoeffding Tree with synthesis and real-
world datasets. The results showed that the accuracies of the proposed methods are
better than the other compared detectors.
DATA STREAM CLASSIFICATION AND CONCEPT DRIFT 117
The authors in [25] have proposed Fast Hoeffding Drift Detection Method (FHDDM).
The proposed method monitors the probabilities of correct predictions over the sliding
window. It compares the maximum and the most recent probabilities and observes the
change if the differences between these probabilities equal or exceed the threshold.
McDiarmid Drift Detection Method (MDDM) employs McDiarmids inequality to
detects concept drift [26]. It uses a sliding window over the prediction results and
weighting approaches to give higher weights to the recent instances. While instances
are processed, MDDM calculates the weighted mean over the sliding window, and
the maximum mean observed so far during the process. It detects a concept drift if
there is a significant difference between the two means, the significance determined by
McDiarmid inequality. The authors have tested the proposed algorithm using MOA
against different methods including DDM, EDDM, RDDM, ADWIN and many oth-
ers. MDDM outperforms other algorithms in its shorter detection delays with high
accuracy.
7.6.5 Ensemble Classifiers
In this approach, the algorithms use a set of classifiers where each classifier is assigned a
weight and adapting to the changes by updating its components and its associated weights
[6] such as:
DWM [27] maintains a set of experts, each of them assigned to a weight. When an
instance arrives, it passed to an expert and then returned with a local prediction. DWM
determine the global prediction using the local predictions and expert weights.
The AUE2 algorithm [28] is another method that partition the data stream into chunks,
and each chunk contains a set of examples. For every arriving chunk, a classifier
associated with a weight will be created. Each classifier performance is evaluated by
calculating the error rate on data chunk to determine the worst performing classifiers.
In addition, two classifiers can be ensembled to form a detection system to detect both
sudden and gradual drift [29]. It composed of two classifiers online classifier and
block-based classifier. Whenever data instance arrives, the online classifier updates
itself, so any occurrence of sudden changes can be detected. While, block-based
classifier work on blocks of data instances, which can observe the gradual changes.
The classifiers error rate will be calculated to detect the changes. The drift can be
observed if the value of the error rate is the same for the next blocks of the data
stream.
Double-Window-based Classification Algorithm DWCDS is another window-based
method used to detect changes in data stream [30]. It detects the concept drift by
checking the data distributions periodically. The proposed algorithm starts with gen-
erating decision trees using the data in the sliding window. If a concept drift is ob-
served, then the model of DWCDS will be updated.
Moreover, in [31] the authors have proposed paired learner (PL) algorithm that en-
semble two classifier; stable and reactive. The stable classifier used to predict based
on its overall experience, while the reactive classifier predicts based recent window.
Pl observes the distributional changes by comparing the performance of these two
classifiers.
118 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
All the mentioned algorithms are summarized in Table 7.3.
Table 7.3 Classification algorithms with concept rrift.
DATASETS 119
7.6.6 Output
This component represents the knowledge and valuable pattern extracted from the data
stream.
7.7 Datasets
Several well-known datasets have been used to evaluate the effectiveness of the classifica-
tion algorithms in detecting Concept Drift. Datasets can be real or artificial where contains
one type of drift or various types. Table 7.4 shows the most used dataset in data stream
mining studies with the presence of Concept Drift.
Table 7.4 Datasets for DSM with concept drift.
7.8 Evaluation Measures
The following list presents the well-known evaluation measure for classification data stream
algorithms:
120 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
Accuracy score: It is calculated by dividing the number of correct predictions by the
total classifier’s predictions [4].
Recall: It refers to as the true positive rate [4].
CPU Time: It measures the total runtime of the CPU in training and testing the clas-
sifier [37].
Memory: It measures the total memory consumed to run the classifier and store the
running statistics [37] .
Kappa Statistic: It measures the homogeneity among the classifiers[37].
The concept drift detection can be assessed through different measures such as:
The probability of true change detection: It measures the algorithms ability to dis-
cover drifts when they occur [2].
Delay time of detection: It measures the time would be passed before the change is
detected [2].
7.9 Data Stream Mining Tools
The most popular tools used in data stream environment are listed below:
Weka: it provides a set of data mining and machine learning algorithms. These algo-
rithms can be implemented directly on datasets through Weka GUI or Import Weka
Java library [38].
MOA: Massive Online Analysis (MOA) is a project developed in University of Waikato,
New Zealand [39]. It provides an environment to deal with data stream, run experi-
ments, and implement data stream mining algorithms. besides, users can create new
data stream algorithms and simulate different types of concept drifts including abrupt,
gradual, incremental and mixed drifts on synthetic and real data streams.
SAMOA: Scalable Advanced Massive Online Analysis is an open source tool pro-
vides the well-known data stream and machine learning algorithms [40].
Apache Storm: an open source platform for processing infinite streams of data. It is
scalable and fast which make it suitable to produce immediate analytics, and perform
online machine learning [41].
RapidMiner: is an open source environment written in Java that is used for tradi-
tional data mining, text mining and data stream mining. Also, machine learning and
predictive analysis are provided [11].
7.10 Data Stream Mining Applications
Classification with concept drift can be used in several areas such as:
Monitoring systems: The monitoring system is used to distinguish between normal
behavior from abnormal behavior [42]. It processes a large amount of data that must
CONCLUSION 121
be analyzed in real time. For example, Intrusion detection systems that search for
suspicious behavior in network traffic. Another example is fraud detectors that track
adversary behavior and prevent online banking fraud. Also, transportation travel time
and traffic prediction are monitored using data stream mining techniques.
Personal assistance: Data stream mining methods have been employed in different
Personal assistance applications. For example, in the news feeds where it classifies
news feeds and categorize the articles [42]. Besides, customer information and prefer-
ences are aggregated to segment customers based on their interests or to predict their
needs [43]. Concept drift happens due to changing individual interests and behavior
over time. Moreover, personal assistance applications address the Spam filtering and
recommendation systems.
Decision support: Decision support and diagnostics applications includes the evalua-
tion of creditworthiness [42]. The results should have high accuracy because the cost
of the mistakes is large. For example, predicting bankruptcy, the system makes the
decision according to different bankruptcy prediction models under various economic
conditions.
Artificial intelligence: Artificial intelligence applications include navigation systems,
Smart homes, virtual reality, and vehicle monitoring [42]. Learning algorithms in this
category should be adaptive and taking concept drift into account.
7.11 Conclusion
Concept Drift is one of the main challenges of DSM. It must be detected and handled to
avoid inaccurate results of learning models. In this chapter, we have discussed the concept
of the data stream. The DSM components including the input/output, estimation methods,
and classification algorithms with concept drift have been presented. Also, we have high-
lighted the most used DSM tools, datasets, applications, and evaluations measures in data
stream experiments.
REFERENCES
1. Ten Key Marketing Trends for 2017 and Ideas for
Exceeding Customer Expectations. [Online]. Available:
http://comsense.consulting/wpcontent/uploads/2017/03/10 Key Marketing Trends for 2017
and Ideas for Exceeding Customer Expectations.pdf. [Accessed: 23-Dec-2018].
2. J. Acm Reference Format: Gama, I. Zliobait E, A. Bifet, M. Pechenizkiy, and A. Bouchachia,
A Survey on Concept Drift Adaptation, A Surv. Concept Drift Adap-tation. ACM Comput.
Surv. 1, 1, Artic., vol. 1, no. 1, 2013.
3. I. Khamassi, M. Sayed-Mouchaweh, M. Hammami, and K. Ghdira, Discussion and review on
evolving data streams and concept drift adapting, Evol. Syst., vol. 9, no. 1, pp. 123, 2018.
4. J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques, Third Edit. 2012.
5. M. Kantardzic, DATA MINING Concepts, Models, Methods, and Algorithms, 2nd ed. John
Wiley & Sons, Inc., 2011.
6. V. Mittal and I. Kashyap, Empirical Study of Impact of Various Concept Drifts in Data Stream
Mining Methods, Int. J. Intell. Syst. Appl., vol. 8, no. 12, pp. 6572, 2016.
122 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM
7. M. Gaber, Advances in data stream mining, Wiley Interdiscip. Rev. Data Min. Knowl. Discov.,
vol. 2, no. 1, pp. 7985, 2012.
8. M. S. B. PhridviRaj and C. V. GuruRao, Data Mining Past, Present and Future A Typical
Survey on Data Streams, Procedia Technol., vol. 12, pp. 255263, 2014.
9. M. Last et al., Open challenges for data stream mining research, ACM SIGKDD Explor.
Newsl., vol. 16, no. 1, pp. 110, 2014.
10. A. Tsymbal, The problem of concept drift: definitions and related work, Comput. Sci. Dep.
Trinity Coll. Dublin, vol. 4, no. C, pp. 200415, 2004.
11. A. Kumar and A. Singh, Stream mining a review: Tool and techniques, Proc. Int. Conf. Elec-
tron. Commun. Aerosp. Technol. ICECA 2017, vol. 2017-Janua, pp. 2732, 2017.
12. J. J. Warren, A Big Data Primer, pp. 3359, 2017.
13. M. M. Gaber, A. Zaslavsky, and and S. Krishnaswamy, Mining Data Streams: A Review,
Encycl. Data Warehous. Mining, Second Ed., no. June 2005, pp. 12481256, 2011.
14. M. Pechenizkiy, liobait et al. 2016 - An Overview of Concept Drift Applications.pdf, pp. 124.
15. J. Gama, P. Medas, G. Castillo, and and P. Rodrigues, Learning with Drift Detection., in 17th
Brazilian Symposium on Artificial Intelligence Proc, 2004, pp. 286295.
16. M. Baena-garc et al., Early drift detection method, Fourth Int. Work. Knowl. Discov. from data
streams, vol. 6, no. August 2014, pp. 7786, 2006.
17. R. S. M. Barros, D. R. L. Cabral, P. M. Gonalves, and S. G. T. C. Santos, RDDM: Reactive
drift detection method, Expert Syst. Appl., vol. 90, pp. 344355, 2017.
18. A. Bifet and R. Gavald, Learning from Time-Changing Data with Adaptive Windowing, pp.
443448, 2013.
19. D. T. J. Huang, Y. S. Koh, G. Dobbie, and R. Pears, Detecting Volatility Shift in Data Streams,
Proc. - IEEE Int. Conf. Data Mining, ICDM, vol. 2015-Janua, no. January, pp. 863868, 2015.
20. G. Hulten, L. Spencer, and and D. Pedro, Mining time-changing data streams., in The seventh
ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp.
97106.
21. L. Du, Q. Song, and X. Jia, Detecting concept drift: An information entropy based method
using an adaptive sliding window, Intell. Data Anal., vol. 18, no. 3, pp. 337364, 2014.
22. A. Liu, G. Zhang, and J. Lu, Fuzzy time windowing for gradual concept drift adaptation, IEEE
Int. Conf. Fuzzy Syst., 2017.
23. K. Nishida and K. Yamauchi, Detecting Concept Drift Using Statistical Testing, Discov. Sci.,
pp. 264269, 2007.
24. D. R. de L. Cabral and R. S. M. de Barros, Concept drift detection based on Fishers Exact test,
Inf. Sci. (Ny)., vol. 442443, pp. 220234, 2018.
25. A. Pesaranghader and H. Viktor, Fast Hoeffding Drift Detection Method for Evolving Data
Streams, in Machine Learning and Knowledge Discovery in Databases, 2016, pp. 96111.
26. A. Pesaranghader, H. L. Viktor, and E. Paquet, McDiarmid Drift Detection Methods for Evolv-
ing Data Streams, in Proceedings of the International Joint Conference on Neural Networks,
2018.
27. J. Z. Kolter and M. A. Maloof, Dynamic weighted majority: a new ensemble method for
tracking concept drift, pp. 123130, 2004.
28. D. Brzezinski and J. Stefanowski, Reacting to Different Types of Concept Drift:, vol. 25, no.
1, pp. 8194, 2014.
29. A. Jadhav and L. Deshpande, An efficient approach to detect concept drifts in data streams,
Proc. - 7th IEEE Int. Adv. Comput. Conf. IACC 2017, pp. 2832, 2017.
CONCLUSION 123
30. Q. Zhu, X. Hu, Y. Zhang, P. Li, and X. Wu, A double-window-based classification algorithm
for concept drifting data streams, Proc. - 2010 IEEE Int. Conf. Granul. Comput. GrC 2010, pp.
639644, 2010.
31. S. H. Bach and M. A. Maloof, Paired learners for concept drift, Proc. - IEEE Int. Conf. Data
Mining, ICDM, pp. 2332, 2008.
32. Census Income Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/census+income.
[Accessed: 21-Dec-2018].
33. Elena Ikonomovskas Web page. [Online]. Available: https://kt.ijs.si/elena ikonomovska/data.html.
[Accessed: 03-Nov-2018].
34. Covertype Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/covertype.
[Accessed: 28-Nov-2018].
35. W. N. Street and Y. Kim, A streaming ensemble algorithm (SEA) for large-scale classification,
vol. 4, pp. 377382, 2004.
36. R. S. M. Barros and S. G. T. C. Santos, A large-scale comparison of concept drift detectors,
Inf. Sci. (Ny)., vol. 451452, pp. 348370, 2018.
37. P. Dhaliwal and M. P. S. Bhatia, Effective Handling of Recurring Concept Drifts in Data
Streams, Indian J. Sci. Technol., vol. 10, no. 30, pp. 16, 2017.
38. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. [Online]. Avail-
able: https://www.cs.waikato.ac.nz/ml/weka/. [Accessed: 03-Nov-2018].
39. A. Bifet, MOA: Massive Online Analysis Learning Examples, J. Mach. Learn. Res., vol. 11,
pp. 16011604, 2010.
40. G. De Francisci Morales, F. Morales, and A. Bifet, SAMOA: Scalable Advanced Massive
Online Analysis, J. Mach. Learn. Res., vol. 16, pp. 149153, 2015.
41. Apache Storm. [Online]. Available: http://storm.apache.org/index.html. [Accessed: 03-Nov-
2018].
42. D. Brzezinski, Mining Data Streams With Concept Drift, 2010.
43. I. liobait, M. Pechenizkiy, and J. Gama, An Overview of Concept Drift Applications, in Big
Data Analysis: New Algorithms for a New Society, vol. 16, Springer International Publishing,
2016.
Chapter
Concept drift is a big challenge in data stream mining (including process mining) since it seriously decreases the accuracy of a model in online learning problems. Model adaptation to changes in data distribution before making new predictions is very necessary. This paper proposes a novel ensemble method called E-ERICS, which combines multiple Bayesian-optimized ERICS models into one model and uses a voting mechanism to determine whether each instance of a data stream is a concept drift point or not. The experimental results on the synthetic and classic real-world streaming datasets showed that the proposed method is much more precise and more sensitive (shown in F1-score, precision, and recall metrics) than the original ERICS models in detecting concept drift, especially a sudden drift.KeywordsConcept driftData streamEnsemble learningBayesian optimization
Article
This paper deals with the issue of concept-drift in machine learning in the context of high dimensional problems. In contrast to previous concept drift detection methods, this application does not depend on the machine learning model in use for a specific target variable, but rather, it attempts to assess the concept drift as an independent characteristic of the evolution of a dataset. This major achievement enables data to be tested for the presence of drift, independently of the specific problem at hand. This is extremely useful when the same dataset is utilized for different classifications simultaneously, as it is often the case in a business environment. Moreover, unlike previous approaches, this method does not require the re-testing of each new model; a strategy which could prove expensive in computational terms. The fundamental intention of this work is to make use of graphical models to elicit the visible structure of data and represent it as a network. Specifically, we investigate how a graphical model evolves by looking at the creation of new links, and the disappearance of existing ones, in different time periods. We perform this task in four steps. We compute the adjacency matrix of a graph in each period, we apply a function that maps each possible state of the adjacency matrix over time into a transition matrix. We use the information in the transition matrix to produce a metric to estimate the presence of a drift in the data. Eventually, we evaluate this method with both three real-world datasets and a synthetic one.
Article
Full-text available
Online learning involves extracting information from large quantities of data (streams) usually affected by changes in the distribution (concept drift). A drift detector is a small program that estimates the positions of these changes to replace the base learner and ultimately improve overall accuracy. This article reports on a large-scale comparison of 14 concept drift detector configurations for mining fully labeled data streams with concept drift, using a large number of artificial datasets and two different base classifiers (Naive Bayes and Hoeffding Tree). The goal is to adequately measure how good the existent concept drift detectors really are and also to verify and challenge a common belief in the area, namely that the best drift detection methods are necessarily those that detect all the existing drifts closer to their correct positions, and only them, irrespective of the fact that different objectives usually require alternative solutions. Finally, to some extent, this article may also be seen as an extensive literature survey of concept drift detectors.
Article
Full-text available
Concept drift detectors are software that usually attempt to estimate the positions of concept drifts in large data streams in order to replace the base learner after changes in the data distribution and thus improve accuracy. Statistical Test of Equal Proportions (STEPD) is a simple, efficient, and well-known method which detects concept drifts based on a hypothesis test between two proportions. However, statistically, this test is not recommended when sample sizes are small or data are sparse and/or imbalanced. This article proposes an ingeniously efficient implementation of the statistically preferred but computationally expensive Fisher's Exact test and examines three slightly different applications of this test for concept drift detection, proposing FPDD, FSDD, and FTDD. Experiments run using four artificial dataset generators, with both abrupt and gradual drift versions, as well as three real-world datasets, suggest that the new methods improve the accuracy results and the detections of STEPD and other well-known and/or recent concept drift detectors in many scenarios, with little impact on memory and run-time usage.
Article
Full-text available
Concept drift detectors are online learning software that mostly attempt to estimate the drift positions in data streams in order to modify the base classifier after these changes and improve accuracy. This is very important in applications such as the detection of anomalies in TCP/IP traffic and/or frauds in financial transactions. Drift Detection Method (DDM) is a simple, efficient, well-known method whose performance is often impaired when the concepts are very long. This article proposes the Reactive Drift Detection Method (RDDM), which is based on DDM and, among other modifications, discards older instances of very long concepts aiming to detect drifts earlier, improving the final accuracy. Experiments run in MOA, using abrupt and gradual concept drift versions of different dataset generators and sizes (48 artificial datasets in total), as well as three real-world datasets, suggest RDDM beats the accuracy results of DDM, ECDD, and STEPD in most scenarios.
Chapter
The aim of this chapter is to describe the history of big data and its characteristics—variety, velocity, and volume—and to serve as a big data primer. Many organizations are using big data to improve their operations and/or create new products and services. Methods for generating data, how data is sensed, and then stored, in other words data collection, will be described. Mobile and internet technologies have transformed data collection for these companies and new sources are emerging at an unheard of speed. Due to the explosion of data, the teams needed to manage the data have evolved to include data scientists, domain experts, computer scientists, visualization experts, and more. The ideas of intellectual property are also changing. Who owns the data, the products generated from the data, and applications of the data? Challenges and tools for data analytics and data visualization of big data will be described, thus, setting the foundation for the rest of the book.