ChapterPDF Available

CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

May 2020

May 2020

DOI:10.1002/9781119654674.ch11

In book: Emerging Extended Reality Technologies For Industry 4.0 (pp.189-205)

Authors:

Manal Abdullah

King Abdulaziz University

The concept of Data Stream has emerged as a result of the evolution of technologies in different domains such as banking, e‐commerce, social media, and many others. It is defined as a sequence of data instances generated at very high speed, which can be hard to store in memory. Thus, it became hard to extract knowledge from the continuous data stream using traditional data mining. Data stream mining algorithms should fulfill some requirements such as limited memory, concept drift detection, and one scan processing. Concept drift must be tracked to avoid poor performance and inaccurate results of predictive models. It refers to changing data stream distributions due to several reasons, including the changes in the environment, individual preferences, or adversary activities. In this chapter, we will present the data stream mining components. The problem of concept drift in classification algorithms and several existing state‐of‐the‐art handling methods are highlighted. Besides, the most used datasets, tools, applications, and evaluation methods will be presented.

Traditional data mining process.

…

The Main Components of Data Stream Mining.

…

Forms of Concept Drifts.

…

Figures - uploaded by Manal Abdullah

Content may be subject to copyright.

Content uploaded by Manal Abdullah

Content may be subject to copyright.

CHAPTER 7

A REVIEW ON CLASSIFICATION OF

CONCEPT DRIFT IN EVOLVING DATA

STREAM

Mashail Althabiti, Manal Abdullah

Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, SA.

Email:

Abstract

Keywords:

Emerging Extended Reality Technologies for Industry 4.0.

Edited by Jolanda G. Tromp et al. Copyright c

2020 Scrivener Publishing

109

110 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

7.1 Introduction

Nowadays, millions of people around the world share data anywhere and anytime. The

emerging technologies in telecommunications, entertainments industry, social media sites,

banking services, and other applications have led to the massive growth of generated data

stream. The data stream can be referred to the sequence of data examples that produced

at a very high rate and arrive continuously at a potentially inﬁnite stream. According to

the ”10 Key Marketing Trends For 2017” report, 90% of the data in the world has been

produced in the last two years only, 2.5 quintillion bytes each day [1]. As a result, the

massive amount of data cannot be stored for farther processing and mining. So, the data

stream mining concept has emerged to extract the knowledge from the data stream and

provide real-time processing.

However, it has some constraints that must cope with such as and concept drift, limited

memory, and one scan of the data. Concept Drift is referring to the changes in data concepts

over time. This may result in wrong predictions and inaccurate results. In non-stationary

environments, learning a model from unstable data can result in inaccurate results and pre-

dictions. The underlying data distribution may change, so the model will not be consistent

with the new data anymore. For example, predicting a customer’s behavior toward shop-

ping, where her/his preferences have been changed. Thus, this will produce wrong results

based on old data. So, concept drift must be handled and tracked using detecting methods

that can cope with the data stream. Data stream mining with concept drift handling were

highly studied last decade.

There are many surveys studies have literature the data stream mining from different

perspectives. The authors in [2] have surveyed the state-of-the-art adaptive learning algo-

rithms with concept drift detection. They have addressed the concept drift through various

applications and highlighted several evaluation techniques. In [3] the authors have pre-

sented general criteria to help researchers in designing their concept drift handling meth-

ods. They have categorized the existing concept drift algorithms according to these criteria.

Also, 14 drift detectors have been evaluated including six artiﬁcial datasets and compared

in term of accuracy and detection. The authors have used Naive Bayes (NB) and Hoeffding

Tree classiﬁers to test the drift detectors.

7.2 Data Mining

Traditional data mining can be deﬁned as the process of ﬁnding and extracting valuable

knowledge and patterns from massive data sets [4]. It can be applied to different kind of

data such as database data, data warehouse data, text data, and multimedia data. The data

mining process was done in an ofﬂine mode where the historical data were used to train

the predictive model.

Data mining consists of the following steps as shown in Figure 7.1 [5]:

Figure 7.1 Traditional data mining process.

DATA STREAM MINING 111

(1) Problem deﬁnition: it concerned with the understanding of the problem, objectives

and formulating the hypothesis.

(2) Data collecting: it can be generated under the expert’s control, or it is collected with-

out expert’s inﬂuence.

(3) Preprocess the data: It composed of several tasks such as detecting outliers or any

abnormal values in the collected data. Also, variable scaling, encoding and selecting

features.

(4) Modeling: it concerned with developing an appropriate model using a data-mining

technique, and

(5) Interpret the model: In this step, the model will be interpreted to draw conclusions

and make decision.

7.3 Data Stream Mining

Learning a model is considered as an essential step in data mining and machine learning

[6]. Previously, it was done in static environments where the whole datasets are available,

stored and can be accessed many times. On the contrary, learning from massive datasets in

non-stationary environments which has become a challenging area. The huge generation

of continuous data in everyday applications has emerged the concept of data stream.

The data stream is a sequence of potentially non-stop data instances that can be read

and processed only once. As technologies evolving, traditional data mining has become

hard to deal with the stream of data on the Internet of Things IoT, web searches, banking

transactions and many more [7]. Thus, the data stream mining has become an attractive

research area.

Data stream mining refers to the process of ﬁnding knowledge and valuable pattern in

continuous, potentially inﬁnite, and high-volume data streams. It plays an essential role in

predictive modeling and decision-making.

The main differences between traditional data mining and stream data mining are pre-

sented in Table 7.1 [8].

Table 7.1 Comparison of Data Mining and Data Stream Mining

112 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

7.3.1 Data Stream Challenges

Data stream mining has serval challenges that must be overcome including the following:

Resource constraints: the data stream is potentially inﬁnite, huge and comes in high

speed, so it is hard to be stored in a memory. Also, the processing time must be as

shorter as possible [5].

One scan: data stream cannot be accessed randomly or many times[5].

Data preprocessing: Since data is continuously arriving, it is not feasible to use man-

ual data preprocessing methods. it should be fully automated and automatically up-

dated as data evolving [9].

Privacy and conﬁdentiality: the data stream is inﬁnite and comes in portions, so the

information will not be incomplete. In this case, it is hard to judge the privacy of a

model that has a data stream as input [9].

Concept drift: data instance may change over time.

Parameter dependence: Algorithms with a lot of parameters are impractical for data

stream applications. So, it desirable for data stream algorithms to have few user-

adjustable parameters [9].

Distinguishing between correct concept drift and noise: Some algorithms interpret

noise as concept drift, while data stream mining algorithms should be robust to noise

and recognize it from concept drift [10].

7.3.2 Data Stream Methods Features

Stream mining methods should include the following features [11]:

The stream mining model should be incremental, so it updates itself for the new con-

tinuous incoming data stream.

The stream mining model should be able to mine and process based on single access

of incoming data stream.

The stream mining model should have high processing and occupies a small space of

memory.

The stream mining model should track and handle concept drifts.

The stream mining model should be able to generate results at any time because the

data stream is potentially inﬁnite.

7.4 Data Stream Sources

Nowadays, a massive data stream can be generated from social media applications such

as Twitter, Facebook, Pinterest, and others [12]. The streaming of posts, likes, comments,

and feeds have a signiﬁcant role in social media analysis. The health industry is another

source of data stream where a vast amount can be collected including laboratory records,

DATA STREAM MINING COMPONENTS 113

electronic medical records, doctor note, and billing. Also, the ﬁnancial industry can form

an inﬁnite data stream through credit and debit cards, loans, insurance, and other services.

The presence of online banking service has increased the amount of generated data stream.

The data stream can be considered as one of the sources of big data [9]. Big data can

be deﬁned as an enormous data comes in various structures including structured, semi-

structured and unstructured that hard to be processed and analyzed using traditional tech-

niques. It has four properties including volume, variety, velocity, and value.

Every second, a massive amount of data is growing and generating continuously from

different sources including global positioning system (GBS), IoT, social media, health

technologies and many more. Its generated in various formats such as text, audio, video,

and others. The value of big data indicates the process of capturing the generated data and

extract the hidden knowledge. Big data and data stream have an essential role in predictive

modeling; thus, it was highly studied last decade [9]. However, a changing in the data

stream may happen and observed over time due to some reasons; therefore, it will affect

the accuracy of the predictive model results.

7.5 Data Stream Mining Components

The massive data stream can be generated from social media applications such as Twitter,

Facebook, Pinterest [12]. Also, the health, economic, ﬁnancial industry, and many others.

The process of the data stream mining involves several components as shown in Figure 2

including the streaming data as input, estimator, data mining algorithm, drift detection and

the extracted knowledge as output.

Figure 7.2 The Main Components of Data Stream Mining.

7.5.1 Input

The data stream can be generated from different sources such as web searches, social

media posts, real-time surveillance systems, banking activities, and other non-stationary

environments.

7.5.2 Estimators

Estimation is a critical step that prepares the data stream to the knowledge extraction pro-

cess. Data stream examples must be processed in real time due to its high speed. Also, it

cannot all be stored in a memory. So, Estimation methods are needed to select a subset of

the arriving data stream. It can be categorized into data-based and task-based techniques

as shown in Table 7.2 [13].

114 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

Table 7.2 Estimation method.

7.6 Data Stream Classiﬁcation and Concept Drift

7.6.1 Data Stream Classiﬁcation

Classiﬁcation is a supervised machine learning algorithm. It uses the past data (training

set) to build a model and then (2) use it to predict class labels (testing set) [4]. Classiﬁ-

cation algorithms in non-stationary environments must fulﬁll the data stream requirements

regarding the processing time, limited memory and one-time scan. In the dynamic environ-

ment, some instances in streaming data may change over time because of the high velocity

and limited memory, and this called concept drift. It is the changes in data distribution

of the output given the input, while the distribution of the input may stay unchanged [2].

For example, predicting a customer’s behavior toward shopping, where her/his preferences

have been changed. Thus, this will produce wrong results based on old data.

7.6.2 Concept Drift

Concept drift occurs between two points of time t0and t1when the joint distribution of x

(independent variable) and y( target variable) at time t0is not equal to the joint distribution

DATA STREAM CLASSIFICATION AND CONCEPT DRIFT 115

of (x, y)in t1[2]. It can be represented as:

pt0(X, y)6=pt1(X, y)(7.1)

Concept drifts may occur if there is a change in:

(1) The prior probabilities of classes p(y),

(2) The class-conditional probability distributions p(X, y ), or

(3) The posterior probabilities p(y|X).

It has been classiﬁed into two types real concept drift and virtual concpt drift as shown

in Figure 7.2. Real concept drift occures in case of changing in p(y|X). Also, it’s called

concept shift. (2) Virtual drift occures if p(X)changes without affecting p(y|X).

Concept drift in the data stream may happen due to different reasons such as the changes

in environment, individual preferences, or adversary activities [14]. It may happen in dif-

ferent forms [2] as shown in Figure 7.3:

Figure 7.3 Forms of Concept Drifts.

Data stream classiﬁcation algorithms can be categorized into single classiﬁers and en-

sembles algorithms. Regarding concept drift adaptation, some algorithms update their

classiﬁers continuously in the occurrence of drift or not. Others algorithms trigger changes

in the classiﬁer whenever a drift is detected [2].

7.6.3 Data Stream Classiﬁcation Algorithms with Concept Drift

Data stream classiﬁcation algorithms with concept drift can be classiﬁed into single clas-

siﬁer algorithms and ensemble algorithms.

7.6.4 Single classiﬁer

Some single classiﬁer algorithms observe and detect the drift in the data distribution by

using statistical methods and keep track of the base classiﬁer performance [6]. In case of

discovering drift, it will alarm the base classiﬁer to update it or rebuilt it such as:

The authors in have designed the Drift Detection Method (DDM) that monitors the

classiﬁer error-rate. If the error rate reaches the warning and drift level, then we can

observe that a data distribution has been changed [15].

DDM detector has been modiﬁed into an improved version named Early Drift Detec-

tion Method (EDDM) in [16]. EDDM used to detect gradual drift that emerge slowly

by considering the distances between the classiﬁcation errors.

116 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

Reactive Drift Detection Method (RDDM) is another modiﬁed version of DDM [17].

The proposed Algorithm has overcome the problem of performance loss of DDM by

discarding the older examples. It periodically recalculates the DMM calculations that

determine the alarm and drift levels. Also, the drift occurs whenever the number of

examples in the alarm level reached the threshold.

Other algorithms detect concept drift using windowing techniques by comparing the

distributions the windows such as:

ADWIN [18] detects the different types of changes using sliding windows with the

most recent examples. Concept drift can be observed if the means between two sub-

windows is greater than the threshold. As the window grows, the processing time

becomes longer. Thus, authors have proposed a developed version called ADWEN2

to satisfy the memory and time requirements.

Similarly, The SEED algorithm uses a sliding window to detect change within the

data stream. If the mean between two sub-windows exceeds a certain threshold, it

suggests a drift and drops the old sub-window. The authors have evaluated and fed the

proposed algorithm with a stream generated from a Bernoulli distribution generator to

run the experiments. They found that seed is faster and uses less memory compared

to ADWIN2 [19].

Concept adapting very fast decision tree is another algorithms that use single classiﬁer

[20]. It extended the VFDT algorithm with the ability to detect the concept drift. Also,

employs sliding window to keep the classiﬁer updated with the recent instances.

Authors in [21] have proposed a window-based algorithm called ADDM where the

size of the window is dynamically determined. It detects the concept drift by keep-

ing track the entropy of the window. It reports a concept drift when the entropy is

equal one. ADDM has been evaluated using seven datasets containing different types

of concept drift. It showed good performance in detecting drift comparing to other

methods. Also, it obtained high accuracy.

The authors in [22] have proposed a fuzzy windowing method to adapt concept drift,

named FW-DA. The proposed algorithm reports a drift when there is a signiﬁcant

difference between the test statistics of the current window and the old window.

The authors in [23] have proposed STEPD algorithm that considers the accuracy of

two windows recent and old. Drift is discovered if there is a signiﬁcant difference

between the two windows which calculated through a statistical test.

Similarly, to STEPD, the authors in [24] proposed three detection methods named

Fisher Proportions Drift Detector (FPDD), Fisher Square Drift Detector (FSDD), and

Fisher Test Drift Detector (FTDD). The proposed methods adopted Fishers Exact

statistical test and sliding windows (old and recent) as the same in STEPD. FPDD

is a variation of STEPD that used when the number of correct predictions over the

two windows is small. While the FSDD adopted Fishers Exact test with the chi-

square statistical test for homogeneity of proportions. FTDD is the simplest where it

uses Fishers Exact test to detect drifts. The authors have run several experiments to

compare the proposed methods against well-known detectors. Also, they have used

two base learners, Naive Bayes (NB) and Hoeffding Tree with synthesis and real-

world datasets. The results showed that the accuracies of the proposed methods are

better than the other compared detectors.

DATA STREAM CLASSIFICATION AND CONCEPT DRIFT 117

The authors in [25] have proposed Fast Hoeffding Drift Detection Method (FHDDM).

The proposed method monitors the probabilities of correct predictions over the sliding

window. It compares the maximum and the most recent probabilities and observes the

change if the differences between these probabilities equal or exceed the threshold.

McDiarmid Drift Detection Method (MDDM) employs McDiarmids inequality to

detects concept drift [26]. It uses a sliding window over the prediction results and

weighting approaches to give higher weights to the recent instances. While instances

are processed, MDDM calculates the weighted mean over the sliding window, and

the maximum mean observed so far during the process. It detects a concept drift if

there is a signiﬁcant difference between the two means, the signiﬁcance determined by

McDiarmid inequality. The authors have tested the proposed algorithm using MOA

against different methods including DDM, EDDM, RDDM, ADWIN and many oth-

ers. MDDM outperforms other algorithms in its shorter detection delays with high

accuracy.

7.6.5 Ensemble Classiﬁers

In this approach, the algorithms use a set of classiﬁers where each classiﬁer is assigned a

weight and adapting to the changes by updating its components and its associated weights

[6] such as:

DWM [27] maintains a set of experts, each of them assigned to a weight. When an

instance arrives, it passed to an expert and then returned with a local prediction. DWM

determine the global prediction using the local predictions and expert weights.

The AUE2 algorithm [28] is another method that partition the data stream into chunks,

and each chunk contains a set of examples. For every arriving chunk, a classiﬁer

associated with a weight will be created. Each classiﬁer performance is evaluated by

calculating the error rate on data chunk to determine the worst performing classiﬁers.

In addition, two classiﬁers can be ensembled to form a detection system to detect both

sudden and gradual drift [29]. It composed of two classiﬁers online classiﬁer and

block-based classiﬁer. Whenever data instance arrives, the online classiﬁer updates

itself, so any occurrence of sudden changes can be detected. While, block-based

classiﬁer work on blocks of data instances, which can observe the gradual changes.

The classiﬁers error rate will be calculated to detect the changes. The drift can be

observed if the value of the error rate is the same for the next blocks of the data

stream.

Double-Window-based Classiﬁcation Algorithm DWCDS is another window-based

method used to detect changes in data stream [30]. It detects the concept drift by

checking the data distributions periodically. The proposed algorithm starts with gen-

erating decision trees using the data in the sliding window. If a concept drift is ob-

served, then the model of DWCDS will be updated.

Moreover, in [31] the authors have proposed paired learner (PL) algorithm that en-

semble two classiﬁer; stable and reactive. The stable classiﬁer used to predict based

on its overall experience, while the reactive classiﬁer predicts based recent window.

Pl observes the distributional changes by comparing the performance of these two

classiﬁers.

118 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

All the mentioned algorithms are summarized in Table 7.3.

Table 7.3 Classiﬁcation algorithms with concept rrift.

DATASETS 119

7.6.6 Output

This component represents the knowledge and valuable pattern extracted from the data

stream.

7.7 Datasets

Several well-known datasets have been used to evaluate the effectiveness of the classiﬁca-

tion algorithms in detecting Concept Drift. Datasets can be real or artiﬁcial where contains

one type of drift or various types. Table 7.4 shows the most used dataset in data stream

mining studies with the presence of Concept Drift.

Table 7.4 Datasets for DSM with concept drift.

7.8 Evaluation Measures

The following list presents the well-known evaluation measure for classiﬁcation data stream

algorithms:

120 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

Accuracy score: It is calculated by dividing the number of correct predictions by the

total classiﬁer’s predictions [4].

Recall: It refers to as the true positive rate [4].

CPU Time: It measures the total runtime of the CPU in training and testing the clas-

siﬁer [37].

Memory: It measures the total memory consumed to run the classiﬁer and store the

running statistics [37] .

Kappa Statistic: It measures the homogeneity among the classiﬁers[37].

The concept drift detection can be assessed through different measures such as:

The probability of true change detection: It measures the algorithms ability to dis-

cover drifts when they occur [2].

Delay time of detection: It measures the time would be passed before the change is

detected [2].

7.9 Data Stream Mining Tools

The most popular tools used in data stream environment are listed below:

Weka: it provides a set of data mining and machine learning algorithms. These algo-

rithms can be implemented directly on datasets through Weka GUI or Import Weka

Java library [38].

MOA: Massive Online Analysis (MOA) is a project developed in University of Waikato,

New Zealand [39]. It provides an environment to deal with data stream, run experi-

ments, and implement data stream mining algorithms. besides, users can create new

data stream algorithms and simulate different types of concept drifts including abrupt,

gradual, incremental and mixed drifts on synthetic and real data streams.

SAMOA: Scalable Advanced Massive Online Analysis is an open source tool pro-

vides the well-known data stream and machine learning algorithms [40].

Apache Storm: an open source platform for processing inﬁnite streams of data. It is

scalable and fast which make it suitable to produce immediate analytics, and perform

online machine learning [41].

RapidMiner: is an open source environment written in Java that is used for tradi-

tional data mining, text mining and data stream mining. Also, machine learning and

predictive analysis are provided [11].

7.10 Data Stream Mining Applications

Classiﬁcation with concept drift can be used in several areas such as:

Monitoring systems: The monitoring system is used to distinguish between normal

behavior from abnormal behavior [42]. It processes a large amount of data that must

CONCLUSION 121

be analyzed in real time. For example, Intrusion detection systems that search for

suspicious behavior in network trafﬁc. Another example is fraud detectors that track

adversary behavior and prevent online banking fraud. Also, transportation travel time

and trafﬁc prediction are monitored using data stream mining techniques.

Personal assistance: Data stream mining methods have been employed in different

Personal assistance applications. For example, in the news feeds where it classiﬁes

news feeds and categorize the articles [42]. Besides, customer information and prefer-

ences are aggregated to segment customers based on their interests or to predict their

needs [43]. Concept drift happens due to changing individual interests and behavior

over time. Moreover, personal assistance applications address the Spam ﬁltering and

recommendation systems.

Decision support: Decision support and diagnostics applications includes the evalua-

tion of creditworthiness [42]. The results should have high accuracy because the cost

of the mistakes is large. For example, predicting bankruptcy, the system makes the

decision according to different bankruptcy prediction models under various economic

conditions.

Artiﬁcial intelligence: Artiﬁcial intelligence applications include navigation systems,

Smart homes, virtual reality, and vehicle monitoring [42]. Learning algorithms in this

category should be adaptive and taking concept drift into account.

7.11 Conclusion

Concept Drift is one of the main challenges of DSM. It must be detected and handled to

avoid inaccurate results of learning models. In this chapter, we have discussed the concept

of the data stream. The DSM components including the input/output, estimation methods,

and classiﬁcation algorithms with concept drift have been presented. Also, we have high-

lighted the most used DSM tools, datasets, applications, and evaluations measures in data

stream experiments.

REFERENCES

1. Ten Key Marketing Trends for 2017 and Ideas for

Exceeding Customer Expectations. [Online]. Available:

http://comsense.consulting/wpcontent/uploads/2017/03/10 Key Marketing Trends for 2017

and Ideas for Exceeding Customer Expectations.pdf. [Accessed: 23-Dec-2018].

2. J. Acm Reference Format: Gama, I. Zliobait E, A. Bifet, M. Pechenizkiy, and A. Bouchachia,

A Survey on Concept Drift Adaptation, A Surv. Concept Drift Adap-tation. ACM Comput.

Surv. 1, 1, Artic., vol. 1, no. 1, 2013.

3. I. Khamassi, M. Sayed-Mouchaweh, M. Hammami, and K. Ghdira, Discussion and review on

evolving data streams and concept drift adapting, Evol. Syst., vol. 9, no. 1, pp. 123, 2018.

4. J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques, Third Edit. 2012.

5. M. Kantardzic, DATA MINING Concepts, Models, Methods, and Algorithms, 2nd ed. John

Wiley & Sons, Inc., 2011.

6. V. Mittal and I. Kashyap, Empirical Study of Impact of Various Concept Drifts in Data Stream

Mining Methods, Int. J. Intell. Syst. Appl., vol. 8, no. 12, pp. 6572, 2016.

122 A REVIEW ON CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

7. M. Gaber, Advances in data stream mining, Wiley Interdiscip. Rev. Data Min. Knowl. Discov.,

vol. 2, no. 1, pp. 7985, 2012.

8. M. S. B. PhridviRaj and C. V. GuruRao, Data Mining Past, Present and Future A Typical

Survey on Data Streams, Procedia Technol., vol. 12, pp. 255263, 2014.

9. M. Last et al., Open challenges for data stream mining research, ACM SIGKDD Explor.

Newsl., vol. 16, no. 1, pp. 110, 2014.

10. A. Tsymbal, The problem of concept drift: deﬁnitions and related work, Comput. Sci. Dep.

Trinity Coll. Dublin, vol. 4, no. C, pp. 200415, 2004.

11. A. Kumar and A. Singh, Stream mining a review: Tool and techniques, Proc. Int. Conf. Elec-

tron. Commun. Aerosp. Technol. ICECA 2017, vol. 2017-Janua, pp. 2732, 2017.

12. J. J. Warren, A Big Data Primer, pp. 3359, 2017.

13. M. M. Gaber, A. Zaslavsky, and and S. Krishnaswamy, Mining Data Streams: A Review,

Encycl. Data Warehous. Mining, Second Ed., no. June 2005, pp. 12481256, 2011.

14. M. Pechenizkiy, liobait et al. 2016 - An Overview of Concept Drift Applications.pdf, pp. 124.

15. J. Gama, P. Medas, G. Castillo, and and P. Rodrigues, Learning with Drift Detection., in 17th

Brazilian Symposium on Artiﬁcial Intelligence Proc, 2004, pp. 286295.

16. M. Baena-garc et al., Early drift detection method, Fourth Int. Work. Knowl. Discov. from data

streams, vol. 6, no. August 2014, pp. 7786, 2006.

17. R. S. M. Barros, D. R. L. Cabral, P. M. Gonalves, and S. G. T. C. Santos, RDDM: Reactive

drift detection method, Expert Syst. Appl., vol. 90, pp. 344355, 2017.

18. A. Bifet and R. Gavald, Learning from Time-Changing Data with Adaptive Windowing, pp.

443448, 2013.

19. D. T. J. Huang, Y. S. Koh, G. Dobbie, and R. Pears, Detecting Volatility Shift in Data Streams,

Proc. - IEEE Int. Conf. Data Mining, ICDM, vol. 2015-Janua, no. January, pp. 863868, 2015.

20. G. Hulten, L. Spencer, and and D. Pedro, Mining time-changing data streams., in The seventh

ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp.

97106.

21. L. Du, Q. Song, and X. Jia, Detecting concept drift: An information entropy based method

using an adaptive sliding window, Intell. Data Anal., vol. 18, no. 3, pp. 337364, 2014.

22. A. Liu, G. Zhang, and J. Lu, Fuzzy time windowing for gradual concept drift adaptation, IEEE

Int. Conf. Fuzzy Syst., 2017.

23. K. Nishida and K. Yamauchi, Detecting Concept Drift Using Statistical Testing, Discov. Sci.,

pp. 264269, 2007.

24. D. R. de L. Cabral and R. S. M. de Barros, Concept drift detection based on Fishers Exact test,

Inf. Sci. (Ny)., vol. 442443, pp. 220234, 2018.

25. A. Pesaranghader and H. Viktor, Fast Hoeffding Drift Detection Method for Evolving Data

Streams, in Machine Learning and Knowledge Discovery in Databases, 2016, pp. 96111.

26. A. Pesaranghader, H. L. Viktor, and E. Paquet, McDiarmid Drift Detection Methods for Evolv-

ing Data Streams, in Proceedings of the International Joint Conference on Neural Networks,

2018.

27. J. Z. Kolter and M. A. Maloof, Dynamic weighted majority: a new ensemble method for

tracking concept drift, pp. 123130, 2004.

28. D. Brzezinski and J. Stefanowski, Reacting to Different Types of Concept Drift:, vol. 25, no.

1, pp. 8194, 2014.

29. A. Jadhav and L. Deshpande, An efﬁcient approach to detect concept drifts in data streams,

Proc. - 7th IEEE Int. Adv. Comput. Conf. IACC 2017, pp. 2832, 2017.

CONCLUSION 123

30. Q. Zhu, X. Hu, Y. Zhang, P. Li, and X. Wu, A double-window-based classiﬁcation algorithm

for concept drifting data streams, Proc. - 2010 IEEE Int. Conf. Granul. Comput. GrC 2010, pp.

639644, 2010.

31. S. H. Bach and M. A. Maloof, Paired learners for concept drift, Proc. - IEEE Int. Conf. Data

Mining, ICDM, pp. 2332, 2008.

32. Census Income Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/census+income.

[Accessed: 21-Dec-2018].

33. Elena Ikonomovskas Web page. [Online]. Available: https://kt.ijs.si/elena ikonomovska/data.html.

[Accessed: 03-Nov-2018].

34. Covertype Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/covertype.

[Accessed: 28-Nov-2018].

35. W. N. Street and Y. Kim, A streaming ensemble algorithm (SEA) for large-scale classiﬁcation,

vol. 4, pp. 377382, 2004.

36. R. S. M. Barros and S. G. T. C. Santos, A large-scale comparison of concept drift detectors,

Inf. Sci. (Ny)., vol. 451452, pp. 348370, 2018.

37. P. Dhaliwal and M. P. S. Bhatia, Effective Handling of Recurring Concept Drifts in Data

Streams, Indian J. Sci. Technol., vol. 10, no. 30, pp. 16, 2017.

38. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. [Online]. Avail-

able: https://www.cs.waikato.ac.nz/ml/weka/. [Accessed: 03-Nov-2018].

39. A. Bifet, MOA: Massive Online Analysis Learning Examples, J. Mach. Learn. Res., vol. 11,

pp. 16011604, 2010.

40. G. De Francisci Morales, F. Morales, and A. Bifet, SAMOA: Scalable Advanced Massive

Online Analysis, J. Mach. Learn. Res., vol. 16, pp. 149153, 2015.

41. Apache Storm. [Online]. Available: http://storm.apache.org/index.html. [Accessed: 03-Nov-

2018].

42. D. Brzezinski, Mining Data Streams With Concept Drift, 2010.

43. I. liobait, M. Pechenizkiy, and J. Gama, An Overview of Concept Drift Applications, in Big

Data Analysis: New Algorithms for a New Society, vol. 16, Springer International Publishing,

2016.

Parameter Distribution Ensemble Learning for Sudden Concept Drift Detection

Chapter

Dec 2022

Concept drift is a big challenge in data stream mining (including process mining) since it seriously decreases the accuracy of a model in online learning problems. Model adaptation to changes in data distribution before making new predictions is very necessary. This paper proposes a novel ensemble method called E-ERICS, which combines multiple Bayesian-optimized ERICS models into one model and uses a voting mechanism to determine whether each instance of a data stream is a concept drift point or not. The experimental results on the synthetic and classic real-world streaming datasets showed that the proposed method is much more precise and more sensitive (shown in F1-score, precision, and recall metrics) than the original ERICS models in detecting concept drift, especially a sudden drift.KeywordsConcept driftData streamEnsemble learningBayesian optimization

Concept Drift Estimation with Graphical Models

Article

May 2022
INFORM SCIENCES

This paper deals with the issue of concept-drift in machine learning in the context of high dimensional problems. In contrast to previous concept drift detection methods, this application does not depend on the machine learning model in use for a specific target variable, but rather, it attempts to assess the concept drift as an independent characteristic of the evolution of a dataset. This major achievement enables data to be tested for the presence of drift, independently of the specific problem at hand. This is extremely useful when the same dataset is utilized for different classifications simultaneously, as it is often the case in a business environment. Moreover, unlike previous approaches, this method does not require the re-testing of each new model; a strategy which could prove expensive in computational terms. The fundamental intention of this work is to make use of graphical models to elicit the visible structure of data and represent it as a network. Specifically, we investigate how a graphical model evolves by looking at the creation of new links, and the disappearance of existing ones, in different time periods. We perform this task in four steps. We compute the adjacency matrix of a graph in each period, we apply a function that maps each possible state of the adjacency matrix over time into a transition matrix. We use the information in the transition matrix to produce a metric to estimate the presence of a drift in the data. Eventually, we evaluate this method with both three real-world datasets and a synthetic one.

A Large-scale Comparison of Concept Drift Detectors

Article

Full-text available

Jul 2018
INFORM SCIENCES

Online learning involves extracting information from large quantities of data (streams) usually affected by changes in the distribution (concept drift). A drift detector is a small program that estimates the positions of these changes to replace the base learner and ultimately improve overall accuracy. This article reports on a large-scale comparison of 14 concept drift detector configurations for mining fully labeled data streams with concept drift, using a large number of artificial datasets and two different base classifiers (Naive Bayes and Hoeffding Tree). The goal is to adequately measure how good the existent concept drift detectors really are and also to verify and challenge a common belief in the area, namely that the best drift detection methods are necessarily those that detect all the existing drifts closer to their correct positions, and only them, irrespective of the fact that different objectives usually require alternative solutions. Finally, to some extent, this article may also be seen as an extensive literature survey of concept drift detectors.

Concept Drift Detection based on Fisher’s Exact Test

Article

Full-text available

May 2018
INFORM SCIENCES

Concept drift detectors are software that usually attempt to estimate the positions of concept drifts in large data streams in order to replace the base learner after changes in the data distribution and thus improve accuracy. Statistical Test of Equal Proportions (STEPD) is a simple, efficient, and well-known method which detects concept drifts based on a hypothesis test between two proportions. However, statistically, this test is not recommended when sample sizes are small or data are sparse and/or imbalanced. This article proposes an ingeniously efficient implementation of the statistically preferred but computationally expensive Fisher's Exact test and examines three slightly different applications of this test for concept drift detection, proposing FPDD, FSDD, and FTDD. Experiments run using four artificial dataset generators, with both abrupt and gradual drift versions, as well as three real-world datasets, suggest that the new methods improve the accuracy results and the detections of STEPD and other well-known and/or recent concept drift detectors in many scenarios, with little impact on memory and run-time usage.

RDDM: Reactive drift detection method

Article

Full-text available

Dec 2017
EXPERT SYST APPL

Concept drift detectors are online learning software that mostly attempt to estimate the drift positions in data streams in order to modify the base classifier after these changes and improve accuracy. This is very important in applications such as the detection of anomalies in TCP/IP traffic and/or frauds in financial transactions. Drift Detection Method (DDM) is a simple, efficient, well-known method whose performance is often impaired when the concepts are very long. This article proposes the Reactive Drift Detection Method (RDDM), which is based on DDM and, among other modifications, discards older instances of very long concepts aiming to detect drifts earlier, improving the final accuracy. Experiments run in MOA, using abrupt and gradual concept drift versions of different dataset generators and sizes (48 artificial datasets in total), as well as three real-world datasets, suggest RDDM beats the accuracy results of DDM, ECDD, and STEPD in most scenarios.

Data Mining: Concepts, Models, Methods, and Algorithms

Book

Oct 2011

Mehmed Kantardzic

Data Mining: Concepts, Models, Methods, and Algorithms

Book

Oct 2019

Mehmed Kantardzic

McDiarmid Drift Detection Methods for Evolving Data Streams

Conference Paper

Jul 2018

Stream mining a review: Tool and techniques

Conference Paper

Apr 2017

A Big Data Primer

Chapter

Nov 2017

Judith J Warren

The aim of this chapter is to describe the history of big data and its characteristics—variety, velocity, and volume—and to serve as a big data primer. Many organizations are using big data to improve their operations and/or create new products and services. Methods for generating data, how data is sensed, and then stored, in other words data collection, will be described. Mobile and internet technologies have transformed data collection for these companies and new sources are emerging at an unheard of speed. Due to the explosion of data, the teams needed to manage the data have evolved to include data scientists, domain experts, computer scientists, visualization experts, and more. The ideas of intellectual property are also changing. Who owns the data, the products generated from the data, and applications of the data? Challenges and tools for data analytics and data visualization of big data will be described, thus, setting the foundation for the rest of the book.

Effective Handling of Recurring Concept Drifts in Data Streams

Article

Feb 2017

Fuzzy time windowing for gradual concept drift adaptation

Conference Paper

Jul 2017

CLASSIFICATION OF CONCEPT DRIFT IN EVOLVING DATA STREAM

Abstract and Figures

Recommended publications

Roadmap of Concept Drift Adaptation in Data Stream Mining, Years Later

CDDM: Concept Drift Detection Model for Data Stream

Fast Reaction to Sudden Concept Drift in the Absence of Class Labels

Concept drift detection methods based on different weighting strategies

A Probabilistic Approach for Detecting Real Concept Drift