ArticlePDF Available

A review on data stream classification approaches

February 2016
Journal of Advanced Computer Science & Technology 5(1):8

February 2016
5(1):8

DOI:10.14419/jacst.v5i1.5225

Authors:

Sajad Homayoun

Technical University of Denmark

Marzieh Ahmadzadeh

University of Toronto

Stream data is usually in vast volume, changing dynamically, possibly infinite, and containing multi-dimensional features. The attention towards data stream mining is increasing as regards to its presence in wide range of real-world applications, such as e-commerce, banking, sensor data and telecommunication records. Similar to data mining, data stream mining includes classification, clustering, frequent pattern mining etc. techniques; the special focus of this paper is on classification methods invented to handle data streams. Early methods of data stream classification needed all instances to be labeled for creating classifier models, but there are some methods (Semi-Supervised Learning and Active Learning) in which unlabeled data is employed as well as labeled data. In this paper, by focusing on ensemble methods, semi-supervised and active learning, a review on some state of the art researches is given.

Content uploaded by Sajad Homayoun

Content may be subject to copyright.

License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Journal of Advanced Computer Science & Technology, 5 (1) (2016) 8-13

Journal of Advanced Computer Science & Technology

Website: www.sciencepubco.com/index.php/JACST

doi: 10.14419/jacst.v5i1.5225

Review paper

A review on data stream classification approaches

Sajad Homayoun *, Marzieh Ahmadzadeh

Department of Computer Engineering and Information Technology, Shiraz University of Technology, Shiraz, Iran

*Corresponding author E-mail: s.homayoun@sutech.ac.ir

Abstract

Stream data is usually in vast volume, changing dynamically, possibly infinite, and containing multi-dimensional features. The attention

towards data stream mining is increasing as regards to its presence in wide range of real-world applications, such as e-commerce, bank-

ing, sensor data and telecommunication records. Similar to data mining, data stream mining includes classification, clustering, frequent

pattern mining etc. techniques; the special focus of this paper is on classification methods invented to handle data streams. Early methods

of data stream classification needed all instances to be labeled for creating classifier models, but there are some methods (Semi-

Supervised Learning and Active Learning) in which unlabeled data is employed as well as labeled data. In this paper, by focusing on

ensemble methods, semi-supervised and active learning, a review on some state of the art researches is given.

Keywords: Data Stream; Data Stream Classification; Ensemble; Semi-Supervised Learning; Active Learning.

1. Introduction

Dramatic growth in information technology and vast volume of

generated data has made new challenging discovery tasks in pro-

cessing of data. The term "data stream" is defined as a sequence of

data that arrives at a system in a continuous and changing manner.

Data stream can be conceived as a continuous and changing se-

quence of data that continuously arrives at a system to be stored or

processed [1]. Data streams have some characteristics in common

such as massive, temporally ordered, fast-changing and potentially

infinite in length [2-4]. According to [5], there are some reasons

which dispart data streams from traditional data mining:

 The size of data streams is potentially boundless.

 The elements of stream arrive on-line.

 Because of limitations in memory space, after processing of

an element, system discards (or summarizes) it.

 The system cannot control or determine how data elements

arrive.

Emails, sensor data, websites customer click stream, network traf-

fic, weather forecasting data etc. are some examples of data

stream. Data stream mining comprises three main techniques such

as clustering, classification and frequent pattern mining.

Classification is a supervised learning techniques which aims to

predict of an independent variable (class label) according to some

values of an instance[6]. Making a classification model has two

main phases: 1) Model creation, 2) Model evaluation. At the first

phase, a learning algorithm uses dataset to create a model which is

able to predict class label. The second phase tries to investigate the

accuracy parameters of created model.

Change in data according to time is one of the main issues in data

stream classification techniques. There are two evolution in data

[7]: 1) concept drifting, 2) concept evolution. Concept drifting

happens whenever class labels changes due to changes in time.

Weather forecasting, spam categorization and monitoring systems

are some examples in which concept drifting is a challenge. Con-

cept evolution occurs when one or more new class labels emerge

on class label set [7]. As shown in Fig. 1 b, concept evolution

occurs when new instances arrive with new class labels.

Fig. 1: (A) Fixed Number of Class Labels; (B) A Novel Class Has

Emerged (Concept Evolution)

Journal of Advanced Computer Science & Technology

Classification in data stream has some challenges that researchers

attempt to solve them. Three main challenges of classification

techniques are as follows [8]:

 Accuracy: It is the most important factor in classification

algorithms, and concept drifting directly influences the ac-

curacy.

 Efficiency: creating of a classifier is costly from processing

point of view. Also, updating of the model is a challenge

due to drifting.

 Ease of use: a classifier model should be usable in applica-

tions.

According to [9], single model incremental and ensemble-based

classification are two major branches of data stream classification.

The first works on one single classifier and update it incrementally

to tackle new evolved stream class labels. It usually needs com-

plex modifications on the internal structure of the classifier. Single

classifier approach often is unable to create strong and accurate

classifiers. In contrast, an ensemble model combines different

classifiers to improve the overall accuracy of predictions. If every

single classifier works better than random prediction (accuracy

more than 0.5), then ensemble model is always more accurate than

a single classifier model.

Due to the need for labeled instances to build classifiers, research-

ers contemplate classification as supervised. It’s worth mentioning

that quality of classifiers extremely depends on the percentage of

labeled instances available in data stream. Many researchers have

tried to use unlabeled instances as well as labeled instances be-

cause manual labeling of instances (by experienced agents) is

costly and time consuming.

The reminder of the paper is organized as follow: section 2 dis-

cusses some ensemble classification methods; section 3 presents a

review on semi-supervised and active learning algorithms while

section 4 is dedicated to future works, and finally section 5 con-

cludes the paper.

2. Incremental learning and ensemble meth-

ods

After introducing ensembles in 1990s, many researchers tried to

improve prediction accuracy by using ensembles [10]. An ensem-

ble method(Fig. 2) creates a set of base classifiers from training

data and classify new instances by poling of base ensembles [11].

Ensembles are popular because they improve classification accu-

racy in static environments [12]. But they need some changes to

adapt with dynamic nature of data streams. In Incremental learn-

ing, a machine learning algorithm take place when new instances

emerge, and then to adjust the model [13]. Some methods of re-

viewed incremental methods are suggested in section 3 because

they are in the category of Semi-Supervised or Active Learning

algorithms. In this section, a review on some incremental and

ensemble methods is given.

Fig. 2: An Ensemble Model.

Jing et al. [14] introduced four main challenges of classification

techniques and claimed their proposed model is able to address all

challenges: 1) infinite length, 2) concept drifting, 3) arrival of

novel classes and 4) lack of labeled instances. For handling con-

cept evolution and lacking of labeled instances, a novel class de-

tection mechanism is proposed. ECM-BDF (Ensemble Classifica-

tion Model Based on Decision Feedback) divides data streams into

sequential chunks with appropriate sizes, then a classifier is made

for each data chunk and some of created classifier considered as

ensemble. In addition, the classifiers made from new labeled in-

stances used for updating of the ensemble. There is also a novel

class detection mechanism to face arriving new class labels and

this mechanism assumes a decision boundary around training data.

The data which place out of boundary considered as outliers. And,

Outliers with strong cohesion may consider as arriving new class.

The proposed model in [14] only uses labeled instances, but in fact

in some cases unlabeled instances are much more than labeled

ones and models which only consider labeled instances usually

have low accuracy. Abdulsalam et al. [15] defined four scenario

for data stream and presented a three phase model which addresses

the scenarios. Scenario 0 in which labeled records only appears at

the beginning of the stream (and is enough for creating classifier)

and the consequent instances are unlabeled. Scenario 1 shows

concept drifting while labeled instances are adequate for making

classifier. Scenario 2 and 3 in which there are no sufficient labeled

instances. Scenario 3 is more common and shows arriving of la-

beled record frequently and periodically. Fig. 3 shows four men-

tioned scenarios.

In phase one, they introduced an approach to handle scenario 0 in

which stream decision tree construction is merged with Random

Forest algorithm. Phase two uses self-adjusting algorithm which

employs entropy-based change-detection technique to address

scenario 1(concept drifting). Phase three aims to handle scenario 2

and scenario 3 while the key feature of this phase is the ability of

determining when the current model is ready to deploy. In other

words, it determines deployment moment by defining a threshold

value for minimum number of needed labeled records. Proposed

algorithm in [15] considers only ordinal or numerical attributes

and it also assumes records are approximately uniformly distribut-

ed; these issues cause limitations on proposed model.

AUE2(Accuracy Updated Ensemble) is proposed in [16] aims to

handle different types of drifts. It combines accuracy based

weighting mechanisms achieved from changes in block based

ensembles with nature of Hoeffding Trees. Ensemble is updated

with appending new classifiers and removing weak classifiers. In

other words, the proposed model improves ensemble reactions

when facing different drifts while decreasing influences of data

chunk size on prediction accuracy. AUE2 is an enhance for AUE1

[17] with some changes in weighting and updating mechanisms to

reduce computation cost and to increase accuracy of prediction.

Classifier 1

Classifier 2

Classifier 𝑖

Classifier 𝑛

VOTE

Input Stream

OUTPUT

Journal of Advanced Computer Science & Technology

Fig.3: Scenarios Introduced in [15] for Data Streams.

[9] proposed an adaptive ensemble approach for classification and

novel class detection in concept drifting. It uses traditional classi-

fiers and applies automatic updating of ensemble models for han-

dling concept drifting. The idea for novel class detection is the

distance among instances. In other words, instances of a class

should be closer and instances of different classes should be far

enough. If an instance is apart enough from available clusters, it

can consider as new class label.

Some articles tried to address feature-evolution; it occurs when set

of features changes during time [18] or whenever new features

appear in data[19].

DXMiner from [19] tried to introduce a model for handling of

feature evolution, but it had high false positive rate (false novel

class detection rate) and false negative rate (missed novel class

detection rate) in some datasets. Moreover, it is unable to detect

new classes if more than one new class arrives at a time. After-

wards, Masud et al. [20] tried to solve the problem of simultane-

ous arriving of new classes. A model is introduced in [21] which

the authors claimed that it has more performance compared to

earlier models in concept drift, concept evolution and feature evo-

lution. To handle concept drift and concept evolution they de-

signed a framework in which each classifier is equipped with a

novel class detector and is able to detect more than one novel clas-

ses. To address feature evolution, they proposed feature set ho-

mogenization technique.

Aggarwal et al. proposed a model which is able to adapt with

changes in data streams(concept evolution) [22]. They proposed

On Demand Classification which is able to dynamically determine

appropriate window size for past training data. They introduced

supervised micro clustering which only made from training data

and each micro cluster is a set of related training instances in

which a cluster's instances have same class label. They tried to

change unsupervised clustering approach [23] and handle high

evolving data stream. Note that some parameters (such as initial

points, size of sliding window etc.) must be set carefully to

achieve appropriate accuracy and it seems as a drawback of their

model. Their model aims at handling concept drifting while it has

no idea for concept evolution. They used KDD 99 data set [24] to

investigate the model.

3. Semi-supervised learning and active learn-

ing

Due to the nature of data stream (high volume, quick et.), labeling

of instances (by experienced agents) is not possible and research-

ers tried to propose novel methods to handle this problem. Ac-

cording to [25], Semi-Supervised Learning(SSL) and Active

Learning(AL) are two iterative approaches of employing unla-

beled data in creating classification model.

By the purpose of reducing manual labelling workload, Semi-

Supervised Learning aims to label samples by the machine itself,

while Active Learning attempts to find the most informative sam-

ples for labelling by experts. The primary characteristics of SSL

and AL are as follows:

 SSL: it selects the sample that has the highest confidence,

and adds the predicted label by the machine itself without

any external (expert) involvement at each iteration.

 AL: it takes the instance which has the lowest confidence as

the most informative one; it selects such instance and asks

the expert for its label in each iteration. AL involves human

experts and aims at selecting the most useful instances for

training. It can greatly improve the model’s performance

and can accelerate the speed of convergence.

Masud et al. [7] tried to employ unlabeled data in building classi-

fier model as well as labeled data. Using the fact that the high

percentage of data are unlabeled (because the speed of labeling of

instances by experienced agents is less than the speed of arriving

data, and earlier classification models only employed labeled da-

ta). For making prediction more accurate, their model tries to cre-

ate classification model by using of both labeled and unlabeled

instances. They introduced a semi-supervised clustering algorithm

and build classifiers on evolving data by a label propagation ap-

proach. The model considers both challenges of concept drift and

concept evolution. They compared their proposed model with On-

Demand method proposed in [22] and the results shows it works

better (in memory usage, computation time and accuracy) while it

only uses 10 percent of training labeled data in compare On-

Demand which uses 100 percent labeled data to build classifier.

Semi-Supervised Classification based on Class Membership

(SSCCM) is proposed in [26] and uses label membership in semi-

supervised learning. They formulated the problem for labeled and

unlabeled data in a unified objective function. Afterwards, they

solved it by using of an iterative strategy which tries to converge

to final solution in each iteration. SSCM uses both label member-

ship and decision functions for classification and prediction of

functions are consistent. In other words, it is assumed an instance

is near the decision boundary if two predictions are inconsistent

and probably the prediction is unreliable. In fact one function is

sufficient for prediction and label membership is preferred. Note

that one can use inconsistency between two predictions to identify

instances which are difficult to classify and use other ways of

classification (such as manual labeling etc.). In particular, each

function is verified by the other and the reliability of classification

is improved.

SUN is a Supervised classification algorithm for data streams with

concept drifts and UNlabeled data [27] aims to handle concept

drifting with data streams including unlabeled data. SUN uses of a

k-modes based algorithm which incrementally places concept

clusters in leaves of constructed decision tree. Converting categor-

ical data into numerical does not make meaningful results neces-

sarily if there is no particular order in categorical data (traditional

clustering algorithms). Therefore, the results of k-means and k-

median are not appropriate and a k-modes based algorithm is in-

troduced in [27].

According to the theory of Naïve Bayes, for a fixed (and unchang-

ing) distribution of the instances, the online error of Naive Bayes

will decrease; while the online error of Naive Bayes will increase

for changing instances. In [28], the change in data distribution

demonstrates the change in attribute dimensions. Thus, to deal

with concept drifts, SUN compares the history concept to new

Scenario 0

Scenario 1

Scenario 2

Scenario 3

Journal of Advanced Computer Science & Technology

concept and considers the distribution of class label to track con-

cept drifts.

Zhang et al. proposed an ensemble model in [28] which uses a

combination of classification and clustering for mining data

streams. They introduced two challenges for combining of two

mentioned methods: 1) generated clusters having only a cluster

number and there is no information about instances of a cluster, 2)

due to concept drifting, combining of clusters and classifiers in

one ensemble is a difficult task. Zhang et al. proposed a solution

for handling of each mentioned challenge: 1) using of a label

propagation technique for each cluster to extract useful infor-

mation (label) from instances of a cluster, 2) weighting approach

to weigh classifier models based on consistency to constructed

model from up to date data chunks. [28] assumes available class

labels of unlabeled data chunks are similar to labeled chunks and

it means there is no solution to handle concept evolution.

A classifier ensemble-based active learning framework is pro-

posed in [28] which selectively labels instances to build an en-

semble classifier. [28] proves classifier ensemble's variance direct-

ly adapt error rate; and classifier ensemble's variance is equal to

the accuracy of prediction. Hence, agent should label instances to

minimize classifier ensemble variance and Minimum Variance

(MV) is proposed. To determine weight values for ensemble clas-

sifier, an optimal weight calculation method is proposed in [28].

Finally, MV and optimal weighting is combined to make a frame-

work.

Hosseini et al. in [29] tried to make use of recurring concept in

learning data stream classifier. They used two approaches of Ac-

tive Learning (AL) and weighted classifier. There is a pool of

classifier which be updated continuously and each classifier in the

pool describes one of the existing concepts. When new data ar-

rives, the model classifies the instances and after determining the

label, an existing classifier available in the pool updated or a new

classifier inserted into the pool. Two methods of Bayesian and

heuristic are used for detecting of recurring concepts and updating

the pool.

4. Future works

AL can help in cases in which there is an expert to determine class

labels and track the model toward high accuracy. In cases of arriv-

ing vast volume of data in extremely high speed, it may output

low accuracy and using of semi-supervised is preferred. SSL tries

to automatically find useful information from unlabeled data (and

it means SSL is high in speed), but in cases that the initial model

is very weak, it might produce wrong labels and cause mistakes in

training set. Furthermore, the instances having the highest confi-

dence are not necessarily the most useful ones, so SSL generally

performs worse than AL. Combining of AL and SSL seems as a

research gap.

Concept evolution (especially arriving new classes simultaneous-

ly) needs more attempt because few researches are available at this

area and the proposed methods are often too complicated.

An integrated ensemble model in which all challenges such as

concept drifting, concept evolution and scarcity of labeled in-

stances is needed. Howsoever, there are some researches on the

topic, but proposed models usually tested on well-known datasets

and systems experiments and implementations in longer period of

time are needed.

Table 1 shows a brief on investigated researches and it can help

readers to find topics for future works.

Table 1: A Comparison of Investigated Research Papers

Approach

Tech.

CE?

CD?

Case

Short Description

Year

Author(s)

Semi-supervised clus-

tering + Label propaga-

tion

SynD, SynDE, KDD

99', ASRS

Utilizing both labeled

and unlabeled instanc-

es to train and update

classification model

2011

Masud et al. [7]

Supervised micro-

clustering, Cluster-

based, Sliding window

KDD 99'

Considering only

labeled instances of

data and Building the

classifier through an

on-demand classifica-

tion process which can

dynamically select the

appropriate window of

past training data.

2006

Aggarwal CC et al.

[22]

Novel class label detec-

tion, feedback from

unsupervised mecha-

nisms

SynCN, KDD 99'

Data streams classifi-

cation with ensemble

model based on deci-

sion-feedback

2014

LIU Jing et al. [14]

Semi-Supervised,

Lossless Homogeniz-

ing Conversion for

feature-evolution

Twitter, ASRS, KDD

99', Forest

Considering dynamic

feature space and

classification and

addressing feature-

evolution

2010

Masud et al. [19]

Semi-supervised, k-

modes based cluster-

ing, statistical approach

in detecting concept

drifts

SEA, STAGGER,

KDD '99, Yahoo shop-

ping data, LED.

Handling both chal-

lenges of concept

drifting and unlabeled

data streams.

2012

Xindong Wua et al.

[27]

Semi-supervised, Label

propagation in clusters

and weighting in updat-

ing ensemble frame-

work.

The Malicious URLs

Detection dataset. The

Intrusion Detection

dataset.

Accumulating labeled

records and combine

them to create a classi-

fier according to

threshold.

2010

Peng Zhang et al.

[30]

entropy-based concept

drift detection, Random

Forest

Synthetic dataset, Sloan

Digital Sky Survey

(SDSS)

Considering multiple

target class labels.

2011

Abdulsalam et al.

[15]

Accuracy-based

weighting, Hoeffding

Trees

Some datasets from

UCI repository[31].

Reacting to different

types of concept drift.

2014

Dariusz Brzezinski

et al. [16]

Journal of Advanced Computer Science & Technology

Minimum Variance

(MV), optimal

weighting

Data stream generated

by Hyperplane-based

synthetic data stream

generator.

Selecting best instanc-

es to determine labels

by foreign agent by

the purpose of de-

creasing classifier

ensemble variance.

2010

Xingquan Zhu et al.

[28]

Bayesian formulation,

heuristic methods

Data stream generated

by Hyperplane-based

synthetic data stream

generator, and Email-

ing list dataset.

Construct a pool of

classifier and updating

the pool according to

new classifier created

from new arrived data

stream to improve

accuracy of the en-

semble.

2011

Hosseini et al. [29]

Decision Tree Learn-

ing, Similarity Based

Clustering,

Some datasets from

UCI repository [30].

Handling concept

evolution by consider-

ing inter-class distance

and intra-class dis-

tance.

2013

Dewan Md. Farid

et al. [9]

Semi-supervised, Label

membership function,

decision function

Some datasets from

UCI repository[31].

Enhancing classifica-

tion reliability by

consistency check

between predictions of

two functions. Each

instance has likelihood

to class labels instead

of belonging to only

one class.

2012

Yunyun Wang et

al. [26]

CD: Concept Drift, CE: Concept Evolving, E: Ensemble, I: Incremental, Y: Yes, N: No, Tech: Technique.

5. Conclusion

Data stream mining includes techniques such as classification,

clustering, frequent pattern mining etc. In this paper, a review is

given by focusing on ensemble methods, semi-supervised and

active learning methods. Traditional data stream classification

methods only employ labeled data and often have less accuracy in

cases that system faces the lack of enough labeled data. In real-

world, scarcity of labeled instances is usual because labeling is a

time consuming and costly process. Hence, recently researchers

have focused on using unlabeled data as well as labeled data in

creating classification models. AL and SSL are two approaches of

using unlabeled data. Handling concept drifting is an important

issue on data stream mining, though few references focused on

concept evolution as well.

References

[1] H. H. Mahnoosh Kholghi, Mohammad Reza Keyvanpour,

"Classification and Evaluation of Data Mining Techniques for Data

Stream Requirements," presented at the International Symposium

on Computer, Communication, Control and Automation, Tainan,

Taiwan, 2010, http://dx.doi.org/10.1109/3CA.2010.5533759.

[2] Data Streams Models and Algorithms: Springer, 2007,

[3] J. D. U. Anand Rajaraman, Mining of Massive Datasets:

Cambridge, 2012,

[4] J. G. M. M. Gaber, Learning from Data Streams: Springer, 2007,

[5] J. Gama, Knowledgeb Discovery from Data Streams: Chapman &

Hall/CRC, Taylor & Francis Group, 2010,

http://dx.doi.org/10.1201/EBK1439826119.

[6] M. Kantardzic, Data mining : concepts, models, methods and

algorithms: Wiley-IEEE Press, 2011,

http://dx.doi.org/10.1002/9781118029145.ch1.

[7] C. W. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han,

Kevin W. Hamlen & Nikunj C. Oza, "Facing the reality of data

stream classification: coping with scarcity of labeled data," Knowl

Inf Syst, vol. 33, p. 32, 2011, http://dx.doi.org/10.1007/s10115-011-

0447-8.

[8] O. M. L. Rokach, Data Mining and Knowledge Discovery

Handbook, 2 ed.: Springer, 2010, http://dx.doi.org/10.1007/978-0-

387-09823-4.

[9] L. Z. Dewan Md. Farid, Alamgir Hossain, Chowdhury Mofizur

Rahman, Rebecca Strachan, Graham Sexton, Keshav DahalDewan

Md. Farid, Li Zhang, Alamgir Hossain, Chowdhury Mofizur

Rahman, Rebecca Strachan, Graham Sexton & Keshav Dahal, "An

adaptive ensemble classifier for mining concept drifting data

streams," Expert Systems with Applications, vol. 40, p. 12, 2013,

[10] R. Polikar, "Ensemble based systems in decision making," IEEE

Circuits and Systems Magazine, vol. 6, p. 25, 2006,

http://dx.doi.org/10.1109/MCAS.2006.1688199.

[11] M. S. V. K. Pang-Ning Tan, Introduction to Data Mining vol. 1:

Pearson Education, 2006,

[12] L. I. Kuncheva, Combining Pattern Classifiers: Methods and

Algorithms. USA: Wiley, 2004,

http://dx.doi.org/10.1002/9781118914564.refs.

[13] P. Z. Wenyu Zang, Chuan Zhou & Li Guo, "Comparative study

between incremental and ensemble learning on data streams: Case

study," Journal of Big Data, vol. 1, p. 16, 2014,

http://dx.doi.org/10.1186/2196-1115-1-5.

[14] X. G.-s. LIU Jing, ZHENG Shi-hui, XIAO Da & GU Li-ze, "Data

streams classification with ensemble model based on decision

feedback," The Journal of China Universities of Posts and

Telecommunications, vol. 21, p. 7, 2014,

http://dx.doi.org/10.1016/S1005-8885(14)60272-7.

[15] D. B. S. P. M. Hanady Abdulsalam, "Classification Using

Streaming Random Forests," IEEE TRANSACTIONS ON

KNOWLEDGE AND DATA ENGINEERING, vol. 23, p. 15, 2011,

http://dx.doi.org/10.1109/TKDE.2010.36.

[16] D. B. J. Stefanowski, "Reacting to Different Types of Concept

Drift: The Accuracy Updated Ensemble Algorithm," IEEE

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING

SYSTEMS, vol. 25, p. 14, 2014,

http://dx.doi.org/10.1109/TNNLS.2013.2251352.

[17] D. B. J. Stefanowski, "Accuracy Updated Ensemble for Data

Streams with Concept Drift," presented at the International

Conference on Hybrid Artificial Intelligent Systems, 2011,

http://dx.doi.org/10.1007/978-3-642-21222-2_19.

[18] J. Gao. (2014-07-01). Data Stream Mining: Challenges and

Techniques. Available: http://www.cse.buffalo.edu/~jing/talks.htm

[19] Q. C. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han

& Bhavani Thuraisingham, "Classification and Novel Class

Detection of Data Streams in a Dynamic Feature Space," presented

at the European conference on Machine learning and knowledge

discovery in databases, Berlin, 2010, http://dx.doi.org/10.1007/978-

3-642-15883-4_22.

[20] Q. C. Mohammad M. Masud , Latifur Khan, Charu Aggarwal, Jing

Gao, Jiawei Han & Bhavani Thuraisingham, "Addressing Concept-

Journal of Advanced Computer Science & Technology

Evolution in Concept-Drifting Data Streams," presented at the

IEEE International Conference on Data Mining, 2010,

http://dx.doi.org/10.1109/ICDM.2010.160.

[21] C. W. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han,

Kevin W. Hamlen & Nikunj C. Oza, "Classification and Adaptive

Novel Class Detection of Feature-Evolving Data Streams," IEEE

TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

vol. 25, p. 14, 2013, http://dx.doi.org/10.1109/TKDE.2012.109.

[22] J. H. Charu C. Aggarwal, Jianyong Wang & Philip S. Yu, "A

Fremework for On-Demand Classification of Evolving Data

Streams," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA

ENGINEERING, vol. 18, p. 13, 2006,

http://dx.doi.org/10.1109/TKDE.2006.69.

[23] J. H. P. S. Y. Charu C. Aggarwal, "A Framework for Clustering

Evolving Data Streams," in International Conferences of Very

Large Data Bases, Berlin, 2003, p. 11,

[24] (1999, 2014-06-06). KDD Cup 1999 Data. Available:

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

[25] X. X. G. Q. Yan Leng, "Combining Active Learning and Semi-

supervised Learning to Construct SVM Classifier," Knowledge-

Based Systems, vol. in press, p. 31, 2014,

http://dx.doi.org/10.1016/j.knosys.2013.01.032.

[26] S. C. Z.-H. Z. Yunyun Wang, "New Semi-Supervised Classification

Method Based on Modified Cluster Assumption," IEEE

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING

SYSTEMS, vol. 23, p. 14, 2012,

http://dx.doi.org/10.1109/TNNLS.2012.2186825.

[27] P. L. X. H. Xindong Wua, "Learning from concept drifting data

streams with unlabeled data," Neurocomputing, vol. 92, p. 11,

2012, http://dx.doi.org/10.1016/j.neucom.2011.08.041.

[28] P. Z. Xingquan Zhu, Xiaodong Lin & Yong Shi, "Active Learning

From Stream Data Using Optimal Weight Classifier Ensemble,"

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND

CYBERNETICS-PART B: CYBERNETICS, vol. 40, p. 15, 2010,

http://dx.doi.org/10.1109/TSMCB.2010.2042445.

[29] Z. A. H. B. Mohammad Javad Hosseini, "Pool and Accuracy Based

Stream Classification: A new ensemble algorithm on data stream

classification using recurring concept detection," presented at the

11th IEEE International Conference on Data Mining Workshops,

2011, http://dx.doi.org/10.1109/ICDMW.2011.137.

[30] X. Z. Peng Zhang, Jianlong Tan & Li Guo, "Classifier and Cluster

Ensembles for Mining Concept Drifting Data Streams," presented

at the 2010 IEEE 10th International Conference on Data Mining

(ICDM), Sydney, NSW, 2010,

http://dx.doi.org/10.1109/ICDM.2010.125.

[31] (2014-07-08). UC Irvine Machine Learning Repository. Available:

http://archive.ics.uci.edu/ml/.

A review of improved extreme learning machine methods for data stream classification

Article

Full-text available

Dec 2019
MULTIMED TOOLS APPL

Classification is a hotspot in data stream mining and has gained increasing interest from various research fields. Compared with traditional data stream classification methods, Extreme Learning Machine (ELM) has attracted much attention because of its efficiency and simplicity, which inspired the development of many improved ELM algorithms that have been proposed in the past few years. This paper mainly reviews the current state of ELM used to classify data streams and its variants. First, we introduce the principles of ELM and the existing problems of data stream classification. Then we provide an overview of various improvements made to ELM, which further improves its stability, accuracy and generalization ability and present the practical applications of ELM used in data stream classification. Finally, the paper highlights the existing problems of ELM used for data stream mining and development prospects of ELM in the future.

Enhanced transformer long short-term memory framework for datastream prediction

Article

Full-text available

Feb 2024
IJECE

In machine learning, datastream prediction is a challenging issue, particularly when dealing with enormous amounts of continuous data. The dynamic nature of data makes it difficult for traditional models to handle and sustain real-time prediction accuracy. This research uses a multi-processor long short-term memory (MPLSTM) architecture to present a unique framework for datastream regression. By employing several central processing units (CPUs) to divide the datastream into multiple parallel chunks, the MPLSTM framework illustrates the intrinsic parallelism of long short-term memory (LSTM) networks. The MPLSTM framework ensures accurate predictions by skillfully learning and adapting to changing data distributions. Extensive experimental assessments on real-world datasets have demonstrated the clear superiority of the MPLSTM architecture over previous methods. This study uses the transformer, the most recent deep learning breakthrough technology, to demonstrate how well it can handle challenging tasks and emphasizes its critical role as a cutting-edge approach to raising the bar for machine learning.

A Business Workflow Architecture for Predictive Maintenance using Real-Time Anomaly Prediction On Streaming IoT Data

Conference Paper

Full-text available

Dec 2022

A Semisupervised Concept Drift Adaptation via Prototype-Based Manifold Regularization Approach with Knowledge Transfer

Article

Full-text available

Jan 2023

Data stream mining deals with processing large amounts of data in nonstationary environments, where the relationship between the data and the labels often changes. Such dynamic relationships make it difficult to design a computationally efficient data stream processing algorithm that is also adaptable to the nonstationarity of the environment. To make the algorithm adaptable to the nonstationarity of the environment, concept drift detectors are attached to detect the changes in the environment by monitoring the error rates and adapting to the environment’s current state. Unfortunately, current approaches to adapt to environmental changes assume that the data stream is fully labeled. Assuming a fully labeled data stream is a flawed assumption as the labeling effort would be too impractical due to the rapid arrival and volume of the data. To address this issue, this study proposes to detect concept drift by anticipating a possible change in the true label in the high confidence prediction region. This study also proposes an ensemble-based concept drift adaptation approach that transfers reliable classifiers to the new concept. The significance of our proposed approach compared to the current baselines is that our approach does not use a performance measur as the drift signal or assume a change in data distribution when concept drift occurs. As a result, our proposed approach can detect concept drift when labeled data are scarce, even when the data distribution remains static. Based on the results, this proposed approach can detect concept drifts and fully supervised data stream mining approaches and performs well on mixed-severity concept drift datasets.

ADAPTIVE CLASSIFICATION IN DATA STREAM MINING

Article

Full-text available

Jul 2020

Data streams gained obvious attention by researches for years. Mining this type of data generates challenges because of their special nature. Classification is one of the major approaches of Data Stream Mining (DSM). Concept drift (changes in pattern of data over time) is one of the major challenges that is needed to be adapted in data streams. Another challenge is high dimensional data streams. This paper provides a review for classification techniques in adaptive data stream mining. Focusing on both challenges; concept drifts and dimensionality reduction and dividing these techniques into incremental and ensemble. Incremental classifiers such as Very Fast Decision Trees (VFDT) and Concept-adapting Very Fast Decision Trees (CVFDT) were tested. Adaptive Random Forests (ARF) was taken as an example for adaptive ensemble classifiers. Furthermore, a practical analysis between VFDT, CVFDT and ARF was held. The analysis was according to accuracy, processing speed, and tree size. Accuracy did not vary much between the three algorithms. ARF has much better results in speed and has the smallest number of tree nodes.

Real-Time Anomaly Detection Business Process for Industrial Equipment Using Internet of Things and Unsupervised Machine Learning Algorithms

Chapter

Jun 2023

Today, it is evident that the Internet of Things, real-time data processing, and artificial intelligence technologies are essential in industrial settings to enable early warning autonomous anomaly detection systems. Such systems can detect anomaly situations that could cause failures shortly after they occur, allowing necessary maintenance to be performed promptly. In this research, a software platform has been designed and developed to collect data from sensors placed on industrial equipment to monitor their condition using the required IoT infrastructure. The real-time data collected from this platform is analyzed using real-time data processing techniques. Here, a business process is introduced for instant anomaly detection using real-time clustering analysis methods. To validate the proposed business process architecture, a prototype software has been developed, and its ability to detect anomaly situations has been evaluated. The results show that the proposed business process architecture is effective in real-time anomaly detection and can successfully detect anomalies that can lead to industrial equipment failures.Keywordsinternet of thingsanomaly detectionartificial intelligenceclusteringreal-time unsupervised machine learningreal-time streaming-data prediction

Data Stream Classification Based on Extreme Learning Machine: A Review

Article

Nov 2022

Many daily applications are generating massive amount of data in the form of stream at an ever higher speed, such as medical data, clicking stream, internet record and banking transaction, etc. In contrast to the traditional static data, data streams are of some inherent properties, to name a few, infinite length, concept drift, multiple labels and concept evolution. Among all the data mining tasks, classification is one of the basic topics in data stream mining and has gained more and more attentions among different research communities. Extreme Learning Machine (ELM) has drawn much interests in data classification due to its high efficiency, universal approximation capability, generalization ability, and simplicity, which have greatly inspired the development of many ELM-based algorithms and their applications during the past decades. In this paper, we mainly provide a comprehensive review on ELM theoretical research and its variants in data stream classification, and categorize these algorithms from different perspectives. Firstly, we briefly introduce the basic principles of ELM and its characteristics. Secondly, we give an overview of different ELM variants to address the particular issues of data stream classification. Thirdly, we present an overview of different strategies to optimize the ELM, which have further improved the stability, accuracy and generalization ability of ELM, and briefly introduce some practical applications of ELM in data stream classification. Finally, we conduct several groups of experiments to compare the performance of ELM based models addressing the focused issues. Also, the open issues and prospects of ELM models used for stream classification are discussed, which are worthwhile to be further studied in the future.

Ultrasonic Multi-Sensor Detection Patterns On Autonomous Vehicles Using Data Stream Method

Conference Paper

Oct 2021

Mirror motion recognition method about upper limb rehabilitation robot based on sEMG

Article

Mar 2021

Lin Li

A novel method of mirror motion recognition by rehabilitation robot with multi-channels sEMG signals is proposed, aiming to help the stroked patients to complete rehabilitation training movement. Firstly the bilateral mirror training is used and the model of muscle synergy with basic sEMG signals is established. Secondly, the constrained L1/2-NMF is used to extracted the main sEMG signals information which can also reduce the limb movement characteristics. Finally the relationship between sEMG signal characteristics and upper limb movement is described by TSSVD-ELM and it is applied to improve the model stability. The validity and feasibility of the proposed strategy are verified by the experiments in this paper, and the rehabilitation robot can move with the mirror upper limb. By comparing the method proposed in this paper with PCA and full-action feature extraction, it is confirmed that convergence speed is faster; the feature extraction accuracy is higher which can be used in rehabilitation robot systems.

Antimicrobial resistance: Mechanisms of action of antimicrobial agents

Conference Paper

Full-text available

Jan 2009

Microorganisms do co-exist with other living organisms and exhibit the greatest genetic and metabolic activity. They have evolved various mechanisms to survive pressure exerted by competitive environmental challenges [1]. Infection is the invasion of the host by microorganisms, which then multiply in close association with the host's tissues. Infections may differ in severity and may range from in apparent to fulminating [2]. There has been a continual battle between humans and the multitude of microorganisms that cause infection and diseases [3]. Antimicrobial agents are among few drugs that cure by eliminating the infective microorganisms [4]. Development of antimicrobials for clinical use has been successful in targeting essential components of general areas of bacterial metabolism namely: cell wall synthesis, protein synthesis, ribonucleic acid (RNA) synthesis, deoxyribonucleic acid (DNA) synthesis, and intermediary metabolism [5]. The successful use of antimicrobial agents to inhibit and eliminate the infectious organisms has been facing challenges and difficulties because microorganisms are developing various forms of resistance to the drugs and as use of antimicrobial drugs increases, so do the level and complexity of the resistance [3]. Emergence of resistance to multiple antimicrobial agents in pathogens has become an emergency public health threat as there are fewer, or sometimes no, effective antimicrobial agents available for infections caused by these pathogens. Currently most widely used antimicrobial agents are subject to resistance and even some newer agents are facing the same challenge. The resistance has generally been met through the discovery of novel antimicrobial agents and by use of derivatives prepared by semisynthetic methods, which are not affected by existing resistance mechanisms [6]. This chapter reviews the pharmacologic concepts mechanism of action of selected antimicrobial agents. In particular this chapter will describe the mechanisms of generation of microbial resistance variants in different organisms, mechanism of action of antimicrobial agents and ways to prevent emergency and spreading of antimicrobial resistance.

Comparative study between incremental and ensemble learning on data streams: Case study

Article

Full-text available

Dec 2014

With unlimited growth of real-world data size and increasing requirement of real-time processing, immediate processing of big stream data has become an urgent problem. In stream data, hidden patterns commonly evolve over time (i.e.,concept drift), where many dynamic learning strategies have been proposed, such as the incremental learning and ensemble learning. To the best of our knowledge, there is no work systematically compare these two methods. In this paper we conduct comparative study between theses two learning methods. We first introduce the concept of “concept drift”, and propose how to quantitatively measure it. Then, we recall the history of incremental learning and ensemble learning, introducing milestones of their developments. In experiments, we comprehensively compare and analyze their performances w.r.t. accuracy and time efficiency, under various concept drift scenarios. We conclude with several future possible research problems.

Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm

Article

Full-text available

May 2014

Data stream mining has been receiving increased attention due to its presence in a wide range of applications, such as sensor networks, banking, and telecommunication. One of the most important challenges in learning from data streams is reacting to concept drift, i.e., unforeseen changes of the stream's underlying data distribution. Several classification algorithms that cope with concept drift have been put forward, however, most of them specialize in one type of change. In this paper, we propose a new data stream classifier, called the Accuracy Updated Ensemble (AUE2), which aims at reacting equally well to different types of drift. AUE2 combines accuracy-based weighting mechanisms known from block-based ensembles with the incremental nature of Hoeffding Trees. The proposed algorithm is experimentally compared with 11 state-of-the-art stream methods, including single classifiers, block-based and online ensembles, and hybrid approaches in different drift scenarios. Out of all the compared algorithms, AUE2 provided best average classification accuracy while proving to be less memory consuming than other ensemble approaches. Experimental results show that AUE2 can be considered suitable for scenarios, involving many types of drift as well as static environments.

Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams

Article

Full-text available

Jul 2013

Data stream classification poses many challenges to the data mining community. In this paper, we address four such major challenges, namely, infinite length, concept-drift, concept-evolution, and feature-evolution. Since a data stream is theoretically infinite in length, it is impractical to store and use all the historical data for training. Concept-drift is a common phenomenon in data streams, which occurs as a result of changes in the underlying concepts. Concept-evolution occurs as a result of new classes evolving in the stream. Feature-evolution is a frequently occurring process in many streams, such as text streams, in which new features (i.e., words or phrases) appear as the stream progresses. Most existing data stream classification techniques address only the first two challenges, and ignore the latter two. In this paper, we propose an ensemble classification framework, where each classifier is equipped with a novel class detector, to address concept-drift and concept-evolution. To address feature-evolution, we propose a feature set homogenization technique. We also enhance the novel class detection module by making it more adaptive to the evolving stream, and enabling it to detect more than one novel class at a time. Comparison with state-of-the-art data stream classification techniques establishes the effectiveness of the proposed approach.

Combining Pattern Classifiers: Methods and Algorithms

Book

Jan 2004

Ludmila Kuncheva

Learning from Data Streams

Chapter

Jan 2007

Knowledge Discovery from Data Streams. Data Mining and Knowledge Discovery

Article

May 2010

João Gama

Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

Combining active learning and semi-supervised learning to construct SVM classifier

Article

May 2013
KNOWL-BASED SYST

One key issue for most classification algorithms is that they need large amounts of labeled samples to train the classifier. Since manual labeling is time consuming, researchers have proposed technologies of active learning and semi-supervised learning to reduce manual labeling workload. There is a certain degree of complementarity between active learning and semi-supervised learning, and therefore some researches combine them to further reduce manual labeling workload. However, researches on combining active learning and semi-supervised learning for SVM classifier are rare. Of numerous SVM active learning algorithms, the most popular is the one that queries the sample closest to the current classification hyperplane in each iteration, which is denoted as SVM"A"L in this paper. Realizing that SVM"A"L is only interested in samples that are more likely to be on the class boundary, while ignoring the usage of the rest large amounts of unlabeled samples, this paper designs a semi-supervised learning algorithm to make full use of the rest non-queried samples, and further forms a new active semi-supervised SVM algorithm. The proposed active semi-supervised SVM algorithm uses active learning to select class boundary samples, and semi-supervised learning to select class central samples, for class central samples are believed to better describe the class distribution, and to help SVM"A"L finding the boundary samples more precisely. In order not to introduce too many labeling errors when exploring class central samples, the label changing rate is used to ensure the reliability of the predicted labels. Experimental results show that the proposed active semi-supervised SVM algorithm performs much better than the pure SVM active learning algorithm, and thus can further reduce manual labeling workload.

Data streams classification with ensemble model based on decision-feedback

Article

Feb 2014

The main challenges of data streams classification include infinite length, concept-drifting, arrival of novel classes and lack of labeled instances. Most existing techniques address only some of them and ignore others. So an ensemble classification model based on decision-feedback (ECM-BDF) is presented in this paper to address all these challenges. Firstly, a data stream is divided into sequential chunks and a classification model is trained from each labeled data chunk. To address the infinite length and concept-drifting problem, a fixed number of such models constitute an ensemble model E and subsequent labeled chunks are used to update E. To deal with the appearance of novel classes and limited labeled instances problem, the model incorporates a novel class detection mechanism to detect the arrival of a novel class without training E with labeled instances of that class. Meanwhile, unsupervised models are trained from unlabeled instances to provide useful constraints for E. An extended ensemble model Ex can be acquired with the constraints as feedback information, and then unlabeled instances can be classified more accurately by satisfying the maximum consensus of Ex. Experimental results demonstrate that the proposed ECM-BDF outperforms traditional techniques in classifying data streams with limited labeled data.

An adaptive ensemble classifier for mining concept drifting data streams

Article

Nov 2013
EXPERT SYST APPL

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.

A review on data stream classification approaches

Abstract

Recommended publications

An Efficient Frequent Closed Itemsets Mining Algorithm Over Data Streams

A novel approach for clustering data streams using granularity technique

Novelty Detection from Evolving Complex Data Streams with Time Windows

Analysis and evaluation of outlier detection algorithms in data streams