ArticlePDF Available

A review on data stream classification approaches

Authors:

Abstract

Stream data is usually in vast volume, changing dynamically, possibly infinite, and containing multi-dimensional features. The attention towards data stream mining is increasing as regards to its presence in wide range of real-world applications, such as e-commerce, banking, sensor data and telecommunication records. Similar to data mining, data stream mining includes classification, clustering, frequent pattern mining etc. techniques; the special focus of this paper is on classification methods invented to handle data streams. Early methods of data stream classification needed all instances to be labeled for creating classifier models, but there are some methods (Semi-Supervised Learning and Active Learning) in which unlabeled data is employed as well as labeled data. In this paper, by focusing on ensemble methods, semi-supervised and active learning, a review on some state of the art researches is given.
Copyright © 2016 Sajad Homayoun, Marzieh Ahmadzadeh. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Journal of Advanced Computer Science & Technology, 5 (1) (2016) 8-13
Journal of Advanced Computer Science & Technology
Website: www.sciencepubco.com/index.php/JACST
doi: 10.14419/jacst.v5i1.5225
Review paper
A review on data stream classification approaches
Sajad Homayoun *, Marzieh Ahmadzadeh
Department of Computer Engineering and Information Technology, Shiraz University of Technology, Shiraz, Iran
*Corresponding author E-mail: s.homayoun@sutech.ac.ir
Abstract
Stream data is usually in vast volume, changing dynamically, possibly infinite, and containing multi-dimensional features. The attention
towards data stream mining is increasing as regards to its presence in wide range of real-world applications, such as e-commerce, bank-
ing, sensor data and telecommunication records. Similar to data mining, data stream mining includes classification, clustering, frequent
pattern mining etc. techniques; the special focus of this paper is on classification methods invented to handle data streams. Early methods
of data stream classification needed all instances to be labeled for creating classifier models, but there are some methods (Semi-
Supervised Learning and Active Learning) in which unlabeled data is employed as well as labeled data. In this paper, by focusing on
ensemble methods, semi-supervised and active learning, a review on some state of the art researches is given.
Keywords: Data Stream; Data Stream Classification; Ensemble; Semi-Supervised Learning; Active Learning.
1. Introduction
Dramatic growth in information technology and vast volume of
generated data has made new challenging discovery tasks in pro-
cessing of data. The term "data stream" is defined as a sequence of
data that arrives at a system in a continuous and changing manner.
Data stream can be conceived as a continuous and changing se-
quence of data that continuously arrives at a system to be stored or
processed [1]. Data streams have some characteristics in common
such as massive, temporally ordered, fast-changing and potentially
infinite in length [2-4]. According to [5], there are some reasons
which dispart data streams from traditional data mining:
The size of data streams is potentially boundless.
The elements of stream arrive on-line.
Because of limitations in memory space, after processing of
an element, system discards (or summarizes) it.
The system cannot control or determine how data elements
arrive.
Emails, sensor data, websites customer click stream, network traf-
fic, weather forecasting data etc. are some examples of data
stream. Data stream mining comprises three main techniques such
as clustering, classification and frequent pattern mining.
Classification is a supervised learning techniques which aims to
predict of an independent variable (class label) according to some
values of an instance[6]. Making a classification model has two
main phases: 1) Model creation, 2) Model evaluation. At the first
phase, a learning algorithm uses dataset to create a model which is
able to predict class label. The second phase tries to investigate the
accuracy parameters of created model.
Change in data according to time is one of the main issues in data
stream classification techniques. There are two evolution in data
[7]: 1) concept drifting, 2) concept evolution. Concept drifting
happens whenever class labels changes due to changes in time.
Weather forecasting, spam categorization and monitoring systems
are some examples in which concept drifting is a challenge. Con-
cept evolution occurs when one or more new class labels emerge
on class label set [7]. As shown in Fig. 1 b, concept evolution
occurs when new instances arrive with new class labels.
Fig. 1: (A) Fixed Number of Class Labels; (B) A Novel Class Has
Emerged (Concept Evolution)
Journal of Advanced Computer Science & Technology
9
Classification in data stream has some challenges that researchers
attempt to solve them. Three main challenges of classification
techniques are as follows [8]:
Accuracy: It is the most important factor in classification
algorithms, and concept drifting directly influences the ac-
curacy.
Efficiency: creating of a classifier is costly from processing
point of view. Also, updating of the model is a challenge
due to drifting.
Ease of use: a classifier model should be usable in applica-
tions.
According to [9], single model incremental and ensemble-based
classification are two major branches of data stream classification.
The first works on one single classifier and update it incrementally
to tackle new evolved stream class labels. It usually needs com-
plex modifications on the internal structure of the classifier. Single
classifier approach often is unable to create strong and accurate
classifiers. In contrast, an ensemble model combines different
classifiers to improve the overall accuracy of predictions. If every
single classifier works better than random prediction (accuracy
more than 0.5), then ensemble model is always more accurate than
a single classifier model.
Due to the need for labeled instances to build classifiers, research-
ers contemplate classification as supervised. It’s worth mentioning
that quality of classifiers extremely depends on the percentage of
labeled instances available in data stream. Many researchers have
tried to use unlabeled instances as well as labeled instances be-
cause manual labeling of instances (by experienced agents) is
costly and time consuming.
The reminder of the paper is organized as follow: section 2 dis-
cusses some ensemble classification methods; section 3 presents a
review on semi-supervised and active learning algorithms while
section 4 is dedicated to future works, and finally section 5 con-
cludes the paper.
2. Incremental learning and ensemble meth-
ods
After introducing ensembles in 1990s, many researchers tried to
improve prediction accuracy by using ensembles [10]. An ensem-
ble method(Fig. 2) creates a set of base classifiers from training
data and classify new instances by poling of base ensembles [11].
Ensembles are popular because they improve classification accu-
racy in static environments [12]. But they need some changes to
adapt with dynamic nature of data streams. In Incremental learn-
ing, a machine learning algorithm take place when new instances
emerge, and then to adjust the model [13]. Some methods of re-
viewed incremental methods are suggested in section 3 because
they are in the category of Semi-Supervised or Active Learning
algorithms. In this section, a review on some incremental and
ensemble methods is given.
Fig. 2: An Ensemble Model.
Jing et al. [14] introduced four main challenges of classification
techniques and claimed their proposed model is able to address all
challenges: 1) infinite length, 2) concept drifting, 3) arrival of
novel classes and 4) lack of labeled instances. For handling con-
cept evolution and lacking of labeled instances, a novel class de-
tection mechanism is proposed. ECM-BDF (Ensemble Classifica-
tion Model Based on Decision Feedback) divides data streams into
sequential chunks with appropriate sizes, then a classifier is made
for each data chunk and some of created classifier considered as
ensemble. In addition, the classifiers made from new labeled in-
stances used for updating of the ensemble. There is also a novel
class detection mechanism to face arriving new class labels and
this mechanism assumes a decision boundary around training data.
The data which place out of boundary considered as outliers. And,
Outliers with strong cohesion may consider as arriving new class.
The proposed model in [14] only uses labeled instances, but in fact
in some cases unlabeled instances are much more than labeled
ones and models which only consider labeled instances usually
have low accuracy. Abdulsalam et al. [15] defined four scenario
for data stream and presented a three phase model which addresses
the scenarios. Scenario 0 in which labeled records only appears at
the beginning of the stream (and is enough for creating classifier)
and the consequent instances are unlabeled. Scenario 1 shows
concept drifting while labeled instances are adequate for making
classifier. Scenario 2 and 3 in which there are no sufficient labeled
instances. Scenario 3 is more common and shows arriving of la-
beled record frequently and periodically. Fig. 3 shows four men-
tioned scenarios.
In phase one, they introduced an approach to handle scenario 0 in
which stream decision tree construction is merged with Random
Forest algorithm. Phase two uses self-adjusting algorithm which
employs entropy-based change-detection technique to address
scenario 1(concept drifting). Phase three aims to handle scenario 2
and scenario 3 while the key feature of this phase is the ability of
determining when the current model is ready to deploy. In other
words, it determines deployment moment by defining a threshold
value for minimum number of needed labeled records. Proposed
algorithm in [15] considers only ordinal or numerical attributes
and it also assumes records are approximately uniformly distribut-
ed; these issues cause limitations on proposed model.
AUE2(Accuracy Updated Ensemble) is proposed in [16] aims to
handle different types of drifts. It combines accuracy based
weighting mechanisms achieved from changes in block based
ensembles with nature of Hoeffding Trees. Ensemble is updated
with appending new classifiers and removing weak classifiers. In
other words, the proposed model improves ensemble reactions
when facing different drifts while decreasing influences of data
chunk size on prediction accuracy. AUE2 is an enhance for AUE1
[17] with some changes in weighting and updating mechanisms to
reduce computation cost and to increase accuracy of prediction.
VOTE
Input Stream
OUTPUT
10
Journal of Advanced Computer Science & Technology
Fig.3: Scenarios Introduced in [15] for Data Streams.
[9] proposed an adaptive ensemble approach for classification and
novel class detection in concept drifting. It uses traditional classi-
fiers and applies automatic updating of ensemble models for han-
dling concept drifting. The idea for novel class detection is the
distance among instances. In other words, instances of a class
should be closer and instances of different classes should be far
enough. If an instance is apart enough from available clusters, it
can consider as new class label.
Some articles tried to address feature-evolution; it occurs when set
of features changes during time [18] or whenever new features
appear in data[19].
DXMiner from [19] tried to introduce a model for handling of
feature evolution, but it had high false positive rate (false novel
class detection rate) and false negative rate (missed novel class
detection rate) in some datasets. Moreover, it is unable to detect
new classes if more than one new class arrives at a time. After-
wards, Masud et al. [20] tried to solve the problem of simultane-
ous arriving of new classes. A model is introduced in [21] which
the authors claimed that it has more performance compared to
earlier models in concept drift, concept evolution and feature evo-
lution. To handle concept drift and concept evolution they de-
signed a framework in which each classifier is equipped with a
novel class detector and is able to detect more than one novel clas-
ses. To address feature evolution, they proposed feature set ho-
mogenization technique.
Aggarwal et al. proposed a model which is able to adapt with
changes in data streams(concept evolution) [22]. They proposed
On Demand Classification which is able to dynamically determine
appropriate window size for past training data. They introduced
supervised micro clustering which only made from training data
and each micro cluster is a set of related training instances in
which a cluster's instances have same class label. They tried to
change unsupervised clustering approach [23] and handle high
evolving data stream. Note that some parameters (such as initial
points, size of sliding window etc.) must be set carefully to
achieve appropriate accuracy and it seems as a drawback of their
model. Their model aims at handling concept drifting while it has
no idea for concept evolution. They used KDD 99 data set [24] to
investigate the model.
3. Semi-supervised learning and active learn-
ing
Due to the nature of data stream (high volume, quick et.), labeling
of instances (by experienced agents) is not possible and research-
ers tried to propose novel methods to handle this problem. Ac-
cording to [25], Semi-Supervised Learning(SSL) and Active
Learning(AL) are two iterative approaches of employing unla-
beled data in creating classification model.
By the purpose of reducing manual labelling workload, Semi-
Supervised Learning aims to label samples by the machine itself,
while Active Learning attempts to find the most informative sam-
ples for labelling by experts. The primary characteristics of SSL
and AL are as follows:
SSL: it selects the sample that has the highest confidence,
and adds the predicted label by the machine itself without
any external (expert) involvement at each iteration.
AL: it takes the instance which has the lowest confidence as
the most informative one; it selects such instance and asks
the expert for its label in each iteration. AL involves human
experts and aims at selecting the most useful instances for
training. It can greatly improve the model’s performance
and can accelerate the speed of convergence.
Masud et al. [7] tried to employ unlabeled data in building classi-
fier model as well as labeled data. Using the fact that the high
percentage of data are unlabeled (because the speed of labeling of
instances by experienced agents is less than the speed of arriving
data, and earlier classification models only employed labeled da-
ta). For making prediction more accurate, their model tries to cre-
ate classification model by using of both labeled and unlabeled
instances. They introduced a semi-supervised clustering algorithm
and build classifiers on evolving data by a label propagation ap-
proach. The model considers both challenges of concept drift and
concept evolution. They compared their proposed model with On-
Demand method proposed in [22] and the results shows it works
better (in memory usage, computation time and accuracy) while it
only uses 10 percent of training labeled data in compare On-
Demand which uses 100 percent labeled data to build classifier.
Semi-Supervised Classification based on Class Membership
(SSCCM) is proposed in [26] and uses label membership in semi-
supervised learning. They formulated the problem for labeled and
unlabeled data in a unified objective function. Afterwards, they
solved it by using of an iterative strategy which tries to converge
to final solution in each iteration. SSCM uses both label member-
ship and decision functions for classification and prediction of
functions are consistent. In other words, it is assumed an instance
is near the decision boundary if two predictions are inconsistent
and probably the prediction is unreliable. In fact one function is
sufficient for prediction and label membership is preferred. Note
that one can use inconsistency between two predictions to identify
instances which are difficult to classify and use other ways of
classification (such as manual labeling etc.). In particular, each
function is verified by the other and the reliability of classification
is improved.
SUN is a Supervised classification algorithm for data streams with
concept drifts and UNlabeled data [27] aims to handle concept
drifting with data streams including unlabeled data. SUN uses of a
k-modes based algorithm which incrementally places concept
clusters in leaves of constructed decision tree. Converting categor-
ical data into numerical does not make meaningful results neces-
sarily if there is no particular order in categorical data (traditional
clustering algorithms). Therefore, the results of k-means and k-
median are not appropriate and a k-modes based algorithm is in-
troduced in [27].
According to the theory of Naïve Bayes, for a fixed (and unchang-
ing) distribution of the instances, the online error of Naive Bayes
will decrease; while the online error of Naive Bayes will increase
for changing instances. In [28], the change in data distribution
demonstrates the change in attribute dimensions. Thus, to deal
with concept drifts, SUN compares the history concept to new
Scenario 0
Scenario 1
Scenario 2
Scenario 3
Journal of Advanced Computer Science & Technology
11
concept and considers the distribution of class label to track con-
cept drifts.
Zhang et al. proposed an ensemble model in [28] which uses a
combination of classification and clustering for mining data
streams. They introduced two challenges for combining of two
mentioned methods: 1) generated clusters having only a cluster
number and there is no information about instances of a cluster, 2)
due to concept drifting, combining of clusters and classifiers in
one ensemble is a difficult task. Zhang et al. proposed a solution
for handling of each mentioned challenge: 1) using of a label
propagation technique for each cluster to extract useful infor-
mation (label) from instances of a cluster, 2) weighting approach
to weigh classifier models based on consistency to constructed
model from up to date data chunks. [28] assumes available class
labels of unlabeled data chunks are similar to labeled chunks and
it means there is no solution to handle concept evolution.
A classifier ensemble-based active learning framework is pro-
posed in [28] which selectively labels instances to build an en-
semble classifier. [28] proves classifier ensemble's variance direct-
ly adapt error rate; and classifier ensemble's variance is equal to
the accuracy of prediction. Hence, agent should label instances to
minimize classifier ensemble variance and Minimum Variance
(MV) is proposed. To determine weight values for ensemble clas-
sifier, an optimal weight calculation method is proposed in [28].
Finally, MV and optimal weighting is combined to make a frame-
work.
Hosseini et al. in [29] tried to make use of recurring concept in
learning data stream classifier. They used two approaches of Ac-
tive Learning (AL) and weighted classifier. There is a pool of
classifier which be updated continuously and each classifier in the
pool describes one of the existing concepts. When new data ar-
rives, the model classifies the instances and after determining the
label, an existing classifier available in the pool updated or a new
classifier inserted into the pool. Two methods of Bayesian and
heuristic are used for detecting of recurring concepts and updating
the pool.
4. Future works
AL can help in cases in which there is an expert to determine class
labels and track the model toward high accuracy. In cases of arriv-
ing vast volume of data in extremely high speed, it may output
low accuracy and using of semi-supervised is preferred. SSL tries
to automatically find useful information from unlabeled data (and
it means SSL is high in speed), but in cases that the initial model
is very weak, it might produce wrong labels and cause mistakes in
training set. Furthermore, the instances having the highest confi-
dence are not necessarily the most useful ones, so SSL generally
performs worse than AL. Combining of AL and SSL seems as a
research gap.
Concept evolution (especially arriving new classes simultaneous-
ly) needs more attempt because few researches are available at this
area and the proposed methods are often too complicated.
An integrated ensemble model in which all challenges such as
concept drifting, concept evolution and scarcity of labeled in-
stances is needed. Howsoever, there are some researches on the
topic, but proposed models usually tested on well-known datasets
and systems experiments and implementations in longer period of
time are needed.
Table 1 shows a brief on investigated researches and it can help
readers to find topics for future works.
Table 1: A Comparison of Investigated Research Papers
Approach
Tech.
CE?
CD?
Case
Short Description
Year
Author(s)
Semi-supervised clus-
tering + Label propaga-
tion
E
Y
Y
SynD, SynDE, KDD
99', ASRS
Utilizing both labeled
and unlabeled instanc-
es to train and update
classification model
2011
Masud et al. [7]
Supervised micro-
clustering, Cluster-
based, Sliding window
I
N
Y
KDD 99'
Considering only
labeled instances of
data and Building the
classifier through an
on-demand classifica-
tion process which can
dynamically select the
appropriate window of
past training data.
2006
Aggarwal CC et al.
[22]
Novel class label detec-
tion, feedback from
unsupervised mecha-
nisms
E
Y
Y
SynCN, KDD 99'
Data streams classifi-
cation with ensemble
model based on deci-
sion-feedback
2014
LIU Jing et al. [14]
Semi-Supervised,
Lossless Homogeniz-
ing Conversion for
feature-evolution
I
Y
Y
Twitter, ASRS, KDD
99', Forest
Considering dynamic
feature space and
classification and
addressing feature-
evolution
2010
Masud et al. [19]
Semi-supervised, k-
modes based cluster-
ing, statistical approach
in detecting concept
drifts
I
N
Y
SEA, STAGGER,
KDD '99, Yahoo shop-
ping data, LED.
Handling both chal-
lenges of concept
drifting and unlabeled
data streams.
2012
Xindong Wua et al.
[27]
Semi-supervised, Label
propagation in clusters
and weighting in updat-
ing ensemble frame-
work.
E
N
Y
The Malicious URLs
Detection dataset. The
Intrusion Detection
dataset.
Accumulating labeled
records and combine
them to create a classi-
fier according to
threshold.
2010
Peng Zhang et al.
[30]
entropy-based concept
drift detection, Random
Forest
E
N
Y
Synthetic dataset, Sloan
Digital Sky Survey
(SDSS)
Considering multiple
target class labels.
2011
Abdulsalam et al.
[15]
Accuracy-based
weighting, Hoeffding
Trees
E
N
Y
Some datasets from
UCI repository[31].
Reacting to different
types of concept drift.
2014
Dariusz Brzezinski
et al. [16]
12
Journal of Advanced Computer Science & Technology
Minimum Variance
(MV), optimal
weighting
E
N
Y
Data stream generated
by Hyperplane-based
synthetic data stream
generator.
Selecting best instanc-
es to determine labels
by foreign agent by
the purpose of de-
creasing classifier
ensemble variance.
2010
Xingquan Zhu et al.
[28]
Bayesian formulation,
heuristic methods
E
N
Y
Data stream generated
by Hyperplane-based
synthetic data stream
generator, and Email-
ing list dataset.
Construct a pool of
classifier and updating
the pool according to
new classifier created
from new arrived data
stream to improve
accuracy of the en-
semble.
2011
Hosseini et al. [29]
Decision Tree Learn-
ing, Similarity Based
Clustering,
E
Y
Y
Some datasets from
UCI repository [30].
Handling concept
evolution by consider-
ing inter-class distance
and intra-class dis-
tance.
2013
Dewan Md. Farid
et al. [9]
Semi-supervised, Label
membership function,
decision function
I
N
N
Some datasets from
UCI repository[31].
Enhancing classifica-
tion reliability by
consistency check
between predictions of
two functions. Each
instance has likelihood
to class labels instead
of belonging to only
one class.
2012
Yunyun Wang et
al. [26]
CD: Concept Drift, CE: Concept Evolving, E: Ensemble, I: Incremental, Y: Yes, N: No, Tech: Technique.
5. Conclusion
Data stream mining includes techniques such as classification,
clustering, frequent pattern mining etc. In this paper, a review is
given by focusing on ensemble methods, semi-supervised and
active learning methods. Traditional data stream classification
methods only employ labeled data and often have less accuracy in
cases that system faces the lack of enough labeled data. In real-
world, scarcity of labeled instances is usual because labeling is a
time consuming and costly process. Hence, recently researchers
have focused on using unlabeled data as well as labeled data in
creating classification models. AL and SSL are two approaches of
using unlabeled data. Handling concept drifting is an important
issue on data stream mining, though few references focused on
concept evolution as well.
References
[1] H. H. Mahnoosh Kholghi, Mohammad Reza Keyvanpour,
"Classification and Evaluation of Data Mining Techniques for Data
Stream Requirements," presented at the International Symposium
on Computer, Communication, Control and Automation, Tainan,
Taiwan, 2010, http://dx.doi.org/10.1109/3CA.2010.5533759.
[2] Data Streams Models and Algorithms: Springer, 2007,
[3] J. D. U. Anand Rajaraman, Mining of Massive Datasets:
Cambridge, 2012,
[4] J. G. M. M. Gaber, Learning from Data Streams: Springer, 2007,
[5] J. Gama, Knowledgeb Discovery from Data Streams: Chapman &
Hall/CRC, Taylor & Francis Group, 2010,
http://dx.doi.org/10.1201/EBK1439826119.
[6] M. Kantardzic, Data mining : concepts, models, methods and
algorithms: Wiley-IEEE Press, 2011,
http://dx.doi.org/10.1002/9781118029145.ch1.
[7] C. W. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han,
Kevin W. Hamlen & Nikunj C. Oza, "Facing the reality of data
stream classification: coping with scarcity of labeled data," Knowl
Inf Syst, vol. 33, p. 32, 2011, http://dx.doi.org/10.1007/s10115-011-
0447-8.
[8] O. M. L. Rokach, Data Mining and Knowledge Discovery
Handbook, 2 ed.: Springer, 2010, http://dx.doi.org/10.1007/978-0-
387-09823-4.
[9] L. Z. Dewan Md. Farid, Alamgir Hossain, Chowdhury Mofizur
Rahman, Rebecca Strachan, Graham Sexton, Keshav DahalDewan
Md. Farid, Li Zhang, Alamgir Hossain, Chowdhury Mofizur
Rahman, Rebecca Strachan, Graham Sexton & Keshav Dahal, "An
adaptive ensemble classifier for mining concept drifting data
streams," Expert Systems with Applications, vol. 40, p. 12, 2013,
[10] R. Polikar, "Ensemble based systems in decision making," IEEE
Circuits and Systems Magazine, vol. 6, p. 25, 2006,
http://dx.doi.org/10.1109/MCAS.2006.1688199.
[11] M. S. V. K. Pang-Ning Tan, Introduction to Data Mining vol. 1:
Pearson Education, 2006,
[12] L. I. Kuncheva, Combining Pattern Classifiers: Methods and
Algorithms. USA: Wiley, 2004,
http://dx.doi.org/10.1002/9781118914564.refs.
[13] P. Z. Wenyu Zang, Chuan Zhou & Li Guo, "Comparative study
between incremental and ensemble learning on data streams: Case
study," Journal of Big Data, vol. 1, p. 16, 2014,
http://dx.doi.org/10.1186/2196-1115-1-5.
[14] X. G.-s. LIU Jing, ZHENG Shi-hui, XIAO Da & GU Li-ze, "Data
streams classification with ensemble model based on decision
feedback," The Journal of China Universities of Posts and
Telecommunications, vol. 21, p. 7, 2014,
http://dx.doi.org/10.1016/S1005-8885(14)60272-7.
[15] D. B. S. P. M. Hanady Abdulsalam, "Classification Using
Streaming Random Forests," IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING, vol. 23, p. 15, 2011,
http://dx.doi.org/10.1109/TKDE.2010.36.
[16] D. B. J. Stefanowski, "Reacting to Different Types of Concept
Drift: The Accuracy Updated Ensemble Algorithm," IEEE
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING
SYSTEMS, vol. 25, p. 14, 2014,
http://dx.doi.org/10.1109/TNNLS.2013.2251352.
[17] D. B. J. Stefanowski, "Accuracy Updated Ensemble for Data
Streams with Concept Drift," presented at the International
Conference on Hybrid Artificial Intelligent Systems, 2011,
http://dx.doi.org/10.1007/978-3-642-21222-2_19.
[18] J. Gao. (2014-07-01). Data Stream Mining: Challenges and
Techniques. Available: http://www.cse.buffalo.edu/~jing/talks.htm
[19] Q. C. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han
& Bhavani Thuraisingham, "Classification and Novel Class
Detection of Data Streams in a Dynamic Feature Space," presented
at the European conference on Machine learning and knowledge
discovery in databases, Berlin, 2010, http://dx.doi.org/10.1007/978-
3-642-15883-4_22.
[20] Q. C. Mohammad M. Masud , Latifur Khan, Charu Aggarwal, Jing
Gao, Jiawei Han & Bhavani Thuraisingham, "Addressing Concept-
Journal of Advanced Computer Science & Technology
13
Evolution in Concept-Drifting Data Streams," presented at the
IEEE International Conference on Data Mining, 2010,
http://dx.doi.org/10.1109/ICDM.2010.160.
[21] C. W. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han,
Kevin W. Hamlen & Nikunj C. Oza, "Classification and Adaptive
Novel Class Detection of Feature-Evolving Data Streams," IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
vol. 25, p. 14, 2013, http://dx.doi.org/10.1109/TKDE.2012.109.
[22] J. H. Charu C. Aggarwal, Jianyong Wang & Philip S. Yu, "A
Fremework for On-Demand Classification of Evolving Data
Streams," IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, vol. 18, p. 13, 2006,
http://dx.doi.org/10.1109/TKDE.2006.69.
[23] J. H. P. S. Y. Charu C. Aggarwal, "A Framework for Clustering
Evolving Data Streams," in International Conferences of Very
Large Data Bases, Berlin, 2003, p. 11,
[24] (1999, 2014-06-06). KDD Cup 1999 Data. Available:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[25] X. X. G. Q. Yan Leng, "Combining Active Learning and Semi-
supervised Learning to Construct SVM Classifier," Knowledge-
Based Systems, vol. in press, p. 31, 2014,
http://dx.doi.org/10.1016/j.knosys.2013.01.032.
[26] S. C. Z.-H. Z. Yunyun Wang, "New Semi-Supervised Classification
Method Based on Modified Cluster Assumption," IEEE
TRANSACTIONS ON NEURAL NETWORKS AND LEARNING
SYSTEMS, vol. 23, p. 14, 2012,
http://dx.doi.org/10.1109/TNNLS.2012.2186825.
[27] P. L. X. H. Xindong Wua, "Learning from concept drifting data
streams with unlabeled data," Neurocomputing, vol. 92, p. 11,
2012, http://dx.doi.org/10.1016/j.neucom.2011.08.041.
[28] P. Z. Xingquan Zhu, Xiaodong Lin & Yong Shi, "Active Learning
From Stream Data Using Optimal Weight Classifier Ensemble,"
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND
CYBERNETICS-PART B: CYBERNETICS, vol. 40, p. 15, 2010,
http://dx.doi.org/10.1109/TSMCB.2010.2042445.
[29] Z. A. H. B. Mohammad Javad Hosseini, "Pool and Accuracy Based
Stream Classification: A new ensemble algorithm on data stream
classification using recurring concept detection," presented at the
11th IEEE International Conference on Data Mining Workshops,
2011, http://dx.doi.org/10.1109/ICDMW.2011.137.
[30] X. Z. Peng Zhang, Jianlong Tan & Li Guo, "Classifier and Cluster
Ensembles for Mining Concept Drifting Data Streams," presented
at the 2010 IEEE 10th International Conference on Data Mining
(ICDM), Sydney, NSW, 2010,
http://dx.doi.org/10.1109/ICDM.2010.125.
[31] (2014-07-08). UC Irvine Machine Learning Repository. Available:
http://archive.ics.uci.edu/ml/.
... such as being real-time, continuous, temporally-ordered, fast-changing and potentially infinite in length [21,22,40,75],which separate data stream classification algorithms from traditional data mining. ...
... With the advent of the golden age of data mining, a large number of data are presented in the forms of data streams, such as supermarket transaction records, network search requests, telecommunication call records, and sensor network data [2,22,40]. These data streams contain valuable knowledge that needs to be processed urgently. ...
Article
Full-text available
Classification is a hotspot in data stream mining and has gained increasing interest from various research fields. Compared with traditional data stream classification methods, Extreme Learning Machine (ELM) has attracted much attention because of its efficiency and simplicity, which inspired the development of many improved ELM algorithms that have been proposed in the past few years. This paper mainly reviews the current state of ELM used to classify data streams and its variants. First, we introduce the principles of ELM and the existing problems of data stream classification. Then we provide an overview of various improvements made to ELM, which further improves its stability, accuracy and generalization ability and present the practical applications of ELM used in data stream classification. Finally, the paper highlights the existing problems of ELM used for data stream mining and development prospects of ELM in the future.
... To achieve accurate predictions for data, traditional machine learning techniques often struggle to handle the challenges posed by big data, including its sheer volume, complexity, and high-dimensional nature [1]. On the other hand, data-driven methods utilizing deep learning have attracted interest due to their capacity to perform statistical analysis and information extraction automatically and successfully on large-scale, multi-source, and high-dimensional data., thereby overcoming the limitations of traditional prediction methods [2]. Recurrent neural networks (RNNs) are a kind of neural network that is particularly good at handling sequential data. ...
Article
Full-text available
In machine learning, datastream prediction is a challenging issue, particularly when dealing with enormous amounts of continuous data. The dynamic nature of data makes it difficult for traditional models to handle and sustain real-time prediction accuracy. This research uses a multi-processor long short-term memory (MPLSTM) architecture to present a unique framework for datastream regression. By employing several central processing units (CPUs) to divide the datastream into multiple parallel chunks, the MPLSTM framework illustrates the intrinsic parallelism of long short-term memory (LSTM) networks. The MPLSTM framework ensures accurate predictions by skillfully learning and adapting to changing data distributions. Extensive experimental assessments on real-world datasets have demonstrated the clear superiority of the MPLSTM architecture over previous methods. This study uses the transformer, the most recent deep learning breakthrough technology, to demonstrate how well it can handle challenging tasks and emphasizes its critical role as a cutting-edge approach to raising the bar for machine learning.
... 6) Data Stream Classification: Stream data is often large, dynamically changing, possibly infinite, and contains multidimensional features. Interest in data stream mining is increasing due to its presence in a wide variety of real-world applications such as e-commerce, banking, sensor data, and telecommunications records [9]. ...
... These data are possibly infinite, making storing them in a storage device impossible, requiring the data to be processed sequentially or streamed. A data streaming algorithm is also designed to be trained online compared to batch learning algorithms that require the whole dataset to be stored for the model to be trained [7][8][9]. Finally, volatility refers to the dynamic environment of data streams, where changes can occur spontaneously. In the data stream literature, this situation is known as concept drift [10,11]. ...
Article
Full-text available
Data stream mining deals with processing large amounts of data in nonstationary environments, where the relationship between the data and the labels often changes. Such dynamic relationships make it difficult to design a computationally efficient data stream processing algorithm that is also adaptable to the nonstationarity of the environment. To make the algorithm adaptable to the nonstationarity of the environment, concept drift detectors are attached to detect the changes in the environment by monitoring the error rates and adapting to the environment’s current state. Unfortunately, current approaches to adapt to environmental changes assume that the data stream is fully labeled. Assuming a fully labeled data stream is a flawed assumption as the labeling effort would be too impractical due to the rapid arrival and volume of the data. To address this issue, this study proposes to detect concept drift by anticipating a possible change in the true label in the high confidence prediction region. This study also proposes an ensemble-based concept drift adaptation approach that transfers reliable classifiers to the new concept. The significance of our proposed approach compared to the current baselines is that our approach does not use a performance measur as the drift signal or assume a change in data distribution when concept drift occurs. As a result, our proposed approach can detect concept drift when labeled data are scarce, even when the data distribution remains static. Based on the results, this proposed approach can detect concept drifts and fully supervised data stream mining approaches and performs well on mixed-severity concept drift datasets.
... Another categorizing of data stream classification methods is incremental and ensemble methods [10]. In incremental learning, a single classifier is being used and updated incrementally to deal with new class labels. ...
Article
Full-text available
Data streams gained obvious attention by researches for years. Mining this type of data generates challenges because of their special nature. Classification is one of the major approaches of Data Stream Mining (DSM). Concept drift (changes in pattern of data over time) is one of the major challenges that is needed to be adapted in data streams. Another challenge is high dimensional data streams. This paper provides a review for classification techniques in adaptive data stream mining. Focusing on both challenges; concept drifts and dimensionality reduction and dividing these techniques into incremental and ensemble. Incremental classifiers such as Very Fast Decision Trees (VFDT) and Concept-adapting Very Fast Decision Trees (CVFDT) were tested. Adaptive Random Forests (ARF) was taken as an example for adaptive ensemble classifiers. Furthermore, a practical analysis between VFDT, CVFDT and ARF was held. The analysis was according to accuracy, processing speed, and tree size. Accuracy did not vary much between the three algorithms. ARF has much better results in speed and has the smallest number of tree nodes.
Chapter
Today, it is evident that the Internet of Things, real-time data processing, and artificial intelligence technologies are essential in industrial settings to enable early warning autonomous anomaly detection systems. Such systems can detect anomaly situations that could cause failures shortly after they occur, allowing necessary maintenance to be performed promptly. In this research, a software platform has been designed and developed to collect data from sensors placed on industrial equipment to monitor their condition using the required IoT infrastructure. The real-time data collected from this platform is analyzed using real-time data processing techniques. Here, a business process is introduced for instant anomaly detection using real-time clustering analysis methods. To validate the proposed business process architecture, a prototype software has been developed, and its ability to detect anomaly situations has been evaluated. The results show that the proposed business process architecture is effective in real-time anomaly detection and can successfully detect anomalies that can lead to industrial equipment failures.Keywordsinternet of thingsanomaly detectionartificial intelligenceclusteringreal-time unsupervised machine learningreal-time streaming-data prediction
Article
Many daily applications are generating massive amount of data in the form of stream at an ever higher speed, such as medical data, clicking stream, internet record and banking transaction, etc. In contrast to the traditional static data, data streams are of some inherent properties, to name a few, infinite length, concept drift, multiple labels and concept evolution. Among all the data mining tasks, classification is one of the basic topics in data stream mining and has gained more and more attentions among different research communities. Extreme Learning Machine (ELM) has drawn much interests in data classification due to its high efficiency, universal approximation capability, generalization ability, and simplicity, which have greatly inspired the development of many ELM-based algorithms and their applications during the past decades. In this paper, we mainly provide a comprehensive review on ELM theoretical research and its variants in data stream classification, and categorize these algorithms from different perspectives. Firstly, we briefly introduce the basic principles of ELM and its characteristics. Secondly, we give an overview of different ELM variants to address the particular issues of data stream classification. Thirdly, we present an overview of different strategies to optimize the ELM, which have further improved the stability, accuracy and generalization ability of ELM, and briefly introduce some practical applications of ELM in data stream classification. Finally, we conduct several groups of experiments to compare the performance of ELM based models addressing the focused issues. Also, the open issues and prospects of ELM models used for stream classification are discussed, which are worthwhile to be further studied in the future.
Article
A novel method of mirror motion recognition by rehabilitation robot with multi-channels sEMG signals is proposed, aiming to help the stroked patients to complete rehabilitation training movement. Firstly the bilateral mirror training is used and the model of muscle synergy with basic sEMG signals is established. Secondly, the constrained L1/2-NMF is used to extracted the main sEMG signals information which can also reduce the limb movement characteristics. Finally the relationship between sEMG signal characteristics and upper limb movement is described by TSSVD-ELM and it is applied to improve the model stability. The validity and feasibility of the proposed strategy are verified by the experiments in this paper, and the rehabilitation robot can move with the mirror upper limb. By comparing the method proposed in this paper with PCA and full-action feature extraction, it is confirmed that convergence speed is faster; the feature extraction accuracy is higher which can be used in rehabilitation robot systems.
Conference Paper
Full-text available
Microorganisms do co-exist with other living organisms and exhibit the greatest genetic and metabolic activity. They have evolved various mechanisms to survive pressure exerted by competitive environmental challenges [1]. Infection is the invasion of the host by microorganisms, which then multiply in close association with the host's tissues. Infections may differ in severity and may range from in apparent to fulminating [2]. There has been a continual battle between humans and the multitude of microorganisms that cause infection and diseases [3]. Antimicrobial agents are among few drugs that cure by eliminating the infective microorganisms [4]. Development of antimicrobials for clinical use has been successful in targeting essential components of general areas of bacterial metabolism namely: cell wall synthesis, protein synthesis, ribonucleic acid (RNA) synthesis, deoxyribonucleic acid (DNA) synthesis, and intermediary metabolism [5]. The successful use of antimicrobial agents to inhibit and eliminate the infectious organisms has been facing challenges and difficulties because microorganisms are developing various forms of resistance to the drugs and as use of antimicrobial drugs increases, so do the level and complexity of the resistance [3]. Emergence of resistance to multiple antimicrobial agents in pathogens has become an emergency public health threat as there are fewer, or sometimes no, effective antimicrobial agents available for infections caused by these pathogens. Currently most widely used antimicrobial agents are subject to resistance and even some newer agents are facing the same challenge. The resistance has generally been met through the discovery of novel antimicrobial agents and by use of derivatives prepared by semisynthetic methods, which are not affected by existing resistance mechanisms [6]. This chapter reviews the pharmacologic concepts mechanism of action of selected antimicrobial agents. In particular this chapter will describe the mechanisms of generation of microbial resistance variants in different organisms, mechanism of action of antimicrobial agents and ways to prevent emergency and spreading of antimicrobial resistance.
Article
Full-text available
With unlimited growth of real-world data size and increasing requirement of real-time processing, immediate processing of big stream data has become an urgent problem. In stream data, hidden patterns commonly evolve over time (i.e.,concept drift), where many dynamic learning strategies have been proposed, such as the incremental learning and ensemble learning. To the best of our knowledge, there is no work systematically compare these two methods. In this paper we conduct comparative study between theses two learning methods. We first introduce the concept of “concept drift”, and propose how to quantitatively measure it. Then, we recall the history of incremental learning and ensemble learning, introducing milestones of their developments. In experiments, we comprehensively compare and analyze their performances w.r.t. accuracy and time efficiency, under various concept drift scenarios. We conclude with several future possible research problems.
Article
Full-text available
Data stream mining has been receiving increased attention due to its presence in a wide range of applications, such as sensor networks, banking, and telecommunication. One of the most important challenges in learning from data streams is reacting to concept drift, i.e., unforeseen changes of the stream's underlying data distribution. Several classification algorithms that cope with concept drift have been put forward, however, most of them specialize in one type of change. In this paper, we propose a new data stream classifier, called the Accuracy Updated Ensemble (AUE2), which aims at reacting equally well to different types of drift. AUE2 combines accuracy-based weighting mechanisms known from block-based ensembles with the incremental nature of Hoeffding Trees. The proposed algorithm is experimentally compared with 11 state-of-the-art stream methods, including single classifiers, block-based and online ensembles, and hybrid approaches in different drift scenarios. Out of all the compared algorithms, AUE2 provided best average classification accuracy while proving to be less memory consuming than other ensemble approaches. Experimental results show that AUE2 can be considered suitable for scenarios, involving many types of drift as well as static environments.
Article
Full-text available
Data stream classification poses many challenges to the data mining community. In this paper, we address four such major challenges, namely, infinite length, concept-drift, concept-evolution, and feature-evolution. Since a data stream is theoretically infinite in length, it is impractical to store and use all the historical data for training. Concept-drift is a common phenomenon in data streams, which occurs as a result of changes in the underlying concepts. Concept-evolution occurs as a result of new classes evolving in the stream. Feature-evolution is a frequently occurring process in many streams, such as text streams, in which new features (i.e., words or phrases) appear as the stream progresses. Most existing data stream classification techniques address only the first two challenges, and ignore the latter two. In this paper, we propose an ensemble classification framework, where each classifier is equipped with a novel class detector, to address concept-drift and concept-evolution. To address feature-evolution, we propose a feature set homogenization technique. We also enhance the novel class detection module by making it more adaptive to the evolving stream, and enabling it to detect more than one novel class at a time. Comparison with state-of-the-art data stream classification techniques establishes the effectiveness of the proposed approach.
Article
Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.
Article
One key issue for most classification algorithms is that they need large amounts of labeled samples to train the classifier. Since manual labeling is time consuming, researchers have proposed technologies of active learning and semi-supervised learning to reduce manual labeling workload. There is a certain degree of complementarity between active learning and semi-supervised learning, and therefore some researches combine them to further reduce manual labeling workload. However, researches on combining active learning and semi-supervised learning for SVM classifier are rare. Of numerous SVM active learning algorithms, the most popular is the one that queries the sample closest to the current classification hyperplane in each iteration, which is denoted as SVM"A"L in this paper. Realizing that SVM"A"L is only interested in samples that are more likely to be on the class boundary, while ignoring the usage of the rest large amounts of unlabeled samples, this paper designs a semi-supervised learning algorithm to make full use of the rest non-queried samples, and further forms a new active semi-supervised SVM algorithm. The proposed active semi-supervised SVM algorithm uses active learning to select class boundary samples, and semi-supervised learning to select class central samples, for class central samples are believed to better describe the class distribution, and to help SVM"A"L finding the boundary samples more precisely. In order not to introduce too many labeling errors when exploring class central samples, the label changing rate is used to ensure the reliability of the predicted labels. Experimental results show that the proposed active semi-supervised SVM algorithm performs much better than the pure SVM active learning algorithm, and thus can further reduce manual labeling workload.
Article
The main challenges of data streams classification include infinite length, concept-drifting, arrival of novel classes and lack of labeled instances. Most existing techniques address only some of them and ignore others. So an ensemble classification model based on decision-feedback (ECM-BDF) is presented in this paper to address all these challenges. Firstly, a data stream is divided into sequential chunks and a classification model is trained from each labeled data chunk. To address the infinite length and concept-drifting problem, a fixed number of such models constitute an ensemble model E and subsequent labeled chunks are used to update E. To deal with the appearance of novel classes and limited labeled instances problem, the model incorporates a novel class detection mechanism to detect the arrival of a novel class without training E with labeled instances of that class. Meanwhile, unsupervised models are trained from unlabeled instances to provide useful constraints for E. An extended ensemble model Ex can be acquired with the constraints as feedback information, and then unlabeled instances can be classified more accurately by satisfying the maximum consensus of Ex. Experimental results demonstrate that the proposed ECM-BDF outperforms traditional techniques in classifying data streams with limited labeled data.
Article
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.