DataPDF Available

Learning-Concept-Drift-Using-Adaptive-Training-Set-Formation-Strategy

Authors:
  • Islamic University of Gaza. Gaza, Palestine and University of Bahrain, Bahrain
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 33
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
We live in a dynamic world, where changes are a part of everyday life. When there is a shift in data, the clas-
sication or prediction models need to be adaptive to the changes. In data mining the phenomenon of change
in data distribution over time is known as concept drift. In this research, the authors propose an adaptive
supervised learning with delayed labeling methodology. As a part of this methodology, the atuhors introduce
Adaptive Training Set Formation for Delayed Labeling Algorithm (SFDL), which is based on selective train-
ing set formation. Our proposed solution is considered as the rst systematic training set formation approach
which takes into account delayed labeling problem. It can be used with any base classier without the need
to change the implementation or setting of this classier. The authors test their algorithm implementation us-
ing synthetic and real dataset from various domains which might have different drift types (sudden, gradual,
incremental recurrences) with different speed of change. The experimental results conrm improvement in
classication accuracy as compared to ordinary classier for all drift types. The authors’ approach is able to
increase the classications accuracy with 20% in average and 56% in the best cases of our experimentations
and it has not been worse than the ordinary classiers in any case. Finally a comparison with other four
related methods to deal with changing in user interest over time and handle recurrence drift is performed.
These methods are simple incremental method, time window approach with different window size, instance
weighting method and conceptual clustering and prediction framework (CCP). Results indicate the effective-
ness of the proposed method over other methods in terms of classication accuracy.
Learning Concept Drift
Using Adaptive Training
Set Formation Strategy
Nabil M. Hewahi, Computer Science Department, Faculty of Information Technology, Islamic
University of Gaza, Gaza, Palestine
Sarah N. Kohail, Computer Science Department, Faculty of Information Technology, Islamic
University of Gaza, Gaza, Palestine
Keywords: Adaptive Learning, Concept Drift, Delayed Labeling, Machine Learning, Training Set Formation
1. INTRODUCTION
A key assumption in supervised learning is that
the training and the testing data (or operational
data) used to train the classifier come from the
same distribution. This means that training data
is representative and the classifier will perform
well on all future unseen data instances. How-
ever, if the statistical properties of the target
variable, which the model is trying to predict,
change over time while the same classifier is
still applicable, the prediction will be no longer
DOI: 10.4018/jtd.2013010103
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
34 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
accurate. In machine learning this phenomenon
of change in data distribution over time is known
as concept drift (Tsymbal, 2004). Concept drift
problem have been stated as the tenth challeng-
ing problem facing researchers in data mining
and machine learning fields (Yang & Wu, 2006).
To show the importance of this problem,
assume a data mining application for spam filter-
ing that is developed using the latest generated
spam dataset. As this filter adapted to deal with
today’s types of spam emails, the spammers
will try to bypass the spam filters by disguis-
ing their emails to look more like legitimate.
So new spam will be generated and the current
application will go toward approximation to
classify this strange pattern. As time goes by,
this will lead to less accurate, poor performance
and incorrect knowledge. This dynamic nature
of spam email raises a requirement for update
in any filter that is to be successful over time
in identifying spam (Delany, Cunningham,
Tsymbal, & Coyle, 2005).
The main difficulty in mining non-
stationary data like spam, intrusion, stock
marketing, weather and customer preferences
is to cope with the changing of data concept.
The fundamental processes generating most
real-time data may change over years, months
and even seconds, at times drastically. Effective
learning in environments with hidden contexts
and concept drift requires a learning algorithm
that can detect context changes without being
explicitly informed about them, recover quickly
from a context change and adjust itself to the
new context, and can make use of previous
experience in situations where old contexts and
corresponding concepts reappear (Nishida &
Yamauchi, 2009).
In our research, we try to add a contribution
to scientific research in solving the problem of
concept drift in supervised learning when true
labels become known with certain delay. The
work presented in this paper is based on train-
ing set formation strategy which is reforming
the training sets when concept drift is detected.
Training set formation methods have an advan-
tages over other adaptivity methods since they
do not require complicated parameterization and
they can be used for online learning plugging
in different types of base classifiers. We can
summarize our contribution as:
• We introduce Adaptive Training Set For-
mation for Delayed Labeling Algorithm
(SFDL), which is based on selective train-
ing set formation. Our proposed solution is
considered as the first systematic training
set formation approach that take into ac-
count delayed labeling problem. Our pro-
posed algorithm can be used with any base
classifier without the need to change the
implementation or setting of the classifier;
• We test our algorithm implementation
using synthetic and real dataset from vari-
ous domains which might have different
drift types (sudden, gradual, incremental
recurrences) with different speed of
change. Experimental evaluation confirms
improvement in classification accuracy
as compared to ordinary classifier for all
drift types.
The rest of the paper is organized as follows:
Section 2 presents related work and gives an
introductory background to the main topic of
this research, namely concept drift problem and
detectability of concept drift when labeled is
delayed. Section 3 defines training set formation
strategy and summarize the main contributions
of our research. Section 4 describes our method-
ology and proposed algorithms. Experimental
results discussed in Section 5. Finally Section
6 concludes the paper.
2. RELATED WORK
2.1. Learning under Concept Drift
In supervised learning, each example is a pair
of objects input vectors x and output labels y.
The task is to interfere a function F that is able
to predict the output labels y′, having input
vectors of a testing data x′. First present of
concept drift causes was by Kelly et al. (1999).
They claim that change in outcome distribution
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 35
(concept drift) may occur in three ways: Firstly,
and most simply, the prior probability for the
class, p(y) may change over time. Secondly, the
distributions of the classes may change; that
is, the p(xly), may alter over time. Thirdly, the
posterior distributions of class memberships,
the p(y|x) may alter. Where x is an instance in
q-dimensional feature space and y ϵ { c1, ….,
cm }, the set of class labels.
Brzezinski (2010) identifies four main
types of drift which may occur in a single vari-
able along time assuming one dimensional data.
By drift types we mean the patterns the data
sources take over time. The types of change
in context/concept are defined based on those
patterns.
The simplest pattern of a change is sudden
drift illustrated in Figure 1a. Sudden drift shows
abrupt changes that instantly and irreversibly
change the variables class assignment. Real
life examples of such changes include change
in e-commerce environment and stock prices.
The next two plots Figure 1b and Figure
1c illustrate changes that happen slowly over
time thus the drift is noticed only when looking
at a long time period. Incremental drift occurs
when variables slowly change their values over
time, we can see it as a sequence of small sudden
drifts. Gradual drift occurs when the change
involves the class distribution of variables.
Some researchers do not distinguish these two
types of drift and use the terms gradual and
incremental interchangeably. A typical example
of incremental drift is price growth due to in-
flation, whilst gradual changes are represented
by slowly changing definitions like spam or
user-interesting news feeds (Brzezinski, 2010).
The fourth type of drift illustrated in Figure
1d is referred as reoccurring concepts. It hap-
pens when several data generating sources are
expected to switch over time at irregular time
intervals. Thus previously active concepts reap-
pear after some time. This drift is not certainly
periodic, it is not clear when the source might
reappear, that is the main difference from sea-
sonality concept used in statics. An example
of reoccurring drift is changing in food sales.
Figure 1. Illustration of the four structural types of concept drift (Brzezinski, 2010)
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
36 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
2.2. Concept Drift under
Delayed Labeling
Most of the work to date on drift detection as-
sumes that the true class of all instances in the
data stream will be known shortly after clas-
sification (Delany, Cunningham, Tsymbal, &
Coyle, 2005; Ludmila, Kuncheva, & Salvador,
2008; Wang, Fan, Yu, & Han, 2003). Under
such assumption, the incoming new data can
be regularly used to periodically examine the
model and compute the real error. In real time,
this scenario is not realistic because decisions
should be made at real time and in many domains
collecting new labeled training objects may
be costly (e.g. require sensors and hardware
systems) or time-consuming (e.g. require hu-
man experts to manually label the new data).
While it is relatively easy to obtain unlabelled
objects, it still challenging to detect changes
using these objects, especially when the prior
probability for the class changed. Examples
of tasks where delayed labeling exist are sales
prediction, bankruptcy prediction, outcome of
patient treatment, intrusion or fraud detection
and spam categorization tasks.
Dealing with delayed labeling problem
will allow the learner to benefit from unlabelled
data (i.e. early change detection) until the true
labels become available.
3. TRAINING SET FORMATION
ADAPTIVITY STRATEGY
Training set formation strategy can be achieved
by using one or more of the following methods:
1. Training set selection: Used to select the
most relevant examples to current concept.
The relevancy here related to how repre-
sentative or important older examples are
for predicting new instances of the possibly
changed concept. For example, instead of
taking all the training history, a number
of the instances that is strongly related to
the current distribution are considered.
Training set selection can be applied in two
ways (Tsymbal, 2004; Žliobaitė, 2010):
(a) Sequential instance selection (training
windows strategies) which select the near-
est neighbors according to example arrival
time, so the latest examples are more trusted
than oldest ones. (b) Selective sampling
(instance selection) which pick the closest
instances to the target instance according
feature space. Selective sampling in space
is particularly beneficial when reoccurring
or gradual concepts are expected;
2. Training set weighting: In this case in-
stances can be weighted according to their
age, and their competence with regard to
the current concept. Klinkenberg (Klinken-
berg, 2004) claimed that instance weighting
techniques handle concept drift worse than
analogous instance selection techniques,
which is probably due to overfitting;
3. Training set manipulation: When drifting
happened, features or even combinations
of attribute values that were relevant in the
past may no longer be useful, some labels
may disappear and new labels may occur.
Training set manipulation is used for feature
reselection, adding new labels that appear
with time and delete labels that disappear
with time.
4. METHODOLOGY AND
PROPOSED APPROACH
We organize this section as follows: Section 4.1
provides a general idea about the methodology
flow. Section 4.2 and section 4.3 explain two
important algorithms which have been created
and used in our main approach. Finally, section
4.4 discusses our proposed Adaptive Training
Set Formation for Delayed Labeling Algorithm
(SFDL).
4.1. Overview of Our Solution
Figure 2 provides a global view for concept
drift learning scenario that we build. To make
the flow clear and complete, we illustrate a
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 37
scenario for the arrival of two new consequent
data batches in Figure 2a and Figure 2b.
Figure 2 summarizes our methodology in
five general steps:
Step 1: Like most of previously proposed drift
learning methods, we used supervised
learning as initial training method. After
training and testing a classifier, Lt is pro-
duced. Classifier Lt is considered as the best
and accurate classifier at time t;
Step 2: When the system receives a new in-
stances (a batch from a drifting concept),
the new instances will be classified using Lt
classifier. This process will continue until
a set of row instances of window size w
Figure 2. Global view for concept drift learning scenario using the pro-posed approach
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
38 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
arrived [xt+1 to xt+N]. Window size value is
fixed for a single system and it depends on
the system designer knowledge of context;
Step 3: Apply our proposed algorithm named
Adaptive Training Set Formation for De-
layed Labeling Algorithm (SFDL) to the
old historical data which have been used
to build the Lt classifier, and the new in-
coming batch. The work of this algorithm
is summarized as follows:
Select the most relevant instances to
current concept (Instance Selection);
Reclassifying the new arrived batch
using the selected instances;
Reform the old set according to the
changes detected;
Step 4: The output of the previous step is a
new formed training set which reflect the
changes occurred during the period [t+1
to t+N]. This set will be used to retrain
the model and produce Lt+N classifier as
illustrated at Figure 2b;
Step 5: When receiving another new batch,
the process will be repeated from step II
and so on.
4.2. Modified k-Nearest
Neighbor Algorithm
The k-Nearest Neighbors (k-NN) algorithm is
the most common instance-based method (Lud-
mila, Kuncheva, & Salvador, 2008; Nishida,
2008). It classifies objects based on closest train-
ing examples in the feature space. The training
phase consists of simply storing every training
example with its label. To make a classification
for a new example, first compute its distance to
every training example. For numeric attributes,
the distance is usually defined in terms of the
standard Euclidean distance. Euclidean distance
between two points xz and xl where each point is
a q-dimensional real feature vector is computed
as follows:
d x x x x
z l z
i
l
i
i
q
,( ) ( )
( )
= −
=
2
1
(1)
where xz
i( ) is the ith feature of the instance xz
and q is the dimensionality.
For Boolean and discrete attributes, the
distance is usually defined in terms of the
number of attributes that two instances do not
have in common. k-NN then keep the k closest
training examples in distance, where k≥1isa
fixed integer. The new example is classified by
a majority vote of its neighbors. Figure 3 shows
the pseudocode of k-NN algorithm.
In addition to the class label outputted by
k-NN classifier, we modified the k-NN so it
can output two additional class labels y′′ and
y′′′ for the same example. The basic idea of
the algorithm does not change, but we add two
more computations, one for y′′ and the other
y′′′. The purpose of doing these computations
is to decide later which class label should be
assigned to the given drifted example. The
Figure 3. k-NN algorithm
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 39
details of this process and how the values of
y′′and y′′′are usedwill beexplained inthe
following sections. Modified k-NN algorithm
is illustrated in Figure 4:
• Computing y′′: After ordering the ex-
amples according to its distance from the
new instance to be classified x′ (line 6), we
select the nearest k instances from each
available class j. We denote the set by
D
j( ) ,
wherej=1,….,ɳandɳisthenumberof
available classes. Then y′′ is assigned to
the class which have the minimum sum-
mation of its distances from x′ Summj;
• Computing y′′′: After selecting the k near-
est instances (line 9) we add the distances
of each group of instances that belong to
one class and then divide it by the number
of nearest neighbor instances belong to that
class label from the total k.
4.3. Closest Class Algorithm
We develop this algorithm as a heuristic to get
the nearest class to each existing classes. Many
other methods calculate the distance between
centers directly to get how much one class is
far from the others. These methods may not
work well when the distribution of the instance
points belong to a certain class label are scat-
tered and non-intensive. This heuristic guide the
algorithm and identify the changes in classes
distribution. It also decide how to change the
class label when there is a drift especially when
the drift is gradual.
Figure 4. Modified k-NN algorithm
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
40 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
Figure 5 illustrates the pseudocode for
computing the closest class for each existing
class. The input for this algorithm is the whole
training set and the output is the closest class
label for each class available in the training set.
It is to be mentioned that if class X is the clos-
est class to class O it is not necessary that class
O is the closest class to X. To compute the
closest class for a given class ci,(i=1,....,ɳ),
first the algorithm compute the average between
every two classes. The average is computed as
follows:
1. The algorithm will group all the instances
according to their class label;
2. For each two different classes i and j;
3. For each instance belong to first class i:
a. Random instance will be picked from
the second class j;
b. Euclidean distance between the two
instances will be computed and added
to summation S;
4. Summation S will be divided on the number
of instances of the class which have the
minimum number of instances (either i or j);
5. Now we have a single average for each
pair of classes. The number of averages is
equal to Binomial Coefficient η
2
where
order is not important. This means, we have
ɳclasses,andwewanttopicktwo(pair)
of them each time with no repetition;
6. The closest class for a given class c will be
the class which have the minimum average
of distances from class c.
4.4. Adaptive Training Set
Formation for Delayed
Labeling Algorithm (SFDL)
The idea of training set formation strategy is to
continually update the training data and form
it according to detected changes from the new
Figure 5. Computing the closest class to each available class
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 41
arriving data. Before explaining our algorithm
we should present some important equations.
Equation 2 explains how threshold value
alpha (α) is computed for each class label.
Alpha(α)parameterindicatesthenumberof
closest instances to a given instance example.
For the ith class ci with center v
i
,alpha(α)is
computed by the following equation:
∝ =
( )
( )
= =
i
z
c
i z z
c
i z
i i
d x d x
max  , min [ ,
{ } { }
1 1
υ υ ]]
2
(2)
where ci
{ }
is the number of instances belong
to ci. i=1,……,m; m is the number of classes:
max d x
z
c
i z
i
=
( )
1
{ }[ , ]v
is the maximum distance between v
i
and any
instance belong to ci:
min d x
z
c
i z
i
 
=
( )
1
{ }[ , ]υ
is the minimum distance between vi
and any
instance belong to ci.
Note: Class center is computed as follows:
υi
z
c
z
i
ix
c
==
{ }
{ }
1 (3)
SFDL Algorithm is illustrated at Figure 6.
The algorithm consists of three sub-algorithms
(1) Instance selection algorithm, (2) Reclas-
sifying the new incoming batch and (3) Set
formation.
4.4.1. Instance Selection Algorithm
Instance selection algorithm is presented in
Figure 7. The algorithm is used to select the
most relevant examples to current concept. The
relevancy here is related to how importance
older examples are for predicting new instances
in term of time similarity and feature space
similarity. The algorithm accepts five inputs:
• Historical data DH which have been used
to build the existing classifier Lt;
• New batch DB which arrived during the
period [t to t+N] and labeled using clas-
sifier Lt;
• Computedcentersυandalphaαvaluesfor
each class in the new batch (equations 2
and 3). Instances in the new arrived batch
DB may not be always classified to all the
existing classes, so the number of classes
at the new batch could be less than the
possibleclasses(m≤ɳ);
• The integer wrecent represents how many
respective recent instances will be selected
before time t. In some application where
the drift is sudden, the time factor is not
important, therefore selecting instances
according to its age is ineffective. So the
designer of the application can set wrecent
to zero.
The algorithm output is a set of relevant
instances to current concept called DKNN and
a set of far instances DFAR (DKNN comple-
ment). To select instances according to distance
similarity, for each existing class label in the
new batch, the algorithm will go through all
instances (old and new one) from x1 to xt+N
and select instances in which the Euclidean
distance between the center of this class and
theinstanceislessthanitscomputedα.
In term of time similarity, relevant instances
are selected according to wrecent value. The value
of wrecent depends on the domain at hand, as
well as the expectations of the system designer
regarding the drift type.
4.4.2. Reclassifying the New
Incoming Batch Algorithm
Based on the selected instances in DKNN, the
algorithm of reclassifying the new incoming
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
42 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
batch will reclassify the new instances, which
were initially classified using the available clas-
sifier. The reclassification process is important
because the current classifier is assumed to be
outdated and useless for classifying the new in-
stances. The algorithm is illustrated in Figure 8.
The main inputs for the reclassification
algorithm are DH, DB, DKNN and the size of
neighborhood k (number of nearest neighbor).
The following points summarize the algorithm
working flow:
• Applying modified k-NN (Figure 4) with
k as a size of neighborhood and DKNN
as a training set for the modified k-NN
algorithm. Modified k-NN algorithm will
Figure 6. SFDL algorithm
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 43
return three different classes as explained
insection4.2,theoriginalk-NNlabely′
andtwoadditionallabelsy′′andy′′′.
• The next step is to update the position of
the existing class centers. To do that we
compute a new center v
x
using the formula
shown in Box 1.
From this, we can determine that:
v v v
  
B H KNN
, ,
arethecentersofclassϰinDB, DH and DKNN
datasetsrespectively,ϰ=1…..ɳ,whereɳis
the number of available classes.
In some cases DB and DKNN do not include
all possible classes available in DH, in this case
the associated centers v v
 
B KNN
or
( )
for miss-
Figure 7. Instance selection algorithm
Box 1.
υ
υ υ υ υ
  
=
+ + +
( )
= =
   { }
  ,
B H KNN
z
cB
z z
max d x min
1 11
4
{ }
 ,
cB
z
d x
υ
( )
( )
(4)
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
44 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
ing classes will be unknown and will be set to
zero. Therefore, classes centers which are not
included at new batch will not be affected by
this formula.
Combining the new computed centers with
the previous ones is important to move the
centers smoothly and gradually forget the old
concept and switch to the new one.
After updating the classes’ centers, the
algorithm will compute the Euclidean distance
between the new centers and every instance
in the new batch DB. Each instance then will
have a fourth label ycenter(inadditiontoy′′′,y′′
andy′)whichrepresenttheclasswithclosest
center to this instance.
Another class label yclosest will be computed
using the algorithm illustrated in Figure 5. Un-
like ycenter which represent the closest class to
a specific instance, yclosest represent the closest
class distribution to other class distribution as
a whole:
• Now each instance in DB has five different
class labels (ycenter, yclosest,y′′′,y′′,y′).The
five labels are used to decide if some in-
stance will stay with its current distribution
or it must be assigned to other possible clos-
est class. Reclassification of any instance
in DB depends on a heuristic certainty rule.
If certainty rule is satisfied, the instance
Figure 8. Reclassifying new batch instances algorithm
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 45
will be reclassified to the most frequent
class label of all the five computed labels.
Otherwise, it will be reclassified to ycenter.
Certainty Rule dictates the following:
1. The instance is not included at DFAR;
2. There is no uncertainty in classification (i.e.
between the five labels). This means that
classification majority must be clear. For
example if two of five classes have been
classified to label X and two for class O
and one for class Z (2:2:1) in this case we
say that there is no certainty, because the
voting is very close. The same example if
three of five classes have been classified to
label X and two for class O (3: 2). Cases like
(3:1:1) and (4:1) reflects a good majority.
By certainty rule we want to determine
those instances that are not classified well by the
existing classifier and have a fuzzy membership
to their current classes.
We choose ycenter to be a label for those
instances that do not satisfy certainty rule.
4.4.3. Set Formation Algorithm
Reclassifying the new data is not enough. We
still need to benefit from the old historical
data. Based on the reclassification step, this
algorithm reform the old. The algorithm is
shown in Figure 9.
The main functions of this algorithm are:
• Recomputing the centers and values for
each class in DB (after reclassification)
using equations 2 and 3.
• Reform the old set. For each instance in
DH if the distance to any class center υi
B
is less than i, then this instance will be
reclassified to the class with closest
center.
The output from SFDL Algorithm is a new
training set consists of reclassified new batch
Figure 9. Set formation algorithm
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
46 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
(output of algorithm in Figure 8) and reformed
old set (output of algorithm in Figure 9).
The problem of the training set continuous
increasing can be solved by using a sampling
method which keep the number of instances
from exceeding a predefined value.
5. EXPERIMENTAL RESULTS
AND EVALUATION
This section discusses the experimental evalua-
tion for the proposed model. Three sub-sections
are presented: Section 5.1 presents a brief
description for datasets used in our experimen-
tation. Section 5.2 describes the experimental
setup and procedure. Finally, Section 5.3 dis-
cusses the experimental results.
5.1. Datasets
For the purposes of research related to concept
drift learning there is no standard concept drift
benchmark dataset. In-stead there are popularly
datasets that were used by most of the existing
researches. Unfortunately most of the existing
real word datasets are not suitable for evaluating
drift learning because there is a little concept
drift in them. So re-searchers turn to introduce
artificial drift in real datasets or create synthetic
(fabricated) datasets with artificial drift.
In our experiments we used six datasets,
all of which are publicly available. The datasets
are chosen from various domains that might
have different drift types with different speed
of change. They include no missing or noise.
Table 1 illustrates the characteristics of each
set: number of instances, attributes, classes and
the type of dataset. A short description of each
data set is given below.
For STAGGER dataset the instance space
is defined by three attributes, size = {small,
medium, large}, color = {red, green, blue},
and shape = {square, circle, triangular}. The
target concepts have changed as follows: (1)
size = small and color = red, (2)color = green
or shape = circular and (3)size = medium or
large. 120 training instances have been gen-
erated randomly and the concept has been
changed every 40 instances. There are two
sources of drift in STAGGER dataset: (1) the
changing in posterior probabilities and (2) the
changing in class balance (Narasimhamurthy
& Kuncheva, 2007).
In SEA data, each instance is described by
three features, x=[x1, x2, x3]T, where values of
x are uniformly randomly generated between
[0 - 10]. Only the first two features are relevant.
An instance belongs to class 1 if x1 + x2 ≤ θ
andbelongstoclass2otherwise,whereθisa
threshold value changed to create a concept.
There are four concepts θ = 7, 8, 8.5, 9. We
generate 200 instances for each concept (100
instances for each class label). There is no label
noise was added and the two classes are perfectly
separated (Street & Kim, 2001).
Table 1. Characteristics of the used datasets
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 47
Electricity dataset includes 2973 instances
collected along a period of 3 months from May
11 to July 11, 1997 from the Australian New
South Wales Electricity Market. Class Label
has two values ‘up’ or ‘down’ indicating the
change of the price. In our experimentation each
month represents one concept (Harries, 1999).
Chess data includes a gaming records for
one player over a period from 2007 December
to 2010 March. A player is developing his skills
over time and he can engage into different
types of competitions (personal, tournament or
champtionship). A player rating and the selected
game type are crucial to the system to select an
opponent. The task is to predict if the player
will win or lose based on the setting. There is a
natural problem of delayed labeling, the winner
is known only after the game is finished.
Credit dataset classifies customers as
having good or bad credit risks. Following
Žliobaitė (2010), a gradual concept change
was introduced artificially by sorting the data
using feature ‘age’ and eliminate this feature
from the dataset. Delayed labeling is relevant
for this task, since the true label (whether a
person fails to repay the credit) is known after
some time. However, the decision makers needs
a real time indication of changes in risk (UCI
machine learning repository: Data sets, n.d.).
Usenet dataset includes two sets usenet1
and usenet2. The difference between the two
sets is illustrated in Figure 10. We have five
batches each contains 300 instances as indi-
cated by the first row in Figure 10. The figure
also shows the change in news interest among
the batches and which newsgroups articles are
considered interesting (+) or uninteresting (-)
in each time period1. This dataset was used to
build news recommender systems, document
categorization and spam filtering applications
(Katakis, Tsouakas, Banos, Bassiliades, &
Vlahavas, 2009).
5.2. Experiments Setup
The experiments took place on a machine
equipped with an Intel Pentium Core 2 Duo
T8300 @ 2.40 GHz processor and 2.00 GB of
RAM. To implement our algorithm we used
Java programming language. The goal of ex-
periments is to observe the system performance
as the target concepts are changing from time
to time. Our approach is expected to enhance
the classification accuracy which might drop
down over time if we use an ordinary classier
(a classifier that does not consider concept drift
in its approach). To achieve our goal we follow
this experiment procedure:
1. We start by dividing the dataset into smaller
sub sets, each is called a “batch”. We ben-
efit from previous researches in the way
they partition the dataset and insure that
every “batch” represent a change (Katakis,
Tsoumakas,&Vlahavas,2008;Žliobaitė,
2010);
Figure 10. Usenet1 and Usenet2 datasets
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
48 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
2. We use one batch as a training set for the
initial learner. In datasets where instances
are ordered according to time, we use
oldest batch to be the initial training set.
Otherwise we pick a random batch as ini-
tial training set, because the drift type in
such sets is mostly sudden. Table 2 shows
dataset partitioning details. The table
presents the following information: type
of drift represented, number of instances
included in the initial dataset with class
balance distribution (in percentage %),
number of batches (#B) and if the dataset
is time ordered (Y) or not (N);
3. As in traditional machine learning process,
we build the initial learner using initial
training set. We use 10-cross validation with
stratified sampling in order to estimate the
performance of the classifier and ensure
we get the best model in current time ac-
cording to its classification accuracy. The
accuracy of the model is calculated using
the following equation:
Accuracy
number of ins cescorrectly classified
=
×
  tan
100
(5)
where n is the number of instances:
4. Next, we pass the first batch, classify it
using current learner (ordinary classifier),
apply SFDL training set formation algo-
rithm and retrain the model using formed
data. As we mention before SFDL algo-
rithm needs two parameters: (1) number
of neighborhood k (space similarity) and
(2) number of the most recent instances
wrecent (time similarity). The choice of these
two parameters values is directly related to
the observed change types and the future
expectations as well as designer knowledge
of domain. Parameters setting is fixed for
one application run;
5. We measure the accuracy at two points
after passing the batch: (1) after its been
classified by Reclassifying the New Incom-
ing Batch (Figure 8). (2) after training set
formation and model retraining;
6. We pass the next batch, classify it using
the most recent trained classifier after
set formation and so on. The procedure
explained previously in section 4.1. After
set formation, we compare our results with
ordinary classifier results.
5.3. Experimental Results
and Discussion
This section discusses the results of numerous
experiments that have been conducted.
5.3.1. Sudden Drift Experiments
(STAGGER and SEA datasets)
Table 3 illustrates experimental results for both
STAGGER and SEA datasets. For STAGGER
dataset, the best model for classifying the
first concept was Naïve Bayes with training
accuracy = 100%. We use the same model to
predict batch1 and batch2. The accuracy of
Table 2. Dataset partitioning details
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 49
classification dropped to 57.50% and 53.66%
for batch1 and batch2 respectively. This results
confirm the existence of drift where the current
Naïve Bayes model could not classify the other
concepts correctly.
Two accuracy observations have been
recorded after passing batch1. The underlined
observation with value = 65% represents the
accuracy after reclassifying the batch using
Figure 8 algorithm. The bold observation with
value = 90% represents the accuracy after ref-
ormation and model retraining. SFDL algorithm
was applied with k = 2 and wrecent = 0. The size
of batches is very small (40 instances) for this
reason we chose a small number of neighbor-
hood k. Including most recent instances from
historical data (wrecent > zero) is meaningless
because we are dealing with sudden drift where
source of drift is not related to time ordering
and the previous concept is not much trusted
to classify the current batch. SFDL algorithm
enhance the accuracy by 32.5% (from 57.50%
to 90.00%).
After the arrival of batch2 we classify it’s
instances using the most recent Naïve Bayes
model (after retraining). Note that the accuracy
decreased to 46.34%. This happened because
the last retraining have been done according to
the drifting of batch1 which is different from
batch2. Additionally, the problem of class
imbalance at batch1 makes the updated model
biased toward dominate label “2” and decreases
accuracy to 43.90%. With perfect parameters
setting (of k and wrecent) and the role of certainty
rule, accuracy increased to 65.85% after apply-
ing SFDL algorithm.
Changes in user interests over time are the
main cause of concept drift in usenet dataset. It
is obvious from Figure 10 that usenet datasets
represent recurrence drift type. In fact this da-
taset is much more complicated in reality due
to unpredictable user interests.
For SEA dataset the most accurate classi-
fier for classifying the first concept was Deci-
sion Tree with accuracy = 100%. The same
model was used to classify the three incoming
concepts. The accuracy of classification was
59.00%, 50.00%, 50.00% for batch1, batch2
and batch3 respectively. Classification accuracy
of incoming concepts decreased compare to
initial concept classification accuracy. Table 3
presents the accuracy after batch reclassifica-
tion and model retraining after passing the three
batches. SFDL algorithm was applied with k =
Table 3. Results of STAGGER and SEA datasets. Accuracy after the batch is underlined. Accuracy
after training set formation and model retraining is in bold.
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
50 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
10 and wrecent= 0 because we are dealing with
sudden and medium-sized dataset. After model
retraining, SFDL algorithm increases the accu-
racy of classification with at least 27%. Unlike
the first two batches, reclassification accuracy
of batch3 is higher than classification accuracy
by initial model. The reason is that the concept
sequencing (θ = 7, 8, 8.5, 9) allows the classifier
to gain more knowledge after passing the two
previous batches.
Figure 11 presents the curves of accuracy
over time using SFDL algorithm and ordinary
classifier for STAGGER and SEA datasets. It
is notable that SFDL algorithm achieves bet-
ter performance than ordinary classifiers for
both sets.
5.3.2. Gradual Drift Experiments
(Electricity and Credit Datasets)
Electricity and credit datasets are real world
datasets with gradual drift. For both datasets
we use Multilayer Perceptron Neural Network
(MLP-NN) as a training model. Because we are
dealing with time-related gradual drift, time
similarity was taken into consideration (wrecent
> 0). In other words, to predict electricity price
or credit card approval, most recent examples
are more reliable to be used to classify new
incoming instances than old historical data.
Table 4 illustrates experimental results for
both Electricity and Credit datasets. The average
error of classifying the new batches by ordinary
classifier for gradual drift experiments is mostly
less than it for sudden drift experiments. The
reason is that the two datasets used here and most
real word datasets include little concept drift.
Figure 12 presents the curves of accuracy
over time using SFDL algorithm and ordinary
classifier for Electricity and credit datasets.
The figure shows that our algorithm achieves
higher classification accuracy in comparison to
ordinary classifier for both datasets.
5.3.3. Incremental Drift
Experiments (Chess Dataset)
Incremental drift is a sequence of small sud-
den drifts. For this reason it is very difficult to
predict and learn. The main difference between
incremental and sudden drift is that incremental
drift is related to time where sudden drift is not.
The results of chess experiments are pre-
sented in Table 5. Chess dataset includes a real
incremental drift. The best model for predicting
the first concept was Rule-Based Classifier with
accuracy = 92.06%. We choose a very small
neighborhood k and wrecent values where it is
suitable to the nature of data and change speed.
From the table it is clear that our approach have
Figure 11. Accuracy over time for SFDL algorithm and ordinary classifier, (a) STAGGER da-
taset, (b) SEA dataset
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 51
better predictive performance than the classical
ordinary classifier.
By the arrival of first and second batch,
SFDL enhance the accuracy by at least 17% but
not more than 3% for the last batch. We think the
reason is the extensive sudden drifts during this
period. Also in this period the user turns to play
personal competitions (70% of total instances
in batch3). This may add another hidden causes
of drift related to player-opponent relationship.
Figure 13 presents the curves of accuracy
for SFDL algorithm and ordinary classifier for
chess dataset. SFDL shows superior accuracy
over ordinary classifier.
Table 4. Results of electricity and credit databases. Accuracy after the batch reclassification is
underlined. Accuracy after training set formation and model retraining is in bold.
Figure 12. Accuracy over time for SFDL algorithm and ordinary classifier, (a) Electricity da-
taset, (b) Credit dataset
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
52 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
5.3.4. Reoccurring Concept
Experiments (Usenet Datasets)
Changes in user interests over time are the main
cause of concept drift in usenet dataset. It is
obvious from Figure 10 that usenet datasets
represent recurrence drift type. In fact this da-
taset is much more complicated in reality due
to unpredictable user interests.
We use the same portioning method as
Katakis, Tsoumakas, and Vlahavas (2008) as
illustrated in Figure 10. The first user interest
(medicine articles) was used to build initial
learner and other parts of interest to represent
incoming batches. It is to be mentioned that
batches with the same interests are not identical.
Table 6 illustrates experimental results for
usenet datasets. The best model for predicting
first concept for both usenet datasets was MLP-
NN with accuracy = 95.83% and 93.33% for
usenet1 and usenet2 respectively. We benefit
from Katakis et al. (2008) experiments to choose
the best wrecent value while extensive experiments
have been done to choose k neighborhood value.
The results proves the SFDL algorithm
ability in switching between concepts as user
interests change. Figure 14 also shows the
advantages of SFDL algorithm over ordinary
classifier.
Table 7 presents a comparative study be-
tween SFDL algorithm and four other stream
classification methods for usenet datasets:
Table 5. Results of chess dataset. Accuracy after the batch reclassification is underlined. Ac-
curacy after training set formation and model retraining is in bold.
Figure 13. Accuracy over time for SFDL algorithm and ordinary classifier for chess dataset
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 53
simple incremental method, time window
method with different window size, weighted
examples and conceptual clustering and predic-
tion framework (CCP). These methods have
considered the concept drift problem in their
approaches.
It is notable that average classification
accuracies observations for usenet2 dataset
are higher than of that for usenet1. Usenet1
includes more complicated drift where the
same batch includes another drift and the user
switch between two different interests. It is clear
Table 6. Results of Usenet datasets. Accuracy after the batch reclassification is underlined. Ac-
curacy after training set formation and model retraining is in bold.
Figure 14. Accuracy over time for SFDL algorithm and ordinary classifier, (a) Usenet1 dataset,
(b) Usenet2 dataset
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
54 International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013
that SFDL approach outperforms all the other
methods and Time Window approach (N=100)
act the worst.
6. CONCLUSION
In this paper, we addressed the problem of
supervised learning over time when the data is
changing and label of new instances is delayed.
We introduce an adaptive training set forma-
tion algorithm called SFDL, which is based on
selective training set formation. Our proposed
algorithm is considered as the first systematic
training set formation approach that take into
account delayed labeling problem.
SFDL algorithm includes three sub algo-
rithms: instance selection algorithm that is used
to select the most relevant examples to current
concept in terms of time similarity and space
similarity, reclassification algorithm to reclas-
sify the new instances which were initially clas-
sified using the available classifier and the third
algorithm is training set formation algorithm
which work to reform the old set according
to the changes made on reclassification step.
We tested our approach using synthetic
and real datasets. The datasets were chosen
from various domains which might have dif-
ferent drift types (sudden, gradual, incremental
reoccurrences) and different speed of change.
Experimental evaluation confirms improve-
ment in classification accuracy as compared
to ordinary classifier for all drift types. Our
approach is able to increase the classifications
accuracy with 20% in average and 56% in the
best cases and it has not been worse than the
ordinary classifiers in any case.
Finally, we conducted a comparative study
with another four methods to identify recur-
rence drift and predict changes in user interest
over time. The results show the superiority of
our solution over other methods in handling
recurrence drift.
Future research will be directed in the fol-
lowing direction: For input setting parameters
like number of neighborhood k and number of
most recent instances wrecent, these parameters
have been determined by application designer.
It is better to automatically determine these
parameters to preserve self-adaption. Exploring
some ideas to enhance the proposed strategy
and improve the results accuracy. A very high
classification accuracy can be provided if we
build a customized version to deal with each
drift individually. The algorithm should also be
extended so it can add or remove classes. This
is important for domains where some classes
may disappear by time and must be removed
or vice versa. It is also very useful to provide
the algorithm with a dynamic feature space
formation ability.
Finally we can say that future research
on adaptivity to concept drift has to be more
specializing in application groups.
Table 7. Average accuracy of the four methods in the usenet datasets
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Technology Diffusion, 4(1), 33-55, January-March 2013 55
REFERENCES
Brzezinski, D. (2010). Mining data streams with
concept drift. Master thesis, Poznan University of
Technology.
Delany, S. J., Cunningham, P., Tsymbal, A., & Coyle,
L. (2005). A case-based technique for tracking
concept drift in spam filtering. Knowledge-Based
Systems, 18.
Harries, M. (1999). Splice-2 comparative evaluation:
Electricity pricing. Technical report, The University
of South Wales. Retrieved October 8, 2011, from
http://www.liaad.up.pt/~jgama/ales/ales_5.html
Katakis, I., Tsoumakas, G., Banos, E., Bassiliades,
N., & Vlahavas, I. (2009). An adaptive personalized
news dissemination system. Journal of Intelligent In-
formation Systems. doi:10.1007/s10844-008-0053-8.
Katakis, I., Tsoumakas, G., & Vlahavas, I. (2008).
An ensemble of classifiers for coping with recur-
ring contexts in data streams. In Proceeding of 18th
European Conference on Artificial Intelligence,
Patras, Greece.
Kelly, M., Hand, D., & Adams, N. (1999, August
15-18). The impact of changing populations on clas-
sifier performance. In Proceedings of the Fifth ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (pp. 367-371).
Klinkenberg, R. (2004). Learning drifting concepts:
Example selection vs. example weighting. Intelligent
Data Analysis, 8(3), 281–300.
Ludmila, I., Kuncheva, J., & Salvador, S. (2008).
Nearest neighbour classifiers for streaming data
with delayed labeling. In Proceedings of the Eighth
IEEE International Conference on Data Mining
ICDM (pp.869-874).
Narasimhamurthy, A., & Kuncheva, L. (2007). A
framework for generating data to simulate changing
environments. In Proceeding of the 25th IASTED
int. Multi-Conference, Artificial Intelligence and
Applications (AIAP‟07) (pp. 384–389). ACTA Press.
Nishida, K. (2008). Learning and detecting concept
drift. PhD thesis, Hokkaido University, Japan, 2008.
Nishida, K., & Yamauchi, K. (2009). Learning,
detecting, understanding, and predicting concept
changes. International Joint Conference on Neural
Networks (IJCNN) (pp. 2280-2287).
Street, N., & Kim, Y. (2001). A streaming ensemble
algorithm (SEA) for large-scale classification. In
Proceedings of the Seventh ACM SIGKDD Inter-
national Conference on Knowledge Discovery and
Data Mining. Retrieved October 8, 2011, from http://
www.liaad.up.pt/~kdus/kdus_5.html
Tsymbal, A. (2004). The problem of concept drift:
Definitions and related work. Technical Report,
Department of Computer Science. Dublin, Ireland:
Trinity College.
UCI machine learning repository: Data sets. (n.d.).
Retrieve October 8, 2011, from http://archive.ics.uci.
edu/ml/datasets.html
Wang, H., Fan, W., Yu, P., & Han, J. (2003). Mining
concept-drifting data streams using ensemble clas-
sifiers. In Proceedings of the Ninth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining (KDD ‘03) (pp.226-235). New
York, NY: ACM.
Widmer, G., & Kubat, M. (1996). Learning in the pres-
ence of concept drift and hidden contexts. Machine
Learning, 23(1), 69–101. doi:10.1007/BF00116900.
Yang, Q., & Wu, X. (2006). 10 challenging problems
in data mining research. International Journal of
Information Technology & Decision Making, 5(4),
597–604. doi:10.1142/S0219622006002258.
Žliobaitė,I.(2010).Adaptive training set formation.
PhD thesis, Vilnius University.
ENDNOTES
1 http://kdd.ics.uci.edu/databases/20newsgrou
ps/20newsgroups.html
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the real world concepts are often not stable but change with time. Typical examples of this are weather prediction rules and customers' preferences. The underlying data distribution may change as well. Often these changes make the model built on old data inconsistent with the new data, and regular updating of the model is necessary. This problem, known as concept drift, complicates the task of learning a model from data and requires special approaches, different from commonly used techniques, which treat arriving instances as equally important contributors to the final concept. This paper considers different types of concept drift, peculiarities of the problem, and gives a critical review of existing approaches to the problem.
Conference Paper
Full-text available
Ensemble methods have recently garnered a great deal of attention in the machine learning community. Techniques such as Boosting and Bagging have proven to be highly effective but require repeated resampling of the training data, making them inappropriate in a data mining context. The methods presented in this paper take advantage of plentiful data, building separate classifiers on sequential chunks of training points. These classifiers are combined into a fixed-size ensemble using a heuristic replacement strategy. The result is a fast algorithm for large-scale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift.
Conference Paper
Full-text available
A fundamental assumption often made in supervised classification is that the problem is static, i.e. the description of the classes does not change with time. However many practical classification tasks involve changing environments. Thus designing and testing classifiers for changing environments are of increasing interest and importance. A number of benchmark data sets are available for static classification tasks. For example, the UCI machine learning repository is extensively used by researchers to compare algorithms across various domains. No such benchmark datasets are available for changing environments. Also, while generating data for static environments is relatively straightforward, this is not so for changing environments. The reason is that an infinite amount of changes can be simulated, and it is difficult to define which ones will be realistic and hence useful. In this paper we propose a general framework for generating data to simulate changing environments. The paper gives illustrations of how the framework encompasses various types of changes observed in real data and also how the two most popular simulation models (STAGGER and moving hyperplane) are represented within.
Conference Paper
Full-text available
This paper proposes a general framework for classify- ing data streams by exploiting incremental clustering in order to dynamically build and update an ensemble of incremental classi- fiers. To achieve this, a transformation function that maps batches of examples into a new conceptual feature space is pro- posed. The clustering algorithm is then applied in order to group different concepts and identify recurring contexts. The ensemble is produced by maintaining an classifier for every concept dis- covered in the stream2.
Conference Paper
Full-text available
We study streaming data where the true labels come with a delay. The question is whether the online nearest neighbour classifier (IB2 and IB3 here) should employ the unlabelled data. Three strategies are examined: do-nothing, replace and forget. Experiments with 28 data sets show that IB2 benefits from unlabelled data, while IB3 does not.
Article
Spam filtering is a particularly challenging machine learning task as the data distribution and concept being learned changes over time. It exhibits a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.
Conference Paper
An assumption fundamental to almost all work on super- vised classification is that the probabilities of class member- ship, conditional on the feature vectors, are stationary. However, in many situations this assumption is untenable. We give examples of such population drift, examine its nature, show how the impact of population drift depends on the chosen measure of classification performance, and propose a strategy for dynamically updating classification rules.
Conference Paper
The demand for learning machines that can adapt to concept change, the change over time of the statistical properties of a target variable, has become more urgent. We, therefore, propose a system in which multiple online and offline classifiers are used for learning changing concepts. Our system is able to: respond to both sudden and gradual changes, handle recurring concepts, detect the occurrence of change, understand the hidden contexts of past concepts, and predict the next concept. We evaluate the effectiveness of our system's elements and demonstrate that our system performed well with synthetic concept-drifting and concept-shifting datasets.