ChapterPDF Available

Speeding Up Statistical Tests to Detect Recurring Concept Drifts

Authors:

Abstract and Figures

RCD is a framework for dealing with recurring concept drifts. It reuses previously stored classifiers that were trained on examples similar to actual data, through the use of multivariate non-parametric statistical tests. The original proposal performed statistical tests sequentially. This paper improves RCD to perform the statistical tests in parallel by the use of a thread pool and presents how parallelism impacts performance. Results show that using parallel execution can considerably improve the evaluation time when compared to the corresponding sequential execution in environments where many concept drifts occur.
Content may be subject to copyright.
Speeding Up Statistical Tests to Detect
Recurring Concept Drifts
Paulo Mauricio Gonc¸alves J´unior and Roberto Souto Maior de Barros
Abstract.
RCD is a framework for dealing with recurring concept drifts. It reuses
previously stored classifiers that were trained on examples similar to actual data,
through the use of multivariate non-parametric statistical tests. The original pro-
posal performed statistical tests sequentially. This paper improves
RCD to perform
the statistical tests in parallel by the use of a thread pool and presents how paral-
lelism impacts performance. Results show that using parallel execution can consid-
erably improve the evaluation time when compared to the corresponding sequential
execution in environments where many concept drifts occur.
Keywords: Data streams, recurring concept drifts, multivariate non-parametric sta-
tistical tests, parallelism.
1 Introduction
Concept drift is a common situation when dealing with data streams. Several authors
have defined it in different terms. One of these definitions was stated by Wang et
al. [23]: “the term concept refers to the quantity that a learning model is trying
to predict, i.e., the variable. Concept drift is the situation in which the statistical
properties of the target concept change over time. Kolter and Maloof offered a more
informal definition: “concept drift occurs when a set of examples has legitimate
class labels at one time and has different legitimate labels at another time” [17].
Paulo Mauricio Gonc¸alves J´unior
Instituto Federal de Educac¸˜ao, Ciˆencia e Tecnologia de Pernambuco, Cidade Universit´aria,
50.740-540, Recife, Brasil
e-mail: paulogoncalves@recife.ifpe.edu.br
Roberto Souto Maior de Barros
Centro de Inform´atica, Universidade Federal de Pernambuco, Cidade Universit´aria,
50.740-560, Recife, Brasil
e-mail: roberto@cin.ufpe.br
R. Lee (Ed.): Computer and Information Science, SCI 493, pp. 129–142.
DOI: 10.1007/978-3-319-00804-2_10
c
Springer International Publishing Switzerland 2013
130 P.M. Gonc¸alves unior and R.S.M. de Barros
Concept drifts may occur in several different situations, in applications such as
spam filtering [6], credit card fraud detection [22], and intrusion detection [18].
In recent years, many proposals have been made to deal with concept drifts, like
the use of concept drift detectors and ensemble classifiers. One actual solution to
deal with recurring concept drifts, named
RCD, was previously proposed to perform
non-parametric multivariate statistical tests to identify if a concept is recurring or
not, and if so, reuse the classifier built on similar data.
In this paper, we present the results of executing the statistical tests in parallel:
how much faster it is when compared to sequential execution, in which situations it
reports better results, the influence of abrupt and gradual concept drifts in the test
results, and how
RCD performs in environments with different number of processing
cores.
The rest of this paper is organized as follows: Sect. 2 presents some common
techniques used to deal with concept drifts; Sect. 3 summarizes the
RCD framework
and how the parallelism was implemented; Sect. 4 describes the data sets used and
their parameters, the evaluation methodology, the
RCD configuration, and other in-
formation about the experiments; Sect. 5 introduces the results of the experiments;
and, finally, Sect. 6 presents our conclusions.
2 Background
There are many approaches used to deal with concept drifts. One approach is to
create a single classifier that adapts its internal structure as new data arrive. A com-
monly used single classifier is based on a Hoeffding tree [7], also named
VFDT
(Very Fast Decision Tree). It is a decision tree that uses a Hoeffding bound to cal-
culate how much data it needs to process in order to select the value of a decision
node. Accuracy of results is similar to that of a batch decision tree, but using much
less memory. In its original form, it was not designed to handle concept drifts. Many
extensions have already been proposed to adapt Hoeffding trees to deal with concept
drifts.
One of these proposals is named
CVFDT (Concept-adapting Very Fast Decision
Tree) [16]. It states that
CVFDT “is an extension to VFDT which maintains VFDTS
speed and accuracy advantages but adds the ability to detect and respond to changes
in the example-generating process”. It uses a sliding window of examples to try to
keep its model up-to-date. For each new arriving instance, statistics are recomputed,
reducing the influence of older instances. When the concept begins to change, alter-
native attributes increase their information gain, making the Hoeffding test on the
split to fail. An alternative tree begins to grow with the new best attribute at its root.
If this subtree becomes more accurate than the old one on new data, it is substituted.
VFDTC [13], on the other hand, extends VFDT with the ability to deal with numeric
attributes and uses naive Bayes classifiers at tree leaves. Proposals with decision
rules were also made [9].
Another common approach to deal with a concept drift is to identify when it
occurs and create a new classifier. Therefore, only classifiers trained on a current
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 131
concept are maintained. Algorithms that follow this approach work in the following
way: each arriving training instance is first evaluated by the base classifier. Internal
statistics are updated with the results and two thresholds are computed: a warning
level and an error level. As the base classifier makes mistakes, the warning level
is reached and instances are stored. If the behavior continues, the error level will
be reached, indicating that a concept drift has occurred. At this moment, the base
classifier is destroyed and a new base classifier is created and initially trained on
the stored instances. On the other hand, if the classifier starts to correctly evaluate
instances, this situation is considered a false alarm and stored instances are flushed.
Algorithms that follow this approach can work with any type of classifier as they
only analyze how the classifier evaluates instances.
One example of this approach is
DDM (Drift Detection Method) [10]. It works
by controlling the algorithm’s error rate. For each point i in the sequence of arriving
instances, the error rate is computed as the probability of misclassifying (p
i
), with
standard deviation given by s
i
=
p
i
(1 p
i
)/i. Statistical theory guarantees that,
when the distribution changes, the error will increase. The values of p
i
and s
i
are
stored when p
i
+ s
i
reaches its minimum value during the process (obtaining p
min
and s
min
). The warning level is reached when p
i
+ s
i
p
min
+ 2× s
min
and the error
level is set at p
i
+ s
i
p
min
+ 3 × s
min
.
Another similar method is Early Drift Detection Method (
EDDM) [1]. It works
similarly to
DDM, but, instead of controlling solely the amount of error of the clas-
sifier, it uses the distance between two errors to identify concept drifts. It computes
the average distance between two errors (p
i
) and the standard deviation of p
i
(s
i
).
These values are stored when p
i
+2× s
i
reaches its maximum value (obtaining p
max
and s
max
). Thus, the value of p
max
+ 2 × s
max
corresponds to the point where the
distribution of distances between errors is maximum.
EDDM was shown to be more
adequate to detect gradual concept drifts while
DDM was better suited for abrupt
concept drifts [1].
Exponentially weighted moving average (
EWMA) charts [19] were originally pro-
posed for detecting an increase in the mean of a sequence of random variables, con-
sidering that the mean and standard deviation of the stream are known. Yeh et al.
[25] proposed an
EWMA change detector for a sequence of random variables that
form a Bernoulli distribution.
ECDD (EWMA for Concept Drift Detection) [20], ex-
tends
EWMA to monitor the misclassification rate of a streaming classifier, allowing
the rate of false positive detection to be controlled and kept constant over time.
Several proposals try to deal with concept drifts by the use of ensemble classi-
fiers. This approach maintains a collection of learners and combine their decisions
to make an overall decision. To deal with concept drifts, ensemble classifiers must
take into account the temporal nature of the data stream.
L
EARN
++
.NSE is a recent proposal of an ensemble classifier. The original al-
gorithm [8] works as follows: a single classifier is created for each data set that
becomes available. The algorithm first evaluates the classification accuracy of the
current ensemble on the newly available data, obtained by the weighted majority
voting of all classifiers in the ensemble. Its error is computed as a simple ratio of
the correctly identified instances of the new data set and normalized in the interval
132 P.M. Gonc¸alves unior and R.S.M. de Barros
[0,1]. Then, the weight of the instances are updated: the weights of the instances
misclassified by the ensemble are reduced by a factor of the normalized error. The
weights are then normalized; a new classifier is created; and all the classifiers gener-
ated so far are evaluated on the current data set, by computing their weighted error.
If the error of the most recent classifier is greater than 0.5, it is discarded and a new
one is created. For each of the other classifiers, if its error is greater than 0.5, its
voting power is removed during the weighted majority voting.
Another proposal for ensemble classifier is
DWAA (Dynamic Weight Assignment
and Adjustment) [24]. It creates classifiers based on data chunks, using the next
chunk to evaluate the classifier previously built. If the ensemble is not full, the clas-
sifier is added; otherwise, the worst classifier in the last data chunk is replaced. To
set the weight, it uses a formula that considers how many of the ensemble classifiers
have actually made correct predictions. If more than half of the classifiers predic-
tions are correct, each one receives a normal reward. Otherwise, each one receives
a higher reward, making those influence more the global decision of the ensemble,
as they are better suited to represent the concept.
3 Parallel RCD
RCD [14, 15] is a framework developed to deal with recurring concept drifts. It
keeps a collection of pairs of classifiers and samples used to train these classifiers,
as presented in Fig. 1. In the training phase, a concept drift detector is used. If it
identifies a concept drift, a multivariate non-parametric statistical test is performed
to compare actual data to stored samples. If the statistical test informs that both data
come from the same distribution, the classifier associated with the stored sample is
reused, meaning that the classifier is adequate to deal with actual data.
On the other hand, if the test indicates that samples are not similar, the next stored
data sample is used for testing, and so on. If no stored classifier is apt for actual data,
a new classifier is created and stored in the set. If the set is full, the older classifier
is substituted. In the testing phase, statistical tests are performed every t instances
(a user parameterized value) to select, from the stored classifiers, the best one for
actual data. Thus,
RCD dynamically adapts to the current data distribution even in
the testing phase.
Originally,
RCD performed the statistical tests sequentially. Thus, a statistical test
would be performed comparing actual data todatastoredinthebufferforclassier
1 to verify if both represented the same data distribution. If positive, this classifier
Classifier 1
Buffer 1
Classifier 2
Buffer 2
Classifier 3
Buffer 3
...
Classifier n
Buffer n
Fig. 1 RCD classifiers set
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 133
was considered the new actual classifier; else, a statistical test would be performed
on classifier 2 buffer data, and so on.
The improvement being proposed is to perform several tests simultaneously by
the use of a thread pool of configurable fixed size to allow the user to fine tune
its value based on the hardware being used. Fig. 2 presents an example illustrating
how the thread pool works. It considers a thread pool with two active cores and a
classifiers set of size six.
When a concept drift occurs, it means the actual classifier does not correctly
represent the actual context. So, it is necessary to check whether any stored classifier
better represents the actual context. The remaining five classifiers stored in the set
must be tested, comparing a sample of actual data to the data stored in the buffer
associated with each classifier which represents the data the classifier was trained
on. Five threads are built to perform the statistical tests and they are sent to the thread
pool using a FIFO scheme to associate each test to a position in the thread pool, but
only the first two are active, i.e., are actually performing a statistical test. At Fig. 2
they are represented by bolder lines, and inactive threads by thinner lines. At this
point (t = 0), two statistical tests are active and the remaining three are waiting to
execute.
When the first statistical test finishes (let’s consider statistical test 1), if the result
indicates that actual data and sample data from classifier 1 do not represent the
same data distribution, the next inactive statistical test (in this case, statistical test
3) executes in the corresponding slot (t = 1). At t = 2, the same occurs. Classifier
2, represented by statistical test 2, also does not better represent actual data and the
next statistical test (number 4) occupies its place.
Now, let’s consider that statistical test 3 has finished and it identified that actual
data and data stored in the buffer of classifier 3 represent the same distribution. In
Active
Slot 1
Active
Slot 2
Inactive
Slot 1
Inactive
Slot 2
...
Statistical
Test 1
Statistical
Test 2
Statistical
Test 3
Statistical
Test 4
Statistical
Test 5
t = 0
Statistical
Test 3
Statistical
Test 2
Statistical
Test 4
Statistical
Test 5
t = 1
Statistical
Test 3
Statistical
Test 4
Statistical
Test 5
t = 2
Fig. 2 Example of a thread pool execution
134 P.M. Gonc¸alves unior and R.S.M. de Barros
this situation, this classifier substitutes the actual classifier, all other active statistical
tests are stopped from executing, and the inactive ones are canceled.
This scheme is interesting because, if a test is negative, the next test to perform
is already being executed, allowing a faster performance of the algorithm. If a test
is positive, all other executing tests are stopped and tests yet to be executed do not
enter the active thread pool.
Notice that this scheme is general and allows the execution of any statistical
test in parallel. Source code and instructions on how to use
RCD are available as a
MOA extension and can be obtained at http://sites.google.com/site/
moaextensions/.
4 Experiments Configuration
We used several data sets to perform the experiments: Hyperplane [16], LED [11],
SEA [21], Forest Covertype [4], Poker Hand [3], and Electricity [10, 12]. The first
three are artificial data sets: the first one presents gradual concept drifts while the
following two present abrupt concept drifts. The last three are real-world data sets.
These data sets and their configurations are the same as used by Bifet et al. [3].
Hyperplane was tested in ten million instances while LED and SEA, in one million.
All tests in the artificial data sets were repeated ten times and computed a 95%
confidence interval. The parameters of these streams are the following:
HYP(x,v) represents a Hyperplane data stream with x attributes changing at
speed v;
LED(v) appends four concepts (1, 3, 5, 7), each one representing a different num-
ber of drifting attributes with length of change v;
SEA(v) uses the same four concepts and in the same order as defined in the
original paper [21], with length of change v.
The
RCD configuration used in the experiments includes naive Bayes as base learner,
classifiers collection size set to 15,
KNN as the statistical test used (with k = 3), and
the minimum amount of similarity between data samples set to 0.05. Two buffer
sizes, two test frequencies (only in the testing phase), and three thread pool sizes
have been used.
The evaluation methodology used was Interleaved Chunks, also known as data
block evaluation method [5], on ten runs. It initially reads a block of d instances.
When the block is formed, it uses the instances for testing the existing classifier and
then the classifier is trained on the instances. This methodology was used because it
is better suited to compute training and testing times. In the following experiments,
d was set to 100,000 instances in the Hyperplane, Covertype, and Poker Hand data
sets, and to 10,000 instances in the LED, SEA, and Electricity data sets.
All the experiments were performed using the Massive Online Analysis (
MOA)
framework [2] in a Core i3 330M processor with 4GB of main memory running
Windows 7 Professional. This processor has four cores, two physical and two virtual
ones, where each core runs at 2.13GHz.
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 135
We used a modified version of the Interleaved Chunks presented in the MOA
framework because the version available in the tool computes the time the thread
executing the classifier uses the processor. In this solution, it is possible to exe-
cute other applications at the same time and the results will not be affected. How-
ever, because we will use a thread pool to perform the statistical tests, these are
not computed because the original thread may not be active in the processor. Here,
we compute the real time taken by
RCD to perform, not being possible to run other
applications at the same time.
5Results
Table 1 presents the average number of detected concept drifts, classifiers set size,
number of reused classifiers, as well as the evaluation, train and test times (in sec-
onds) for
RCD considering the ten runs, using a buffer size with 100 instances and a
test frequency of 500, considering thread pools with one, two, and four active cores.
It is worth pointing out the results were quite similar in the artificial data sets,
regardless of the thread pool size. This behavior did not occur in the real-world data
sets and is probably related to the number of detected concept drifts. In the artificial
data sets, the average number of detected concept drifts are considerably low, as can
be seen in the first column (CD), because a small number of concept drifts demand
few statistical tests to be performed.
However, not only the number of detected concept drifts influences performance.
Reusing classifiers also matters. If the first tests identify similarity between distribu-
tions, several other tests will not be executed, reducing the benefits of using a thread
pool. On the other hand, if only the last tests or none at all identify similarity, more
tests need to be executed. This is the situation expected to be more benefited from
the parallelization of the statistical tests.
For example, the average classifiers collection size (CS) in the artificial data sets
is below three, not being even close to fill the set (15 classifiers). Having few stored
classifiers indicates that few statistical tests need to be performed. The difference
between the number of detected concept drifts and the number of stored classifiers
is due to the reuse of classifiers. Analyzing the column with the number of reused
classifiers (RC), we can see that the values are very close to the ones presented in
the first column. This means that in the majority of the concept drifts a classifier was
reused.
In this
RCD configuration, analyzing the artificial data sets, using one core had
better statistical results than using two cores in both configurations of Hyperplane
and in LED but worse results in both versions of SEA. Using one core performed
statistically better in the HYP(10, 0.0001) and LED data sets when compared to
using four cores but worse performance in SEA, similarly to using two cores. In the
HYP(10, 0.001) data set, both had statistically similar results. Using four cores had
better statistical results than using two cores in the Hyperplane and SEA data sets,
and similar ones in LED.
136 P.M. Gonc¸alves unior and R.S.M. de Barros
Table 1 Results for a buffer with 100 instances and test frequency of 500 instances (in sec-
onds)
Data sets CD CS RC
1 core 2 cores 4 cores
eval train test eval train test eval train test
HYP(10,0.001) 7.7 2.6 6.0 99.64 44.45 36.30 101.49 45.40 39.98 100.13 46.13 38.20
HYP(10,0.0001) 8.2 2.4 6.7 98.99 44.06 36.15 101.49 45.35 40.29 100.10 46.18 38.18
SEA(50) 0.1 1.1 0.0 4.70 2.11 1.09 4.32 1.53 1.31 4.30 1.56 1.27
SEA(50000) 0.4 1.1 0.3 4.63 1.97 1.25 4.26 1.54 1.31 4.23 1.56 1.27
LED(50000) 0.3 1.2 0.1 32.56 14.23 12.78 33.54 14.72 13.25 33.54 14.70 13.24
Covertype 2980.0 15.0 844.0 98.12 80.68 8.30 84.24 66.99 8.25 73.99 56.77 8.32
Poker Hand 1871.0 15.0 109.0 45.72 37.86 3.78 38.31 30.45 3.81 31.67 23.82 3.82
Electricity 212.0 15.0 30.0 2.89 2.48 0.09 2.40 1.98 0.09 2.00 1.58 0.09
Table 2 Thread pool management for a buffer with 100 instances and test frequency of 500
instances (in milliseconds)
Creation time Execution time Destruction time
1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores
HYP(10,0.001) 2.49 1.03 1.03 1.87 2.22 2.63 0.00 0.00 0.00
HYP(10,0.0001) 2.49 1.16 0.40 1.73 2.90 3.68 0.00 0.00 0.00
SEA(50) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SEA(50000) 4.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
LED(50000) 0.00 5.33 0.00 5.00 0.00 10.33 0.00 0.00 0.00
Covertype 2.12 2.25 2.97 25.76 19.84 15.04 0.03 0.02 0.04
Poker Hand 2.01 2.21 2.96 14.81 10.89 6.10 0.01 0.01 0.06
Electricity 2.16 1.95 2.92 8.82 6.58 3.42 0.00 0.00 0.00
Table 3 Results for a buffer and test frequency of 100 instances (in seconds)
Data sets CD CS RC
1 core 2 cores 4 cores
eval train test eval train test eval train test
HYP(10,0.001) 8.4 2.8 6.5 422.74 45.61 360.38 487.43 46.86 424.62 499.94 46.87 437.12
HYP(10,0.0001) 10.2 2.3 8.2 421.24 45.37 359.88 459.86 46.66 397.05 454.82 46.63 392.47
SEA(50) 0.1 1.1 0.0 6.18 1.51 3.15 6.25 1.55 3.19 6.22 1.54 3.16
SEA(50000) 0.2 1.1 0.1 6.27 1.53 3.23 6.35 1.56 3.29 6.34 1.58 3.25
LED(50000) 2.3 1.2 2.1 37.22 14.25 17.38 38.47 14.89 17.98 38.51 14.90 18.06
Covertype 3376.0 15.0 944.0 534.32 86.47 438.71 354.76 72.31 273.38 315.25 60.87 245.36
Poker Hand 1871.0 15.0 109.0 305.93 37.83 264.08 229.43 30.78 194.59 166.22 23.65 138.53
Electricity 212.0 15.0 30.0 12.40 2.43 9.62 9.50 2.07 7.08 6.80 1.50 4.96
Table 4 Thread pool management for a buffer and test frequency of 100 instances (in
milliseconds)
Creation time Execution time Destruction time
1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores
HYP(10,0.001) 0.39 0.49 0.51 2.85 3.36 3.47 0.01 0.01 0.01
HYP(10,0.0001) 0.32 0.37 0.36 2.91 3.21 3.17 0.01 0.01 0.01
SEA(50) 0.03 0.03 0.02 0.15 0.15 0.15 0.00 0.00 0.00
SEA(50000) 0.02 0.03 0.03 0.16 0.16 0.16 0.00 0.00 0.00
LED(50000) 0.03 0.03 0.04 0.40 0.41 0.41 0.00 0.00 0.00
Covertype 2.05 2.35 3.14 73.84 46.14 39.73 0.01 0.02 0.07
Poker Hand 2.14 2.39 3.08 30.97 21.95 14.00 0.01 0.02 0.08
Electricity 2.08 2.75 2.85 21.79 15.07 9.69 0.08 0.00 0.03
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 137
On the other hand, in the real-world data sets, more cores returned lower eval-
uation times: two cores were faster than one core and using four cores was faster
than the other two thread pool sizes. In the artificial data sets, using one core was,
on average, 1.90% faster than using two cores, but 14.84% slower in the real-world
data sets. Comparing one and four cores, similar results apply. One core was faster
by 0.74% in the artificial data sets while four cores was faster by 26.63% in the real
data sets. Using four cores had practically the same performance than using two
cores in artificial data sets, being 1.14% faster, but was 13.85% faster in the real-
world data sets. The real-world data sets presented a huge amount of concept drifts,
the classifiers set became full and the number of reused classifiers was also much
bigger than in the artificial data sets.
To better analyze how the thread pool influences performance, we computed the
average amount of time (in milliseconds) needed to create the thread pool and to
assign the statistical tests to their respective slots, to execute the thread pool, and to
finalize it. In the assignment stage, if a statistical test is assigned to an active slot,
it starts executing immediately, while other tests are still being assigned, so it is not
necessary to wait for all tests to be assigned a specific slot to start execution, saving
time. This information is presented at Table 2.
Observing the real-world data sets, it is possible to notice that, in general, the
greater the number of cores, the longer was the time spent in the creation of the
thread pool, but the differences are usually very small. In the execution time, using
more cores meant faster execution, with no exception. The destruction times were
usually negligible, taking less than 0.1 milliseconds.
The results of Table 3 are similar to those presented at Table 1 but the test fre-
quency in the experiments was increased to 100 instances. Again, parallelism out-
performs sequential solution in the real-world data sets, the ones with more detected
concept drifts.
In these tests, we can also notice that the evaluation time is mostly spent in the
testing phase, differently from the results of Table 1. As the test frequency is higher,
the evaluation time and the time spent in the testing phase increased considerably.
Making the tests more frequently also increased the number of detected concept
drifts in 50% of the data sets. In SEA(50,000), it was 0.4 to 0.2. In Poker Hand,
Electricity, and SEA(50), the number of detected concept drifts stayed the same.
Using one core was faster than using two cores in average by 11.72% in the ar-
tificial data sets, but was 30.37% slower in the real-world data sets. Comparing one
core to four cores similar results apply: one core was faster by 12.55% in the artifi-
cial data sets and four cores was faster by 42.74% in the real-world data sets. Using
two cores offered better average results compared to using four cores by 0.75% in
the artificial data sets and worse performance by 17.76% in the real-world ones.
Table 4 presents similar information to those presented at Table 2. Results are
also similar: the creation times is usually slightly faster using less cores, the execu-
tion time is considerably smaller when using more cores, and destruction times are
usually less than 0.1 milliseconds.
Instead of increasing the test frequency, Table 5 presents similar information as
Tables 1 and 3 but increasing the buffer size to 500 instances. Here, the tests took
138 P.M. Gonc¸alves unior and R.S.M. de Barros
Table 5 Results for a buffer and test frequency of 500 instances (in seconds)
Data sets CD CS RC
1 core 2 cores 4 cores
eval train test eval train test eval train test
HYP(10,0.001) 11.9 2.6 9.9 1339.33 46.07 1275.16 1453.37 46.65 1388.60 1508.29 48.98 1441.27
HYP(10,0.0001) 9.5 2.4 7.5 1356.57 45.77 1293.32 1404.87 46.01 1341.41 1413.77 47.41 1348.88
SEA(50) 1.1 1.1 1.0 11.17 1.59 7.30 11.17 1.57 7.31 11.25 1.62 7.30
SEA(50000) 0.2 1.1 0.1 11.39 1.56 7.57 11.40 1.58 7.55 11.49 1.60 7.60
LED(50000) 0.2 1.2 0.0 57.99 14.28 37.33 53.98 14.25 33.33 55.14 14.92 33.85
Covertype 3063.0 15.0 798.0 1781.17 358.50 1413.39 1107.10 244.95 853.07 1167.37 268.35 889.92
Poker Hand 410.0 15.0 18.0 588.42 47.24 537.02 359.69 34.10 321.47 377.63 36.19 336.09
Electricity 183.0 15.0 20.0 33.38 8.85 24.15 20.87 6.02 14.46 22.92 7.19 15.26
Table 6 Thread pool management for a buffer and test frequency of 500 instances (in mil-
liseconds)
Creation time Execution time Destruction time
1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores
HYP(10,0.001) 0.54 0.60 0.60 61.90 67.54 70.14 0.02 0.02 0.02
HYP(10,0.0001) 0.50 0.55 0.51 62.94 65.31 65.66 0.02 0.01 0.02
SEA(50) 0.04 0.03 0.04 2.99 2.99 2.98 0.00 0.00 0.00
SEA(50000) 0.04 0.05 0.04 3.10 3.09 3.09 0.00 0.00 0.01
LED(50000) 0.06 0.04 0.05 12.32 10.27 10.26 0.00 0.00 0.00
Covertype 2.01 2.34 13.66 560.81 342.49 354.36 0.02 0.01 0.06
Poker Hand 2.27 2.65 6.36 314.13 186.57 187.92 0.01 0.06 0.03
Electricity 2.04 2.28 5.23 140.45 91.20 91.78 0.00 0.06 0.06
longer to complete; in average, 62 milliseconds compared to 3 milliseconds when
using a buffer with 100 instances. Nevertheless, the results were similar to the ones
presented at Table 3. Using parallelism was much faster in the real-world data sets
and slightly slower in the artificial ones. Using one core was 5.70% faster than using
two cores in the artificial data sets but 38.09% slower in the real-world ones.
However, it was interesting to observe that the evaluation time was lower using
two cores than with four. Comparing one and four cores similar results apply: one
core was 8.05% faster in the artificial data sets and 34.75% slower in the real-world
ones. Using two cores was slightly better than using four: 2.22% in the artificial
and 5.39% in the real-world data sets. This probably occurs because there are much
more statistical tests to perform and they take longer to complete than the tests in
the other configurations, putting a higher load on the whole system and negatively
affecting the performance.
Comparing Tables 1, 3 and 5, it is possible to notice that the increase in the
buffer size had a higher influence in the evaluation time than the increase in the
test frequency. Increasing the test frequency by five times increased the evaluation
time between 4.26 and 4.50 times. Increasing the buffer size by five times increased
the evaluation time between 11.95 and 13.37 times. The training time practically
did not change in the three configurations performed; the increase in the evaluation
time was due to the testing time.
Table 6 presents the times taken for the thread pool management, as previously
described at Tables 2 and 4. In the artificial data sets, the creation times are very
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 139
close in the three sizes of active cores used. In the real-world data sets, one and two
cores take very similar times, and using four cores takes more time than the other
two. This probably occurs because, in the creation time, the statistical tests asso-
ciated with active cores begin executing while other tests are still being assigned.
Thus, the creation time tends to be bigger when there are more active cores and they
take longer to complete. We can see it comparing the three tables concerning thread
pool management. At Tables 2 and 4, the differences in the creation time between
one, two and four cores are almost negligible. In these cases, the average time taken
to perform a statistical test is three milliseconds. At Table 6, using four cores takes
more time than using one and two cores. Here, the average time taken to perform
the statistical test is 62 milliseconds. The execution times are very similar in the ar-
tificial data sets. In the real-world data sets, using two or four cores is considerably
faster than using one active core. Using two cores was faster than using four cores in
the Covertype data set, while in the other two real-world data sets, the performances
were quite similar.
6Conclusion
This paper studied the influence of executing parallel statistical tests in the RCD
framework using six data sets (eight configurations), with and without concept
drifts, with abrupt and gradual concept drifts, and considering artificial and real-
world data sets. Tests were performed with both sequential and parallel execution
of two and four statistical tests.
Analysis of the experiment results led to the conclusion that the execution of
parallel statistical tests was most beneficial when there was a high number of de-
tected concept drifts leading to more statistical tests being performed. Tests were
also performed to analyze the performance results in the following conditions:
1. the buffer size was increased, making the statistical tests take longer to complete;
and
2. increase the test frequency, making more statistical tests to be performed.
In data sets with a small number of detected concept drifts (the artificial data sets),
the performances were quite similar, but using sequential execution had lower evalu-
ation times ranging from 0.74% to 12.55%. On the other hand, in the data sets with
a high number of detected concept drifts (the real-world ones), using parallelism
increased performance in values ranging from 13.85% to 42.74%.
Multiplying the test frequency five times increased the evaluation time more than
four times, while multiplying the size of the buffer five times increased the evalua-
tion time more than 11 times, indicating that the buffer size has a higher impact in
performance than test frequency.
The analysis of the thread pool creation, execution, and destruction times were
also performed, showing that, as expected, the major improvement occurs in the
execution phase. The creation times using different number of cores are close to
140 P.M. Gonc¸alves unior and R.S.M. de Barros
one another and the destruction times are commonly negligible, taking less than 0.1
milliseconds.
6.1 Future Work
Some other experiments might be made to better understand how the execution of
parallel statistical tests can improve the performance of the
RCD framework. One
of these experiments is testing the influence of the number of available cores in the
processor in performance. Other possible experiment is to analyze the influence of
other buffer sizes and test frequencies.
References
1. Baena-Garc´ıa, M., Del Campo-
´
Avila, J., Fidalgo, R., Bifet, A., Gavald`a, R., Morales-
Bueno, R.: Early drift detection method. In: International Workshop on Knowledge Dis-
covery from Data Streams, IWKDDS 2006, pp. 77–86 (2006),
http://eprints.pascal-network.org/archive/00002509/
2. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive online analysis. J. of
Mach. Learn. Res. 11, 1601–1604 (2010),
http://portal.acm.org/citation.cfm?id=1859890.1859903
3. Bifet, A., Holmes, G., Pfahringer, B., Frank, E.: Fast perceptron decision tree learn-
ing from evolving data streams. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.)
PAKDD 2010, Part II. LNCS (LNAI), vol. 6119, pp. 299–310. Springer, Heidelberg
(2010), http://dx.doi.org/10.1007/978-3-642-13672-6_30
4. Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and dis-
criminant analysis in predicting forest cover types from cartographic variables. Comput.
and Electron. in Agric. 24(3), 131–151 (1999),
http://dx.doi.org/10.1016/S0168-1699(99)00046-0
5. Brzezi´nski, D., Stefanowski, J.: Accuracy updated ensemble for data streams with con-
cept drift. In: Corchado, E., Kurzy´nski, M., Wo´zniak, M. (eds.) HAIS 2011, Part II.
LNCS, vol. 6679, pp. 155–163. Springer, Heidelberg (2011),
http://dx.doi.org/10.1007/978-3-642-21222-2_19
6. Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique
for tracking concept drift in spam ltering. Knowl.-Based Syst. 18(4-5), 187–195
(2005), http://dx.doi.org/10.1016/j.knosys.2004.10.002; AI-2004,
Cambridge, England, December 13-15 (2004)
7. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD 2000, New York, NY, USA, pp. 71–80 (2000),
http://dx.doi.org/10.1145/347090.347107
8. Elwell, R., Polikar, R.: Incremental learning of concept drift in nonstationary environ-
ments. IEEE Trans. on Neural Netw. 22(10), 1517–1531 (2011),
http://dx.doi.org/10.1109/TNN.2011.2160459
9. Ferrer-Troyano, F., Aguilar-Ruiz, J.S., Riquelme, J.C.: Discovering decision rules from
numerical data streams. In: Proceedings of the 2004 ACM Symposium on Applied Com-
puting, SAC 2004, New York, NY, USA, pp. 649–653 (2004),
http://dx.doi.org/10.1145/967900.968036
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 141
10. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan,
A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer,
Heidelberg (2004),
http://dx.doi.org/10.1007/978-3-540-28645-5_29
11. Gama, J., Medas, P., Rocha, R.: Forest trees for on-line data. In: Proceedings of the 2004
ACM Symposium on Applied Computing, SAC 2004, New York, NY, USA, pp. 632–
636 (2004),
http://dx.doi.org/10.1145/967900.968033
12. Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams.
In: Proceedings of the 2005 ACM Symposium on Applied Computing, SAC 2005, New
York, NY, USA, pp. 573–577 (2005),
http://dx.doi.org/10.1145/1066677.1066809
13. Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data
streams. In: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2003, New York, NY, USA, pp. 523–528
(2003), http://dx.doi.org/10.1145/956750.956813
14. Gonc¸alves Jr., P.M., Barros, R.S.M.: A comparison on how statistical tests deal with
concept drifts. In: Arabnia, H.R., et al. (eds.) Proceedings of the 2012 International Con-
ference on Artificial Intelligence, ICAI 2012, vol. 2, pp. 832–838. CSREA Press, Las
Vegas (2012)
15. Gonc¸alves Jr., P.M., Barros, R.S.M.: RCD: A recurring concept drift framework. Pattern
Recognit. Lett. (to appear, 2013),
http://dx.doi.org/10.1016/j.patrec.2013.02.005
16. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Pro-
ceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, KDD 2001, New York, NY, USA, pp. 97–106 (2001),
http://dx.doi.org/10.1145/502512.502529
17. Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: An ensemble method for drift-
ing concepts. J. of Mach. Learn. Res. 8, 2755–2790 (2007),
http://dl.acm.org/citation.cfm?id=1314498.1390333
18. Lane, T., Brodley, C.E.: Approaches to online learning and concept drift for user identifi-
cation in computer security. In: Agrawal, R., Stolorz, P. (eds.) Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining, KDD 1998, pp.
259–263. AAAI Press, Menlo Park (1998),
http://www.aaai.org/Papers/KDD/1998/KDD98-045.pdf
19. Roberts, S.W.: Control chart tests based on geometric moving averages. Technomet-
rics 1(3), 239–250 (1959), http://www.jstor.org/stable/1266443
20. Ross, G.J., Adams, N.M., Tasoulis, D.K., Hand, D.J.: Exponentially weighted moving
average charts for detecting concept drift. Pattern Recognit. Lett. 33(2), 191–198 (2012),
http://dx.doi.org/10.1016/j.patrec.2011.08.019
21. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classifica-
tion. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD 2001, New York, NY, USA, pp. 377–382 (2001),
http://dx.doi.org/10.1145/502512.502568
22. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensem-
ble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2003, New York, NY, USA, pp. 226–235
(2003), http://dx.doi.org/10.1145/956750.956778
142 P.M. Gonc¸alves unior and R.S.M. de Barros
23. Wang, S., Schlobach, S., Klein, M.: Concept drift and how to identify it. Web Semant.:
Sci., Serv. and Agents on the World Wide Web 9(3), 247–265 (2011),
http://dx.doi.org/10.1016/j.websem.2011.05.003
24. Wu, D., Wang, K., He, T., Ren, J.: A dynamic weighted ensemble to cope with concept
drifting classification. In: The 9th International Conference for Young Computer Scien-
tists, ICYCS 2008, pp. 1854–1859 (2008),
http://dx.doi.org/10.1109/ICYCS.2008.491
25. Yeh, A.B., Mcgrath, R.N., Sembower, M.A., Shen, Q.: Ewma control charts for monitor-
ing high-yield processes based on non-transformed observations. International Journal
of Production Research 46(20), 5679–5699 (2008),
http://dx.doi.org/10.1080/00207540601182252
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
An emerging problem in Data Streams is the detection of concept drift. This problem is aggravated when the drift is gradual over time. In this work we deflne a method for detecting concept drift, even in the case of slow gradual change. It is based on the estimated distribution of the distances between classiflcation errors. The proposed method can be used with any learning algorithm in two ways: using it as a wrapper of a batch learning algorithm or implementing it inside an incremental and online algorithm. The experimentation results compare our method (EDDM) with a similar one (DDM). Latter uses the error-rate instead of distance-error-rate.
Article
Full-text available
This paper presents recurring concept drifts (RCD), a framework that offers an alternative approach to handle data streams that suffer from recurring concept drifts (on-line learning). It creates a new classifier to each context found and stores a sample of data used to build it. When a new concept drift occurs, the algorithm compares the new context to previous ones using a non-parametric multivariate statistical test to verify if both contexts come from the same distribution. If so, the corresponding classifier is reused. The RCD framework is compared with several algorithms (among single and ensemble approaches), in both artificial and real data sets, chosen from frequently used algorithms and data sets in the concept drift research area. We claim the proposed framework had better average ranks in data sets with abrupt and gradual concept drifts compared to both the single classifiers and the ensemble approaches that use the same base learner.
Conference Paper
Full-text available
RCD is a framework proposed to deal with recurring concept drifts. It stores classifiers together with a sample of data used to train them. If a concept drift occurs, RCD tests all the stored samples with a sample of actual data, trying to verify if this is a new context or an old one that is recurring. This is performed by a non-parametric multivariate statistical test to make the verification. This paper describes how two statistical tests (KNN and Cramer) can distinguish between new and old contexts. RCD is tested with several base classifiers, in environments with different rates-of-change values, with gradual and abrupt concept drifts. Results show that RCD improves single classifiers accuracy independently of the statistical test used.
Conference Paper
Full-text available
Mining of data streams must balance three evaluation dimensions: accuracy, time and memory. Excellent accuracy on data streams has been obtained with Naive Bayes Hoeffding Trees—Hoeffding Trees with naive Bayes models at the leaf nodes—albeit with increased runtime compared to standard Hoeffding Trees. In this paper, we show that runtime can be reduced by replacing naive Bayes with perceptron classifiers, while maintaining highly competitive accuracy. We also show that accuracy can be increased even further by combining majority vote, naive Bayes, and perceptrons. We evaluate four perceptron-based learning strategies and compare them against appropriate baselines: simple perceptrons, Perceptron Hoeffding Trees, hybrid Naive Bayes Perceptron Trees, and bagged versions thereof. We implement a perceptron that uses the sigmoid activation function instead of the threshold activation function and optimizes the squared error, with one perceptron per class value. We test our methods by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.
Article
Full-text available
We introduce an ensemble of classifiers-based approach for incremental learning of concept drift, characterized by nonstationary environments (NSEs), where the underlying data distributions change over time. The proposed algorithm, named Learn<sup>++</sup>.NSE, learns from consecutive batches of data without making any assumptions on the nature or rate of drift; it can learn from such environments that experience constant or variable rate of drift, addition or deletion of concept classes, as well as cyclical drift. The algorithm learns incrementally, as other members of the Learn<sup>++</sup> family of algorithms, that is, without requiring access to previously seen data. Learn<sup>++</sup>.NSE trains one new classifier for each batch of data it receives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifier's time-adjusted accuracy on current and past environments. This approach allows the algorithm to recognize, and act accordingly, to the changes in underlying data distributions, as well as to a possible reoccurrence of an earlier distribution. We evaluate the algorithm on several synthetic datasets designed to simulate a variety of nonstationary environments, as well as a real-world weather prediction dataset. Comparisons with several other approaches are also included. Results indicate that Learn<sup>++</sup>.NSE can track the changing environments very closely, regardless of the type of concept drift. To allow future use, comparison and benchmarking by interested researchers, we also release our data used in this paper.
Article
Full-text available
This study compared two alternative techniques for predicting forest cover types from cartographic variables. The study evaluated four wilderness areas in the Roosevelt National Forest, located in the Front Range of northern Colorado. Cover type data came from US Forest Service inventory information, while the cartographic variables used to predict cover type consisted of elevation, aspect, and other information derived from standard digital spatial data processed in a geographic information system (GIS). The results of the comparison indicated that a feedforward artificial neural network model more accurately predicted forest cover type than did a traditional statistical model based on Gaussian discriminant analysis.
Article
We propose and study exponentially weighted moving average (EWMA) control charts for monitoring high-yield processes. The EWMA control charts are developed based on non-transformed geometric, binomial and Bernoulli counts. The proposed charts are evaluated based on the average number of items sampled before the first out-of-control signal is detected. By selecting small smoothing constants, the proposed EWMA control charts outperform in numerous cases the recently developed CUSUM control charts [Chang, T.C. and Gan, F.F., Cumulative sum charts for high yield processes. Statist. Sin., 2001, 11, 791–805], which are considered the most efficient control charting mechanisms in the existing literature for monitoring fraction non-conforming as small as 0.0001. Numerous simulations are included for performance comparisons. An example is also given to demonstrate the applicability of the proposed EWMA control charts.
Article
A geometrical moving average gives the most recent observation the greatest weight, and all previous observations weights decreasing in geometric progression from the most recent back to the first. A graphical procedure for generating geometric moving averages is described in which the most recent observation is assigned a weight r. The properties of control chart tests based on geometric moving averages are compared to tests based on ordinary moving averages.
Article
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an Exponentially Weighted Moving Average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.
Article
Spam filtering is a particularly challenging machine learning task as the data distribution and concept being learned changes over time. It exhibits a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.