Content uploaded by Roberto Souto Maior de Barros
Author content
All content in this area was uploaded by Roberto Souto Maior de Barros on Aug 16, 2015
Content may be subject to copyright.
Content uploaded by Roberto Souto Maior de Barros
Author content
All content in this area was uploaded by Roberto Souto Maior de Barros
Content may be subject to copyright.
Speeding Up Statistical Tests to Detect
Recurring Concept Drifts
Paulo Mauricio Gonc¸alves J´unior and Roberto Souto Maior de Barros
Abstract.
RCD is a framework for dealing with recurring concept drifts. It reuses
previously stored classifiers that were trained on examples similar to actual data,
through the use of multivariate non-parametric statistical tests. The original pro-
posal performed statistical tests sequentially. This paper improves
RCD to perform
the statistical tests in parallel by the use of a thread pool and presents how paral-
lelism impacts performance. Results show that using parallel execution can consid-
erably improve the evaluation time when compared to the corresponding sequential
execution in environments where many concept drifts occur.
Keywords: Data streams, recurring concept drifts, multivariate non-parametric sta-
tistical tests, parallelism.
1 Introduction
Concept drift is a common situation when dealing with data streams. Several authors
have defined it in different terms. One of these definitions was stated by Wang et
al. [23]: “the term concept refers to the quantity that a learning model is trying
to predict, i.e., the variable. Concept drift is the situation in which the statistical
properties of the target concept change over time.” Kolter and Maloof offered a more
informal definition: “concept drift occurs when a set of examples has legitimate
class labels at one time and has different legitimate labels at another time” [17].
Paulo Mauricio Gonc¸alves J´unior
Instituto Federal de Educac¸˜ao, Ciˆencia e Tecnologia de Pernambuco, Cidade Universit´aria,
50.740-540, Recife, Brasil
e-mail: paulogoncalves@recife.ifpe.edu.br
Roberto Souto Maior de Barros
Centro de Inform´atica, Universidade Federal de Pernambuco, Cidade Universit´aria,
50.740-560, Recife, Brasil
e-mail: roberto@cin.ufpe.br
R. Lee (Ed.): Computer and Information Science, SCI 493, pp. 129–142.
DOI: 10.1007/978-3-319-00804-2_10
c
Springer International Publishing Switzerland 2013
130 P.M. Gonc¸alves J´unior and R.S.M. de Barros
Concept drifts may occur in several different situations, in applications such as
spam filtering [6], credit card fraud detection [22], and intrusion detection [18].
In recent years, many proposals have been made to deal with concept drifts, like
the use of concept drift detectors and ensemble classifiers. One actual solution to
deal with recurring concept drifts, named
RCD, was previously proposed to perform
non-parametric multivariate statistical tests to identify if a concept is recurring or
not, and if so, reuse the classifier built on similar data.
In this paper, we present the results of executing the statistical tests in parallel:
how much faster it is when compared to sequential execution, in which situations it
reports better results, the influence of abrupt and gradual concept drifts in the test
results, and how
RCD performs in environments with different number of processing
cores.
The rest of this paper is organized as follows: Sect. 2 presents some common
techniques used to deal with concept drifts; Sect. 3 summarizes the
RCD framework
and how the parallelism was implemented; Sect. 4 describes the data sets used and
their parameters, the evaluation methodology, the
RCD configuration, and other in-
formation about the experiments; Sect. 5 introduces the results of the experiments;
and, finally, Sect. 6 presents our conclusions.
2 Background
There are many approaches used to deal with concept drifts. One approach is to
create a single classifier that adapts its internal structure as new data arrive. A com-
monly used single classifier is based on a Hoeffding tree [7], also named
VFDT
(Very Fast Decision Tree). It is a decision tree that uses a Hoeffding bound to cal-
culate how much data it needs to process in order to select the value of a decision
node. Accuracy of results is similar to that of a batch decision tree, but using much
less memory. In its original form, it was not designed to handle concept drifts. Many
extensions have already been proposed to adapt Hoeffding trees to deal with concept
drifts.
One of these proposals is named
CVFDT (Concept-adapting Very Fast Decision
Tree) [16]. It states that
CVFDT “is an extension to VFDT which maintains VFDT’S
speed and accuracy advantages but adds the ability to detect and respond to changes
in the example-generating process”. It uses a sliding window of examples to try to
keep its model up-to-date. For each new arriving instance, statistics are recomputed,
reducing the influence of older instances. When the concept begins to change, alter-
native attributes increase their information gain, making the Hoeffding test on the
split to fail. An alternative tree begins to grow with the new best attribute at its root.
If this subtree becomes more accurate than the old one on new data, it is substituted.
VFDTC [13], on the other hand, extends VFDT with the ability to deal with numeric
attributes and uses naive Bayes classifiers at tree leaves. Proposals with decision
rules were also made [9].
Another common approach to deal with a concept drift is to identify when it
occurs and create a new classifier. Therefore, only classifiers trained on a current
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 131
concept are maintained. Algorithms that follow this approach work in the following
way: each arriving training instance is first evaluated by the base classifier. Internal
statistics are updated with the results and two thresholds are computed: a warning
level and an error level. As the base classifier makes mistakes, the warning level
is reached and instances are stored. If the behavior continues, the error level will
be reached, indicating that a concept drift has occurred. At this moment, the base
classifier is destroyed and a new base classifier is created and initially trained on
the stored instances. On the other hand, if the classifier starts to correctly evaluate
instances, this situation is considered a false alarm and stored instances are flushed.
Algorithms that follow this approach can work with any type of classifier as they
only analyze how the classifier evaluates instances.
One example of this approach is
DDM (Drift Detection Method) [10]. It works
by controlling the algorithm’s error rate. For each point i in the sequence of arriving
instances, the error rate is computed as the probability of misclassifying (p
i
), with
standard deviation given by s
i
=
p
i
(1 − p
i
)/i. Statistical theory guarantees that,
when the distribution changes, the error will increase. The values of p
i
and s
i
are
stored when p
i
+ s
i
reaches its minimum value during the process (obtaining p
min
and s
min
). The warning level is reached when p
i
+ s
i
≥ p
min
+ 2× s
min
and the error
level is set at p
i
+ s
i
≥ p
min
+ 3 × s
min
.
Another similar method is Early Drift Detection Method (
EDDM) [1]. It works
similarly to
DDM, but, instead of controlling solely the amount of error of the clas-
sifier, it uses the distance between two errors to identify concept drifts. It computes
the average distance between two errors (p
i
) and the standard deviation of p
i
(s
i
).
These values are stored when p
i
+2× s
i
reaches its maximum value (obtaining p
max
and s
max
). Thus, the value of p
max
+ 2 × s
max
corresponds to the point where the
distribution of distances between errors is maximum.
EDDM was shown to be more
adequate to detect gradual concept drifts while
DDM was better suited for abrupt
concept drifts [1].
Exponentially weighted moving average (
EWMA) charts [19] were originally pro-
posed for detecting an increase in the mean of a sequence of random variables, con-
sidering that the mean and standard deviation of the stream are known. Yeh et al.
[25] proposed an
EWMA change detector for a sequence of random variables that
form a Bernoulli distribution.
ECDD (EWMA for Concept Drift Detection) [20], ex-
tends
EWMA to monitor the misclassification rate of a streaming classifier, allowing
the rate of false positive detection to be controlled and kept constant over time.
Several proposals try to deal with concept drifts by the use of ensemble classi-
fiers. This approach maintains a collection of learners and combine their decisions
to make an overall decision. To deal with concept drifts, ensemble classifiers must
take into account the temporal nature of the data stream.
L
EARN
++
.NSE is a recent proposal of an ensemble classifier. The original al-
gorithm [8] works as follows: a single classifier is created for each data set that
becomes available. The algorithm first evaluates the classification accuracy of the
current ensemble on the newly available data, obtained by the weighted majority
voting of all classifiers in the ensemble. Its error is computed as a simple ratio of
the correctly identified instances of the new data set and normalized in the interval
132 P.M. Gonc¸alves J´unior and R.S.M. de Barros
[0,1]. Then, the weight of the instances are updated: the weights of the instances
misclassified by the ensemble are reduced by a factor of the normalized error. The
weights are then normalized; a new classifier is created; and all the classifiers gener-
ated so far are evaluated on the current data set, by computing their weighted error.
If the error of the most recent classifier is greater than 0.5, it is discarded and a new
one is created. For each of the other classifiers, if its error is greater than 0.5, its
voting power is removed during the weighted majority voting.
Another proposal for ensemble classifier is
DWAA (Dynamic Weight Assignment
and Adjustment) [24]. It creates classifiers based on data chunks, using the next
chunk to evaluate the classifier previously built. If the ensemble is not full, the clas-
sifier is added; otherwise, the worst classifier in the last data chunk is replaced. To
set the weight, it uses a formula that considers how many of the ensemble classifiers
have actually made correct predictions. If more than half of the classifiers predic-
tions are correct, each one receives a normal reward. Otherwise, each one receives
a higher reward, making those influence more the global decision of the ensemble,
as they are better suited to represent the concept.
3 Parallel RCD
RCD [14, 15] is a framework developed to deal with recurring concept drifts. It
keeps a collection of pairs of classifiers and samples used to train these classifiers,
as presented in Fig. 1. In the training phase, a concept drift detector is used. If it
identifies a concept drift, a multivariate non-parametric statistical test is performed
to compare actual data to stored samples. If the statistical test informs that both data
come from the same distribution, the classifier associated with the stored sample is
reused, meaning that the classifier is adequate to deal with actual data.
On the other hand, if the test indicates that samples are not similar, the next stored
data sample is used for testing, and so on. If no stored classifier is apt for actual data,
a new classifier is created and stored in the set. If the set is full, the older classifier
is substituted. In the testing phase, statistical tests are performed every t instances
(a user parameterized value) to select, from the stored classifiers, the best one for
actual data. Thus,
RCD dynamically adapts to the current data distribution even in
the testing phase.
Originally,
RCD performed the statistical tests sequentially. Thus, a statistical test
would be performed comparing actual data todatastoredinthebufferforclassifier
1 to verify if both represented the same data distribution. If positive, this classifier
Classifier 1
Buffer 1
Classifier 2
Buffer 2
Classifier 3
Buffer 3
...
Classifier n
Buffer n
Fig. 1 RCD classifiers set
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 133
was considered the new actual classifier; else, a statistical test would be performed
on classifier 2 buffer data, and so on.
The improvement being proposed is to perform several tests simultaneously by
the use of a thread pool of configurable fixed size to allow the user to fine tune
its value based on the hardware being used. Fig. 2 presents an example illustrating
how the thread pool works. It considers a thread pool with two active cores and a
classifiers set of size six.
When a concept drift occurs, it means the actual classifier does not correctly
represent the actual context. So, it is necessary to check whether any stored classifier
better represents the actual context. The remaining five classifiers stored in the set
must be tested, comparing a sample of actual data to the data stored in the buffer
associated with each classifier which represents the data the classifier was trained
on. Five threads are built to perform the statistical tests and they are sent to the thread
pool using a FIFO scheme to associate each test to a position in the thread pool, but
only the first two are active, i.e., are actually performing a statistical test. At Fig. 2
they are represented by bolder lines, and inactive threads by thinner lines. At this
point (t = 0), two statistical tests are active and the remaining three are waiting to
execute.
When the first statistical test finishes (let’s consider statistical test 1), if the result
indicates that actual data and sample data from classifier 1 do not represent the
same data distribution, the next inactive statistical test (in this case, statistical test
3) executes in the corresponding slot (t = 1). At t = 2, the same occurs. Classifier
2, represented by statistical test 2, also does not better represent actual data and the
next statistical test (number 4) occupies its place.
Now, let’s consider that statistical test 3 has finished and it identified that actual
data and data stored in the buffer of classifier 3 represent the same distribution. In
Active
Slot 1
Active
Slot 2
Inactive
Slot 1
Inactive
Slot 2
...
Statistical
Test 1
Statistical
Test 2
Statistical
Test 3
Statistical
Test 4
Statistical
Test 5
t = 0
Statistical
Test 3
Statistical
Test 2
Statistical
Test 4
Statistical
Test 5
t = 1
Statistical
Test 3
Statistical
Test 4
Statistical
Test 5
t = 2
Fig. 2 Example of a thread pool execution
134 P.M. Gonc¸alves J´unior and R.S.M. de Barros
this situation, this classifier substitutes the actual classifier, all other active statistical
tests are stopped from executing, and the inactive ones are canceled.
This scheme is interesting because, if a test is negative, the next test to perform
is already being executed, allowing a faster performance of the algorithm. If a test
is positive, all other executing tests are stopped and tests yet to be executed do not
enter the active thread pool.
Notice that this scheme is general and allows the execution of any statistical
test in parallel. Source code and instructions on how to use
RCD are available as a
MOA extension and can be obtained at http://sites.google.com/site/
moaextensions/.
4 Experiments Configuration
We used several data sets to perform the experiments: Hyperplane [16], LED [11],
SEA [21], Forest Covertype [4], Poker Hand [3], and Electricity [10, 12]. The first
three are artificial data sets: the first one presents gradual concept drifts while the
following two present abrupt concept drifts. The last three are real-world data sets.
These data sets and their configurations are the same as used by Bifet et al. [3].
Hyperplane was tested in ten million instances while LED and SEA, in one million.
All tests in the artificial data sets were repeated ten times and computed a 95%
confidence interval. The parameters of these streams are the following:
• HYP(x,v) represents a Hyperplane data stream with x attributes changing at
speed v;
• LED(v) appends four concepts (1, 3, 5, 7), each one representing a different num-
ber of drifting attributes with length of change v;
• SEA(v) uses the same four concepts and in the same order as defined in the
original paper [21], with length of change v.
The
RCD configuration used in the experiments includes naive Bayes as base learner,
classifiers collection size set to 15,
KNN as the statistical test used (with k = 3), and
the minimum amount of similarity between data samples set to 0.05. Two buffer
sizes, two test frequencies (only in the testing phase), and three thread pool sizes
have been used.
The evaluation methodology used was Interleaved Chunks, also known as data
block evaluation method [5], on ten runs. It initially reads a block of d instances.
When the block is formed, it uses the instances for testing the existing classifier and
then the classifier is trained on the instances. This methodology was used because it
is better suited to compute training and testing times. In the following experiments,
d was set to 100,000 instances in the Hyperplane, Covertype, and Poker Hand data
sets, and to 10,000 instances in the LED, SEA, and Electricity data sets.
All the experiments were performed using the Massive Online Analysis (
MOA)
framework [2] in a Core i3 330M processor with 4GB of main memory running
Windows 7 Professional. This processor has four cores, two physical and two virtual
ones, where each core runs at 2.13GHz.
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 135
We used a modified version of the Interleaved Chunks presented in the MOA
framework because the version available in the tool computes the time the thread
executing the classifier uses the processor. In this solution, it is possible to exe-
cute other applications at the same time and the results will not be affected. How-
ever, because we will use a thread pool to perform the statistical tests, these are
not computed because the original thread may not be active in the processor. Here,
we compute the real time taken by
RCD to perform, not being possible to run other
applications at the same time.
5Results
Table 1 presents the average number of detected concept drifts, classifiers set size,
number of reused classifiers, as well as the evaluation, train and test times (in sec-
onds) for
RCD considering the ten runs, using a buffer size with 100 instances and a
test frequency of 500, considering thread pools with one, two, and four active cores.
It is worth pointing out the results were quite similar in the artificial data sets,
regardless of the thread pool size. This behavior did not occur in the real-world data
sets and is probably related to the number of detected concept drifts. In the artificial
data sets, the average number of detected concept drifts are considerably low, as can
be seen in the first column (CD), because a small number of concept drifts demand
few statistical tests to be performed.
However, not only the number of detected concept drifts influences performance.
Reusing classifiers also matters. If the first tests identify similarity between distribu-
tions, several other tests will not be executed, reducing the benefits of using a thread
pool. On the other hand, if only the last tests or none at all identify similarity, more
tests need to be executed. This is the situation expected to be more benefited from
the parallelization of the statistical tests.
For example, the average classifiers collection size (CS) in the artificial data sets
is below three, not being even close to fill the set (15 classifiers). Having few stored
classifiers indicates that few statistical tests need to be performed. The difference
between the number of detected concept drifts and the number of stored classifiers
is due to the reuse of classifiers. Analyzing the column with the number of reused
classifiers (RC), we can see that the values are very close to the ones presented in
the first column. This means that in the majority of the concept drifts a classifier was
reused.
In this
RCD configuration, analyzing the artificial data sets, using one core had
better statistical results than using two cores in both configurations of Hyperplane
and in LED but worse results in both versions of SEA. Using one core performed
statistically better in the HYP(10, 0.0001) and LED data sets when compared to
using four cores but worse performance in SEA, similarly to using two cores. In the
HYP(10, 0.001) data set, both had statistically similar results. Using four cores had
better statistical results than using two cores in the Hyperplane and SEA data sets,
and similar ones in LED.
136 P.M. Gonc¸alves J´unior and R.S.M. de Barros
Table 1 Results for a buffer with 100 instances and test frequency of 500 instances (in sec-
onds)
Data sets CD CS RC
1 core 2 cores 4 cores
eval train test eval train test eval train test
HYP(10,0.001) 7.7 2.6 6.0 99.64 44.45 36.30 101.49 45.40 39.98 100.13 46.13 38.20
HYP(10,0.0001) 8.2 2.4 6.7 98.99 44.06 36.15 101.49 45.35 40.29 100.10 46.18 38.18
SEA(50) 0.1 1.1 0.0 4.70 2.11 1.09 4.32 1.53 1.31 4.30 1.56 1.27
SEA(50000) 0.4 1.1 0.3 4.63 1.97 1.25 4.26 1.54 1.31 4.23 1.56 1.27
LED(50000) 0.3 1.2 0.1 32.56 14.23 12.78 33.54 14.72 13.25 33.54 14.70 13.24
Covertype 2980.0 15.0 844.0 98.12 80.68 8.30 84.24 66.99 8.25 73.99 56.77 8.32
Poker Hand 1871.0 15.0 109.0 45.72 37.86 3.78 38.31 30.45 3.81 31.67 23.82 3.82
Electricity 212.0 15.0 30.0 2.89 2.48 0.09 2.40 1.98 0.09 2.00 1.58 0.09
Table 2 Thread pool management for a buffer with 100 instances and test frequency of 500
instances (in milliseconds)
Creation time Execution time Destruction time
1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores
HYP(10,0.001) 2.49 1.03 1.03 1.87 2.22 2.63 0.00 0.00 0.00
HYP(10,0.0001) 2.49 1.16 0.40 1.73 2.90 3.68 0.00 0.00 0.00
SEA(50) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SEA(50000) 4.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
LED(50000) 0.00 5.33 0.00 5.00 0.00 10.33 0.00 0.00 0.00
Covertype 2.12 2.25 2.97 25.76 19.84 15.04 0.03 0.02 0.04
Poker Hand 2.01 2.21 2.96 14.81 10.89 6.10 0.01 0.01 0.06
Electricity 2.16 1.95 2.92 8.82 6.58 3.42 0.00 0.00 0.00
Table 3 Results for a buffer and test frequency of 100 instances (in seconds)
Data sets CD CS RC
1 core 2 cores 4 cores
eval train test eval train test eval train test
HYP(10,0.001) 8.4 2.8 6.5 422.74 45.61 360.38 487.43 46.86 424.62 499.94 46.87 437.12
HYP(10,0.0001) 10.2 2.3 8.2 421.24 45.37 359.88 459.86 46.66 397.05 454.82 46.63 392.47
SEA(50) 0.1 1.1 0.0 6.18 1.51 3.15 6.25 1.55 3.19 6.22 1.54 3.16
SEA(50000) 0.2 1.1 0.1 6.27 1.53 3.23 6.35 1.56 3.29 6.34 1.58 3.25
LED(50000) 2.3 1.2 2.1 37.22 14.25 17.38 38.47 14.89 17.98 38.51 14.90 18.06
Covertype 3376.0 15.0 944.0 534.32 86.47 438.71 354.76 72.31 273.38 315.25 60.87 245.36
Poker Hand 1871.0 15.0 109.0 305.93 37.83 264.08 229.43 30.78 194.59 166.22 23.65 138.53
Electricity 212.0 15.0 30.0 12.40 2.43 9.62 9.50 2.07 7.08 6.80 1.50 4.96
Table 4 Thread pool management for a buffer and test frequency of 100 instances (in
milliseconds)
Creation time Execution time Destruction time
1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores
HYP(10,0.001) 0.39 0.49 0.51 2.85 3.36 3.47 0.01 0.01 0.01
HYP(10,0.0001) 0.32 0.37 0.36 2.91 3.21 3.17 0.01 0.01 0.01
SEA(50) 0.03 0.03 0.02 0.15 0.15 0.15 0.00 0.00 0.00
SEA(50000) 0.02 0.03 0.03 0.16 0.16 0.16 0.00 0.00 0.00
LED(50000) 0.03 0.03 0.04 0.40 0.41 0.41 0.00 0.00 0.00
Covertype 2.05 2.35 3.14 73.84 46.14 39.73 0.01 0.02 0.07
Poker Hand 2.14 2.39 3.08 30.97 21.95 14.00 0.01 0.02 0.08
Electricity 2.08 2.75 2.85 21.79 15.07 9.69 0.08 0.00 0.03
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 137
On the other hand, in the real-world data sets, more cores returned lower eval-
uation times: two cores were faster than one core and using four cores was faster
than the other two thread pool sizes. In the artificial data sets, using one core was,
on average, 1.90% faster than using two cores, but 14.84% slower in the real-world
data sets. Comparing one and four cores, similar results apply. One core was faster
by 0.74% in the artificial data sets while four cores was faster by 26.63% in the real
data sets. Using four cores had practically the same performance than using two
cores in artificial data sets, being 1.14% faster, but was 13.85% faster in the real-
world data sets. The real-world data sets presented a huge amount of concept drifts,
the classifiers set became full and the number of reused classifiers was also much
bigger than in the artificial data sets.
To better analyze how the thread pool influences performance, we computed the
average amount of time (in milliseconds) needed to create the thread pool and to
assign the statistical tests to their respective slots, to execute the thread pool, and to
finalize it. In the assignment stage, if a statistical test is assigned to an active slot,
it starts executing immediately, while other tests are still being assigned, so it is not
necessary to wait for all tests to be assigned a specific slot to start execution, saving
time. This information is presented at Table 2.
Observing the real-world data sets, it is possible to notice that, in general, the
greater the number of cores, the longer was the time spent in the creation of the
thread pool, but the differences are usually very small. In the execution time, using
more cores meant faster execution, with no exception. The destruction times were
usually negligible, taking less than 0.1 milliseconds.
The results of Table 3 are similar to those presented at Table 1 but the test fre-
quency in the experiments was increased to 100 instances. Again, parallelism out-
performs sequential solution in the real-world data sets, the ones with more detected
concept drifts.
In these tests, we can also notice that the evaluation time is mostly spent in the
testing phase, differently from the results of Table 1. As the test frequency is higher,
the evaluation time and the time spent in the testing phase increased considerably.
Making the tests more frequently also increased the number of detected concept
drifts in 50% of the data sets. In SEA(50,000), it was 0.4 to 0.2. In Poker Hand,
Electricity, and SEA(50), the number of detected concept drifts stayed the same.
Using one core was faster than using two cores in average by 11.72% in the ar-
tificial data sets, but was 30.37% slower in the real-world data sets. Comparing one
core to four cores similar results apply: one core was faster by 12.55% in the artifi-
cial data sets and four cores was faster by 42.74% in the real-world data sets. Using
two cores offered better average results compared to using four cores by 0.75% in
the artificial data sets and worse performance by 17.76% in the real-world ones.
Table 4 presents similar information to those presented at Table 2. Results are
also similar: the creation times is usually slightly faster using less cores, the execu-
tion time is considerably smaller when using more cores, and destruction times are
usually less than 0.1 milliseconds.
Instead of increasing the test frequency, Table 5 presents similar information as
Tables 1 and 3 but increasing the buffer size to 500 instances. Here, the tests took
138 P.M. Gonc¸alves J´unior and R.S.M. de Barros
Table 5 Results for a buffer and test frequency of 500 instances (in seconds)
Data sets CD CS RC
1 core 2 cores 4 cores
eval train test eval train test eval train test
HYP(10,0.001) 11.9 2.6 9.9 1339.33 46.07 1275.16 1453.37 46.65 1388.60 1508.29 48.98 1441.27
HYP(10,0.0001) 9.5 2.4 7.5 1356.57 45.77 1293.32 1404.87 46.01 1341.41 1413.77 47.41 1348.88
SEA(50) 1.1 1.1 1.0 11.17 1.59 7.30 11.17 1.57 7.31 11.25 1.62 7.30
SEA(50000) 0.2 1.1 0.1 11.39 1.56 7.57 11.40 1.58 7.55 11.49 1.60 7.60
LED(50000) 0.2 1.2 0.0 57.99 14.28 37.33 53.98 14.25 33.33 55.14 14.92 33.85
Covertype 3063.0 15.0 798.0 1781.17 358.50 1413.39 1107.10 244.95 853.07 1167.37 268.35 889.92
Poker Hand 410.0 15.0 18.0 588.42 47.24 537.02 359.69 34.10 321.47 377.63 36.19 336.09
Electricity 183.0 15.0 20.0 33.38 8.85 24.15 20.87 6.02 14.46 22.92 7.19 15.26
Table 6 Thread pool management for a buffer and test frequency of 500 instances (in mil-
liseconds)
Creation time Execution time Destruction time
1 core 2 cores 4 cores 1 core 2 cores 4 cores 1 core 2 cores 4 cores
HYP(10,0.001) 0.54 0.60 0.60 61.90 67.54 70.14 0.02 0.02 0.02
HYP(10,0.0001) 0.50 0.55 0.51 62.94 65.31 65.66 0.02 0.01 0.02
SEA(50) 0.04 0.03 0.04 2.99 2.99 2.98 0.00 0.00 0.00
SEA(50000) 0.04 0.05 0.04 3.10 3.09 3.09 0.00 0.00 0.01
LED(50000) 0.06 0.04 0.05 12.32 10.27 10.26 0.00 0.00 0.00
Covertype 2.01 2.34 13.66 560.81 342.49 354.36 0.02 0.01 0.06
Poker Hand 2.27 2.65 6.36 314.13 186.57 187.92 0.01 0.06 0.03
Electricity 2.04 2.28 5.23 140.45 91.20 91.78 0.00 0.06 0.06
longer to complete; in average, 62 milliseconds compared to 3 milliseconds when
using a buffer with 100 instances. Nevertheless, the results were similar to the ones
presented at Table 3. Using parallelism was much faster in the real-world data sets
and slightly slower in the artificial ones. Using one core was 5.70% faster than using
two cores in the artificial data sets but 38.09% slower in the real-world ones.
However, it was interesting to observe that the evaluation time was lower using
two cores than with four. Comparing one and four cores similar results apply: one
core was 8.05% faster in the artificial data sets and 34.75% slower in the real-world
ones. Using two cores was slightly better than using four: 2.22% in the artificial
and 5.39% in the real-world data sets. This probably occurs because there are much
more statistical tests to perform and they take longer to complete than the tests in
the other configurations, putting a higher load on the whole system and negatively
affecting the performance.
Comparing Tables 1, 3 and 5, it is possible to notice that the increase in the
buffer size had a higher influence in the evaluation time than the increase in the
test frequency. Increasing the test frequency by five times increased the evaluation
time between 4.26 and 4.50 times. Increasing the buffer size by five times increased
the evaluation time between 11.95 and 13.37 times. The training time practically
did not change in the three configurations performed; the increase in the evaluation
time was due to the testing time.
Table 6 presents the times taken for the thread pool management, as previously
described at Tables 2 and 4. In the artificial data sets, the creation times are very
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 139
close in the three sizes of active cores used. In the real-world data sets, one and two
cores take very similar times, and using four cores takes more time than the other
two. This probably occurs because, in the creation time, the statistical tests asso-
ciated with active cores begin executing while other tests are still being assigned.
Thus, the creation time tends to be bigger when there are more active cores and they
take longer to complete. We can see it comparing the three tables concerning thread
pool management. At Tables 2 and 4, the differences in the creation time between
one, two and four cores are almost negligible. In these cases, the average time taken
to perform a statistical test is three milliseconds. At Table 6, using four cores takes
more time than using one and two cores. Here, the average time taken to perform
the statistical test is 62 milliseconds. The execution times are very similar in the ar-
tificial data sets. In the real-world data sets, using two or four cores is considerably
faster than using one active core. Using two cores was faster than using four cores in
the Covertype data set, while in the other two real-world data sets, the performances
were quite similar.
6Conclusion
This paper studied the influence of executing parallel statistical tests in the RCD
framework using six data sets (eight configurations), with and without concept
drifts, with abrupt and gradual concept drifts, and considering artificial and real-
world data sets. Tests were performed with both sequential and parallel execution
of two and four statistical tests.
Analysis of the experiment results led to the conclusion that the execution of
parallel statistical tests was most beneficial when there was a high number of de-
tected concept drifts leading to more statistical tests being performed. Tests were
also performed to analyze the performance results in the following conditions:
1. the buffer size was increased, making the statistical tests take longer to complete;
and
2. increase the test frequency, making more statistical tests to be performed.
In data sets with a small number of detected concept drifts (the artificial data sets),
the performances were quite similar, but using sequential execution had lower evalu-
ation times ranging from 0.74% to 12.55%. On the other hand, in the data sets with
a high number of detected concept drifts (the real-world ones), using parallelism
increased performance in values ranging from 13.85% to 42.74%.
Multiplying the test frequency five times increased the evaluation time more than
four times, while multiplying the size of the buffer five times increased the evalua-
tion time more than 11 times, indicating that the buffer size has a higher impact in
performance than test frequency.
The analysis of the thread pool creation, execution, and destruction times were
also performed, showing that, as expected, the major improvement occurs in the
execution phase. The creation times using different number of cores are close to
140 P.M. Gonc¸alves J´unior and R.S.M. de Barros
one another and the destruction times are commonly negligible, taking less than 0.1
milliseconds.
6.1 Future Work
Some other experiments might be made to better understand how the execution of
parallel statistical tests can improve the performance of the
RCD framework. One
of these experiments is testing the influence of the number of available cores in the
processor in performance. Other possible experiment is to analyze the influence of
other buffer sizes and test frequencies.
References
1. Baena-Garc´ıa, M., Del Campo-
´
Avila, J., Fidalgo, R., Bifet, A., Gavald`a, R., Morales-
Bueno, R.: Early drift detection method. In: International Workshop on Knowledge Dis-
covery from Data Streams, IWKDDS 2006, pp. 77–86 (2006),
http://eprints.pascal-network.org/archive/00002509/
2. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive online analysis. J. of
Mach. Learn. Res. 11, 1601–1604 (2010),
http://portal.acm.org/citation.cfm?id=1859890.1859903
3. Bifet, A., Holmes, G., Pfahringer, B., Frank, E.: Fast perceptron decision tree learn-
ing from evolving data streams. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.)
PAKDD 2010, Part II. LNCS (LNAI), vol. 6119, pp. 299–310. Springer, Heidelberg
(2010), http://dx.doi.org/10.1007/978-3-642-13672-6_30
4. Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and dis-
criminant analysis in predicting forest cover types from cartographic variables. Comput.
and Electron. in Agric. 24(3), 131–151 (1999),
http://dx.doi.org/10.1016/S0168-1699(99)00046-0
5. Brzezi´nski, D., Stefanowski, J.: Accuracy updated ensemble for data streams with con-
cept drift. In: Corchado, E., Kurzy´nski, M., Wo´zniak, M. (eds.) HAIS 2011, Part II.
LNCS, vol. 6679, pp. 155–163. Springer, Heidelberg (2011),
http://dx.doi.org/10.1007/978-3-642-21222-2_19
6. Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique
for tracking concept drift in spam filtering. Knowl.-Based Syst. 18(4-5), 187–195
(2005), http://dx.doi.org/10.1016/j.knosys.2004.10.002; AI-2004,
Cambridge, England, December 13-15 (2004)
7. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD 2000, New York, NY, USA, pp. 71–80 (2000),
http://dx.doi.org/10.1145/347090.347107
8. Elwell, R., Polikar, R.: Incremental learning of concept drift in nonstationary environ-
ments. IEEE Trans. on Neural Netw. 22(10), 1517–1531 (2011),
http://dx.doi.org/10.1109/TNN.2011.2160459
9. Ferrer-Troyano, F., Aguilar-Ruiz, J.S., Riquelme, J.C.: Discovering decision rules from
numerical data streams. In: Proceedings of the 2004 ACM Symposium on Applied Com-
puting, SAC 2004, New York, NY, USA, pp. 649–653 (2004),
http://dx.doi.org/10.1145/967900.968036
Speeding Up Statistical Tests to Detect Recurring Concept Drifts 141
10. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan,
A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer,
Heidelberg (2004),
http://dx.doi.org/10.1007/978-3-540-28645-5_29
11. Gama, J., Medas, P., Rocha, R.: Forest trees for on-line data. In: Proceedings of the 2004
ACM Symposium on Applied Computing, SAC 2004, New York, NY, USA, pp. 632–
636 (2004),
http://dx.doi.org/10.1145/967900.968033
12. Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams.
In: Proceedings of the 2005 ACM Symposium on Applied Computing, SAC 2005, New
York, NY, USA, pp. 573–577 (2005),
http://dx.doi.org/10.1145/1066677.1066809
13. Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data
streams. In: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2003, New York, NY, USA, pp. 523–528
(2003), http://dx.doi.org/10.1145/956750.956813
14. Gonc¸alves Jr., P.M., Barros, R.S.M.: A comparison on how statistical tests deal with
concept drifts. In: Arabnia, H.R., et al. (eds.) Proceedings of the 2012 International Con-
ference on Artificial Intelligence, ICAI 2012, vol. 2, pp. 832–838. CSREA Press, Las
Vegas (2012)
15. Gonc¸alves Jr., P.M., Barros, R.S.M.: RCD: A recurring concept drift framework. Pattern
Recognit. Lett. (to appear, 2013),
http://dx.doi.org/10.1016/j.patrec.2013.02.005
16. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Pro-
ceedings of the Seventh ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, KDD 2001, New York, NY, USA, pp. 97–106 (2001),
http://dx.doi.org/10.1145/502512.502529
17. Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: An ensemble method for drift-
ing concepts. J. of Mach. Learn. Res. 8, 2755–2790 (2007),
http://dl.acm.org/citation.cfm?id=1314498.1390333
18. Lane, T., Brodley, C.E.: Approaches to online learning and concept drift for user identifi-
cation in computer security. In: Agrawal, R., Stolorz, P. (eds.) Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining, KDD 1998, pp.
259–263. AAAI Press, Menlo Park (1998),
http://www.aaai.org/Papers/KDD/1998/KDD98-045.pdf
19. Roberts, S.W.: Control chart tests based on geometric moving averages. Technomet-
rics 1(3), 239–250 (1959), http://www.jstor.org/stable/1266443
20. Ross, G.J., Adams, N.M., Tasoulis, D.K., Hand, D.J.: Exponentially weighted moving
average charts for detecting concept drift. Pattern Recognit. Lett. 33(2), 191–198 (2012),
http://dx.doi.org/10.1016/j.patrec.2011.08.019
21. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classifica-
tion. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD 2001, New York, NY, USA, pp. 377–382 (2001),
http://dx.doi.org/10.1145/502512.502568
22. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensem-
ble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2003, New York, NY, USA, pp. 226–235
(2003), http://dx.doi.org/10.1145/956750.956778
142 P.M. Gonc¸alves J´unior and R.S.M. de Barros
23. Wang, S., Schlobach, S., Klein, M.: Concept drift and how to identify it. Web Semant.:
Sci., Serv. and Agents on the World Wide Web 9(3), 247–265 (2011),
http://dx.doi.org/10.1016/j.websem.2011.05.003
24. Wu, D., Wang, K., He, T., Ren, J.: A dynamic weighted ensemble to cope with concept
drifting classification. In: The 9th International Conference for Young Computer Scien-
tists, ICYCS 2008, pp. 1854–1859 (2008),
http://dx.doi.org/10.1109/ICYCS.2008.491
25. Yeh, A.B., Mcgrath, R.N., Sembower, M.A., Shen, Q.: Ewma control charts for monitor-
ing high-yield processes based on non-transformed observations. International Journal
of Production Research 46(20), 5679–5699 (2008),
http://dx.doi.org/10.1080/00207540601182252