Conference PaperPDF Available

Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees

Authors:

Abstract and Figures

Although a vast majority of inductive learning algorithms has been developed for handling of the concept drifting data streams, especially the ones in virtue of ensemble classification models, few of them could adapt to the detection on the different types of concept drifts from noisy streaming data in a light demand on overheads of time and space. Motivated by this, a new classification algorithm for Concept drifting Detection based on an ensembling model of Random Decision Trees (called CDRDT) is proposed in this paper. Extensive studies with synthetic and real streaming data demonstrate that in comparison to several representative classification algorithms for concept drifting data streams, CDRDT not only could effectively and efficiently detect the potential concept changes in the noisy data streams, but also performs much better on the abilities of runtime and space with an improvement in predictive accuracy. Thus, our proposed algorithm provides a significant reference to the classification for concept drifting data streams with noise in a light weight way.
Content may be subject to copyright.
Concept Drifting Detection on Noisy Streaming
Data in Random Ensemble Decision Trees
Peipei Li1,2, Xuegang Hu1, Qianhui Liang2, and Yunjun Gao2,3
1School of Computer Science and Information Technology, Hefei University of
Technology, China, 230009
2School of Information Systems, Singapore Management University, Singapore,
178902
3College of Computer Science, Zhejiang University, China, 310027
Abstract. Although a vast majority of inductive learning algorithms
has been developed for handling of the concept drifting data streams,
especially the ones in virtue of ensemble classification models, few of
them could adapt to the detection on the different types of concept
drifts from noisy streaming data in a light demand on overheads of time
and space. Motivated by this, a new classification algorithm for Concept
drifting Detection based on an ensembling model of Random Decision
Trees (called CDRDT) is proposed in this paper. Extensive studies with
synthetic and real streaming data demonstrate that in comparison to
several representative classification algorithms for concept drifting data
streams, CDRDT not only could effectively and efficiently detect the
potential concept changes in the noisy data streams, but also performs
much better on the abilities of runtime and space with an improvement
in predictive accuracy. Thus, our proposed algorithm provides a signifi-
cant reference to the classification for concept drifting data streams with
noise in a light weight way.
Keywords: Data Streams, Ensemble Decision Trees, Concept Drift,
Noise.
1 Introduction
As the definition of data streams described in [23], it is an ordered sequence of
tuples with certain time intervals. And as compared with the traditional data
source, it always presents various new characteristics as being open-ended,cont in-
uous and high-volume etc.. It is hence a challenge to learn from these streaming
data for most of traditional inductive models or classification algorithms[18,19,9].
Especially, it is intensively challenging for them oriented to the issues of con-
cept drifts and noise contamination in the real applications, such as web search,
online shopping or stock market and alike. To handle these problems, massive
models and algorithms of classification have been proposed. The representative
ones are based on ensemble learning, including an early ensemble algorithm
of SEA[1] addressed the concept drift of data streams, a general framework
P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 236–250, 2009.
c
Springer-Verlag Berlin Heidelberg 2009
Concept Drifting Detection on Noisy Streaming Data 237
for mining concept-drifting data streams using weighted ensemble classifiers[2],
a discriminative model based on the EM framework for fast mining of noisy
data streams[4], decision tree algorithms for concept drifting data streams with
noise[5,11] and a boosting-like method for adaptation to different kinds of con-
cept drifts[6] etc.. However, for these algorithms referred above, the limitations
mainly rely that on one hand, little attention is paid to handle various types
of concept drifts in data streams impacted from noise. On the other hand, the
overheads of space and runtime are probably demanded heavily while without a
prominent improvement on predictive accuracy.
Therefore, to address the aforementioned issues, we present a light-weighted
ensemble classification algorithm of CDRDT for Concept Drifting data streams
with noise. It is based on random decision trees evolved from semi-random de-
cision trees in [14]. Namely, it adopts the strategy of random selection to solve
split-test for the nodes with numerical attributes instead of the heuristic method.
In comparison to other ensembling model of random decision trees for concept
drifting data streams, there are four significant contributions in CDRDT: i)
the basic classifiers are constructed incrementally with small various chunks of
streaming data. ii) the inequality of Hoeffding Bounds[7] is adopted to specify
two thresholds, which are used in the concept drifting detection from noise. It
benefits distinguishing the different types of concept drifts from noise. iii) the
sizes of data chunks are adjusted dynamically with the bound limit to adapt to
the concept drifts. It is beneficial to avoid the disadvantage of too large or too
small sizes of data chunks in the detection of data distribution, especially in the
case with the classification method of ma jority class. iv) the effectiveness and ef-
ficiency of CDRDT in the detection on concept drifts from noisy data streams are
estimated and contrasted with other algorithms, including the state-of-the-art
algorithm of CVFDT[10] and the new ensemble algorithms of MSRT (Multi-
ple Semi-Random decision Trees)[11] based on semi-random decision trees. And
the experimental results show that CDRDT performs in a light demand on the
overheads of time and space with higher predictive accuracy.
The rest of the paper is organized as follows. Section 2 reviews related work
based on ensemble classifiers of random decision trees learning from concept
drifting data streams. Our algorithm of CDRDT for the concept drifting de-
tection from noisy data streams is described in details at Section 3. Section 4
provides the experimental evaluations and Section 5 is the conclusion.
2 Related Work
Since the model of Random Decision Forests[12] was first proposed by Ho in 1995,
the random selection strategy of split-features has been applied into the model
of decision trees popularly. And many developed or new random decision trees
have appeared, such as [24, 25, 17]. However, it is not suitable for them to han-
dle data streams directly. Sub-sequentially, a random decision tree ensembling
method[3] for streaming data was proposed by Fan in 2004. It adopts the cross-
validation estimation for higher classification accuracy. Hu et al. designed an in-
cremental algorithm of Semi-Random Multiple Decision Trees for Data Streams
238 P. Li et al.
(SRMTDS )[14] in 2007. It uses the inequality of Hoeffding bounds with a heuris-
tic method to implement split-test. In the following year, an extended algorithm
of MSRT in [11] was further introduced by authors to reduce the impact from
noise in the concept-drifting detection. At the same year, H. Abdulsalam et
al. proposed a stream-classification algorithm of Dynamic Streaming Random
Forests[13]. It is able to handle evolving data streams with the underlying class
boundaries drift using an entropy-based drift-detection technique.
In contrast with the algorithms based on decision trees ensembling mentioned
above, our classification algorithm of CDRDT for concept drifting data streams
proposed here behaves with four prominent characteristics. Firstly, the ensemble
models of random decision trees developed from semi-random decision trees are
generated incrementally in variable sizes of streaming data chunks. Secondly, to
avoid the oversensitivity to the concept drifts and reduce the noise contamina-
tion, two thresholds are specified to partition their bounds in the inequality of
Hoeffding Bound. Thirdly, the check period are adjusted dynamically for adap-
tation to concept drifts. Lastly, it presents better performances on the abilities
of space, time and predictive accuracy.
3 Concept Drifting Detection Algorithm Based on
Random Ensemble Decision Trees
3.1 Algorithm Description
The classification algorithm of CDRDT to be proposed in this section is for the
detection of concept drifts from the data streams with noise. It first generates
multiple classifiers of random decision trees incrementally with variable chunks
of data streams. After seeing all streaming data in a chunk (i.e., the check period
is reached), a concept drifting detection is installed in this ensembling model.
By means of the pre-defined thresholds in the Hoeffding Bound inequality, the
difference of the average error rates classified in the method of Na¨ıve Bayes or
majority-class at leaves are taken to measure the distribution changes of stream-
ing data. Further different types of concept drifts from noise are distinguished.
Once a concept drift is detected, we correspondingly adjust the check period
to adapt to the concept drift. Finally, a majority-class voting or Na¨ıve Bayes
is utilized to classify the test instances. Generally, the process flow of CDRDT
mentioned above could be partitioned into three major components: i) the incre-
mental generation of random decision trees in the function of GenerateClassifier.
ii) the concept drifting detection methods adopted in ComputeClassDistribution.
iii) the adaptation strategies to concept drifts and noise in CheckConceptChange.
The related details will be illustrated as follows respectively.
Ensemble Classifiers of Random Decision Trees
In different from the previous algorithms involved in [11, 14], on one hand,
CDRDT utilizes various magnitudes of streaming data chunks to generate ensem-
ble classifiers of random decision trees. Here, random indicates that the split-test
Concept Drifting Detection on Noisy Streaming Data 239
Input: Training set: DSTR; Test set: DSTE; Attribute set: A; Initial height of tree:
h0; The number of minimum split-examples: nmin; Split estimator function: H(·); The
number of trees: N; The set of classifiers: CT ; Memory Constraint: MC and Check
Perio d: CP.
Output: Theerrorrateofclassication
Procedure CDRDT {DSTR,DSTE,A,h0,nmin ,H(·), N,CT,MC,CP }
1. For each chunk of training data streams SjDSTR (|CP|=|Sj|,j1)
2. For each classifier of CT k(1 kN)
3. GenerateClassifier(CT k,Sj,MC,CP );
4. If all streaming data in Sjare observed
5. averageError =ComputeClassDistribution();
6. If the current chunk is the first one
7. fError =averageError ;
8. Else
9. sError =averageError;
10. If ( j2)
11. CheckConceptChange (fError,sError,CP,Sj);
12. fError =sError;
13. For each test instance in DSTE
14. For each classifier of CT k
15. Travel the tree of CT kfrom its root to a leaf;
16. Classify with the method of majority class or Na¨ıve Bayes in CT k;
17. Return the error rate of voting classification.
method adopted in our algorithm selects an index of the discretization intervals
consisted in ordered values of a numerical attribute randomly and sets the mean
value of this interval to a cut-point. On the other hand, it won’t split continu-
ously for nodes with the discrete attributes until the count of instances collected
meets the specified threshold (a default value is initialized to two). However, the
remainder details of trees’ growing are similar to the descriptions in [11, 14].
Concept Drifting Detection
In this subsection, we first introduce several basic concepts relevant to
concept drift.
Definition 1. Acon cept signifies either a stationary distribution of class labels
in a set of instances at the current data streams or a similar distribution rule
about the attributes in the given instances.
According to the divergence of concept drifting patterns, the change modes of
a concept could be divided into three types of concept drift,concept shift and
sampling change as involved in [15].
240 P. Li et al.
Definition 2. The types of concept drift and concept shift belong to the pattern
with distinct change speed in the attribute values or class labels of databases.
The first one refers to the gradual change and the other one indicates the rapid
change.
Definition 3. sampling change is mostly attributed to the pattern change in
the data distribution of class labels (in this paper all changes are called concept
drifts instead.).
In CDRDT, a concept drifting detection on the distribution changes of stream-
ingdataisinstalledafteradatachunktraverses all of random decision trees.
And various types of concept drifts are distinguished from noise in virtue of the
relation between the difference of average error rates of classification at leaves
and the specified thresholds. Here, the thresholds are specified in the inequal-
ity of Hoeffding Bound, whose detailed description is given below: Consider a
real-valued random variable rwhose range is R. Suppose we have made ninde-
pendent observations of this variable, and computed their mean ¯r,whichshows
that, with probability 1 - δ, the true mean of the variable is at least ¯r-ε.
P(r¯r-ε)=1-δ,ε=R2ln(1/δ)/2n (1)
Where Ris defined as log(M(classes)) and M(classes) indicates the count of
total class labels in the current database, the value of nrefers to the size of the
current streaming data chunk, the random variable of rspecifies the expectation
errorrateclassiedinthemethodofNa¨ıve Bayes or majority-class at leaves over
all classifiers of random decision trees in CDRDT. Suppose the target object of
¯ris the history classification result in the ith-chunk (denoted as ¯ef)andthe
current observation object refers to the estimation result of classification in the
(i+1)th chunk (marked as ¯es). The detailed definition of ¯efes) is formalized
below.
¯efes)=1/N·N
k=1Mk
leaf
i=1 [pki ·nki /Mk
leaf
i=1 nki](2)
In this formula, Nsignifies the number of total trees, Mk
leaf refers to the count
of leaves at the kth classifier, nki is the count of instances at the ith leaf in the
classifier of CT kand pki is the error rate estimated in 0-1 loss function at the
ith leaf in CT k. In terms of Formula (2), we utilize the difference between ¯es
and ¯ef( i.e., Δees¯ef) to discover the distribution changes of class labels.
More specifically, if the value of Δeis nonnegative, a potential concept drift is
taken into account. Otherwise, it is regarded as a case without any concept drift.
This is based on the statistics theory, which guarantees that for stationary dis-
tribution of the instances, the online error of Na¨ıve Bayes will decrease; when the
distribution function of the instances changes, the online error of the Na¨ıve Bayes
at the node will increase[16]. However, for the classification results in the method
of majority-class, a similar rule could be concluded from the distribution changes
Concept Drifting Detection on Noisy Streaming Data 241
of class labels in small chunks of streaming data but with sufficient instances as
well (In this paper, the minimum size of a data chunk marked as nmin is set to
0.2k, 1k = 1000. This is obtained from the conclusion in [22].). It is also verified
in our experiments on the tracking of concept drifts in Section 4. Hence, Eq.(1)
could be transformed into Eq.(3).
Pesefε0)=1-δ0,ε0=R2ln(10)/2n(3)
To distinguish diverse concept drifts from noise, it is necessary to specify differ-
ent values of ε0to partition their bounds, which refer to the tolerant bounds of
deviation between the current error rate and the reference error rate. Evidently,
the larger the variance of ε0the higher the drifting likelihood is. In other words,
it is more probable that the previous model won’t adapt to the current data
streams due to the deficiency in the accuracy of classification. Correspondingly,
the value of δ0will decrease while the confidence of 1-δ0will increase. Therefore,
with the evocation from [8], two thresholds are defined in the inequality of Ho-
effding Bound to control the classification deviation of error rates, i.e., Tmax and
Tmin. Considering the demand on the predictive ability of the current models,
their values are specified as follows.
PesefTmax )=1-δmin,Tmax =3ε0
δmin =1/exp[Tmax2·2n/R2](4)
PesefTmin )=1-δmax,Tmin =ε0
δmax =1/exp(Tmin2·2n/R2)(5)
Adaptation to Concept Drifts Contaminated by the Noise
In accordance with the related analysis mentioned above and the definitions of
thresholds specified in Eqs.(4) and (5), four types of concept drifting states would
be partitioned, including the ones of a non-concept drift,apotent ial concept
drift, a plausible concept drift and a true concept drift. Namely, if the value of
Δeis negative, it is taken as a non-concept drift. Otherwise, it is in a case of
other three possible concept drifts. More precisely, if the value of Δeis less than
Tmin,apoten tia l concept drift is considered (pote ntial indicates that the slower
or much slower concept drift is probably occurring). And if greater than Tmax,
atrue concept drift is taken into account, which is resulted from a potential
concept drift or an abrupt concept drift. Otherwise, it is attributed to the state
of plausible concept drift considering the effect from the noise contamination. It
spans the transition interval between a potent ial concept drift and a true concept
drift. As regards this fuzzy status, it is beneficial to reduce the impact from the
noise in data streams and avoid over-sensitivity to the concept drifts.
Correspondingly, different strategies are adopted to handle various types of
concept drifts. More specifically, for the case of non-concept drift, maintain the
size of the current data chunk in a default value (e.g., nmin). For the potential
concept drift, increase the chunk size by the number of instances-mmin (e.g.,
mmin =nmin = 0.2k). However, for a plausible concept drift, shrink the size of
242 P. Li et al.
streaming data chunk and the check period by one third respectively. Because it
is necessary to further observe the change of data streams for a deterministic type
of concept drift. Otherwise, for a true concept drift, reduce the sizes into a half of
the original values. Regarding the disadvantages of streaming data chunks with
too large or too small sizes, the maximum bound (e.g., mmax = 10*nmin)and
the minimum one (e.g., mmin) are specified to control the change magnitude
of a data chunk for better adaption to the concept changes. It indicates that
if a bound is reached, the check period remains invariable until a new concept
drift occurs. Furthermore, to improve the utility of each tree, those sub-branches
whose error rates of classification are lower than the average level (e.g., 50%)
will be pruned.
3.2 Analysis
Generation Error Rate for the Concept Drifting Data Streams
According to the theorem of generation error analyzed in [17], as the number of
trees increases, for almost surely all sequences Θ1..., the generation error of PE
will converge to
PX,Y(PΘ(h(X,Θ)=Y) - maxj=Y(PΘ(h(X,Θ)=j))<0) (6)
where Xis the training set; Ymeans the class label, Θspecifies the random
vector of feature generated from the attribute set, P(X,Y) indicates the proba-
bility over the X,Yspace and h(X,Θ) refers to the classifier. Eq.(6) is concluded
on the assumption that the sequences of Θ1... are independent identically dis-
tributed random vectors. However, regarding the concept drifting data streams,
it is probable that the streaming data distributions are not uniform any more as
the time flows. As a result, it is improper to judge the convergence of generation
error in this case. Therefore, in the analysis of our ensembling model, we give
an infimum bound of generation error. Because each detection on concept drift
is installed every a certain period of instances, the training data are divided
into small sequences, i.e., θt(t∈{1, 2, ..., |B|},|B|is the maximum index of
sequence.). In an overall consideration, each generation error of our model in a
chunk of θtcouldbeexpressedbelow:
PETt
θt=Pθt(V(Tt,θt)=Y)-Maxj=YPθt(V(Tt,θt)=j)(7)
where Ttspecifies the current decision tree ensemble and each tree is generated
or updated with the data chunks composed of {θk,1kt};V(·) signifies
the voting function, which acts on the data chunk of θtclassified by the current
ensemble random decision trees. Considering the worst case, the generation error
would be defined in Formula (8).
PE=Max(P(Ttt)(PE Tt
θt<0)) P(Ttt)(PETt
θt<0) (8)
Concept Drifting Detection on Noisy Streaming Data 243
Due to Max
j=YPθt(V(Tt
t)=j)1Pθt(V(Tt
t)=Y), Eq.(8) hence could
be written as Eq.(9):
PEP(Ttt)(Pθt(V(Tt,θt)=Y)0.5) (9)
Based on the analysis on the probability of optimal ensemble model in [14], i.e.,
P(M(Attr), N,h0)=1-(1-1/M(Attr))N·2h01, we take it as an estimation on the
classification accuracy. Therefore, the generation error referring to the probabil-
ity of P(M(Attr), N,h0)0.5 would be formalized into Eq.(10),
PEP(P(M(Attr), N,h0)0.5) (10)
It clearly shows that the higher the optimal probability for an ensemble model
the less the generation error rate is. As a consequence, we could adjust the
number of trees or the heights of trees to improve the predictive accuracy for
the adaptation to concept drifts.
4 Experiments
To verify the efficiency and effectiveness of CDRDT in the detection on dif-
ferent types of concept drifts from noisy data streams, extensive experiments
are conducted on the diverse benchmark concept drifting databases and the
real streaming data obtained from Yahoo! Shopping Web Service. And the ex-
perimental study presents that not only could CDRDT detect concept changes
timely and effectively with a certain resilience to the noise, but also outperforms
on the abilities of runtime & space and the predictive accuracy as compared
with CVFDT and MSRT. Therefore, this section would be divided into two
parts correspondingly. The first one discusses the characteristics of all concept-
drifting databases used in our experiments. And the second one analyzes the
drifting track in CDRDT and the performances on runtime, space and predic-
tive accuracy (all experiments referred here are performed on a P4, 3.00GHz
PC with 1G main memory, running Windows XP Professional. Furthermore, all
algorithms used in our experiments are written in Visual C++.). Due to the
limited space, only partial experimental results are given as follows.
4.1 Data Source
Synthetic Data
HyperPlane. HyperPlane is a benchmark database of data streams with the
gradual concept drift, which has been used in a numerous of references included
[10, 3, 11, 2]. A HyperPlane in a d-dimensional space (d=50) is denoted by equa-
tion: d
i=1wixi=w0. Each vector of variables (x1,x2,···,xd) in this database is
a randomly generated instance and is uniformly distributed in the multidimen-
sional space[0, 1]d.Ifd
i=1wixiw0, the class label is 1, or else is 0. The bound
244 P. Li et al.
of coefficient wiis limited to [-10, 10]. For a weight of wi, each initial value is gen-
erated at random. And then it increases or decreases continuously by the value of
Δwi= 0.005 till it is up or down to the boundary, further changes the direction
with the probability of pw=10%). Meanwhile, in order to simulate the concept
drifting case, we select 5-dimension to change their weights in the database with
the noise rate of r=10%.
SEA. The artificial data of SEA first described in [1] is a well-known data set of
concept shift with numerical attributes only. It is composed of 60k random points
in a three-dimensional feature space with two classes. All three features have val-
ues between 0 and 10 but only the first two features are relevant. Those points are
divided into four chunks with different concepts. In each chunk, a data point be-
longs to class 1 if f1+f2θ,wheref1and f2represent the first two features and
θis a threshold value between these two classes. In this database, there are four
thresholds of 8, 9, 7 and 9.5 to divide data chunks. Each chunk reserves 2.5k-sized
records as test sets containing 10% class noise for the different concepts. And the
rest 50k-sized points are treated as the training data, in which each of concept
shift appears every 12.5k-sized instances.
KDDCup99. The KDDCup99 database[20] is a database for network intru-
sion detection, which is selected here because it has been simulated as streaming
data with sampling change in [15]. In this database, the count of attributes is
41 dimensions with 34-dimensions of numerical attributes and the number of
class labels is 24 totaly. Due to the skew distribution of class labels, the data
with minor rates of class labels (i.e., the total number is lower than the value of
nmin) are taken as the noise data. Hence, the data set without noise contains
490k-sized instances with 12 class labels.
Real Data
Yahoo! Shopping Data. The web shopping data used in our experiments are
obtained via the interface of Yahoo! web services. They are sampled from Ya-
hoo! shopping databases relevant to catalog listing,product search and merchant
search. The basic feature of this data set contains 17 dimensions of attributes and
the total number of numerical ones is 10 dimensions. Meanwhile, it is composed
of the product information with the attribute set of (NnumeratingofProduct,Av-
erageRating, etc.) and the related information of merchants included attributes
of (NumRatingsofMerchant,Price-SatisfactionRating,Overal lRating etc.). The
correlation between a product and a merchant is connected by the catalog listing
(see [21] for more details). To mine the relation between the credibility of mer-
chants and possible factors, the attribute of Overal lRating with different scores
are defined as our class labels, which are divided into five class labels. By the
label distribution, we extract the number of 84k-sized instances randomly from
the obtained records as a training set and the rest 28k-instances as a test set.
4.2 Experimental Evaluation on Synthetic Databases
Before introducing the experimental results evaluated on synthetic databases,
several symbols involved in our experiments would be given firstly in the above
Concept Drifting Detection on Noisy Streaming Data 245
Symbol Description
Max/Bayes the classification method of majority-class/Na¨ıve Bayes.
∗∗∗” refers to the name of a database. The whole symbol refers that the ex-
∗∗∗-Max/Bayes perimental results are classified by the method of Max/Bayes.
Error rate the error rate estimated in a test set or a specified data chunk, Unit: (%). The
value in the former is averaged over 20 runs classified by 20*N-trees.
the training + test time (algorithms based on ensemble decision trees run in
T+C time a simulated parallel environment of multiple PCs. Thus, the time overhead is
calculated by the largest one of Ntrees, Unit: (s). But for CDRDT, the “T”
time is computed by the total generation time of N-trees here.
Memory the total memory consumption of all trees, Unit: (M). All results are averaged
over 20 runs as similarly as the results with Error rate.
structure: the name + the size of trainingdata+thesizeoftestdata+the
Database name database type (i.e., C: numerical, D: discrete, CD: hybrid) + the number of
attribute dimensions, e.g., SEA-50K-2.5K-C-3.
Drift-Track the process of detection on the concept drifts in a certain time-interval.
for KDDCup99, it refers to the index of different class labels, marked as Cla -
Drift-Level ss label; For HyperPlane, it stands for the difference rate of class labels in
comparison with an original database after introducing the noise, marked as
Drift-rate, Unit: (%). A unified name is called Drift Level.
Period-Change a short for the size of a data chunk in the concept drifting detection, i.e., the
count of instances for each detection, Unit: (k).
table. Secondly, on the setting of parameters, the parameters in CVFDT and
MSRT still follow the original definitions in [10, 11] respectively. While for
CDRDT, the parameter values are specified below: N= 10, h0=M(attr)/2
as defined in [22], an initial size of a streaming data chunk of |Sj|= 0.2k (i.e.,
|CP|), MC = 500k and δmax =0.1(ε0and δmin are calculated in Eqs.(5) & (4)
respectively). Now, the experimental details will be described as follows.
Tracking Concept Drifts
The tracking curves of concept drifts are plotted in Figures 14, which present
various detection cases on the databases with different drifting characteristics.
In these figures, the curves of Drift-Track are drawn in solid lines with the scale
of the left y-axis and the Drift-Level curves are described in dotted lines with
the scale of the right y-axis. While for the changing values of Period-Change,we
use a plus sign of “+” to represent in the corresponding figures. Its definition in
scale is specified below: the lowest value starts from 1k and the basic unit (de-
noted as BU ) is set to 0.5k based on the scale of the left y-axis. In Figure 1-a,
due to the different magnitudes of Drift-Level, the corresponding tracking curve
fluctuates variously. Especially at the beginning of detection, there are larger
discrepancies owing to the insufficiency of training data. Furthermore, when the
current drifting level transfers to another one, a jump will appear in the track-
ing curve from a local minimum point to a local maximum one. However, due
to the gradual concept drifts in HyperPlane, little deviation occurs between the
adjacent detection results. As a result, most of check periods are maintained
stably. Moreover, with the increasing of streaming data, the fluctuation trend is
gradually converging.
Considering the concept shifting detection on SEA, the tracking results fluc-
tuate with three shifting occurrences as shown in Figure 2. As compared with
the detection cases in the methods of Max and Bayes, there are several common
characteristics: i) the fluctuations are frequent at the beginning of the training
246 P. Li et al.
Fig. 1. (left:) Drift track on HyperPlane (right:) Classification results
Fig. 2. Drift track over sequential data chunks for SEA
even if without any concept change. This is similar to the case in HyperPlane;
ii) the upward and downward trends in curves alternately take place with the
shifting of the concepts. This is mainly resulted from the distribution changes
with only two class labels in this database, which contains four concepts in total
varying from the rate of 1.8: 1 to 0.85:1 (In this figure, both of the minimum
period and the unit of period refer to 0.2k-sized instances, which take the left
y-axisasthescaleaxis).
With respect to the detection on the sampling change, the tracking curves
drawn in Figures 34 describe the detection results on KDDCup99 in the classi-
fication methods of Max and Bayes respectively. In the observations, we can see
that the more the frequent changes in the distribution of class labels the fiercer
the fluctuations occur. It is demonstrated noticeably in the curve segment span-
ning the interval from the 1st data chunk to the 121th one (For KDDCup99,
the default value of check period is set to 1k and the unit of BU is 0.2k. In
Figure 3, the distance of two basic units in the left y-axis signifies one BU for
a conspicuous description while a unit refers to one BU in Figure 4.). However,
if the stable distribution of a class label is reached, the trend of tracking curves
will fall down, such as during the curve segment between the 121th chunk and
Concept Drifting Detection on Noisy Streaming Data 247
Fig. 3. Drift track over sequential data chunks classified with Max for KDDCup99
Fig. 4. Drift track over sequential data chunks classified with Bayes for KDDCup99
the 321th one. Hence, the check period maintains invariably because the state of
non-concept drift is considered in this case.
Predictive Accuracy and Overheads of Time and Space
Except of the drifting detection on the training data, we also evaluate the classi-
fication abilities of CDRDT on the different test sets in comparison to CVFDT
and MSRT. To be concrete, firstly, the results classified on Hyperplane are plot-
ted in Figure 1-b, including the overheads of runtime & space and the mean error
rates of classification added variances. It is clear that CDRDT outperforms other
two algorithms on all abilities involved above. Secondly, the classification results
on SEA and KDDCup99 are also concluded in Tables 1 and 2 respectively. As
shown in Table 1, though in the Max classification method, CDRDT does not
perform as well as other algorithms on the predictive accuracy, the error rate
could be reduced by 6.15% at least if adopting the method of Bayes.Further-
more, as similar to the case in HyperPlane, the overheads of space and runtime
are lowest, which the largest deviation is up to dozens of times. However, in
Table 2, the predictive accuracy in CDRDT is improved largely by 14.76% and
36.61% respectively in Max as compared with CVFDT and MSRT. Meanwhile,
in the case with the Bayes method, the superiority of predictive accuracy in
CDRDT is also prominent. Moreover, the performances on both of the runtime
and space consumption are much better as well.
248 P. Li et al.
Table 1 . Classification results on SEA-50k-2.5k-C-3
Algorithm Error rate(%) (T+C)time(s) Memory(M)
Max Bayes
mean varia nce mean va ri anc e Max Bayes Max Bayes
CDRDT 44.32 3.762 13.53 1.354 0+0 0+0 <1 5
MSRT 25.98 2.770 19.66 2.344 0+0 0+0 7*5
CVFDT 25.24 /8+0 /41 /
Table 2 . Classification results on KDDCup99-490k-310k-CD-41
Algorithm Error rate(%) (T+C)time(s) Memory(M)
Max Bayes
mean varia nce mean va ri anc e Max Bayes Max Bayes
CDRDT 8.87 0.169 9.06 0.488 32+19 41+123 6 7
MSRT 44.48 22.350 28.60 19.896 93+18 92+504 65 5
CVFDT 23.48 /75+18 /20 /
4.3 Experimental Evaluation on Web-Shopping Data
For the real-world data streams, it is hard to judge whether the current data
streams carry a potential concept drift or when a concept drift occurs. Mean-
while, it is inevitable to be affected from the noise. Therefore, to verify the
feasibility and utility of our algorithm, we conduct several comparison experi-
ments on real data streams of Yahoo! shopping data with other algorithms as
well. The experimental results listed in Table 3 show that CDRDT is superior
to CVFDT and MSRT in the predictive accuracy and the overheads of runtime
and space. For instance, the predictive accuracy in CDRDT is highest, which is
improved by 9.32% averagely even though in the worst case. And for the training
time, it is reduced by the times of 3/2 at least with an approximate overhead
on the test time in contrast to other two algorithms. In addition, on the space
consumption, the maximum rate only takes a half of the space consumption in
CVFDT while the minimum one is only 1/25 of MSRT.
Table 3 . Classification on Yahoo!-shopping-data-84k-28k-CD-16
Algorithm Error rate(%) (T+C)time(s) Memory(M)
Max Bayes
mean varia nce mean va ri anc e Max Bayes Max Bayes
CDRDT 14.02 5.545 4.45 1.936 2+1 6+5 <1 6
MSRT 33.52 17.891 44.42 39.166 11+1 15+20 30 5
CVFDT 23.34 /10+3 /13 /
5Conclusion
In this paper, we have proposed an ensembling classification algorithm named
CDRDT for Concept Drifting detection from noisy data streams, which is based
on Random Decision Trees. In contrast to previous efforts on ensemble classifiers
of decision trees or random decision trees, small data chunks with unfixed sizes
are adopted in CDRDT to generate the classifiers of random decision tress incre-
mentally. To effectively distinguish different types of concept drifts from noise,
Concept Drifting Detection on Noisy Streaming Data 249
two thresholds are defined in virtue of the inequality of Hoffeding Bounds. Fur-
thermore, for better adaptation to the concept drifts, the check period is adjusted
dynamically and timely. Moreover, extensive experiments are conducted on three
types of synthetic concept drifting databases and a real-world database of Yahoo!
shopping data. And the experimental results demonstrate that CDRDT could
adapt to various concept drifts efficiently in the noisy data streams. In addition,
as compared with the state-of-the-art algorithm of CVFDT and an ensemble
algorithm of MSRT, it outperforms on the abilities of runtime & space and the
predictive accuracy. Hence, a conclusion is drawn in the study that CDRDT
is a light-weighted ensembling algorithm of classification. It would provide an
efficient method for the detection on a variety of concept drifts in data stream.
However, how to model the noise data to discern the concept drifts from noise
accurately and how to deal with the cases with the skewed distribution of class
labels in data streams are still challenging and interesting issues for our future
work.
References
1. Street, W., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classi-
fication. In: 7th ACM SIGKDD international conference on Knowledge Discovery
and Data mining, KDD 2001, pp. 377–382. ACM Press, New York (2001)
2. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining Concept-Drifting Data Streams
Using Ensemble Classifiers. In: 9th ACM SIGKDD international conference on
Knowledge Discovery and Data mining, KDD 2003, pp. 226–235. ACM Press, New
York (2003)
3. Fan, W.: Streamminer: a classifier ensemble-based engine to mine concept-drifting
data streams. In: 30th international conference on Very Large Data Bases, VLDB
2004, pp. 1257–1260. VLDB Endowment (2004)
4. Chu, F., Wang, Y., Zaniolo, C.: An adaptive learning approach for noisy data
streams. In: 4th IEEE International Conference on Data Mining, pp. 351–354.
IEEE Computer Science, Los Alamitos (2004)
5. Gama, J., Fernandes, R., Rocha, R.: Decision trees for mining data streams. Intel-
ligent Data Analysis 10, 23–45 (2006)
6. Scholz, M., Klinkenberg, R.: Boosting Classifiers for Drifting Concepts. Intel-
ligent Data Analysis (IDA), Special Issue on Knowledge Discovery from Data
Streams 11(1), 3–28 (2007)
7. Hoeffding, W.: Probability inequalities for sums of bounded random variabless.
Journal of the American Statistical Association 58(301), 13–30 (1963)
8. Castillo, G., Gama, J., Medas, P.: Adaptation to Drifting Concepts. In: Pires, F.M.,
Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 279–293. Springer,
Heidelberg (2003)
9. Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y.: Boat-optimistic decision tree
construction. In: 1999 ACM SIGMOD International Conference on Management
of Data, pp. 169–180. ACM Press, New York (1999)
10. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In:
7th ACM SIGKDD international conference on Knowledge Discovery and Data
mining, KDD 2001, pp. 97–106 (2001)
250 P. Li et al.
11. Li, P., Hu, X., Wu, X.: Mining concept-drifting data streams with multiple semi-
random decision trees. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X.
(eds.) ADMA 2008. LNCS, vol. 5139, pp. 733–740. Springer, Heidelberg (2008)
12. Ho, T.K.: Random decision forests. In: 3rd International Conference on Document
Analysis and Recognition, pp. 278–282. IEEE Computer Society, Los Alamitos
(1995)
13. Abdulsalam, H., Skillicorn, D.B., Martin, P.: Classifying Evolving Data Streams
Using Dynamic Streaming Random Forests. In: Bhowmick, S.S., K¨ung, J., Wagner,
R. (eds.) DEXA 2008. LNCS, vol. 5181, pp. 643–651. Springer, Heidelberg (2008)
14. Hu, X., Li, P., Wu, X., Wu, G.: A semi-random multiple decision-tree algorithm for
mining data streams. Journal of Computer Science and Technology 22(5), 711–724
(2007)
15. Yang, Y., Wu, X., Zhu, X.: Combining Proactive and Reactive Predictions for Data
Streams. In: 11th ACM SIGKDD international conference on Knowledge Discovery
in Data mining, KDD 2005, pp. 710–715. ACM Press, New York (2005)
16. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
17. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
18. Quinlan, R.J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers
Inc., San Francisco (1993)
19. Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data
mining. In: 22th International Conference on Very Large Data Bases, VLDB 1996,
pp. 544–555. Morgan Kaufmann, San Francisco (1996)
20. KDDCUP 1999 DataSet,
http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html
21. Yahoo! Shopping Web Services, http://developer.yahoo.com/everything.html
22. Li, P., Liang, Q., Wu, X., Hu, X.: Parameter Estimation in Semi-Random Decision
Tree Ensembling on Streaming Data. In: Theeramunkong, T., Kijsirikul, B., Cer-
cone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 376–388.
Springer, Heidelberg (2009)
23. Wikipedia, http://en.wikipedia.org/wiki/Data_stream
24. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees.
Neural Computation 9, 1545–1588 (1997)
25. Ho, T.K.: The random subspace method for constructing decision forests. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
... work [10], which proposes the utilization of unweighted groups, as new information may have a place with an idea not quite the same as the latest preparing information. Road and Kim [11] likewise report that no predictable enhancement for the precision was gotten when utilizing gathering part loads in their methodology. Regardless of the way that there is a tremendous writing on time cross sectional data processing, most of the current methodologies does not consider whether the sequence of data points taken at timely order is an exceptional sort of streaming [1]. ...
... The chunk based ensemble can be used to detect the novel class present in the data chunks of a stream by training the unlabeled data. Elegant techniques which combine more than one models have been more well-known than their single model partners due to their less complex usage and higher proficiency [11]. A large portion of these elegant procedures utilize a block based methodology for learning [9], [11], in which they partition the information stream into blocks, and train a model from one piece. ...
... Elegant techniques which combine more than one models have been more well-known than their single model partners due to their less complex usage and higher proficiency [11]. A large portion of these elegant procedures utilize a block based methodology for learning [9], [11], in which they partition the information stream into blocks, and train a model from one piece. We allude to these methodologies as "chunk based" approaches. ...
Article
Full-text available
The Propelled analytics of data streams is rapidly turning into an important territory of processing the complicated data as the quantity of requesting such applications grows continuously. Web based mining when the input data advancing over period is getting to be most important center problem. In todays real world the data is not stationary but non stationary. The properties of attributed are changing over time. As a result the data stream has concept drift and noise which affects the performance. The aim of the paper is to first present an overview of the challenges in the data streams, followed by the measures to improve the factors that affect the performance.
... work [10], which proposes the utilization of unweighted groups, as new information may have a place with an idea not quite the same as the latest preparing information. Road and Kim [11] likewise report that no predictable enhancement for the precision was gotten when utilizing gathering part loads in their methodology. Regardless of the way that there is a tremendous writing on time cross sectional data processing, most of the current methodologies does not consider whether the sequence of data points taken at timely order is an exceptional sort of streaming [1]. ...
... The chunk based ensemble can be used to detect the novel class present in the data chunks of a stream by training the unlabeled data. Elegant techniques which combine more than one models have been more well-known than their single model partners due to their less complex usage and higher proficiency [11]. A large portion of these elegant procedures utilize a block based methodology for learning [9], [11], in which they partition the information stream into blocks, and train a model from one piece. ...
... Elegant techniques which combine more than one models have been more well-known than their single model partners due to their less complex usage and higher proficiency [11]. A large portion of these elegant procedures utilize a block based methodology for learning [9], [11], in which they partition the information stream into blocks, and train a model from one piece. We allude to these methodologies as "chunk based" approaches. ...
Article
The Propelled analytics of data streams is rapidly turning into an important territory of processing the complicated data as the quantity of requesting such applications grows continuously. Web based mining when the input data advancing over period is getting to be most important center problem. In todays real world the data is not stationary but non stationary. The properties of attributed are changing over time. As a result the data stream has concept drift and noise which affects the performance. The aim of the paper is to first present an overview of the challenges in the data streams, followed by the measures to improve the factors that affect the performance.
... Concept drift has different types and has been categorized in several ways [4][5][6]. Ref [6] classified concept drift as follows: 1) Sudden drift which is the simplest one. It refers to a sudden replacement of a concept with another one at a specific time. ...
... A classification algorithm for concept drift detection based on an ensemble model of random decision trees (CDRDT) is suggested in [4]. The main goal of this algorithm is to detect concept drift in data streams with noise. ...
Article
Concept drift, change in the underlying distribution that data points come from, is an inevitable phenomenon in data streams. Due to increase in the number of data streams' applications such as network intrusion detection, weather forecasting, and detection of unconventional behavior in financial transactions; numerous researches have recently been conducted in the area of concept drift detection. An ideal method for concept drift detection should be able to rapidly and correctly identify changes in the underlying distribution of data points and adapt its model as quickly as possible while the memory and processing time is limited. In this paper, we propose a novel explicit method based on ensemble classifiers for detecting concept drift. The method processes samples one by one, and monitors the distribution of ensemble's error in order to detect probable drifts. After detection of a drift, a new classifier will be trained on the new concept in order to keep the model up-to-date. The proposed method has been evaluated on some artificial and real benchmark data sets. The experiments' results show that the proposed method is capable of detecting and adjusting to concept drifts from different types, and it has outperformed well-known state-of-the-art methods. Especially, in the case of high-speed concept drifts.
... Existing concept drift detection methods fall into two groups based on their theoretical foundations: monitoring learner outputs and two-sample distribution tests [16]. Methods in the first group, including [17], [18], treat the learning error of the base classifier output as a random variable from Bernoulli trials, so that binomial distribution can be used to describe the distribution of the error rate. A significant increase in the error rate, when exceeding a preset threshold, indicates concept drift. ...
Preprint
Full-text available
Uncertain changes in data streams present challenges for machine learning models to dynamically adapt and uphold performance in real-time. Particularly, classification boundary change, also known as real concept drift, is the major cause of classification performance deterioration. However, accurately detecting real concept drift remains challenging because the theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information on the trend of the drift, which could be invaluable for model maintenance. This work presents a novel real concept drift detection method based on Neighbor-Searching Discrepancy, a new statistic that measures the classification boundary difference between two samples. The proposed method is able to detect real concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying the invasion or retreat of a certain class, which is also an indicator of separability change between classes. A comprehensive evaluation of 11 experiments is conducted, including empirical verification of the proposed theory using artificial datasets, and experimental comparisons with commonly used drift handling methods on real-world datasets. The results show that the proposed theory is robust against a range of distributions and dimensions, and the drift detection method outperforms state-of-the-art alternative methods.
... concept drifts from data streams. iii) Evaluations on both synthetic and real-life data show that SUN is comparable to the state-of-the-art concept drifting algorithms of CVFDT (Hulten et al. 2001) and CDRDT (Li et al. 2009), even on unlabeled data streams. Meanwhile, SUN outperforms semi-supervised algorithms mentioned in (Wu et al. 2006). ...
Article
Contrary to the previous beliefs that all arrived streaming data are labeled and the class labels are immediately availa- ble, we propose a Semi-supervised classification algorithm for data streams with concept drifts and UNlabeled data, called SUN. SUN is based on an evolved decision tree. In terms of deviation between history concept clusters and new ones generated by a developed clustering algorithm of k-Modes, concept drifts are distinguished from noise at leaves. Extensive studies on both synthetic and real data demonstrate that SUN performs well compared to several known online algorithms on unlabeled data. A conclusion is hence drawn that a feasible reference framework is provided for tackling concept drifting data streams with unlabeled data.
Article
With the continuous generation of huge volumes of streaming data, streaming data regression has become more complicated. A regressor that predicts two or more outputs, i.e., multioutput regression, is commonly used in many applications. However, current multioutput regressors use a batch method to handle data, which presents compatibility issues for streaming data as they need to be analyzed online. To address this issue, we present a multioutput regression system, called MORStreaming, for streaming data. MORStreaming uses an instance-based model to make predictions because this model can quickly adapt to change by only storing new instances or by throwing away old instances. However, learning instances in our regression system are constrained by online demand and need to consider the relationship between outputs. Therefore, MORStreaming consists of two algorithms: 1) an online algorithm based on topology networks which is designed to learn the instances and 2) an online algorithm based on adaptive rules which is designed to learn the correlation between outputs automatically. Experiments involving both artificial and real-world datasets indicate MORStreaming can achieve superior performance compared with other multioutput methods.
Article
In this work, we address the problem of making decisions based on data streams, i.e., choosing an action when a new value is recorded. For instance, actions can be trading decisions in financial markets or perturbations of the data stream itself. To start with, we propose a language that allows individuals to formulate requirements on the action space. We use prediction techniques to identify the best possible action. However, for many scenarios there is not just one technique that predicts the future precisely, and different techniques behave differently. Since there is no technique that dominates all the others, our conclusion is to take multiple predictions into account. While ensemble techniques aggregating the predictions seem promising, existing techniques have issues, such as unnecessary information losses or the need for a predefined quality measure. Thus, we propose a new ensemble approach that weights predictions techniques according to requirements and solves an optimization problem that derives decisions from weighted predictions. We apply our solution to data privacy on data streams. For this setting, the benefits provided by prediction techniques have not been studied yet. In three case studies, we show that our solution consistently achieves better decision-making quality than approaches from related work.
Conference Paper
The ability to adapt to new learning environments is a vital feature of contemporary case-based reasoning system. It is imperative that decision makers know when and how to discard outdated cases and apply new cases to perform smart maintenance operations. Competence-based empirical distance has been recently proposed as a measurement that can estimate the difference between case sample sets without knowing the actual case distributions. It is reportedly one of the most accurate drift detection algorithms in both synthetic and real-world data sets. However, as the construction of competence models have to retain every case in memory, it is not suitable for online drift detection. In addition, the high computational complexity O(\(n^{2}\)) also limits its practical application, especially when dealing with large scale data sets with time constrains. In this paper, therefore, we propose a space-based online case grouping strategy, and a new case group enhanced competence distance (CGCD), to address these issues. The experiment results show that the proposed strategy and related algorithms significantly improve the efficiency of the current leading competence-based drift detection algorithm.
Article
Detecting changes of concepts, such as a change of customer preference for telecom services, is very important in terms of prediction and decision applications in dynamic environments. In particular, for case-based reasoning systems, it is important to know when and how concept drift can effectively assist decision makers to perform smarter maintenance operations at an appropriate time. This paper presents a novel method for detecting concept drift in a case-based reasoning system. Rather than measuring the actual case distribution, we introduce a new competence model that detects differences through changes in competence. Our competence-based concept detection method requires no prior knowledge of case distribution and provides statistical guarantees on the reliability of the changes detected, as well as meaningful descriptions and quantification of these changes. This research concludes that changes in data distribution do reflect upon competence. Eight sets of experiments under three categories demonstrate that our method effectively detects concept drift and highlights drifting competence areas accurately. These results directly contribute to the research that tackles concept drift in case-based reasoning, and to competence model studies.
Chapter
Full-text available
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Conference Paper
Full-text available
Classification with concept-drifting data streams has found wide applications. However, many classification algorithms on streaming data have been designed for fixed features of concept drift and cannot deal with the noise impact on concept drift detection. An incremental algorithm with Multiple Semi- Random decision Trees (MSRT) for concept-drifting data streams is presented in this paper, which takes two sliding windows for training and testing, uses the inequality of Hoeffding Bounds to determine the thresholds for distinguishing the true drift from noise, and chooses the classification function to estimate the error rate for periodic concept-drift detection. Our extensive empirical study shows that MSRT has an improved performance in time, accuracy and robustness in comparison with CVFDT, a state-of-the-art decision-tree algorithm for classifying concept-drifting data streams.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148-156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the "real" tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
Article
Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr {S – ES ≥ nt} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.
Article
Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naïve Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.
Article
Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for $\Pr \{ S - ES \geq nt \}$ depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.
Conference Paper
Mining data streams is important in both science and commerce. Two major challenges are (1) the data may grow without limit so that it is difficult to retain a long history; and (2) the underlying concept of the data may change over time. Different from common practice that keeps recent raw data, this paper uses a measure of conceptual equivalence to organize the data history into a history of concepts. Along the journey of concept change, it identifies new concepts as well as re-appearing ones, and learns transition patterns among concepts to help prediction. Different from conventional methodology that passively waits until the concept changes, this paper incorporates proactive and reactive predictions. In a proactive mode, it anticipates what the new concept will be if a future concept change takes place, and prepares prediction strategies in advance. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. A system RePro is proposed to implement these new ideas. Experiments compare the system with representative existing prediction methods on various benchmark data sets that represent diversified scenarios of concept change. Empirical evidence demonstrates that the proposed methodology is an effective and efficient solution to prediction for data streams.