Conference PaperPDF Available

Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees

July 2009

July 2009

DOI:10.1007/978-3-642-03070-3_18

Source
dx.doi.org

Conference: Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition

Authors:

Peipei Li

Hefei University of Technology

Hu Xuegang

Fujian Medical University

Qianhui Liang

HP Inc.

Although a vast majority of inductive learning algorithms has been developed for handling of the concept drifting data streams, especially the ones in virtue of ensemble classification models, few of them could adapt to the detection on the different types of concept drifts from noisy streaming data in a light demand on overheads of time and space. Motivated by this, a new classification algorithm for Concept drifting Detection based on an ensembling model of Random Decision Trees (called CDRDT) is proposed in this paper. Extensive studies with synthetic and real streaming data demonstrate that in comparison to several representative classification algorithms for concept drifting data streams, CDRDT not only could effectively and efficiently detect the potential concept changes in the noisy data streams, but also performs much better on the abilities of runtime and space with an improvement in predictive accuracy. Thus, our proposed algorithm provides a significant reference to the classification for concept drifting data streams with noise in a light weight way.

. Classification results on SEA-50k-2.5k-C-3

…

Classification results on KDDCup99-490k-310k-CD-41

…

Classification on Yahoo!-shopping-data-84k-28k-CD-16

…

Figures - uploaded by Peipei Li

Content may be subject to copyright.

Content uploaded by Peipei Li

Content may be subject to copyright.

Concept Drifting Detection on Noisy Streaming

Data in Random Ensemble Decision Trees

Peipei Li1,2, Xuegang Hu1, Qianhui Liang2, and Yunjun Gao2,3

1School of Computer Science and Information Technology, Hefei University of

Technology, China, 230009

2School of Information Systems, Singapore Management University, Singapore,

178902

3College of Computer Science, Zhejiang University, China, 310027

Abstract. Although a vast majority of inductive learning algorithms

has been developed for handling of the concept drifting data streams,

especially the ones in virtue of ensemble classiﬁcation models, few of

them could adapt to the detection on the diﬀerent types of concept

drifts from noisy streaming data in a light demand on overheads of time

and space. Motivated by this, a new classiﬁcation algorithm for Concept

drifting Detection based on an ensembling model of Random Decision

Trees (called CDRDT) is proposed in this paper. Extensive studies with

synthetic and real streaming data demonstrate that in comparison to

several representative classiﬁcation algorithms for concept drifting data

streams, CDRDT not only could eﬀectively and eﬃciently detect the

potential concept changes in the noisy data streams, but also performs

much better on the abilities of runtime and space with an improvement

in predictive accuracy. Thus, our proposed algorithm provides a signiﬁ-

cant reference to the classiﬁcation for concept drifting data streams with

noise in a light weight way.

Keywords: Data Streams, Ensemble Decision Trees, Concept Drift,

Noise.

1 Introduction

As the deﬁnition of data streams described in [23], it is an ordered sequence of

tuples with certain time intervals. And as compared with the traditional data

source, it always presents various new characteristics as being open-ended,cont in-

uous and high-volume etc.. It is hence a challenge to learn from these streaming

data for most of traditional inductive models or classiﬁcation algorithms[18,19,9].

Especially, it is intensively challenging for them oriented to the issues of con-

cept drifts and noise contamination in the real applications, such as web search,

online shopping or stock market and alike. To handle these problems, massive

models and algorithms of classiﬁcation have been proposed. The representative

ones are based on ensemble learning, including an early ensemble algorithm

of SEA[1] addressed the concept drift of data streams, a general framework

P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 236–250, 2009.

Springer-Verlag Berlin Heidelberg 2009

Concept Drifting Detection on Noisy Streaming Data 237

for mining concept-drifting data streams using weighted ensemble classiﬁers[2],

a discriminative model based on the EM framework for fast mining of noisy

data streams[4], decision tree algorithms for concept drifting data streams with

noise[5,11] and a boosting-like method for adaptation to diﬀerent kinds of con-

cept drifts[6] etc.. However, for these algorithms referred above, the limitations

mainly rely that on one hand, little attention is paid to handle various types

of concept drifts in data streams impacted from noise. On the other hand, the

overheads of space and runtime are probably demanded heavily while without a

prominent improvement on predictive accuracy.

Therefore, to address the aforementioned issues, we present a light-weighted

ensemble classiﬁcation algorithm of CDRDT for Concept Drifting data streams

with noise. It is based on random decision trees evolved from semi-random de-

cision trees in [14]. Namely, it adopts the strategy of random selection to solve

split-test for the nodes with numerical attributes instead of the heuristic method.

In comparison to other ensembling model of random decision trees for concept

drifting data streams, there are four signiﬁcant contributions in CDRDT: i)

the basic classiﬁers are constructed incrementally with small various chunks of

streaming data. ii) the inequality of Hoeﬀding Bounds[7] is adopted to specify

two thresholds, which are used in the concept drifting detection from noise. It

beneﬁts distinguishing the diﬀerent types of concept drifts from noise. iii) the

sizes of data chunks are adjusted dynamically with the bound limit to adapt to

the concept drifts. It is beneﬁcial to avoid the disadvantage of too large or too

small sizes of data chunks in the detection of data distribution, especially in the

case with the classiﬁcation method of ma jority class. iv) the eﬀectiveness and ef-

ﬁciency of CDRDT in the detection on concept drifts from noisy data streams are

estimated and contrasted with other algorithms, including the state-of-the-art

algorithm of CVFDT[10] and the new ensemble algorithms of MSRT (Multi-

ple Semi-Random decision Trees)[11] based on semi-random decision trees. And

the experimental results show that CDRDT performs in a light demand on the

overheads of time and space with higher predictive accuracy.

The rest of the paper is organized as follows. Section 2 reviews related work

based on ensemble classiﬁers of random decision trees learning from concept

drifting data streams. Our algorithm of CDRDT for the concept drifting de-

tection from noisy data streams is described in details at Section 3. Section 4

provides the experimental evaluations and Section 5 is the conclusion.

2 Related Work

Since the model of Random Decision Forests[12] was ﬁrst proposed by Ho in 1995,

the random selection strategy of split-features has been applied into the model

of decision trees popularly. And many developed or new random decision trees

have appeared, such as [24, 25, 17]. However, it is not suitable for them to han-

dle data streams directly. Sub-sequentially, a random decision tree ensembling

method[3] for streaming data was proposed by Fan in 2004. It adopts the cross-

validation estimation for higher classiﬁcation accuracy. Hu et al. designed an in-

cremental algorithm of Semi-Random Multiple Decision Trees for Data Streams

238 P. Li et al.

(SRMTDS )[14] in 2007. It uses the inequality of Hoeﬀding bounds with a heuris-

tic method to implement split-test. In the following year, an extended algorithm

of MSRT in [11] was further introduced by authors to reduce the impact from

noise in the concept-drifting detection. At the same year, H. Abdulsalam et

al. proposed a stream-classiﬁcation algorithm of Dynamic Streaming Random

Forests[13]. It is able to handle evolving data streams with the underlying class

boundaries drift using an entropy-based drift-detection technique.

In contrast with the algorithms based on decision trees ensembling mentioned

above, our classiﬁcation algorithm of CDRDT for concept drifting data streams

proposed here behaves with four prominent characteristics. Firstly, the ensemble

models of random decision trees developed from semi-random decision trees are

generated incrementally in variable sizes of streaming data chunks. Secondly, to

avoid the oversensitivity to the concept drifts and reduce the noise contamina-

tion, two thresholds are speciﬁed to partition their bounds in the inequality of

Hoeﬀding Bound. Thirdly, the check period are adjusted dynamically for adap-

tation to concept drifts. Lastly, it presents better performances on the abilities

of space, time and predictive accuracy.

3 Concept Drifting Detection Algorithm Based on

Random Ensemble Decision Trees

3.1 Algorithm Description

The classiﬁcation algorithm of CDRDT to be proposed in this section is for the

detection of concept drifts from the data streams with noise. It ﬁrst generates

multiple classiﬁers of random decision trees incrementally with variable chunks

of data streams. After seeing all streaming data in a chunk (i.e., the check period

is reached), a concept drifting detection is installed in this ensembling model.

By means of the pre-deﬁned thresholds in the Hoeﬀding Bound inequality, the

diﬀerence of the average error rates classiﬁed in the method of Na¨ıve Bayes or

majority-class at leaves are taken to measure the distribution changes of stream-

ing data. Further diﬀerent types of concept drifts from noise are distinguished.

Once a concept drift is detected, we correspondingly adjust the check period

to adapt to the concept drift. Finally, a majority-class voting or Na¨ıve Bayes

is utilized to classify the test instances. Generally, the process ﬂow of CDRDT

mentioned above could be partitioned into three major components: i) the incre-

mental generation of random decision trees in the function of GenerateClassiﬁer.

ii) the concept drifting detection methods adopted in ComputeClassDistribution.

iii) the adaptation strategies to concept drifts and noise in CheckConceptChange.

The related details will be illustrated as follows respectively.

Ensemble Classiﬁers of Random Decision Trees

In diﬀerent from the previous algorithms involved in [11, 14], on one hand,

CDRDT utilizes various magnitudes of streaming data chunks to generate ensem-

ble classiﬁers of random decision trees. Here, random indicates that the split-test

Concept Drifting Detection on Noisy Streaming Data 239

Input: Training set: DSTR; Test set: DSTE; Attribute set: A; Initial height of tree:

h0; The number of minimum split-examples: nmin; Split estimator function: H(·); The

number of trees: N; The set of classiﬁers: CT ; Memory Constraint: MC and Check

Perio d: CP.

Output: Theerrorrateofclassiﬁcation

Procedure CDRDT {DSTR,DSTE,A,h0,nmin ,H(·), N,CT,MC,CP }

1. For each chunk of training data streams Sj∈DSTR (|CP|=|Sj|,j≥1)

2. For each classiﬁer of CT k(1 ≤k≤N)

3. GenerateClassiﬁer(CT k,Sj,MC,CP );

4. If all streaming data in Sjare observed

5. averageError =ComputeClassDistribution();

6. If the current chunk is the ﬁrst one

7. fError =averageError ;

8. Else

9. sError =averageError;

10. If ( j≥2)

11. CheckConceptChange (fError,sError,CP,Sj);

12. fError =sError;

13. For each test instance in DSTE

14. For each classiﬁer of CT k

15. Travel the tree of CT kfrom its root to a leaf;

16. Classify with the method of majority class or Na¨ıve Bayes in CT k;

17. Return the error rate of voting classiﬁcation.

method adopted in our algorithm selects an index of the discretization intervals

consisted in ordered values of a numerical attribute randomly and sets the mean

value of this interval to a cut-point. On the other hand, it won’t split continu-

ously for nodes with the discrete attributes until the count of instances collected

meets the speciﬁed threshold (a default value is initialized to two). However, the

remainder details of trees’ growing are similar to the descriptions in [11, 14].

Concept Drifting Detection

In this subsection, we ﬁrst introduce several basic concepts relevant to

concept drift.

Deﬁnition 1. Acon cept signiﬁes either a stationary distribution of class labels

in a set of instances at the current data streams or a similar distribution rule

about the attributes in the given instances.

According to the divergence of concept drifting patterns, the change modes of

a concept could be divided into three types of concept drift,concept shift and

sampling change as involved in [15].

240 P. Li et al.

Deﬁnition 2. The types of concept drift and concept shift belong to the pattern

with distinct change speed in the attribute values or class labels of databases.

The ﬁrst one refers to the gradual change and the other one indicates the rapid

change.

Deﬁnition 3. sampling change is mostly attributed to the pattern change in

the data distribution of class labels (in this paper all changes are called concept

drifts instead.).

In CDRDT, a concept drifting detection on the distribution changes of stream-

ingdataisinstalledafteradatachunktraverses all of random decision trees.

And various types of concept drifts are distinguished from noise in virtue of the

relation between the diﬀerence of average error rates of classiﬁcation at leaves

and the speciﬁed thresholds. Here, the thresholds are speciﬁed in the inequal-

ity of Hoeﬀding Bound, whose detailed description is given below: Consider a

real-valued random variable rwhose range is R. Suppose we have made ninde-

pendent observations of this variable, and computed their mean ¯r,whichshows

that, with probability 1 - δ, the true mean of the variable is at least ¯r-ε.

P(r≥¯r-ε)=1-δ,ε=R2ln(1/δ)/2n (1)

Where Ris deﬁned as log(M(classes)) and M(classes) indicates the count of

total class labels in the current database, the value of nrefers to the size of the

current streaming data chunk, the random variable of rspeciﬁes the expectation

errorrateclassiﬁedinthemethodofNa¨ıve Bayes or majority-class at leaves over

all classiﬁers of random decision trees in CDRDT. Suppose the target object of

¯ris the history classiﬁcation result in the ith-chunk (denoted as ¯ef)andthe

current observation object refers to the estimation result of classiﬁcation in the

(i+1)th chunk (marked as ¯es). The detailed deﬁnition of ¯ef(¯es) is formalized

below.

¯ef(¯es)=1/N·N

k=1Mk

leaf

i=1 [pki ·nki /Mk

leaf

i=1 nki](2)

In this formula, Nsigniﬁes the number of total trees, Mk

leaf refers to the count

of leaves at the kth classiﬁer, nki is the count of instances at the ith leaf in the

classiﬁer of CT kand pki is the error rate estimated in 0-1 loss function at the

ith leaf in CT k. In terms of Formula (2), we utilize the diﬀerence between ¯es

and ¯ef( i.e., Δe=¯es−¯ef) to discover the distribution changes of class labels.

More speciﬁcally, if the value of Δeis nonnegative, a potential concept drift is

taken into account. Otherwise, it is regarded as a case without any concept drift.

This is based on the statistics theory, which guarantees that for stationary dis-

tribution of the instances, the online error of Na¨ıve Bayes will decrease; when the

distribution function of the instances changes, the online error of the Na¨ıve Bayes

at the node will increase[16]. However, for the classiﬁcation results in the method

of majority-class, a similar rule could be concluded from the distribution changes

Concept Drifting Detection on Noisy Streaming Data 241

of class labels in small chunks of streaming data but with suﬃcient instances as

well (In this paper, the minimum size of a data chunk marked as nmin is set to

0.2k, 1k = 1000. This is obtained from the conclusion in [22].). It is also veriﬁed

in our experiments on the tracking of concept drifts in Section 4. Hence, Eq.(1)

could be transformed into Eq.(3).

P(¯es-¯ef≥ε0)=1-δ0,ε0=R2ln(1/δ0)/2n(3)

To distinguish diverse concept drifts from noise, it is necessary to specify diﬀer-

ent values of ε0to partition their bounds, which refer to the tolerant bounds of

deviation between the current error rate and the reference error rate. Evidently,

the larger the variance of ε0the higher the drifting likelihood is. In other words,

it is more probable that the previous model won’t adapt to the current data

streams due to the deﬁciency in the accuracy of classiﬁcation. Correspondingly,

the value of δ0will decrease while the conﬁdence of 1-δ0will increase. Therefore,

with the evocation from [8], two thresholds are deﬁned in the inequality of Ho-

eﬀding Bound to control the classiﬁcation deviation of error rates, i.e., Tmax and

Tmin. Considering the demand on the predictive ability of the current models,

their values are speciﬁed as follows.

P(¯es-¯ef≥Tmax )=1-δmin,Tmax =3ε0

δmin =1/exp[Tmax2·2n/R2](4)

P(¯es-¯ef≥Tmin )=1-δmax,Tmin =ε0

δmax =1/exp(Tmin2·2n/R2)(5)

Adaptation to Concept Drifts Contaminated by the Noise

In accordance with the related analysis mentioned above and the deﬁnitions of

thresholds speciﬁed in Eqs.(4) and (5), four types of concept drifting states would

be partitioned, including the ones of a non-concept drift,apotent ial concept

drift, a plausible concept drift and a true concept drift. Namely, if the value of

Δeis negative, it is taken as a non-concept drift. Otherwise, it is in a case of

other three possible concept drifts. More precisely, if the value of Δeis less than

Tmin,apoten tia l concept drift is considered (pote ntial indicates that the slower

or much slower concept drift is probably occurring). And if greater than Tmax,

atrue concept drift is taken into account, which is resulted from a potential

concept drift or an abrupt concept drift. Otherwise, it is attributed to the state

of plausible concept drift considering the eﬀect from the noise contamination. It

spans the transition interval between a potent ial concept drift and a true concept

drift. As regards this fuzzy status, it is beneﬁcial to reduce the impact from the

noise in data streams and avoid over-sensitivity to the concept drifts.

Correspondingly, diﬀerent strategies are adopted to handle various types of

concept drifts. More speciﬁcally, for the case of non-concept drift, maintain the

size of the current data chunk in a default value (e.g., nmin). For the potential

concept drift, increase the chunk size by the number of instances-mmin (e.g.,

mmin =nmin = 0.2k). However, for a plausible concept drift, shrink the size of

242 P. Li et al.

streaming data chunk and the check period by one third respectively. Because it

is necessary to further observe the change of data streams for a deterministic type

of concept drift. Otherwise, for a true concept drift, reduce the sizes into a half of

the original values. Regarding the disadvantages of streaming data chunks with

too large or too small sizes, the maximum bound (e.g., mmax = 10*nmin)and

the minimum one (e.g., mmin) are speciﬁed to control the change magnitude

of a data chunk for better adaption to the concept changes. It indicates that

if a bound is reached, the check period remains invariable until a new concept

drift occurs. Furthermore, to improve the utility of each tree, those sub-branches

whose error rates of classiﬁcation are lower than the average level (e.g., 50%)

will be pruned.

3.2 Analysis

Generation Error Rate for the Concept Drifting Data Streams

According to the theorem of generation error analyzed in [17], as the number of

trees increases, for almost surely all sequences Θ1..., the generation error of PE

will converge to

PX,Y(PΘ(h(X,Θ)=Y) - maxj=Y(PΘ(h(X,Θ)=j))<0) (6)

where Xis the training set; Ymeans the class label, Θspeciﬁes the random

vector of feature generated from the attribute set, P(X,Y) indicates the proba-

bility over the X,Yspace and h(X,Θ) refers to the classiﬁer. Eq.(6) is concluded

on the assumption that the sequences of Θ1... are independent identically dis-

tributed random vectors. However, regarding the concept drifting data streams,

it is probable that the streaming data distributions are not uniform any more as

the time ﬂows. As a result, it is improper to judge the convergence of generation

error in this case. Therefore, in the analysis of our ensembling model, we give

an inﬁmum bound of generation error. Because each detection on concept drift

is installed every a certain period of instances, the training data are divided

into small sequences, i.e., θt(t∈{1, 2, ..., |B|},|B|is the maximum index of

sequence.). In an overall consideration, each generation error of our model in a

chunk of θtcouldbeexpressedbelow:

PETt

θt=Pθt(V(Tt,θt)=Y)-Maxj=YPθt(V(Tt,θt)=j)(7)

where Ttspeciﬁes the current decision tree ensemble and each tree is generated

or updated with the data chunks composed of {θk,1≤k≤t};V(·) signiﬁes

the voting function, which acts on the data chunk of θtclassiﬁed by the current

ensemble random decision trees. Considering the worst case, the generation error

would be deﬁned in Formula (8).

PE∗=Max(P(Tt,θt)(PE Tt

θt<0)) ≥P(Tt,θt)(PETt

θt<0) (8)

Concept Drifting Detection on Noisy Streaming Data 243

Due to Max

j=YPθt(V(Tt,θ

t)=j)≤1−Pθt(V(Tt,θ

t)=Y), Eq.(8) hence could

be written as Eq.(9):

PE∗≥P(Tt,θt)(Pθt(V(Tt,θt)=Y)≤0.5) (9)

Based on the analysis on the probability of optimal ensemble model in [14], i.e.,

P(M(Attr), N,h0)=1-(1-1/M(Attr))N·2h0−1, we take it as an estimation on the

classiﬁcation accuracy. Therefore, the generation error referring to the probabil-

ity of P(M(Attr), N,h0)≤0.5 would be formalized into Eq.(10),

PE∗≥P(P(M(Attr), N,h0)≤0.5) (10)

It clearly shows that the higher the optimal probability for an ensemble model

the less the generation error rate is. As a consequence, we could adjust the

number of trees or the heights of trees to improve the predictive accuracy for

the adaptation to concept drifts.

4 Experiments

To verify the eﬃciency and eﬀectiveness of CDRDT in the detection on dif-

ferent types of concept drifts from noisy data streams, extensive experiments

are conducted on the diverse benchmark concept drifting databases and the

real streaming data obtained from Yahoo! Shopping Web Service. And the ex-

perimental study presents that not only could CDRDT detect concept changes

timely and eﬀectively with a certain resilience to the noise, but also outperforms

on the abilities of runtime & space and the predictive accuracy as compared

with CVFDT and MSRT. Therefore, this section would be divided into two

parts correspondingly. The ﬁrst one discusses the characteristics of all concept-

drifting databases used in our experiments. And the second one analyzes the

drifting track in CDRDT and the performances on runtime, space and predic-

tive accuracy (all experiments referred here are performed on a P4, 3.00GHz

PC with 1G main memory, running Windows XP Professional. Furthermore, all

algorithms used in our experiments are written in Visual C++.). Due to the

limited space, only partial experimental results are given as follows.

4.1 Data Source

Synthetic Data

HyperPlane. HyperPlane is a benchmark database of data streams with the

gradual concept drift, which has been used in a numerous of references included

[10, 3, 11, 2]. A HyperPlane in a d-dimensional space (d=50) is denoted by equa-

tion: d

i=1wixi=w0. Each vector of variables (x1,x2,···,xd) in this database is

a randomly generated instance and is uniformly distributed in the multidimen-

sional space[0, 1]d.Ifd

i=1wixi≥w0, the class label is 1, or else is 0. The bound

244 P. Li et al.

of coeﬃcient wiis limited to [-10, 10]. For a weight of wi, each initial value is gen-

erated at random. And then it increases or decreases continuously by the value of

Δwi= 0.005 till it is up or down to the boundary, further changes the direction

with the probability of pw=10%). Meanwhile, in order to simulate the concept

drifting case, we select 5-dimension to change their weights in the database with

the noise rate of r=10%.

SEA. The artiﬁcial data of SEA ﬁrst described in [1] is a well-known data set of

concept shift with numerical attributes only. It is composed of 60k random points

in a three-dimensional feature space with two classes. All three features have val-

ues between 0 and 10 but only the ﬁrst two features are relevant. Those points are

divided into four chunks with diﬀerent concepts. In each chunk, a data point be-

longs to class 1 if f1+f2≤θ,wheref1and f2represent the ﬁrst two features and

θis a threshold value between these two classes. In this database, there are four

thresholds of 8, 9, 7 and 9.5 to divide data chunks. Each chunk reserves 2.5k-sized

records as test sets containing 10% class noise for the diﬀerent concepts. And the

rest 50k-sized points are treated as the training data, in which each of concept

shift appears every 12.5k-sized instances.

KDDCup99. The KDDCup99 database[20] is a database for network intru-

sion detection, which is selected here because it has been simulated as streaming

data with sampling change in [15]. In this database, the count of attributes is

41 dimensions with 34-dimensions of numerical attributes and the number of

class labels is 24 totaly. Due to the skew distribution of class labels, the data

with minor rates of class labels (i.e., the total number is lower than the value of

nmin) are taken as the noise data. Hence, the data set without noise contains

490k-sized instances with 12 class labels.

Real Data

Yahoo! Shopping Data. The web shopping data used in our experiments are

obtained via the interface of Yahoo! web services. They are sampled from Ya-

hoo! shopping databases relevant to catalog listing,product search and merchant

search. The basic feature of this data set contains 17 dimensions of attributes and

the total number of numerical ones is 10 dimensions. Meanwhile, it is composed

of the product information with the attribute set of (NnumeratingofProduct,Av-

erageRating, etc.) and the related information of merchants included attributes

of (NumRatingsofMerchant,Price-SatisfactionRating,Overal lRating etc.). The

correlation between a product and a merchant is connected by the catalog listing

(see [21] for more details). To mine the relation between the credibility of mer-

chants and possible factors, the attribute of Overal lRating with diﬀerent scores

are deﬁned as our class labels, which are divided into ﬁve class labels. By the

label distribution, we extract the number of 84k-sized instances randomly from

the obtained records as a training set and the rest 28k-instances as a test set.

4.2 Experimental Evaluation on Synthetic Databases

Before introducing the experimental results evaluated on synthetic databases,

several symbols involved in our experiments would be given ﬁrstly in the above

Concept Drifting Detection on Noisy Streaming Data 245

Symbol Description

Max/Bayes the classiﬁcation method of majority-class/Na¨ıve Bayes.

“∗∗∗” refers to the name of a database. The whole symbol refers that the ex-

∗∗∗-Max/Bayes perimental results are classiﬁed by the method of Max/Bayes.

Error rate the error rate estimated in a test set or a speciﬁed data chunk, Unit: (%). The

value in the former is averaged over 20 runs classiﬁed by 20*N-trees.

the training + test time (algorithms based on ensemble decision trees run in

T+C time a simulated parallel environment of multiple PCs. Thus, the time overhead is

calculated by the largest one of Ntrees, Unit: (s). But for CDRDT, the “T”

time is computed by the total generation time of N-trees here.

Memory the total memory consumption of all trees, Unit: (M). All results are averaged

over 20 runs as similarly as the results with Error rate.

structure: the name + the size of trainingdata+thesizeoftestdata+the

Database name database type (i.e., C: numerical, D: discrete, CD: hybrid) + the number of

attribute dimensions, e.g., SEA-50K-2.5K-C-3.

Drift-Track the process of detection on the concept drifts in a certain time-interval.

for KDDCup99, it refers to the index of diﬀerent class labels, marked as Cla -

Drift-Level ss label; For HyperPlane, it stands for the diﬀerence rate of class labels in

comparison with an original database after introducing the noise, marked as

Drift-rate, Unit: (%). A uniﬁed name is called Drift Level.

Period-Change a short for the size of a data chunk in the concept drifting detection, i.e., the

count of instances for each detection, Unit: (k).

table. Secondly, on the setting of parameters, the parameters in CVFDT and

MSRT still follow the original deﬁnitions in [10, 11] respectively. While for

CDRDT, the parameter values are speciﬁed below: N= 10, h0=M(attr)/2

as deﬁned in [22], an initial size of a streaming data chunk of |Sj|= 0.2k (i.e.,

|CP|), MC = 500k and δmax =0.1(ε0and δmin are calculated in Eqs.(5) & (4)

respectively). Now, the experimental details will be described as follows.

Tracking Concept Drifts

The tracking curves of concept drifts are plotted in Figures 1∼4, which present

various detection cases on the databases with diﬀerent drifting characteristics.

In these ﬁgures, the curves of Drift-Track are drawn in solid lines with the scale

of the left y-axis and the Drift-Level curves are described in dotted lines with

the scale of the right y-axis. While for the changing values of Period-Change,we

use a plus sign of “+” to represent in the corresponding ﬁgures. Its deﬁnition in

scale is speciﬁed below: the lowest value starts from 1k and the basic unit (de-

noted as BU ) is set to 0.5k based on the scale of the left y-axis. In Figure 1-a,

due to the diﬀerent magnitudes of Drift-Level, the corresponding tracking curve

ﬂuctuates variously. Especially at the beginning of detection, there are larger

discrepancies owing to the insuﬃciency of training data. Furthermore, when the

current drifting level transfers to another one, a jump will appear in the track-

ing curve from a local minimum point to a local maximum one. However, due

to the gradual concept drifts in HyperPlane, little deviation occurs between the

adjacent detection results. As a result, most of check periods are maintained

stably. Moreover, with the increasing of streaming data, the ﬂuctuation trend is

gradually converging.

Considering the concept shifting detection on SEA, the tracking results ﬂuc-

tuate with three shifting occurrences as shown in Figure 2. As compared with

the detection cases in the methods of Max and Bayes, there are several common

characteristics: i) the ﬂuctuations are frequent at the beginning of the training

246 P. Li et al.

Fig. 1. (left:) Drift track on HyperPlane (right:) Classiﬁcation results

Fig. 2. Drift track over sequential data chunks for SEA

even if without any concept change. This is similar to the case in HyperPlane;

ii) the upward and downward trends in curves alternately take place with the

shifting of the concepts. This is mainly resulted from the distribution changes

with only two class labels in this database, which contains four concepts in total

varying from the rate of 1.8: 1 to 0.85:1 (In this ﬁgure, both of the minimum

period and the unit of period refer to 0.2k-sized instances, which take the left

y-axisasthescaleaxis).

With respect to the detection on the sampling change, the tracking curves

drawn in Figures 3∼4 describe the detection results on KDDCup99 in the classi-

ﬁcation methods of Max and Bayes respectively. In the observations, we can see

that the more the frequent changes in the distribution of class labels the ﬁercer

the ﬂuctuations occur. It is demonstrated noticeably in the curve segment span-

ning the interval from the 1st data chunk to the 121th one (For KDDCup99,

the default value of check period is set to 1k and the unit of BU is 0.2k. In

Figure 3, the distance of two basic units in the left y-axis signiﬁes one BU for

a conspicuous description while a unit refers to one BU in Figure 4.). However,

if the stable distribution of a class label is reached, the trend of tracking curves

will fall down, such as during the curve segment between the 121th chunk and

Concept Drifting Detection on Noisy Streaming Data 247

Fig. 3. Drift track over sequential data chunks classiﬁed with Max for KDDCup99

Fig. 4. Drift track over sequential data chunks classiﬁed with Bayes for KDDCup99

the 321th one. Hence, the check period maintains invariably because the state of

non-concept drift is considered in this case.

Predictive Accuracy and Overheads of Time and Space

Except of the drifting detection on the training data, we also evaluate the classi-

ﬁcation abilities of CDRDT on the diﬀerent test sets in comparison to CVFDT

and MSRT. To be concrete, ﬁrstly, the results classiﬁed on Hyperplane are plot-

ted in Figure 1-b, including the overheads of runtime & space and the mean error

rates of classiﬁcation added variances. It is clear that CDRDT outperforms other

two algorithms on all abilities involved above. Secondly, the classiﬁcation results

on SEA and KDDCup99 are also concluded in Tables 1 and 2 respectively. As

shown in Table 1, though in the Max classiﬁcation method, CDRDT does not

perform as well as other algorithms on the predictive accuracy, the error rate

could be reduced by 6.15% at least if adopting the method of Bayes.Further-

more, as similar to the case in HyperPlane, the overheads of space and runtime

are lowest, which the largest deviation is up to dozens of times. However, in

Table 2, the predictive accuracy in CDRDT is improved largely by 14.76% and

36.61% respectively in Max as compared with CVFDT and MSRT. Meanwhile,

in the case with the Bayes method, the superiority of predictive accuracy in

CDRDT is also prominent. Moreover, the performances on both of the runtime

and space consumption are much better as well.

248 P. Li et al.

Table 1 . Classiﬁcation results on SEA-50k-2.5k-C-3

Algorithm Error rate(%) (T+C)time(s) Memory(M)

Max Bayes

mean varia nce mean va ri anc e Max Bayes Max Bayes

CDRDT 44.32 3.762 13.53 1.354 0+0 0+0 <1 5

MSRT 25.98 2.770 19.66 2.344 0+0 0+0 7*5

CVFDT 25.24 /8+0 /41 /

Table 2 . Classiﬁcation results on KDDCup99-490k-310k-CD-41

Algorithm Error rate(%) (T+C)time(s) Memory(M)

Max Bayes

mean varia nce mean va ri anc e Max Bayes Max Bayes

CDRDT 8.87 0.169 9.06 0.488 32+19 41+123 6 7

MSRT 44.48 22.350 28.60 19.896 93+18 92+504 65 ∗5

CVFDT 23.48 /75+18 /20 /

4.3 Experimental Evaluation on Web-Shopping Data

For the real-world data streams, it is hard to judge whether the current data

streams carry a potential concept drift or when a concept drift occurs. Mean-

while, it is inevitable to be aﬀected from the noise. Therefore, to verify the

feasibility and utility of our algorithm, we conduct several comparison experi-

ments on real data streams of Yahoo! shopping data with other algorithms as

well. The experimental results listed in Table 3 show that CDRDT is superior

to CVFDT and MSRT in the predictive accuracy and the overheads of runtime

and space. For instance, the predictive accuracy in CDRDT is highest, which is

improved by 9.32% averagely even though in the worst case. And for the training

time, it is reduced by the times of 3/2 at least with an approximate overhead

on the test time in contrast to other two algorithms. In addition, on the space

consumption, the maximum rate only takes a half of the space consumption in

CVFDT while the minimum one is only 1/25 of MSRT.

Table 3 . Classiﬁcation on Yahoo!-shopping-data-84k-28k-CD-16

Algorithm Error rate(%) (T+C)time(s) Memory(M)

Max Bayes

mean varia nce mean va ri anc e Max Bayes Max Bayes

CDRDT 14.02 5.545 4.45 1.936 2+1 6+5 <1 6

MSRT 33.52 17.891 44.42 39.166 11+1 15+20 30 ∗5

CVFDT 23.34 /10+3 /13 /

5Conclusion

In this paper, we have proposed an ensembling classiﬁcation algorithm named

CDRDT for Concept Drifting detection from noisy data streams, which is based

on Random Decision Trees. In contrast to previous eﬀorts on ensemble classiﬁers

of decision trees or random decision trees, small data chunks with unﬁxed sizes

are adopted in CDRDT to generate the classiﬁers of random decision tress incre-

mentally. To eﬀectively distinguish diﬀerent types of concept drifts from noise,

Concept Drifting Detection on Noisy Streaming Data 249

two thresholds are deﬁned in virtue of the inequality of Hoﬀeding Bounds. Fur-

thermore, for better adaptation to the concept drifts, the check period is adjusted

dynamically and timely. Moreover, extensive experiments are conducted on three

types of synthetic concept drifting databases and a real-world database of Yahoo!

shopping data. And the experimental results demonstrate that CDRDT could

adapt to various concept drifts eﬃciently in the noisy data streams. In addition,

as compared with the state-of-the-art algorithm of CVFDT and an ensemble

algorithm of MSRT, it outperforms on the abilities of runtime & space and the

predictive accuracy. Hence, a conclusion is drawn in the study that CDRDT

is a light-weighted ensembling algorithm of classiﬁcation. It would provide an

eﬃcient method for the detection on a variety of concept drifts in data stream.

However, how to model the noise data to discern the concept drifts from noise

accurately and how to deal with the cases with the skewed distribution of class

labels in data streams are still challenging and interesting issues for our future

work.

References

1. Street, W., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classi-

ﬁcation. In: 7th ACM SIGKDD international conference on Knowledge Discovery

and Data mining, KDD 2001, pp. 377–382. ACM Press, New York (2001)

2. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining Concept-Drifting Data Streams

Using Ensemble Classiﬁers. In: 9th ACM SIGKDD international conference on

Knowledge Discovery and Data mining, KDD 2003, pp. 226–235. ACM Press, New

York (2003)

3. Fan, W.: Streamminer: a classiﬁer ensemble-based engine to mine concept-drifting

data streams. In: 30th international conference on Very Large Data Bases, VLDB

2004, pp. 1257–1260. VLDB Endowment (2004)

4. Chu, F., Wang, Y., Zaniolo, C.: An adaptive learning approach for noisy data

streams. In: 4th IEEE International Conference on Data Mining, pp. 351–354.

IEEE Computer Science, Los Alamitos (2004)

5. Gama, J., Fernandes, R., Rocha, R.: Decision trees for mining data streams. Intel-

ligent Data Analysis 10, 23–45 (2006)

6. Scholz, M., Klinkenberg, R.: Boosting Classiﬁers for Drifting Concepts. Intel-

ligent Data Analysis (IDA), Special Issue on Knowledge Discovery from Data

Streams 11(1), 3–28 (2007)

7. Hoeﬀding, W.: Probability inequalities for sums of bounded random variabless.

Journal of the American Statistical Association 58(301), 13–30 (1963)

8. Castillo, G., Gama, J., Medas, P.: Adaptation to Drifting Concepts. In: Pires, F.M.,

Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 279–293. Springer,

Heidelberg (2003)

9. Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y.: Boat-optimistic decision tree

construction. In: 1999 ACM SIGMOD International Conference on Management

of Data, pp. 169–180. ACM Press, New York (1999)

10. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In:

7th ACM SIGKDD international conference on Knowledge Discovery and Data

mining, KDD 2001, pp. 97–106 (2001)

250 P. Li et al.

11. Li, P., Hu, X., Wu, X.: Mining concept-drifting data streams with multiple semi-

random decision trees. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X.

(eds.) ADMA 2008. LNCS, vol. 5139, pp. 733–740. Springer, Heidelberg (2008)

12. Ho, T.K.: Random decision forests. In: 3rd International Conference on Document

Analysis and Recognition, pp. 278–282. IEEE Computer Society, Los Alamitos

(1995)

13. Abdulsalam, H., Skillicorn, D.B., Martin, P.: Classifying Evolving Data Streams

Using Dynamic Streaming Random Forests. In: Bhowmick, S.S., K¨ung, J., Wagner,

R. (eds.) DEXA 2008. LNCS, vol. 5181, pp. 643–651. Springer, Heidelberg (2008)

14. Hu, X., Li, P., Wu, X., Wu, G.: A semi-random multiple decision-tree algorithm for

mining data streams. Journal of Computer Science and Technology 22(5), 711–724

(2007)

15. Yang, Y., Wu, X., Zhu, X.: Combining Proactive and Reactive Predictions for Data

Streams. In: 11th ACM SIGKDD international conference on Knowledge Discovery

in Data mining, KDD 2005, pp. 710–715. ACM Press, New York (2005)

16. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)

17. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

18. Quinlan, R.J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers

Inc., San Francisco (1993)

19. Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classiﬁer for data

mining. In: 22th International Conference on Very Large Data Bases, VLDB 1996,

pp. 544–555. Morgan Kaufmann, San Francisco (1996)

20. KDDCUP 1999 DataSet,

http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html

21. Yahoo! Shopping Web Services, http://developer.yahoo.com/everything.html

22. Li, P., Liang, Q., Wu, X., Hu, X.: Parameter Estimation in Semi-Random Decision

Tree Ensembling on Streaming Data. In: Theeramunkong, T., Kijsirikul, B., Cer-

cone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 376–388.

Springer, Heidelberg (2009)

23. Wikipedia, http://en.wikipedia.org/wiki/Data_stream

24. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees.

Neural Computation 9, 1545–1588 (1997)

25. Ho, T.K.: The random subspace method for constructing decision forests. IEEE

Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)

The Impact of Streaming Data Noise Reduction by using Chunk Based Ensemble

Article

Full-text available

Jul 2019
Int J Comput Civ Struct Eng

Nalini Nagendhiran

The Propelled analytics of data streams is rapidly turning into an important territory of processing the complicated data as the quantity of requesting such applications grows continuously. Web based mining when the input data advancing over period is getting to be most important center problem. In todays real world the data is not stationary but non stationary. The properties of attributed are changing over time. As a result the data stream has concept drift and noise which affects the performance. The aim of the paper is to first present an overview of the challenges in the data streams, followed by the measures to improve the factors that affect the performance.

vishual-mtech

Article

Jul 2019

Nalini Nagendhiran

A novel concept drift detection method in data streams using ensemble classifiers

Article

Nov 2016

Concept drift, change in the underlying distribution that data points come from, is an inevitable phenomenon in data streams. Due to increase in the number of data streams' applications such as network intrusion detection, weather forecasting, and detection of unconventional behavior in financial transactions; numerous researches have recently been conducted in the area of concept drift detection. An ideal method for concept drift detection should be able to rapidly and correctly identify changes in the underlying distribution of data points and adapt its model as quickly as possible while the memory and processing time is limited. In this paper, we propose a novel explicit method based on ensemble classifiers for detecting concept drift. The method processes samples one by one, and monitors the distribution of ensemble's error in order to detect probable drifts. After detection of a drift, a new classifier will be trained on the new concept in order to keep the model up-to-date. The proposed method has been evaluated on some artificial and real benchmark data sets. The experiments' results show that the proposed method is capable of detecting and adjusting to concept drifts from different types, and it has outperformed well-known state-of-the-art methods. Especially, in the case of high-speed concept drifts.

A Neighbor-Searching Discrepancy-based Drift Detection Scheme for Learning Evolving Data

Preprint

Full-text available

May 2024

Uncertain changes in data streams present challenges for machine learning models to dynamically adapt and uphold performance in real-time. Particularly, classification boundary change, also known as real concept drift, is the major cause of classification performance deterioration. However, accurately detecting real concept drift remains challenging because the theoretical foundations of existing drift detection methods - two-sample distribution tests and monitoring classification error rate, both suffer from inherent limitations such as the inability to distinguish virtual drift (changes not affecting the classification boundary, will introduce unnecessary model maintenance), limited statistical power, or high computational cost. Furthermore, no existing detection method can provide information on the trend of the drift, which could be invaluable for model maintenance. This work presents a novel real concept drift detection method based on Neighbor-Searching Discrepancy, a new statistic that measures the classification boundary difference between two samples. The proposed method is able to detect real concept drift with high accuracy while ignoring virtual drift. It can also indicate the direction of the classification boundary change by identifying the invasion or retreat of a certain class, which is also an indicator of separability change between classes. A comprehensive evaluation of 11 experiments is conducted, including empirical verification of the proposed theory using artificial datasets, and experimental comparisons with commonly used drift handling methods on real-world datasets. The results show that the proposed theory is robust against a range of distributions and dimensions, and the drift detection method outperforms state-of-the-art alternative methods.

Learning from Concept Drifting Data Streams with Unlabeled Data

Article

Jul 2010

Contrary to the previous beliefs that all arrived streaming data are labeled and the class labels are immediately availa- ble, we propose a Semi-supervised classification algorithm for data streams with concept drifts and UNlabeled data, called SUN. SUN is based on an evolved decision tree. In terms of deviation between history concept clusters and new ones generated by a developed clustering algorithm of k-Modes, concept drifts are distinguished from noise at leaves. Extensive studies on both synthetic and real data demonstrate that SUN performs well compared to several known online algorithms on unlabeled data. A conclusion is hence drawn that a feasible reference framework is provided for tackling concept drifting data streams with unlabeled data.

MORStreaming: A Multioutput Regression System for Streaming Data

Article

Sep 2021

With the continuous generation of huge volumes of streaming data, streaming data regression has become more complicated. A regressor that predicts two or more outputs, i.e., multioutput regression, is commonly used in many applications. However, current multioutput regressors use a batch method to handle data, which presents compatibility issues for streaming data as they need to be analyzed online. To address this issue, we present a multioutput regression system, called MORStreaming, for streaming data. MORStreaming uses an instance-based model to make predictions because this model can quickly adapt to change by only storing new instances or by throwing away old instances. However, learning instances in our regression system are constrained by online demand and need to consider the relationship between outputs. Therefore, MORStreaming consists of two algorithms: 1) an online algorithm based on topology networks which is designed to learn the instances and 2) an online algorithm based on adaptive rules which is designed to learn the correlation between outputs automatically. Experiments involving both artificial and real-world datasets indicate MORStreaming can achieve superior performance compared with other multioutput methods.

Improving Fraud Detection and Concept Drift Adaptation in Credit Card Transactions Using Incremental Gradient Boosting Trees

Conference Paper

Dec 2020

An Ensemble Technique for Better Decisions Based on Data Streams and its Application to Data Privacy

Article

Feb 2020

In this work, we address the problem of making decisions based on data streams, i.e., choosing an action when a new value is recorded. For instance, actions can be trading decisions in financial markets or perturbations of the data stream itself. To start with, we propose a language that allows individuals to formulate requirements on the action space. We use prediction techniques to identify the best possible action. However, for many scenarios there is not just one technique that predicts the future precisely, and different techniques behave differently. Since there is no technique that dominates all the others, our conclusion is to take multiple predictions into account. While ensemble techniques aggregating the predictions seem promising, existing techniques have issues, such as unnecessary information losses or the need for a predefined quality measure. Thus, we propose a new ensemble approach that weights predictions techniques according to requirements and solves an optimization problem that derives decisions from weighted predictions. We apply our solution to data privacy on data streams. For this setting, the benefits provided by prediction techniques have not been studied yet. In three case studies, we show that our solution consistently achieves better decision-making quality than approaches from related work.

An Online Competence-Based Concept Drift Detection Algorithm

Conference Paper

Dec 2016

The ability to adapt to new learning environments is a vital feature of contemporary case-based reasoning system. It is imperative that decision makers know when and how to discard outdated cases and apply new cases to perform smart maintenance operations. Competence-based empirical distance has been recently proposed as a measurement that can estimate the difference between case sample sets without knowing the actual case distributions. It is reportedly one of the most accurate drift detection algorithms in both synthetic and real-world data sets. However, as the construction of competence models have to retain every case in memory, it is not suitable for online drift detection. In addition, the high computational complexity O($n^{2}$) also limits its practical application, especially when dealing with large scale data sets with time constrains. In this paper, therefore, we propose a space-based online case grouping strategy, and a new case group enhanced competence distance (CGCD), to address these issues. The experiment results show that the proposed strategy and related algorithms significantly improve the efficiency of the current leading competence-based drift detection algorithm.

Concept drift detection via competence models

Article

Apr 2014
ARTIF INTELL

Detecting changes of concepts, such as a change of customer preference for telecom services, is very important in terms of prediction and decision applications in dynamic environments. In particular, for case-based reasoning systems, it is important to know when and how concept drift can effectively assist decision makers to perform smarter maintenance operations at an appropriate time. This paper presents a novel method for detecting concept drift in a case-based reasoning system. Rather than measuring the actual case distribution, we introduce a new competence model that detects differences through changes in competence. Our competence-based concept detection method requires no prior knowledge of case distribution and provides statistical guarantees on the reliability of the changes detected, as well as meaningful descriptions and quantification of these changes. This research concludes that changes in data distribution do reflect upon competence. Eight sets of experiments under three categories demonstrate that our method effectively detects concept drift and highlights drifting competence areas accurately. These results directly contribute to the research that tackles concept drift in case-based reasoning, and to competence model studies.

Random Forests

Chapter

Full-text available

Jan 2011

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Mining Concept-Drifting Data Streams with Multiple Semi-Random Decision Trees

Conference Paper

Full-text available

Sep 2008

Classification with concept-drifting data streams has found wide applications. However, many classification algorithms on streaming data have been designed for fixed features of concept drift and cannot deal with the noise impact on concept drift detection. An incremental algorithm with Multiple Semi- Random decision Trees (MSRT) for concept-drifting data streams is presented in this paper, which takes two sliding windows for training and testing, uses the inequality of Hoeffding Bounds to determine the thresholds for distinguishing the true drift from noise, and chooses the classification function to estimate the error rate for periodic concept-drift detection. Our extensive empirical study shows that MSRT has an improved performance in time, accuracy and robustness in comparison with CVFDT, a state-of-the-art decision-tree algorithm for classifying concept-drifting data streams.

Learning concept-drifting data streams with random ensemble decision trees

Article

Oct 2015
NEUROCOMPUTING

Random Forests

Article

Oct 2001

L Breiman

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148-156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Machine Learning, Volume 45, Number 1 - SpringerLink

Article

Oct 2001

Leo Breiman

BOAT - Optimistic Decision Tree Construction

Article

Jun 1999

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the "real" tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.

Probability Inequalities for Sums of Bounded Random Variables

Article

Mar 1963
J AM STAT ASSOC

Wassily Hoeffding

Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr {S – ES ≥ nt} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.

A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

Article

Jan 2007

Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naïve Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.

Probability Inequalities For Sums of Bounded Random Variables

Article

Mar 1963

Wassily Hoeffding

Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for $\Pr \{ S - ES \geq nt \}$ depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.

Combining proactive and reactive predictions for data streams

Conference Paper

Aug 2005

Mining data streams is important in both science and commerce. Two major challenges are (1) the data may grow without limit so that it is difficult to retain a long history; and (2) the underlying concept of the data may change over time. Different from common practice that keeps recent raw data, this paper uses a measure of conceptual equivalence to organize the data history into a history of concepts. Along the journey of concept change, it identifies new concepts as well as re-appearing ones, and learns transition patterns among concepts to help prediction. Different from conventional methodology that passively waits until the concept changes, this paper incorporates proactive and reactive predictions. In a proactive mode, it anticipates what the new concept will be if a future concept change takes place, and prepares prediction strategies in advance. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. A system RePro is proposed to implement these new ideas. Experiments compare the system with representative existing prediction methods on various benchmark data sets that represent diversified scenarios of concept change. Empirical evidence demonstrates that the proposed methodology is an effective and efficient solution to prediction for data streams.

Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees

Abstract and Figures

Recommended publications

A Double-Window-Based Classification Algorithm for Concept Drifting Data Streams

Learning very fast decision tree from uncertain data streams with positive and unlabeled samples

Data Flow Classification Algorithm Based on Integrated Classifier

Random Ensemble Decision Trees for Learning Concept-Drifting Data Streams