DataPDF Available

conf_virtual drift_paper.pdf

December 2023

December 2023

Authors:

Supriya Agrahari

Motilal Nehru National Institute of Technology

Anil Singh

Motilal Nehru National Institute of Technology

Content uploaded by Supriya Agrahari

Content may be subject to copyright.

Unsupervised Virtual Drift Detection

Method in Streaming Environment

Supriya Agrahari and Anil Kumar Singh

Abstract Real-time applications generate an enormous amount of data that can

potentially change data distribution. The underline change in data distribution con-

cerning time causes concept drift. The learning model of the data stream encounters

concept drift problems while predicting the patterns. It leads to deterioration in the

learning model’s performance. Additional challenges of high-dimensional data cre-

ate memory and time requirements. The proposed work develops an unsupervised

concept drift detection method to detect virtual drift in non-stationary data. The K-

means clustering algorithm is applied to the relevant features to ﬁnd the stream’s

virtual drift. The proposed work reduces the complexity by detecting the drifts using

the khighest score features suitable with high-dimensional data. Here, we analyze

the data stream’svirtual drift by considering the changes in data distribution of recent

and current window data instances.

Keywords Data stream mining

·Concept drift ·Clustering ·Learning model ·

Adaptive model

1 Introduction

The data stream is a continuous ﬂow of data instances. The sequence of data instances

is produced from various applications, such as cyber security data, industrial produc-

tion data, weather forecasting data, and human daily activity data [1]. Vast volumes

of data characterize the data streams originated at high frequencies. The data stream

mining learning model performs predictive analysis of data samples. But due to the

dynamic behavior of the data stream, accurate prediction becomes difﬁcult for a sin-

gle learning model because the training samples of data instances are insufﬁcient to

deﬁne the complexity of problem space and degrade the learning model’s accuracy.

S. Agrahari (B)·A. K. Singh

Motilal Nehru National Institute of Technology Allahabad,Prayagraj, India

e-mail: supriyagrahari@gmail.com

A. K. Singh

e-mail: ak@mnnit.ac.in

M. Tistarelli et al. (eds.), Computer Vision and Machine Intelligence,

Lecture Notes in Networks and Systems 586,

https://doi.org/10.1007/978-981-19-7867-8_25

311

312 S. Agrahari and A. K. Singh

In the streaming environment, the data distribution changes are considered con-

cept drift, negatively impacting the learning model’s performance. So, there is a

requirement to get through concept drift while prediction is performed for the data

stream. The drift detector is generally coupled with a prediction (or learning) model

[2]. The detector raises a signal for the prediction model whenever the changes in the

environment are detected. After that, the prediction model adapts as per the knowl-

edge present in the recent data instances and discards the current prediction model

to cope with the stream’s drifts.

Data stream considers the sequence of data examples (or instance) that arrives

online at varying speed. There is a possibility of changes in features or class deﬁni-

tions compared with past knowledge (i.e., concept drifts) [3]. In practice, there are

limited, deferred, or even no labels present in the streaming data. It happens because

the label for the incoming data instances cannot be quickly obtained for reasons, such

as the higher labeling cost in real-world applications. It limits the many concept drift

detectors, such as distribution-based, error-rate-based, and statistical-based methods

[4].

Several drift detectors need labeled data to observe the model’s performance,

which helps to identify whether the model is obsolete. However, the labeled data

in a streaming environment is not available in several applications, and providing

a label is a cost-ineffective process. In such cases, supervised learning is no longer

efﬁcient. Hence, the unsupervised method offers a more practical approach in the

streaming environment. On the other hand, the high-dimensional data creates high

computational complexity, so the relevant feature selection is a way to provide an

efﬁcient process of model building.

We propose an unsupervised drift detection method that detects the drifts based on

signiﬁcant differences between the recent and current window data distribution. Drift

detection is performed using the relevant features identiﬁed by the chi-square test.

Dimensionality reduction of the feature space directly results in a faster computation

time [5]. K-means clustering is also performed on the data stream because it is the

most popular and simplest clustering algorithm. The signiﬁcant contributions of the

paper are as follows:

•We propose an unsupervised drift detection method in the streaming environment.

•The proposed work performs drift detection efﬁciently with no labeling require-

ment.

•The experiment performs with k highest score features. So, it minimizes the com-

putational complexity of the learning model.

The paper is organized as follows: Section 2discusses several existing drift detec-

tion methods. Section3gives a detailed description of the proposed work with their

pseudo-code and workﬂow diagram. Section4contains the experimental evaluation,

case study and results. Section5describes the conclusion and future work.

Unsupervised Virtual Drift Detection Method in Streaming … 313

2 Related Work

In this section, we brieﬂy discuss previous research related to concept drift. There are

two categories of concept drift detection methods: supervised and unsupervised. The

supervised approach assumes that the true labels (or target values) of incoming data,

instances are available after prediction. So, they generally use the error or prediction

accuracy of the learning model as the main input to their detection technique, whereas

the unsupervised approaches do not require labels in their techniques. In this section,

we emphasize unsupervised concept drift detection approaches.

de Mello et al. [6] focuses on Statistical Learning Theory to provide a theoretical

framework for drift detection. It develops a plover algorithm that detects the drift

using statistical measures such as mean, variance, kurtosis, and skewness. It utilizes

power spectrum analysis to measure drift detection’s data frequencies and amplitudes.

SOINN+[7] is self-organizing incremental neural network with a forgetting mech-

anism. From the incoming data instances, it learns a topology-preserving mapping

to a network structure. It demonstrates clusters of arbitrary shapes in streams of

noisy data. Souza et al. [3] present an Image-Based Drift Detector (IBDD) for high-

dimensional and high-speed data streams. It detects the drifts based on pixel differ-

ences. Huang et al. [5] present a new unsupervised feature selection technique for

the handling of high-dimensional data.

OCDD is unsupervised One-Class Drift Detector [8]. It uses a sliding window

and a one-class classiﬁer to identify the change in concepts. The drift is detected

when the ratio of false predictions is higher than a threshold. Pinto et al. [9] present

a SAMM, an automatic model monitoring system. It is a time- and space-efﬁcient

unsupervised streaming method. It generates alarm reports along with a summary of

the events. In addition, it provides information about important features to explain

the concept drift detection scenario.

DDM [10], ADWIN [11], ECDD [12], SEED [13], SEQDRIFT2 [14], STEPD

[15], and WSTDIW [16] are compared with the proposed approach to compare

performance in terms of classiﬁcation accuracy.

3 Proposed Work

In the streaming environment, the learning model requires processing data instances

as fast as the data becomes available because there is limited memory to process it.

In this regard, an online algorithm can process data sequentially or form a window

for computation to work well with limited memory and time.

In the proposed work, the data stream is deﬁned as D

=[x

i+1

, ..., x

i+n

, ...],

and D

is ddimensional data matrix. Each xrepresents feature vector at different

timestamp. The change in data distribution (P) between different timestamp t

and

, i.e., P

l= P

mis considered as concept drift. In the paper, the virtual drift detection

is perfomed by identiﬁed the change in feature vector distribution over time. In such

case, the boundaries of data distribution of data instances remain same.

314 S. Agrahari and A. K. Singh

3.1 Proposed Drift Detection Method

This section illustrates the working of the proposed drift detection method. The

method is model-independent and unsupervised. The general workﬂow diagram of

the proposed work is shown in Fig.1. The pseudo-code of the proposed detector is

described in Algorithms 1 and 2.

WindowingData Stream

Select feature with k

highest score

K-means Clustering

of window data

instances

Find inter-cluster distance (d

(Cr,Cc) ), farthest point from cluster

(FP) and cluster centroid (C)

If dissimilarity exist between

(d (Cr,Cc) ), FP, and C for current

and recent data window

Concept drift detected

No concept drift

Yes

Fig. 1 General workﬂow diagram of proposed work

Unsupervised Virtual Drift Detection Method in Streaming … 315

The proposed method detects the drifts using two data windows where w

is the

current window. When new data instances exist, the previous w

becomes a recent

window w

to accommodate the current window with newly available data.

We utilize the chi-square test [17] as the statistical analysis to ﬁnd the association

or difference between recent and current window data instances. The chi-square is

a popular feature selection method. It evaluates data features separately in terms of

the classes. It is necessary to discretize the range of continuous-valued features into

intervals. The chi-squared test compares the obtained values of a class’s frequency

due to the split to the expected frequency of the class. Let N

i j

be the number of Ci

class samples in the jth interval among the Nexamples, and M

I j

be the number of

samples in the jth interval. E

i j

I j

|/Nis the anticipated frequency of N

i j

. A

particular data stream’s chi-squared statistic is thus deﬁned as Eq.1.



i=1



j=1

i j

−E

i j

)

j(1)

The number of intervals is denoted by I. The higher the obtained value, the more

useful the relevant features are. In this way, the best feature is extracted from the

window data based on k highest score. The selection of features eliminates the less

important information and reduces the method’s complexity.

Algorithm 1: Windowing of data stream

Data: Data St r ea m:D

,C ur ren t W i nd ow:w

,Re cent W i nd ow:w

Result: W i nd owo f d at a i ns ta nce s.

Initialize current window size;

while stream has data instance do

if w

= Full then

Add data instances into the current window;

else

DriftDetector(w

)

K-means clustering algorithm utilizes the high score features of the current win-

dow. K-means clustering is an unsupervised learning technique, and it is used to split

the unlabeled data into non-identical groups. The random selection of data points as

a cluster center is performed, and further, the distances between the centroids and

the data points are calculated. It assigns each data instance to its nearest centroid.

We store these cluster centroids for further evaluation as deﬁned in Eqs. 2and 3. For

the new incoming data instances, a new cluster center is selected.

= {C

i,C

i+1, . . . , C

i+n}(2)

= {C

i,C

i+1, . . . , C

i+n}(3)

316 S. Agrahari and A. K. Singh

Algorithm 2: Drift detection method

Data: Data St r ea m:D

;C ur re nt W i nd ow:w

;Re cent W i nd ow:w

;T em por ar y

V ar ia ble :l;Clust er C ent er :C

;Array:C

,F P,d(C

Result: Dr i f t detect ion.

Function

DriftDetector(

)

if l==0 then

Select features according to the k highest scores using the Chi-Square test;

Apply K-means clustering on selected features data instances;

Find center of each cluster(C

) and store it in C

;

Calculate squared distance to cluster center to identify data point furthest from the

centers and store result in F P

;

else

Select features according to the k highest scores using Chi-Square test;

Apply K-means clustering on selected festures data instances;

Find center of each cluster (C

) and store it in C

;

Calculate squared distance to cluster center to identify data point furthest from the

centers and store result in F P

i+1

;

Calculate the euclidean distance between the cluster centers and store outcome in

);

if C

= C

∨F P

= F P

i+1

∨d

)= d

i+1

)then

Return True;

else

Return False;

The centroid distance d(C

)between the centroid of recent window (C

) and the

centroid of the current window (C

) is evaluated with the help of Euclidean distance,

as shown in Eq.4.

d(C

)=(C

i−C

+(C

j−C

(4)

Figure2demonstrates the cluster centroid and farthest point (F P) of a cluster, respec-

tively. The virtual drift is detected when their boundary remains the same, but the

cluster data distribution varies over time. The proposed work signals the virtual drift

when one of the three conditions is satisﬁed, as shown below.

f(x)=C

i= C

i∨F P

= F P

i+1

∨d

)= d

i+1

), return True

False,otherwise

The above condition shows that concept drift is detected if there is a change in

centroid or farthest point or intercluster distance; otherwise, no drift is ﬂagged. When

the drift is signaled, the learning model rebuilds with the new incoming data instances

to overcome concept drift in the data stream.

Unsupervised Virtual Drift Detection Method in Streaming … 317

Fig. 2 Demostration of

cluster centroid (+) and

farthest point of each cluster

centroid (•)

4 Evaluation and Results

This section discusses the experimental setup, case study, experimental datasets, and

experimental results and analyses. The experiment is implemented in Python using

libraries Scikit-learn and Scikit-multiﬂow. The online method does not contain a

built-in drift adaptation mechanism, so the proposed work can be applied there.

Here, Interleaved Test-Then-Train [18] approach is utilized for evaluation purposes.

It depicts that the learning model prediction is performed when the new data instances

arrive, and further, an update is done in the model.

Here, the sliding window mechanism is used. The window shrinks whenever the

drift is ﬂagged; otherwise, it expands. For evaluation purposes, the window size is

taken as 50. When the drift is detected, the size of the window is reduced by half of

its current size.

4.1 Case Study on Iris Dataset

The iris dataset includes three iris species with 150 number of instances and four

attributes. There is no missing value presents in the dataset. The dataset is charac-

terized as multivariate and the atrribute characteristics are real.

It has some properties about ﬂowers. The two ﬂower species are not linearly

separable from each other. At the same time, one of them is linearly separable from the

other two. Figure3demonstrates the data window clustering at different timestamps.

There are three timestamps data windows in which the centroid and distribution of

clusters are more similar. So, there is no drift found in these data windows.

Figure4exhibits that the cluster with data points in red color drifted from times-

tamp t

to t

. It suggests that the data distribution of a cluster changes concerning

time.

318 S. Agrahari and A. K. Singh

Fig. 3 Data window clustering at different timestamp (no drift scenario)

Fig. 4 Data window clustering at different timestamp (drift scenario)

4.2 Datasets

Synthetic datasets

•LED dataset: It works by predicting the digits that appear on the seven-segment

LED display. There is 10 noise in this multivariate dataset. There are 24 attributes

in total, all of which are categorical data. The qualities are represented in the

form of 0 or 1 on the LED display. It indicates whether or not the reciprocal light

is turned on. The ten percent noise indicates that each attribute vector has a ten

percent chance of being reversed. A drift is deﬁned as a change in the value of a

characteristic.

Unsupervised Virtual Drift Detection Method in Streaming … 319

•SINE dataset: In the dataset, there are two contexts: Sine1, where y

=si n(x

and Sine2, where y

=0.5+0.3×si n(3πx

). Reversing the context as mentioned

earlier condition detects concept drift.

•Agrawal dataset: The information in the dataset pertains to people who are inter-

ested in taking out a loan. They are divided into two groups: group A and group B.

The data collection includes age, salary, education level, house value, zip code, and

other variables. There are ten functions in all, but only ﬁve are used to construct

the dataset. The attribute value can be both numeric and nominal. The notion drift

occurs both quickly and gradually in this case.

Real-time datasets

•Airlines dataset: There are two target values in the dataset. It determines if a ﬂight

is delayed. The analysis is based on factors such as ﬂight, destination airport, time,

weekdays, and length.

•Spam Assassin dataset: Based on e-mail communications, the data collection

comprises 500 attributes. The values of all characteristics are binary. It shows

whether a word appears in the e-mail. Does there appear to be a progressive change

in spam texts over time?

•Forest cover dataset: The dataset includes 30 ×30 m cells in Region 2 of the US

Forest Service (USFS). There are 54 qualities, 44 of which are binary values and

10 of which are numerical values. It depicts many characteristics such as height,

vegetation appearances, disappearances, and so on. It’s a normalized set of data.

•Usenets dataset: Usenets is a dataset that combines usenet1 and usenet2 to create

a new dataset. It’s a compilation of twenty different news organizations. The user

labels the communications in the order of their interest. In both datasets, there are

99 properties.

4.3 Experimental Results and Analyses

Synthetic and real-time datasets are used in the experiment. In the synthetic dataset,

abrupt and gradual drift contains datasets are written as Abr and Grad, respectively.

In addition, the number of data instances is also mentioned with the particular dataset.

The suggested method is compared with existing methods that use the Hoefﬂding

Tree classiﬁer. At the end of the data stream, the mean accuracy of each window is

utilized to calculate classiﬁcation accuracy (or average mean accuracy). The mean

accuracy is calculated by the ratio of the number of correct predictions to the total

number of predictions of each window.

In terms of classiﬁcation accuracy, the suggested technique using the Hoeffd-

ing base classiﬁer behaves as follows (see Table 1). Sine (Grad-20K), Airlines and

Usenets dataset exhibit a decrease in classiﬁcation accuracy. At the same time,

Agrawal (Abr-20K) dataset shows a marginal decrease in classiﬁcation accuracy.

In addition to this, Agrawal (Abr-50K), Agrawal (Abr-100K), Agrawal (Abr-50K),

320 S. Agrahari and A. K. Singh

Fig. 5 Critical distance (CD) diagram based on classiﬁcation accuracy of methods with HT

classiﬁer

Agrawal (Grad-20K), Agrawal (Grad-50K), Agrawal (Grad-100K), Forest Cover,

and Spam Assassin manifest a signiﬁcant increase in classiﬁcation accuracy.

We use the Fr i edman test with N e meny i -post -hoc analysis (Demšar) to vali-

date the statistical signiﬁcance of the performance of the proposed method and the

compared methods utilizing NB and HT classiﬁer. The null hypothesis H0 states that

equivalent methods have the same rank.The F r iedman test is based on this assump-

tion. We compare eight strategies using ten datasets in this test. Each method is ranked

according to its performance in terms of classiﬁcation accuracy (see Table1).

As mentioned by Demšar, a N em eny i -post-hoc analysis is performed, and the

Critical Difference (CD) is calculated. The proposed technique apparently outper-

forms ADWIN, ECDD, and SEED substantially (see Fig.5).

5 Conclusion

In the streaming environment, the learning model has the ability to obtain new infor-

mation. It updates information by applying the forgetting mechanism and rebuild-

ing the learning model using further information. Several drift detection algorithms

assume that the label of data instances is available after the learning model’s predic-

tion. But in the real-time scenario, it is not feasible. The paper proposes an unsuper-

vised drift detection method to detect virtual drift in non-stationary data. It minimizes

the complexity of data by selecting the k-high score features of data samples. So, it

works efﬁciently with high-dimensional streaming data.

The future direction of the proposed work is that it can be applied to various

application domains. Future work will add the outlier detection technique with the

proposed drift detection method. In addition, distinguishing between noise and con-

cept drift is an open challenge.

Unsupervised Virtual Drift Detection Method in Streaming … 321

Table 1 Comparison of classiﬁcation accuracy between proposed method and existing methods using HT classiﬁer

Dataset ADWIN DDM ECDD SEED SEQDRIFT2 STEPD WSTD1W Proposed work

LED (Grad-20K) 63.09 70.73 67.59 55.60 60.97 65.64 69.06 82.40

Sine (Grad-20K) 85.54 86.82 85.22 85.49 86.69 86.01 86.79 83.04

Agrawal (Abr-20K)) 64.50 65.21 64.27 64.64 64.63 65.27 65.76 65.29

Agrawal (Abr-50K) 65.88 70.01 65.40 65.39 67.29 66.47 69.89 88.24

Agrawal (Abr-100K) 66.63 73.09 66.96 65.71 68.70 67.06 70.52 88.29

Agrawal (Grad-20K) 63.62 65.27 63.26 63.84 63.37 64.26 64.98 88.24

Agrawal (Grad-50K) 65.48 69.20 65.69 65.04 66.83 66.02 68.58 88.49

Agrawal (Grad-100K) 66.35 73.48 66.43 65.47 68.49 66.89 70.22 88.40

Airlines 66.70 65.35 63.66 66.71 66.60 65.73 66.71 58.47

Forest cover 67.73 67.14 67.39 67.32 67.68 67.62 68.18 70.73

Spam assassin 91.87 89.34 88.39 90.90 89.70 91.42 91.80 91.89

Usenets 68.41 71.01 72.75 68.65 66.31 71.95 71.58 63.07

Average rank 5.6 3.2 6.6 6.8 4.9 4.5 2.6 1.8

Bold signiﬁces the highest accuracy of method with respect to particular dataset

322 S. Agrahari and A. K. Singh

References

1. Agrahari, S., Singh, A.K.: Concept drift detection in data stream mining: A literature review. J.

King Saud Univer. Comput. Inf. Sci. (2021). ISSN 1319-1578. https://doi.org/10.1016/j.jksuci.

2021.11.006. URL https://www.sciencedirect.com/science/article/pii/S1319157821003062

2. Agrahari, Supriya, Singh, Anil Kumar: Disposition-based concept drift detection and adaptation

in data stream. Arab. J. Sci. Eng. 47(8), 10605–10621 (2022). https://doi.org/10.1007/s13369-

022-06653-4

3. Souza, V., Parmezan, A.R.S., Chowdhury, F. A., Mueen, A.: Efﬁcient unsupervised drift detector

for fast and high-dimensional data streams. Knowl. Inf. Syst. 63(6), 1497–1527 (2021)

4. Xuan, Junyu, Jie, Lu., Zhang, Guangquan: Bayesian nonparametric unsupervised concept drift

detection for data stream mining. ACM Trans. Intell. Syst. Technol. (TIST) 12(1), 1–22 (2020)

5. Huang, H., Yoo, S., Kasiviswanathan, S.P.: Unsupervised feature selection on data streams.

In: Proceedings of the 24th ACM International on Conference on Information and Knowledge

Management, pp. 1031–1040 (2015)

6. de Mello, R.F., Vaz, Y., Grossi, C.H., Bifet, A.: On learning guarantees to unsupervised concept

drift detection on data streams. Expert Syst. Appl. 117, 90–102 (2019)

7. Wiwatcharakoses, C., Berrar, D.: Soinn+, a self-organizing incremental neural network for

unsupervised learning from noisy data streams. Expert Syst. Appl. 143, 113069 (2020)

8. Gözüaçık, Ömer., Can, Fazli: Concept learning using one-class classiﬁers for implicit drift

detection in evolving data streams. Artif. Intell. Rev. 54(5), 3725–3747 (2021)

9. Pinto, F., Sampaio, M.O.P., Bizarro, P.: Automatic model monitoring for data streams. arXiv

preprint arXiv:1908.04240 (2019)

10. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Brazilian

Symposium on Artiﬁcial Intelligence, pp. 286–295. Springer, Berlin (2004)

11. Bifet, Albert: Adaptive learning and mining for data streams and frequent patterns. ACM

SIGKDD Explor. Newsl. 11(1), 55–56 (2009)

12. Ross, G.J., Adams, N.M., Tasoulis, D.K., Hand, D.J.: Exponentially weighted moving average

charts for detecting concept drift. Pattern Recogn. Lett. 33(2), 191–198 (2012)

13. Huang, D.T.J., Koh, Y.S., Dobbie, G., Pears, R.: Detecting volatility shift in data streams. In:

2014 IEEE International Conference on Data Mining, pp. 863–868 (2014). https://doi.org/10.

1109/ICDM.2014.50

14. Pears, R., Sakthithasan, S., Koh, Y.S.: Detecting concept change in dynamic data streams.

Mach. Learn. 97(3), 259–293 (2014)

15. Nishida, K., Yamauchi, K.: Detecting concept drift using statistical testing. In: Corruble, V.,

Takeda, M., Suzuki, E. (eds.) Discovery Science, pp. 264–269, Springer, Berlin (2007). ISBN

978-3-540-75488-6

16. de Barros, R.S.M., Hidalgo, J.I.G., de Lima Cabral, D.R.: Wilcoxon rank sum test drift detector.

Neurocomputing 275, 1954–1963 (2018)

17. Franke, T.M., Ho, T., Christie, C.A.: The chi-square test: Often used and more often misinter-

preted. Am. J. Eval. 33(3), 448–458 (2012)

18. Gama, João., Žliobait˙e, Indr˙e, Bifet, Albert, Pechenizkiy, Mykola, Bouchachia, Abdelhamid:

A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 1–37 (2014)

19. Demšar, Janez: Statistical comparisons of classiﬁers over multiple data sets. J. Mach. Learn.

Res. 7, 1–30 (2006)

Content uploaded by Supriya Agrahari

Author content

conf_virtual drift_pap

er.pdf

PDF

218.02 KB

Download file

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

Unsupervised Virtual Drift Detection Method in Streaming Environment

Chapter

May 2023

Supriya Agrahari · Anil Singh

conf_virtual drift_paper.pdf

File (1)

Linked Research

Recommended publications

Leaking currents from virtual cathodes

Motion fading is driven by perceived, not actual angular velocity

Ion energy control in an insulating inductively coupled discharge reactor

The recent results on formation and transportation of low-energy., high-current electron beams