DataPDF Available
Unsupervised Virtual Drift Detection
Method in Streaming Environment
Supriya Agrahari and Anil Kumar Singh
Abstract Real-time applications generate an enormous amount of data that can
potentially change data distribution. The underline change in data distribution con-
cerning time causes concept drift. The learning model of the data stream encounters
concept drift problems while predicting the patterns. It leads to deterioration in the
learning model’s performance. Additional challenges of high-dimensional data cre-
ate memory and time requirements. The proposed work develops an unsupervised
concept drift detection method to detect virtual drift in non-stationary data. The K-
means clustering algorithm is applied to the relevant features to find the stream’s
virtual drift. The proposed work reduces the complexity by detecting the drifts using
the khighest score features suitable with high-dimensional data. Here, we analyze
the data stream’svirtual drift by considering the changes in data distribution of recent
and current window data instances.
Keywords Data stream mining
·Concept drift ·Clustering ·Learning model ·
Adaptive model
1 Introduction
The data stream is a continuous flow of data instances. The sequence of data instances
is produced from various applications, such as cyber security data, industrial produc-
tion data, weather forecasting data, and human daily activity data [1]. Vast volumes
of data characterize the data streams originated at high frequencies. The data stream
mining learning model performs predictive analysis of data samples. But due to the
dynamic behavior of the data stream, accurate prediction becomes difficult for a sin-
gle learning model because the training samples of data instances are insufficient to
define the complexity of problem space and degrade the learning model’s accuracy.
S. Agrahari (B)·A. K. Singh
Motilal Nehru National Institute of Technology Allahabad,Prayagraj, India
e-mail: supriyagrahari@gmail.com
A. K. Singh
e-mail: ak@mnnit.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
M. Tistarelli et al. (eds.), Computer Vision and Machine Intelligence,
Lecture Notes in Networks and Systems 586,
https://doi.org/10.1007/978-981-19-7867-8_25
311
312 S. Agrahari and A. K. Singh
In the streaming environment, the data distribution changes are considered con-
cept drift, negatively impacting the learning model’s performance. So, there is a
requirement to get through concept drift while prediction is performed for the data
stream. The drift detector is generally coupled with a prediction (or learning) model
[2]. The detector raises a signal for the prediction model whenever the changes in the
environment are detected. After that, the prediction model adapts as per the knowl-
edge present in the recent data instances and discards the current prediction model
to cope with the stream’s drifts.
Data stream considers the sequence of data examples (or instance) that arrives
online at varying speed. There is a possibility of changes in features or class defini-
tions compared with past knowledge (i.e., concept drifts) [3]. In practice, there are
limited, deferred, or even no labels present in the streaming data. It happens because
the label for the incoming data instances cannot be quickly obtained for reasons, such
as the higher labeling cost in real-world applications. It limits the many concept drift
detectors, such as distribution-based, error-rate-based, and statistical-based methods
[4].
Several drift detectors need labeled data to observe the model’s performance,
which helps to identify whether the model is obsolete. However, the labeled data
in a streaming environment is not available in several applications, and providing
a label is a cost-ineffective process. In such cases, supervised learning is no longer
efficient. Hence, the unsupervised method offers a more practical approach in the
streaming environment. On the other hand, the high-dimensional data creates high
computational complexity, so the relevant feature selection is a way to provide an
efficient process of model building.
We propose an unsupervised drift detection method that detects the drifts based on
significant differences between the recent and current window data distribution. Drift
detection is performed using the relevant features identified by the chi-square test.
Dimensionality reduction of the feature space directly results in a faster computation
time [5]. K-means clustering is also performed on the data stream because it is the
most popular and simplest clustering algorithm. The significant contributions of the
paper are as follows:
We propose an unsupervised drift detection method in the streaming environment.
The proposed work performs drift detection efficiently with no labeling require-
ment.
The experiment performs with k highest score features. So, it minimizes the com-
putational complexity of the learning model.
The paper is organized as follows: Section 2discusses several existing drift detec-
tion methods. Section3gives a detailed description of the proposed work with their
pseudo-code and workflow diagram. Section4contains the experimental evaluation,
case study and results. Section5describes the conclusion and future work.
Unsupervised Virtual Drift Detection Method in Streaming 313
2 Related Work
In this section, we briefly discuss previous research related to concept drift. There are
two categories of concept drift detection methods: supervised and unsupervised. The
supervised approach assumes that the true labels (or target values) of incoming data,
instances are available after prediction. So, they generally use the error or prediction
accuracy of the learning model as the main input to their detection technique, whereas
the unsupervised approaches do not require labels in their techniques. In this section,
we emphasize unsupervised concept drift detection approaches.
de Mello et al. [6] focuses on Statistical Learning Theory to provide a theoretical
framework for drift detection. It develops a plover algorithm that detects the drift
using statistical measures such as mean, variance, kurtosis, and skewness. It utilizes
power spectrum analysis to measure drift detection’s data frequencies and amplitudes.
SOINN+[7] is self-organizing incremental neural network with a forgetting mech-
anism. From the incoming data instances, it learns a topology-preserving mapping
to a network structure. It demonstrates clusters of arbitrary shapes in streams of
noisy data. Souza et al. [3] present an Image-Based Drift Detector (IBDD) for high-
dimensional and high-speed data streams. It detects the drifts based on pixel differ-
ences. Huang et al. [5] present a new unsupervised feature selection technique for
the handling of high-dimensional data.
OCDD is unsupervised One-Class Drift Detector [8]. It uses a sliding window
and a one-class classifier to identify the change in concepts. The drift is detected
when the ratio of false predictions is higher than a threshold. Pinto et al. [9] present
a SAMM, an automatic model monitoring system. It is a time- and space-efficient
unsupervised streaming method. It generates alarm reports along with a summary of
the events. In addition, it provides information about important features to explain
the concept drift detection scenario.
DDM [10], ADWIN [11], ECDD [12], SEED [13], SEQDRIFT2 [14], STEPD
[15], and WSTDIW [16] are compared with the proposed approach to compare
performance in terms of classification accuracy.
3 Proposed Work
In the streaming environment, the learning model requires processing data instances
as fast as the data becomes available because there is limited memory to process it.
In this regard, an online algorithm can process data sequentially or form a window
for computation to work well with limited memory and time.
In the proposed work, the data stream is defined as D
s
=[x
i
,x
i+1
, ..., x
i+n
, ...],
and D
s
is ddimensional data matrix. Each xrepresents feature vector at different
timestamp. The change in data distribution (P) between different timestamp t
l
and
t
m
, i.e., P
t
l= P
t
mis considered as concept drift. In the paper, the virtual drift detection
is perfomed by identified the change in feature vector distribution over time. In such
case, the boundaries of data distribution of data instances remain same.
314 S. Agrahari and A. K. Singh
3.1 Proposed Drift Detection Method
This section illustrates the working of the proposed drift detection method. The
method is model-independent and unsupervised. The general workflow diagram of
the proposed work is shown in Fig.1. The pseudo-code of the proposed detector is
described in Algorithms 1 and 2.
WindowingData Stream
Select feature with k
highest score
K-means Clustering
of window data
instances
Find inter-cluster distance (d
(Cr,Cc) ), farthest point from cluster
(FP) and cluster centroid (C)
If dissimilarity exist between
(d (Cr,Cc) ), FP, and C for current
and recent data window
Concept drift detected
No concept drift
No
Yes
Fig. 1 General workflow diagram of proposed work
Unsupervised Virtual Drift Detection Method in Streaming 315
The proposed method detects the drifts using two data windows where w
c
is the
current window. When new data instances exist, the previous w
c
becomes a recent
window w
r
to accommodate the current window with newly available data.
We utilize the chi-square test [17] as the statistical analysis to find the association
or difference between recent and current window data instances. The chi-square is
a popular feature selection method. It evaluates data features separately in terms of
the classes. It is necessary to discretize the range of continuous-valued features into
intervals. The chi-squared test compares the obtained values of a class’s frequency
due to the split to the expected frequency of the class. Let N
i j
be the number of Ci
class samples in the jth interval among the Nexamples, and M
I j
be the number of
samples in the jth interval. E
i j
=M
I j
|C
i
|/Nis the anticipated frequency of N
i j
. A
particular data stream’s chi-squared statistic is thus defined as Eq.1.
χ
2
=
C
i=1
I
j=1
(N
i j
E
i j
)
E
i
j(1)
The number of intervals is denoted by I. The higher the obtained value, the more
useful the relevant features are. In this way, the best feature is extracted from the
window data based on k highest score. The selection of features eliminates the less
important information and reduces the method’s complexity.
Algorithm 1: Windowing of data stream
Data: Data St r ea m:D
s
,C ur ren t W i nd ow:w
c
,Re cent W i nd ow:w
r
.
Result: W i nd owo f d at a i ns ta nce s.
1
Initialize current window size;
2
while stream has data instance do
3
if w
c
= Full then
4
Add data instances into the current window;
5
else
6
DriftDetector(w
c
)
K-means clustering algorithm utilizes the high score features of the current win-
dow. K-means clustering is an unsupervised learning technique, and it is used to split
the unlabeled data into non-identical groups. The random selection of data points as
a cluster center is performed, and further, the distances between the centroids and
the data points are calculated. It assigns each data instance to its nearest centroid.
We store these cluster centroids for further evaluation as defined in Eqs. 2and 3. For
the new incoming data instances, a new cluster center is selected.
C
r
= {C
r
i,C
r
i+1, . . . , C
r
i+n}(2)
C
c
= {C
c
i,C
c
i+1, . . . , C
c
i+n}(3)
316 S. Agrahari and A. K. Singh
Algorithm 2: Drift detection method
Data: Data St r ea m:D
s
;C ur re nt W i nd ow:w
c
;Re cent W i nd ow:w
r
;T em por ar y
V ar ia ble :l;Clust er C ent er :C
i
;Array:C
r
,C
c
,F P,d(C
r
,C
c
).
Result: Dr i f t detect ion.
1
Function
DriftDetector(
w
c
)
:
2
if l==0 then
3
Select features according to the k highest scores using the Chi-Square test;
4
Apply K-means clustering on selected features data instances;
5
Find center of each cluster(C
r
i
) and store it in C
r
;
6
Calculate squared distance to cluster center to identify data point furthest from the
centers and store result in F P
i
;
7
else
8
Select features according to the k highest scores using Chi-Square test;
9
Apply K-means clustering on selected festures data instances;
10
Find center of each cluster (C
c
i
) and store it in C
c
;
11
Calculate squared distance to cluster center to identify data point furthest from the
centers and store result in F P
i+1
;
12
Calculate the euclidean distance between the cluster centers and store outcome in
d
i
(C
r
,C
c
);
13
if C
r
i
= C
c
i
F P
i
= F P
i+1
d
i
(C
r
,C
c
)= d
i+1
(C
r
,C
c
)then
14
Return True;
15
else
16
Return False;
The centroid distance d(C
r
,C
c
)between the centroid of recent window (C
r
) and the
centroid of the current window (C
c
) is evaluated with the help of Euclidean distance,
as shown in Eq.4.
d(C
r
,C
c
)=(C
r
iC
c
i)
2
+(C
r
jC
c
j)
2
(4)
Figure2demonstrates the cluster centroid and farthest point (F P) of a cluster, respec-
tively. The virtual drift is detected when their boundary remains the same, but the
cluster data distribution varies over time. The proposed work signals the virtual drift
when one of the three conditions is satisfied, as shown below.
f(x)=C
r
i= C
c
iF P
i
= F P
i+1
d
i
(C
r
,C
c
)= d
i+1
(C
r
,C
c
), return True
False,otherwise
The above condition shows that concept drift is detected if there is a change in
centroid or farthest point or intercluster distance; otherwise, no drift is flagged. When
the drift is signaled, the learning model rebuilds with the new incoming data instances
to overcome concept drift in the data stream.
Unsupervised Virtual Drift Detection Method in Streaming 317
Fig. 2 Demostration of
cluster centroid (+) and
farthest point of each cluster
centroid ()
4 Evaluation and Results
This section discusses the experimental setup, case study, experimental datasets, and
experimental results and analyses. The experiment is implemented in Python using
libraries Scikit-learn and Scikit-multiflow. The online method does not contain a
built-in drift adaptation mechanism, so the proposed work can be applied there.
Here, Interleaved Test-Then-Train [18] approach is utilized for evaluation purposes.
It depicts that the learning model prediction is performed when the new data instances
arrive, and further, an update is done in the model.
Here, the sliding window mechanism is used. The window shrinks whenever the
drift is flagged; otherwise, it expands. For evaluation purposes, the window size is
taken as 50. When the drift is detected, the size of the window is reduced by half of
its current size.
4.1 Case Study on Iris Dataset
The iris dataset includes three iris species with 150 number of instances and four
attributes. There is no missing value presents in the dataset. The dataset is charac-
terized as multivariate and the atrribute characteristics are real.
It has some properties about flowers. The two flower species are not linearly
separable from each other. At the same time, one of them is linearly separable from the
other two. Figure3demonstrates the data window clustering at different timestamps.
There are three timestamps data windows in which the centroid and distribution of
clusters are more similar. So, there is no drift found in these data windows.
Figure4exhibits that the cluster with data points in red color drifted from times-
tamp t
p
to t
q
. It suggests that the data distribution of a cluster changes concerning
time.
318 S. Agrahari and A. K. Singh
Fig. 3 Data window clustering at different timestamp (no drift scenario)
Fig. 4 Data window clustering at different timestamp (drift scenario)
4.2 Datasets
Synthetic datasets
LED dataset: It works by predicting the digits that appear on the seven-segment
LED display. There is 10 noise in this multivariate dataset. There are 24 attributes
in total, all of which are categorical data. The qualities are represented in the
form of 0 or 1 on the LED display. It indicates whether or not the reciprocal light
is turned on. The ten percent noise indicates that each attribute vector has a ten
percent chance of being reversed. A drift is defined as a change in the value of a
characteristic.
Unsupervised Virtual Drift Detection Method in Streaming 319
SINE dataset: In the dataset, there are two contexts: Sine1, where y
i
=si n(x
i
),
and Sine2, where y
i
=0.5+0.3×si n(3πx
i
). Reversing the context as mentioned
earlier condition detects concept drift.
Agrawal dataset: The information in the dataset pertains to people who are inter-
ested in taking out a loan. They are divided into two groups: group A and group B.
The data collection includes age, salary, education level, house value, zip code, and
other variables. There are ten functions in all, but only five are used to construct
the dataset. The attribute value can be both numeric and nominal. The notion drift
occurs both quickly and gradually in this case.
Real-time datasets
Airlines dataset: There are two target values in the dataset. It determines if a flight
is delayed. The analysis is based on factors such as flight, destination airport, time,
weekdays, and length.
Spam Assassin dataset: Based on e-mail communications, the data collection
comprises 500 attributes. The values of all characteristics are binary. It shows
whether a word appears in the e-mail. Does there appear to be a progressive change
in spam texts over time?
Forest cover dataset: The dataset includes 30 ×30 m cells in Region 2 of the US
Forest Service (USFS). There are 54 qualities, 44 of which are binary values and
10 of which are numerical values. It depicts many characteristics such as height,
vegetation appearances, disappearances, and so on. It’s a normalized set of data.
Usenets dataset: Usenets is a dataset that combines usenet1 and usenet2 to create
a new dataset. It’s a compilation of twenty different news organizations. The user
labels the communications in the order of their interest. In both datasets, there are
99 properties.
4.3 Experimental Results and Analyses
Synthetic and real-time datasets are used in the experiment. In the synthetic dataset,
abrupt and gradual drift contains datasets are written as Abr and Grad, respectively.
In addition, the number of data instances is also mentioned with the particular dataset.
The suggested method is compared with existing methods that use the Hoefflding
Tree classifier. At the end of the data stream, the mean accuracy of each window is
utilized to calculate classification accuracy (or average mean accuracy). The mean
accuracy is calculated by the ratio of the number of correct predictions to the total
number of predictions of each window.
In terms of classification accuracy, the suggested technique using the Hoeffd-
ing base classifier behaves as follows (see Table 1). Sine (Grad-20K), Airlines and
Usenets dataset exhibit a decrease in classification accuracy. At the same time,
Agrawal (Abr-20K) dataset shows a marginal decrease in classification accuracy.
In addition to this, Agrawal (Abr-50K), Agrawal (Abr-100K), Agrawal (Abr-50K),
320 S. Agrahari and A. K. Singh
Fig. 5 Critical distance (CD) diagram based on classification accuracy of methods with HT
classifier
Agrawal (Grad-20K), Agrawal (Grad-50K), Agrawal (Grad-100K), Forest Cover,
and Spam Assassin manifest a significant increase in classification accuracy.
We use the Fr i edman test with N e meny i -post -hoc analysis (Demšar) to vali-
date the statistical significance of the performance of the proposed method and the
compared methods utilizing NB and HT classifier. The null hypothesis H0 states that
equivalent methods have the same rank.The F r iedman test is based on this assump-
tion. We compare eight strategies using ten datasets in this test. Each method is ranked
according to its performance in terms of classification accuracy (see Table1).
As mentioned by Demšar, a N em eny i -post-hoc analysis is performed, and the
Critical Difference (CD) is calculated. The proposed technique apparently outper-
forms ADWIN, ECDD, and SEED substantially (see Fig.5).
5 Conclusion
In the streaming environment, the learning model has the ability to obtain new infor-
mation. It updates information by applying the forgetting mechanism and rebuild-
ing the learning model using further information. Several drift detection algorithms
assume that the label of data instances is available after the learning model’s predic-
tion. But in the real-time scenario, it is not feasible. The paper proposes an unsuper-
vised drift detection method to detect virtual drift in non-stationary data. It minimizes
the complexity of data by selecting the k-high score features of data samples. So, it
works efficiently with high-dimensional streaming data.
The future direction of the proposed work is that it can be applied to various
application domains. Future work will add the outlier detection technique with the
proposed drift detection method. In addition, distinguishing between noise and con-
cept drift is an open challenge.
Unsupervised Virtual Drift Detection Method in Streaming 321
Table 1 Comparison of classification accuracy between proposed method and existing methods using HT classifier
Dataset ADWIN DDM ECDD SEED SEQDRIFT2 STEPD WSTD1W Proposed work
LED (Grad-20K) 63.09 70.73 67.59 55.60 60.97 65.64 69.06 82.40
Sine (Grad-20K) 85.54 86.82 85.22 85.49 86.69 86.01 86.79 83.04
Agrawal (Abr-20K)) 64.50 65.21 64.27 64.64 64.63 65.27 65.76 65.29
Agrawal (Abr-50K) 65.88 70.01 65.40 65.39 67.29 66.47 69.89 88.24
Agrawal (Abr-100K) 66.63 73.09 66.96 65.71 68.70 67.06 70.52 88.29
Agrawal (Grad-20K) 63.62 65.27 63.26 63.84 63.37 64.26 64.98 88.24
Agrawal (Grad-50K) 65.48 69.20 65.69 65.04 66.83 66.02 68.58 88.49
Agrawal (Grad-100K) 66.35 73.48 66.43 65.47 68.49 66.89 70.22 88.40
Airlines 66.70 65.35 63.66 66.71 66.60 65.73 66.71 58.47
Forest cover 67.73 67.14 67.39 67.32 67.68 67.62 68.18 70.73
Spam assassin 91.87 89.34 88.39 90.90 89.70 91.42 91.80 91.89
Usenets 68.41 71.01 72.75 68.65 66.31 71.95 71.58 63.07
Average rank 5.6 3.2 6.6 6.8 4.9 4.5 2.6 1.8
Bold significes the highest accuracy of method with respect to particular dataset
322 S. Agrahari and A. K. Singh
References
1. Agrahari, S., Singh, A.K.: Concept drift detection in data stream mining: A literature review. J.
King Saud Univer. Comput. Inf. Sci. (2021). ISSN 1319-1578. https://doi.org/10.1016/j.jksuci.
2021.11.006. URL https://www.sciencedirect.com/science/article/pii/S1319157821003062
2. Agrahari, Supriya, Singh, Anil Kumar: Disposition-based concept drift detection and adaptation
in data stream. Arab. J. Sci. Eng. 47(8), 10605–10621 (2022). https://doi.org/10.1007/s13369-
022-06653-4
3. Souza, V., Parmezan, A.R.S., Chowdhury, F. A., Mueen, A.: Efficient unsupervised drift detector
for fast and high-dimensional data streams. Knowl. Inf. Syst. 63(6), 1497–1527 (2021)
4. Xuan, Junyu, Jie, Lu., Zhang, Guangquan: Bayesian nonparametric unsupervised concept drift
detection for data stream mining. ACM Trans. Intell. Syst. Technol. (TIST) 12(1), 1–22 (2020)
5. Huang, H., Yoo, S., Kasiviswanathan, S.P.: Unsupervised feature selection on data streams.
In: Proceedings of the 24th ACM International on Conference on Information and Knowledge
Management, pp. 1031–1040 (2015)
6. de Mello, R.F., Vaz, Y., Grossi, C.H., Bifet, A.: On learning guarantees to unsupervised concept
drift detection on data streams. Expert Syst. Appl. 117, 90–102 (2019)
7. Wiwatcharakoses, C., Berrar, D.: Soinn+, a self-organizing incremental neural network for
unsupervised learning from noisy data streams. Expert Syst. Appl. 143, 113069 (2020)
8. Gözüaçık, Ömer., Can, Fazli: Concept learning using one-class classifiers for implicit drift
detection in evolving data streams. Artif. Intell. Rev. 54(5), 3725–3747 (2021)
9. Pinto, F., Sampaio, M.O.P., Bizarro, P.: Automatic model monitoring for data streams. arXiv
preprint arXiv:1908.04240 (2019)
10. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Brazilian
Symposium on Artificial Intelligence, pp. 286–295. Springer, Berlin (2004)
11. Bifet, Albert: Adaptive learning and mining for data streams and frequent patterns. ACM
SIGKDD Explor. Newsl. 11(1), 55–56 (2009)
12. Ross, G.J., Adams, N.M., Tasoulis, D.K., Hand, D.J.: Exponentially weighted moving average
charts for detecting concept drift. Pattern Recogn. Lett. 33(2), 191–198 (2012)
13. Huang, D.T.J., Koh, Y.S., Dobbie, G., Pears, R.: Detecting volatility shift in data streams. In:
2014 IEEE International Conference on Data Mining, pp. 863–868 (2014). https://doi.org/10.
1109/ICDM.2014.50
14. Pears, R., Sakthithasan, S., Koh, Y.S.: Detecting concept change in dynamic data streams.
Mach. Learn. 97(3), 259–293 (2014)
15. Nishida, K., Yamauchi, K.: Detecting concept drift using statistical testing. In: Corruble, V.,
Takeda, M., Suzuki, E. (eds.) Discovery Science, pp. 264–269, Springer, Berlin (2007). ISBN
978-3-540-75488-6
16. de Barros, R.S.M., Hidalgo, J.I.G., de Lima Cabral, D.R.: Wilcoxon rank sum test drift detector.
Neurocomputing 275, 1954–1963 (2018)
17. Franke, T.M., Ho, T., Christie, C.A.: The chi-square test: Often used and more often misinter-
preted. Am. J. Eval. 33(3), 448–458 (2012)
18. Gama, João., Žliobait˙e, Indr˙e, Bifet, Albert, Pechenizkiy, Mykola, Bouchachia, Abdelhamid:
A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 1–37 (2014)
19. Demšar, Janez: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn.
Res. 7, 1–30 (2006)

File (1)

Content uploaded by Supriya Agrahari
Author content
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.