ArticlePDF Available

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

September 2015
International Journal of Computer Applications 125(15):1-6

September 2015
125(15):1-6

DOI:10.5120/ijca2015905842

Authors:

Chandu kumar Reddy

Rajiv Gandhi University of Knowledge Technologies

Content uploaded by Chandu kumar Reddy

Content may be subject to copyright.

International Journal of Computer Applications (0975 – 8887)

Volume 125 – No.15, September 2015

Classification of Concept-Drifting Data Streams using

Optimized Genetic Algorithm

E. Padmalatha

Asst.prof

CBIT

C.R.K. Reddy, PhD

Professor

CBIT

B. Padmaja Rani, PhD

Professor

JNTUH

ABSTRACT

Data Stream Mining is the process of extracting knowledge

structures from continuous, rapid data records. In these

applications, the main goal is to predict the class or value of

new instances in the data stream given some knowledge about

the class membership or values of previous instances in the

data stream. Machine learning techniques can be used to learn

this prediction task from labeled examples in an automated

fashion.In many applications which are in non-stationary

environments, the distribution underlying the instances or the

rules underlying their labeling may change over time, i.e., the

class or the target value to be predicted may change over time.

This problem is referred to as Concept drift[8]. Evolutionary

Computations like Genetic Algorithm is a strong rule based

classification algorithm which is used for mining static small

data sets and inefficient for large data streams. Evolutionary

Algorithms are one of the population optimization techniques

done by calculating fitness evaluation measures using gene

reproduction, crossover, mutation and selection of the

individual gene mechanisms. If the Genetic Algorithm can be

made scalable and adaptable by reducing its I/O intensity, it

will become an efficient and effective tool for mining large

data sets like data streams.In this paper a scalable and

adaptable online genetic algorithm is proposed to mine

classification rules for the data streams with concept drifts.

The results of the proposed method are comparable with the

other standard methods which are used for mining the data

streams.

Keywords

Data Stream, conceptdrift, Genetic Algorithm,optimization.

1. INTRODUCTION

Genetic Algorithm [1, 4and 6] is a rule based classifier whose

performance will be almost similar to RBC. The Evolution of

GA was started from the Darwinian’s Theory “Survival of the

fittest”. It also has some major advantages over RBC. To

make the classifier building process faster and easier, RBC

stores a compressed form of the data stream in the memory as

a tree. Since the stream evolves abruptly, frequent and fast

modification of both the trees is also required. Hence, when

the domain becomes too complex, building and maintaining

the trees becomes a difficult task.

Compared to RBC, GA model is independent of the domain

knowledge and does not require any complex data structures

to store the data. So its memory requirement is low and does

not require any complex computations as required for RBC.

Due to its evolutionary based characteristic, it can handle the

concept drifts in a natural way and the model can be made to

evolve and adapt itself in accordance with the changes in the

concepts of the data streams due to concept drifts. On the

other hand GA scans the data set repeatedly to check the

accuracy of the candidate rule set after each generation which

is not possible with respect to the data streams, as the data

streams cannot be accessed repeatedly. Hence here a scalable

and adaptable GA is built for large data sets like data streams

by reducing its I/O intensity.

2. RELATED WORK

Ensemble Classifiers (EC)

EC [3, 5] builds and uses a group of classifiers for predicting

the class label of the new unknown data sample. In this type

of algorithm, the data stream is divided into weighted chunks

and classifiers are built for each chunk separately as shown in

Fig1. The newly built last classifiers are used for predicting

the new data samples. This collective decision making

increases the prediction accuracy[7] when compared to the

prediction accuracy of the models that employ only a single

classifier for the prediction purpose.

Example for Ensemble Classifier is providing the same raw

material to design a product to different designers according

to their weightages based on their experience. Usually

ensemble algorithms perform poorly while predicting the data

samples of rare classes, particularly when the data distribution

is highly skewed.

Figure 1. Ensemble Classifier

Rule Based Classifiers (RBC)

The third category of algorithms considers the classifier as a

combination of tiny independent components and each

component is built independently in an incremental way. The

most recent algorithm of this type is low granularity Rule

Based Classifier (RBC) proposed by Wang et al. 2007.Their

way of building a classifier considerably reduces the model

updating cost and their approach also maintains high accuracy

level when compared to the first two approaches as their

method upgrades only the components which are affected by

the concept drift rather than making a global modification

whenever a concept drift is detected, which is a faster and

International Journal of Computer Applications (0975 – 8887)

Volume 125 – No.15, September 2015

easier process and makes the approach accurate and faster.

3. DESIGN OF OPTIMIZED GA

3.1 OGA Functionalities

Methodology of Optimized GA includes four functionalities

shown in Fig2:

1. Data stream distributor

Initially, training dataset and test datasets are taken.

Here, test dataset is taken as input and training dataset is

uploaded for streaming the data. Data stream distributor

is responsible for streaming the uploaded training dataset

continuously.

2. Population creator

During the process of data streaming, Population creator

creates initial population and individuals with duplicate

instances in series of windows. Then these population

generated windows are given to Genetic engine.

3. Genetic engine

Then the Genetic engine performs OGA mechanisms on

set of populations created by calculating the fitness

values. Genetic Engine mechanisms include following

steps like selection, crossover, mutation and elitism

selection of individuals.

4. Rule set Evaluator

Here, after calculation of fitness values of all the

individuals, rules are generated for solution, i.e., genes

values are calculated with their classified run time. Rule

set evaluator generates the best rules with best fitness

values after all the iterations.

Figure 2. Design Flow of OGA Process

Figure 3 OGA Process

3.2 OGA Process with Datasets

Considering car data sets, which contain 1728 records and 6

attributes, all attributes are categorical. The target class

attribute has four values namely ‘unacc’, ‘acc’, ‘good’,

‘vgood’. To generate larger data sets of size 10000, 20000 and

30000 the records are duplicated and randomly arranged such

that the data distribution is proportionately similar to the

original data set.

Attribute

Values

Buying

vhigh, high, med, low

Maintenance

vhigh, high, med, low

Doors

2, 3, 4, 5more

Persons

2, 4, more

Lug boot

Small, med, big

Safety

Low, med, high

Now here in OGA Process,

Creation of Population is duplicating the records with size say

suppose 1000 set of records from training data sets.

Individuals are the sets of records. Here in car data set, the

example for individual is,

vhigh, vhigh, 3, 2, small, high, unacc

Chromosomes are the combination of target class and

individual for generating the solution. Example for

chromosome is,

target: acc and vhigh, med,

3, more, med, med

Genes are the solutions found after generating the solution in

GA process with assigned target class label value. Example

for genes isf`,

vhigh vhigh 3 2 small high unacc

Fitness value of an individual is the measure value of the

fitness function for that individual. Here, fitness value is

initiated with a minimum threshold value based on the best

elitism selection.

Now, the OGA Process shown fig3 for car datasets is done in

the following steps:

target class attributes unacc, acc, good and vgood is

considered as 1000, 0100, 0010 and 0001.

International Journal of Computer Applications (0975 – 8887)

Volume 125 – No.15, September 2015

Similarly for the other attributes

Attribute

Values

Buying

vhigh-1000, high-0100,

med-0010 and low-0001

Maintenance

vhigh-1000, high-0100,

med-0010 and low-0001.

Doors

2-1000, 3-0100, 4-0010

and 5more-0001

Persons

2-1000, 4-0100 and more-

0010

Lug boot

Small-1000, med-0100 and

big-0010

Safety

Low-1000, med-0100 and

high-0010

Now for example, the individuals

vhigh vhigh 3 2 small high unacc Is considered as,

1000 1000 0100 1000 1000 0010 1000

Chromosomes are formed with the target class attribute for

unacc-1000 and the individual for generating the solution. So

total 7 attributes and 4 target class attributes forms 28

chromosomes.

Similarly, for all the rules genes solution set is generated.The

same OGA process is applied for other datasets also.

4. EXPERIMENTATION AND RESULTS

Experimental Process

1. Initially, any dataset that has to be taken is divided into

two datasets, training (80%) and test (20%) datasets.

2. Then, the test dataset is taken as input.txt file in OGA

and the training dataset is to be uploaded.

3. Then the uploaded training data set is to be streamed.

The OGA process starts on the streamed datasets.

4. The OGA Process generates the genes solution for the

corresponding target class attribute.

5. After generating the solution, the OGA displays the

correctly and incorrectly classified attributes.

6. Finally, the classified run time (in nano seconds) of the

generated best genes are shown with the dataset record

index size

7. Graph between Runtime V/s Solution Dataset size value

is plotted as shown in the following figure

Fig4.Classification Run Time (in Nano Seconds) of OGA

after generating the classified solution value for Yeast

datasets

Fig 5. Run Time Graph of the Generated Classified Values

of Yeast datasets

5. PERFORMANCE EVALUATION

USING RIVAL ALGORITHMS

Considering

1. Error Rate which is equal to the ratio of incorrectly

classified values and 100

2. Classification Run Time.

Table1. Classification Run Time (Seconds) tabulated using Different Classifications for 10 different datasets

Index

Dataset

RBC

CVFDT

Optimized GA

(Random Forest)

(PART)

KDDCup

57.33

164.58

131.84

0.01811

Car

0.27

0.13

0.00084

Chess

5.03

39.52

0.001315

Nursery

6.021

0.15

0.00087

International Journal of Computer Applications (0975 – 8887)

Volume 125 – No.15, September 2015

Hyperplane

7.43

0.14

0.015452

Sea

6.78

0.12

0.004276

Letter

26.7

0.1

1.5

0.01417

Image Segmentation

0.18

0.01

0.07

0.000724

Solar Flare

0.13

0.01

0.04

0.000892

Yeast Database

1.48

0.02

0.49

0.003023

Fig 4.Comparison of Classification Run Time (Seconds) using Different Classifications for

Table 2. Error Rates tabulated using Different Classifications for 10 different datasets

Index

Dataset

RBC

CVFDT

Optimized GA

(Random Forest)

(PART)

KDDCup

0.003

0.002375

0.15

Car

0.251

0.29978

Chess

0.133412

0.122407

0.916844

0.001

Nursery

0.125

0.666667

0.498302

0.0015

Hyperplane

0.235

0.5378

0.4531

Sea

0.0895

0.4744

0.3741

0. 2

Letter

0.1389

0.98

0.603

0.49269

Image Segmentation

0.0435

0.95

0.0823

0.001

Solar Flare

0.168

0.1456

0.1183

0.001

Yeast Database

0.425202

0.38814

100

120

140

160

180

RBC

CVFDT

Optimized GA

Run Time (Seconds)

International Journal of Computer Applications (0975 – 8887)

Volume 125 – No.15, September 2015

Fig6. Comparison of Error Rates using Different Classifications for 10 different datasets

6. RIVALS ALGORITHM

To compare the algorithms’ performance, error rate and run

time of data sets are calculated. A win/lose/tie (w/l/t) record is

calculated for each pair of the method for which the

experiment is performed.

It represents the number of data sets in which an algorithm,

respectively wins, looses or ties when compared with the

other algorithm regarding error rate. Same is calculated for all

algorithms with respect to run time. From that we can prove

which algorithm has best performance.

Table3. Performance Evaluation Using Rival Algorithm’s w/l/t records with regard to their run time across 10 datasets

Method

RBC

CVFDT

Optimized GA

0/0/10

2/8/0

0/10/0

RBC

8/2/0

0/0/10

8/2/0

0/10/0

CVFDT

8/2/0

2/8/0

0/0/10

0/10/0

OGA

10/0/0

0/0/10

Table 4. Performance Evaluation Using Rival Algorithm’s w/l/t records with regard to their error rates across 10 datasets

Method

RBC

CVFDT

Optimized GA

0/0/10

7/3/0

9/1/0

2/7/1

RBC

3/7/0

0/0/10

2/7/1

0/10/0

CVFDT

1/9/0

7/2/1

0/0/10

0/10/0

OGA

7/2/1

10/0/0

0/0/10

Hence Optimized GA has highest winning probability from

both classification error rate and run time which proves the

best efficiency.

7. CONCLUSION

Unlike the existing data sets classification algorithms like

CVFDT, RBC, EC and Traditional GA, it is not possible to

classify the data streams underlying with the mechanism

International Journal of Computer Applications (0975 – 8887)

Volume 125 – No.15, September 2015

called concept-drift where the data streams are changed due to

some underlying context changes. Also the data streams are

not stored fully in any of the earlier classification techniques

due to their concept drift.

Optimized GA is such a technique where the classification is

done for concept-drifting data streams by using streaming

window and its mechanisms like selection, crossover,

mutation and elitism for the generation of the solution with

best fitness value for best classification rate.

Further, the OGA can be optimized by minimizing the build

time for construction of the model for even large data sets

when streamed which enhances performance and time

efficiency.

8. REFERENCES

[1] Periasamy Vivekanandan and Raju Nedunchezhian,

“Mining data streams with concept drifts using genetic

algorithm”, Artificial Intelligence Review, Vol. 36, Issue

3, pp 163-178, Springer, October 2011.

[2] Araujo D.L.A, Lopes H.S, Freitas A.A, “Rule discovery

with a parallel genetic algorithm”, In Proceedings of

IEEE systems, man and cybernetics conference, Brazil,

1999.

[3] Wang H, “Mining Concept-Drifting Data Streams”, IBM

T.J. Watson Research Center, August 19, 2004.

[4] Basheer M. Al-Maqaleh and Hamid Shahbazkia, “A

Genetic Algorithm for Discovering Classification Rules

in Data Mining”, International Journal of Computer

Applications (0975-8887), Vol. 41-No. 18, March 2012.

[5] Wang H, Fan W, Yu PS, Han J, “Mining concept-drifting

data streams using ensemble classifiers”, In Proceedings

of the 9th ACM SIGKDD international conference on

knowledge discovery and data mining, pp 226–235,

2003.

[6] Syed Shaheena and Shaik Habeeb, “Classification Rule

Discovery Using Genetic Algorithm-Based Approach”,

NIMRA Institute, Department of CSE, IJCTT Journal,

Vol. 4, Issue 8, pp 2710-2715, August 2013.

[7] E Padmalatha, C R K Reddy and Padmaja B Rani.

Article: Ensemble Classification for Drifting

Concept. International Journal of Computer

Applications 80(11):33-36, October 2013.

[8] E.Padmalatha,C.R.K.Reddy, B.Padmaja Rani

”Classification of Concept Drift Data Streams”In the

proceedings of the Fifth International Conference on

Information Science and Applications .ICISA 2014.IEEE

PP291-295, 2014.

IJCATM : www.ijcaonline.org

Air Quality Index (AQI) Using Time Series Modelling During COVID Pandemic

Chapter

Jan 2022

Air quality index is use to identify how polluted the current air is and measures the level of pollution in air. Increasing AQI always been a matter of worry because of rapid increase in traffic, urbanization and pollutants. This paper aims to predict AQI of Delhi region during COVID-19 using time series modelling which is a machine learning algorithm. Time series modelling involves models to fit into collected dataset and use them to predict future values. The research is based on major pollutants like particulate matter, CO, SO, NO, NH3 and ozone. Data of the pollutants are collected from Central Pollution Control Board (CPCB), Government of India. Coefficient of determination of PM 10 is 0.95 and PM 2.5 is 0.82.

Importance of Self-Learning Algorithms for Fraud Detection Under Concept Drift

Chapter

Full-text available

Apr 2022

Fraud detection has been a difficult problem in the industry for the past many years which has caused a massive financial loss for individuals/organizations. Machine learning techniques have proved to be an efficient technique to identify and detect fraud. The major problem which exists in this domain is “Concept drift”. Fraudsters tend to evolve their habits over time which eventually leads to a pattern change. Machine learning models normally depend on the reliable set of labels which classifies whether the transaction is fraud or legitimate. When there happens a change in the patterns, the model tends to lose its performance in predicting new patterns. A model which adapt to this changing behaviour is inevitable in such scenarios. In this paper, we have reviewed several articles which discusses the problem of concept drift in fraud detection, and we have also surveyed different types of methods and techniques used so far by the researchers to deal with it. Above that the paper also proposes a procedure and steps to make a machine learning model adaptive.

Ensemble Classification for Drifting Concept

Article

Full-text available

Oct 2013

A Genetic Algorithm for Discovering Classification Rules in Data Mining

Article

Full-text available

Mar 2012

Data mining has as goal to discover knowledge from huge volume of data. Rule mining is considered as one of the usable mining method in order to obtain valuable knowledge from stored data on database systems. In this paper, a genetic algorithm-based approach for mining classification rules from large database is presented. For emphasizing on accuracy, coverage and comprehensibility of the rules and simplifying the implementation of a genetic algorithm. The design of encoding, genetic operators and fitness function of genetic algorithm for this task are discussed. Experimental results show that genetic algorithm proposed in this paper is suitable for classification rule mining and those rules discovered by the algorithm have higher classification performance to unknown data.

Parallel genetic algorithm for rule discovery in large databases

Conference Paper

Full-text available

Feb 1999

This paper presents GA-PVMINER, a parallel genetic algorithm that uses the Parallel Virtual Machine (PVM) to discover rules in a database. The system uses the Michigan's approach, where each individual represents a rule. A rule has the form “if condition then prediction”. GA-PVMINER is based on the concept learning framework, but it performs a generalization of the classification task, which can be called dependence modeling (sometimes also called generalized rule induction). In this task, different discovered rules can predict the value of different goal attributes in the “prediction” part of a rule, whereas in classification all discovered rules predict the value of the same goal attribute. The global population of genetic algorithm individuals is divided into several subpopulations, each assigned to a distinct processor. For each subpopulation, all the individuals represent rules with the same goal attribute in the “prediction” part of the rule. Different subpopulations evolve rules predicting different goal attributes. The system exploits both data parallelism and function parallelism

Mining Concept-Drifting Data Streams Using Ensemble Classifiers

Article

Full-text available

Jul 2003

Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

Classification of Concept Drift Data Streams

Conference Paper

May 2014

Concept drift has been a very important concept in the realm of data streams. Streaming data may consist of multiple drifting concepts each having its own underlying data distribution. Concept drift occurs when a set of examples has legitimate class labels at one time and has different legitimate labels at another time. This paper provides a comprehensive overview of existing concept -evolution in concept drifting techniques along different dimensions and it provides lucid vision about the ensemble's behavior when dealing with concept drifts. Key words:data stream,ensemble, class label,concept drift.

Mining data streams with concept drifts using genetic algorithm

Article

Oct 2011

Recent research shows that rule based models perform well while classifying large data sets such as data streams with concept drifts. A genetic algorithm is a strong rule based classification algorithm which is used only for mining static small data sets. If the genetic algorithm can be made scalable and adaptable by reducing its I/O intensity, it will become an efficient and effective tool for mining large data sets like data streams. In this paper a scalable and adaptable online genetic algorithm is proposed to mine classification rules for the data streams with concept drifts. Since the data streams are generated continuously in a rapid rate, the proposed method does not use a fixed static data set for fitness calculation. Instead, it extracts a small snapshot of the training example from the current part of data stream whenever data is required for the fitness calculation. The proposed method also builds rules for all the classes separately in a parallel independent iterative manner. This makes the proposed method scalable to the data streams and also adaptable to the concept drifts that occur in the data stream in a fast and more natural way without storing the whole stream or a part of the stream in a compressed form as done by the other rule based algorithms. The results of the proposed method are comparable with the other standard methods which are used for mining the data streams.

Rule Discovery with a Parallel Genetic Algorithm

Article

Aug 2001

An important issue in data mining is scalability with respect to the size of the dataset being mined. In the paper we address this issue by presenting a parallel GA for rule discovery. This algorithm exploits both data parallelism, by distributing the data being mined across all available processors, and control parallelism, by distributing the population of individuals across all available processors. 1

Mining Concept-Drifting Data Streams

Aug 2004

H Wang
Ibm T J Watson Research
Center

Wang H, "Mining Concept-Drifting Data Streams", IBM T.J. Watson Research Center, August 19, 2004.

Classification Rule Discovery Using Genetic Algorithm-Based Approach

Aug 2013
2710-2715

Syed Shaheena
Shaik Habeeb

Syed Shaheena and Shaik Habeeb, "Classification Rule Discovery Using Genetic Algorithm-Based Approach", NIMRA Institute, Department of CSE, IJCTT Journal, Vol. 4, Issue 8, pp 2710-2715, August 2013.

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Recommended publications

The Research on Tracking Concept Drift Based on Genetic Algorithm

Classification of Concept Drift Data Streams

Efficient Learning Approaches for Agents in Data Mining

Ensemble Classification for Drifting Concept

EOCD: An Ensemble Optimization Approach for Concept Drift Applications