ArticlePDF Available

A two level approach to discretize cosmetic data using Rough set theory

October 2015
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 14(10):6147-6152

October 2015
14(10):6147-6152

DOI:10.24297/ijct.v14i10.1826

License
CC BY 4.0

Authors:

p.m. Prasuna

Jawaharlal Nehru Technological University, Hyderabad

Ramadevi Yellasiri

Chaitanya Bharathi Institute of Technology

Discrete values play a very prominent role in extracting knowledge. Most of the machines learning algorithms use discrete values. It is also observed that the rules discovered through discrete values are shorter and precise. The predictive accuracy is more when discrete values are used. Cosmetic industry extracts the features from the face images of the customers to analyze their facial skin problems. These values are continuous in nature. A predictive model with high accuracy is required to determine the cosmetic problems of the customers and suggest suitable cosmetic. Existing traditional discretization techniques are not sufficient for deriving discretized data from continuous valued cosmetic data as it has to balance the loss of information intrinsic to process adapted and generating a reasonable number of cut points, that is, a reasonable search space. This paper proposes a two level discretization method which is a combination of traditional k means clustering technique and rough set theory to discretize continuous features of cosmetic data.

No caption available

…

No caption available

…

No caption available

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Content uploaded by p.m. Prasuna

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

ISSN 2277-3061

6147 | Page J u l y 1 0 , 2 0 1 5

A two level approach to discretize cosmetic data using Rough set theory

P.M. Prasuna, Dr.Y. Ramadevi, Dr. A.Vinay Babu

Research Scholar JNTU, Hyderabad

prasunamanikya@yahoo.com

Dr.Y. Ramadevi, Professor CBIT, Hyderabad

yrdcse.cbit@gmail.com

Dr. A.Vinay Babu, Professor JNTUHCE, Hyderabad

avb1222@jntuh.ac.in

ABSTRACT

Discrete values play a very prominent role in extracting knowledge. Most of the machine learning algorithms use discrete

values. It is also observed that the rules discovered through discrete values are shorter and precise. The predictive

accuracy is more when discrete values are used. Cosmetic industry extracts the features from the face images of the

customers to analyze their facial skin problems. These values are continuous in nature. A predictive model with high

accuracy is required to determine the cosmetic problems of the customers and suggest suitable cosmetic. Existing

traditional discretization techniques are not sufficient for deriving discretized data from continuous valued cosmetic data as

it has to balance the loss of information intrinsic to process adapted and generating a reasonable number of cut points,

that is, a reasonable search space. This paper proposes a two level discretization method which is a combination of

traditional k means clustering technique and rough set theory to discretize continuous features of cosmetic data.

Indexing terms/Keywords

Rough set Theory, Discretization, cut points, kmeans.

Academic Discipline And Sub-Disciplines

Data Mining and Retrieval

SUBJECT CLASSIFICATION

Discretization technique

TYPE (METHOD/APPROACH)

Rough set theory

Council for Innovative Research

Peer Review Research Publishing System

Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY

Vol. 14, No. 10

www.ijctonline.com , editorijctonline@gmail.com

ISSN 2277-3061

6148 | Page J u l y 1 0 , 2 0 1 5

INTRODUCTION

There are huge volumes of data in the cosmetic industry not only to analyze the problems of the customers but also to

rejuvenate a new product basing on the customer problems. Data mining algorithms help us to extract necessary

information for decision making from this cosmetic data. However, many mining algorithms or machine learning algorithms

cannot be applied on them as they are continuous in nature. Numeric data contain large number of values when compared

to discrete values, the rules discovered looks complex and gives less predictive accuracy. As discrete attributes are

represented with simple interval numbers they are understandable and easier to use. The rules of discrete attributes

usually are shorter and easy to understand, hence will increase the accurateness of predictions. Therefore, it is essential

to have good descretization techniques [1] to transform continuous valued features into discrete valued features. This not

only speeds up the mining process but also helps in developing a better model. This paper deals with a two level

discretization technique for cosmetic data which firstly uses the traditional kmeans algorithm and then applies rough set

theory to discrete the data at attribute level.

K means algorithm

Kmeans algorithm: Kmeans is a simple unsupervised clustering technique [2]. It follows simple and easy steps to form the

clusters. Initially number of clusters to be formed is to be determined. Then it follows three steps, initialization, expectation

and maximization. In initialization step k centers are created where k is the number of clusters to be formed which is

predetermined. In expectation step each data point is assigned to the center closest to it and maximization step deals with

computation of new center basing on the data points associated to it. These steps are carried out repeatedly until no more

changes are done to centers.Finally, this clustering technique aims at minimizing an objective function, in this case a

squared error function. The objective function used is

where is a chosen distance measure between a data point and the cluster centre is an indicator of the

distance of the data points from their respective cluster centres [3].

Kmeans Algorithm:

Make initial guesses for the centres c1, c2, ..., ck

 Until there are no changes in any centre

o Use the estimated means to classify the samples into clusters

o For i from 1 to k

 Replace mi with the mean of all of the samples for cluster i

o end_for

 end_until

Application of k means algorithm to cosmetic data discretization

Initially kmeans algorithm is applied on sample cosmetic data to form the clusters as it is unsupervised [4]. This completes

the basic discretization step. This step discretizes the data into specified number of intervals .The results are then given to

the second phase which uses Rough set theory[5]

Rough Set Theory

Rough set theory was proposed by Professor Powlak (powlak, 1982:1991 skowron, 1990) [6]. The main goal of the rough

set analysis is induction of (learning) approximations of concepts. It offers mathematical tools to discover patterns hidden

in data. The basic concepts of rough set theory are described below:

Approximation Space: An approximation space is a pair (U, B) where U is a nonempty finite set called the universe and

B is an equivalence relation defined on U.

Information System: An information system is a pair S= (U, A), where U is a nonempty finite set called the universe and

A is a nonempty finite set of attributes, i.e., a: U→Va for aєA, where Va is called the domain of a.

Decision Table (Data Table): A decision table is a special case of information system, S = (U, A= C є {d}), where

attributes in C are called condition attributes and d is a designated attribute called the decision attribute.

Approximations of Sets: Let S = (U, B) be an approximation space and X be a subset of U.

ISSN 2277-3061

6149 | Page J u l y 1 0 , 2 0 1 5

The lower approximation of X in S is defined as

= {x є X: [x] BєX}

The upper approximation of X in S is defined as

= {x є X: [x] B ∩ X ≠ φ}

For a given set of conditional attributes B, the B- positive region POSB(D) in the relation IND(D) is defined as, POSB(D) =

є{BX : X є [x ]D } . The positive POSB(D) region contains all the objects in U that can be classified without any error into

distinct classes defined by IND (D), based only on information in the relation IND (B). Greater the cardinality POSB(D)

higher the significance of the attributes in the set B with respect to D.

The rough membership function quantifies the degree of relative overlap between X and the equivalence class to which x

belongs. Thus this rough membership function is also a measure of the significance of B⊆A to describe X and is defined

by [7],

Application of rough set theory to refine the cut points generated by Kmeans

algorithm

The traits of the clusters formed by the kmeans algorithm vary. This discretization using clustering technique is not

sufficient to generate cut points with minimum information loss. Hence they are refined using Rough set theory concepts

[8]. The main aim in splitting the cluster is to refine the discretized interval. The refinement is to enhance the significance

of the attribute. In rough set theory the significance of an attribute is measured through rough membership function

POSai(D).Hence maximizing POSai(D) leads to maximizing the significance of the attribute. To maximize POSai (D), the

clusters formed through kmeans are refined further to generate new intervals or cut points. The refinement is processed in

such a way that the maximum number of objects is correctly classified by each of the interval of attribute ai, just as they

are classified by D [9]. This is done by a rough membership function applied to each interval of the attribute ai with respect

to the clusters formed through the kmeans which are further treated as class labels.

Let us take the data set U contains objects of m clusters say {c1, c2, c3 … cm} and let the k distinct values of an attribute

ai in ascending order be {vi1, vi2, vi3 … vik} i.e. the interval [vi1, vik]. The rough membership function of any interval

I= [Vi1, Vij] of the attribute ai for class cp is defined as

f (ai, cp, I) =

where ={x | ai(x) |є I} and, ={x | ai(x) є I, D(x) =cp}.

Maximizing f(ai,cp,I) is maximising which further maximizes POSai(D) [10].To achieve this each cluster

generated by kmeans is examined carefully and if necessary a cluster may be split into two or merged with the

neighbouring cluster. The splitting process uses the rough set membership function such that it maximizes the POSai

(D).in this way the intervals are refined. The refinement takes place as follows. Initially three predetermined parameters

are taken. Max_size determines the maximum no of values that could fall in each cluster. Min_size decides the minimum

number of values to form a cluster and Range gives the length of the cluster. These parameters decide whether the

cluster can be retained or still to be refined. The refinement process takes place if the cluster is large or small. The cluster

is said to be large if its cardinality is greater than the Max_size or the length is greater than the Range. A cluster is treated

as small if its cardinality is less than the Min_size. If the cluster is large it is split into two or else small, merged with other

small clusters thereby generating new cut points or intervals. This process is refined until there is no change in the cut

points or intervals.

Algorithm for the proposed method

Step1: Consider each attribute in the data set, select distinct values and sort them.

Step2: Apply kmeans algorithm to form clusters.

Step3.From the generated clusters determine the class labels as well as intervals.

Step4.Refine the intervals and add new intervals to the interval set.

Refine (I1, I2, …. Ir))

While (no change in no. of intervals) do

For each interval Ij

If SP-C (Ij, Min_size, Range) = True then

ISSN 2277-3061

6150 | Page J u l y 1 0 , 2 0 1 5

Temp= Cut Point ({vj1, vj2, vj3 … vjk})

Replace the interval Ij with two intervals

Ij1 = [vj1, Temp] and Ij2 = [Temp, vjk]

Else if | Ij | < Min_size then

If for Ik′ either neighbour of Ij

MR_C (Ij, Ik′, Max_size, Min_size) = True then Merge Ij to an interval Ik′

End if

End for

End while

Cut Point ({vi1, vi2, vi3 … vik})

I = [vi1, vik/2]

MAXRMV= Max ({f (Ai, cp, I)}) є cp,

for each vij , j=k/2 to 2

I = [vi1, vij]

Temp = Max ({ f (A i , c p , I) }) є cp;

if Temp> MAXRMV then

MAXRMV=Temp;

else

break;

End if

If (j < k/2) then

return vij as cut point for the cluster

else

for each vij, j=k/2 to k-1

I = [vij, vij]

Temp = Max ({ f (A i , cp , I) }) є cp;

if Temp> MAXRMV then

MAXRMV=Temp;

else

return vij as cut point for the cluster

End if

End for

for each vij, j=k/2 to k-1

I = [vij, vij]

Temp = Max ({ f (Ai , cp , I) })є cp;

if Temp> MAXRMV then

ISSN 2277-3061

6151 | Page J u l y 1 0 , 2 0 1 5

MAXRMV=Temp;

else

return vij as cut point for the cluster

End if

End for

Results

Cosmetic data has been collected from the customers of different age groups. The facial images of the customers are

captured under sophisticated environment and then the features are extracted. The features are numeric in nature. To

analyse the data collected mining tools are applied. As a preprocessing step to mining process the numeric values are

discretized. To show the experimental results a dataset of 33 samples taken which consists of 17 numeric Features. After

applying the proposed algorithm the results are as shown in Table -1:

Table -1

Complexity

Before we apply kmeans algorithm first distinct values are identified then they are sorted. For carrying out this process

complexity is O(N log N) where N is the number of objects in the dataset. Kmeans is known to have the complexity

which may be in worst situation for the above algorithm i.e. when the attribute values for each object are

distinct. The complexity of the Refine function is bounded by k * N/2, where k is the number of intervals of an attribute and

the running time of the function Cut_Point is bounded by N/2. If n is the number of attribute then the total complexity of the

algorithm is bounded by,

n * (N log N + N log N + k * N /2 + N )

≈ n * (N log N)

Sno

Attribute

Type

Distin

values

(distin

values)

Interva

Stype

Numeric

Saa

Numeric

S_Count

Numeric

A_Spots

Numeric

Pimples

Numeric

Pastules

Numeric

Papules

Numeric

Cysts

Numeric

B_Visi

Numeric

A_count

Numeric

p_count

Numeric

V_pores

Numeric

E_lines

Numeric

F_Lines

Numeric

D_lines

Numeric

E_Wrinkle

Numeric

h_skin

Numeric

ISSN 2277-3061

6152 | Page J u l y 1 0 , 2 0 1 5

The number of attributes n is normally small in comparison to N. The preprocessing of the dataset for selecting relevant

attributes further reduces the value of n to be small compared to N. Therefore, the running time of the proposed algorithm

for labeled data is bounded by N logN.

Conclusion

By the proposed method the natural intervals of the values of the continuous attributes are obtained which maximized the

mutual class-attribute interdependency. The method also generates the possibly minimum number of intervals.

Although the computational effort for the search algorithm for cut point has been reduced to half of N, the size of dataset,

by implementing binary search for the cut point can further reduce the complexity of search step.

REFERENCES

[1] D Sotiris Kotsiantis, Dimitris Kanellopoulos “Discretization Techniques: A recent survey”GESTS International

Transactions on Computer Science and Engineering, Vol.32 (1), 2006, pp. 47-58

[2] Tapas Kanungo, , David M. Mount, , Nathan S. Netanyahu, ,Christine D. Piatko, Ruth Silverman, and Angela Y. Wu

“An Efficient k-Means Clustering Algorithm: Analysis and Implementation” IEEE TRANSACTIONS ON PATTERN

ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002

[3] James G. Booth, Ithaca, George Casella and James, P. Hobert “ Clustering using objective functions and stochastic

search “J. R. Statist. Soc. B (2008) 70, Part 1, pp. 119–139

[4] Daniela Joiţa “ UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Titu Maiorescu

University, Bucharest, Romania

[5] XU Chenggang “A Two-step Discretization Algorithm Based on Rough Set 2012 International Conference on

Computer Science and Electronics Engineering

[6] Zbigniew Suraj “An Introduction to Rough Set Theory and Its Applications” ICENCO’2004, December 27-30, 2004,

Cairo, Egypt.

[7] Frida Coaquira and Edgar Acuña “Applications of Rough Sets Theory in Data Preprocessing for Knowledge

Discovery” Proceedings of the World Congress on Engineering and Computer Science 2007, San Francisco, USA

[8] Guan Xin, Yi Xiao, He You “Discretization Of Continuous Interval-Valued Attributes In Rough Set Theory And Its

Application “ Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong,

19-22 August 2007

[9] Nandita Sengupta “Evaluation of Rough Set Theory Based Network Traffic Data Classifier Using Different

Discretization Method “ International Journal of Information and Electronics Engineering, Vol. 2, No. 3, May 2012.

[10] Girish Kumar Singh, Sonajharia Minz “Discretization Using Clustering and Rough Set Theory” Proceedings of the

International Conference on Computing: Theory and Applications(ICCTA'07)0-7695-2770-1/07,2007Bowman, M.,

Debray, S. K., and Peterson, L. L. 1993. Reasoning about naming systems. .

[11] Y.T. Yu, M.F. Lau, "A comparison of MC/DC, MUMCUT and several other coverage criteria for logical decisions",

Journal of Systems and Software, 2005, in press.

[12] Spector, A. Z. 1989. Achieving application requirements. In Distributed Systems, S. Mullender

Defect Identification for Mild Steel in Arc Welding Using Multi-Sensor and Neighborhood Rough Set Approach

Article

Full-text available

Jun 2024

Welding technology plays a vital role in the manufacturing process of ships, automobiles, and aerospace vehicles because it directly impacts their operational safety and reliability. Hence, the development of an accurate system for identifying welding defects in arc welding is crucial to enhancing the quality of welding production. In this study, a defect recognition method combining the Neighborhood Rough Set (NRS) with the Dingo Optimization Algorithm Support Vector Machine (DOA-SVM) in a multisensory framework is proposed. The 195-dimensional decision-making system mentioned above was constructed to integrate multi-source information from molten pool images, welding current, and vibration signals. To optimize the system, it was further refined to a 12-dimensional decision-making setup through outlier processing and feature selection based on the Neighborhood Rough Set. Subsequently, the DOA-SVM is employed for detecting welding defects. Experimental results demonstrate a 98.98% accuracy rate in identifying welding defects using our model. Importantly, this method outperforms comparative techniques in terms of quickly and accurately identifying five common welding defects, thereby affirming its suitability for arc welding. The proposed method not only achieves high accuracy but also simplifies the model structure, enhances detection efficiency, and streamlines network training.

Role of rough sets in classifying audio data

Conference Paper

May 2016

Advent of technology enables content producers and content consumers to exchange huge amount of multimedia data. Existing literature on automatic classification of audio signals is studied. To automate the process of classification of audio signals an approach based on rough sets is proposed on tollywood movie trailers.

Evaluation of Rough Set Theory Based Network TrafficData Classifier Using Different Discretization Method

Article

Full-text available

Jan 2012

Nandita Sengupta

In information security, intrusion detection is a challenging task for which designing of an efficient classifier is most important. In the paper, network traffic data is classified using rough set theory where discretization of data is a necessary preprocessing step. Different discretization methods are available and selection of one has great impact on classification accuracy, time complexity and system adaptability. Three discretization methods are applied on continuous KDD network data namely, rough set exploration system (RSES), supervised and unsupervised discretization methods to evaluate the classifier accuracy. It has been observed that supervised discretization yields best accuracy for rough set classification and provides system adaptability.

An Introduction to Rough Set Theory and Its Applications A tutorial

Article

Full-text available

Jan 2004

Zbigniew Suraj

Unsupervised static discretization methods in data mining

Article

Full-text available

D. Joita

Discretization of real-valued data is often used as a pre-processing step in many data mining algorithms. In this paper we review some important unsupervised discretization methods among which there are the discretization methods based on clustering. We propose a discretization method based on the k-means clustering algorithm which avoids the O(n log n) time requirement for sorting .

Discretization Using Clustering and Rough Set Theory

Conference Paper

Full-text available

Mar 2007

The majority of the data mining algorithms are applied to data described by discrete or nominal attributes. In order to apply these algorithms effectively to any dataset the continuous attribute need to be transformed to discretized ones. This paper presents an approach using clustering and rough set theory (RST). The experiments are performed on four datasets from UCI ML repository. The performance of the proposed approach is compared with some common discretization methods based on the two parameters - the number of intervals and the class-attribute interdependence redundancy (CAIR) value. The results of the proposed method show a satisfactory trade off between the number of intervals and the information loss due to discretization

Applications of Rough Sets Theory in Data Preprocessing for Knowledge Discovery

Article

Full-text available

Oct 2007

Data preprocessing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Rough sets theory, where sets are approximated using elementary sets, is a different approach for developing methods for the data preprocessing process. In this paper Rough sets theory is applied to three preprocessing steps: Discretization, Feature selection, and instance selection. The new methods proposed in this paper have been tested on eight datasets widely used in the KDD community.

A Two-step Discretization Algorithm Based on Rough Set

Conference Paper

Mar 2012

A two-step discretization algorithm by dynamic clustering based on Rough set is proposed. The algorithm first discretization for decision table using dynamic clustering algorithm, then discrete again using cut importance discretization algorithms, and obtain the final cut sets, because the dynamic clustering algorithm processing speed is quickly and it impel a lot of breakpoint be screened, thus the operation efficiency of algorithm is increase notable. Finally, simulation results show that the proposed algorithm is correct and high efficiency.

Achieving application requirements

Article

Apr 1990

A. Z. Spector

An abstract is not available.

An Efficient K-Means Clustering Algorithm Analysis and Implementation

Article

Jul 2002

In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

Discretization techniques: A recent survey

Article

Nov 2005

A comparison of MC/DC, MUMCUT and several other coverage criteria for logical decisions

Article

May 2006
J SYST SOFTWARE

Many testing criteria, including condition coverage and decision coverage, are inadequate for software characterised by complex logical decisions, such as those in safety-critical software. In the past decade, more sophisticated testing criteria have been advocated. In particular, compliance of the MC/DC criterion has been mandated in the commercial aviation industry for the approval of airborne software. Recently, the MUMCUT criterion has been proposed as it guarantees the detection of certain faults in logical decisions in disjunctive normal form in which no variable is redundant. This paper compares MC/DC, MUMCUT and several other related coverage criteria for logical decisions by both formal and empirical analysis, focusing on the fault-detecting ability of test sets satisfying these testing criteria. Our results show that MC/DC test sets are effective, but they may still miss some faults that can almost always be detected by test sets satisfying the MUMCUT criterion.

A two level approach to discretize cosmetic data using Rough set theory

Abstract and Figures

Recommended publications

New approach on structural feature extraction for character recognition

K-modes and Entropy Cluster Centers Initialization Methods

KNOWLEDGE EXTRACTION FROM RISE-TIME AUTO-CORRELATED PATTERNS

Multi-agent based data mining aggregation approaches using machine learning techniques