ArticlePDF Available

A two level approach to discretize cosmetic data using Rough set theory

Authors:

Abstract and Figures

Discrete values play a very prominent role in extracting knowledge. Most of the machines learning algorithms use discrete values. It is also observed that the rules discovered through discrete values are shorter and precise. The predictive accuracy is more when discrete values are used. Cosmetic industry extracts the features from the face images of the customers to analyze their facial skin problems. These values are continuous in nature. A predictive model with high accuracy is required to determine the cosmetic problems of the customers and suggest suitable cosmetic. Existing traditional discretization techniques are not sufficient for deriving discretized data from continuous valued cosmetic data as it has to balance the loss of information intrinsic to process adapted and generating a reasonable number of cut points, that is, a reasonable search space. This paper proposes a two level discretization method which is a combination of traditional k means clustering technique and rough set theory to discretize continuous features of cosmetic data.
No caption available
… 
No caption available
… 
No caption available
… 
Content may be subject to copyright.
ISSN 2277-3061
6147 | Page J u l y 1 0 , 2 0 1 5
A two level approach to discretize cosmetic data using Rough set theory
P.M. Prasuna, Dr.Y. Ramadevi, Dr. A.Vinay Babu
Research Scholar JNTU, Hyderabad
prasunamanikya@yahoo.com
Dr.Y. Ramadevi, Professor CBIT, Hyderabad
yrdcse.cbit@gmail.com
Dr. A.Vinay Babu, Professor JNTUHCE, Hyderabad
avb1222@jntuh.ac.in
ABSTRACT
Discrete values play a very prominent role in extracting knowledge. Most of the machine learning algorithms use discrete
values. It is also observed that the rules discovered through discrete values are shorter and precise. The predictive
accuracy is more when discrete values are used. Cosmetic industry extracts the features from the face images of the
customers to analyze their facial skin problems. These values are continuous in nature. A predictive model with high
accuracy is required to determine the cosmetic problems of the customers and suggest suitable cosmetic. Existing
traditional discretization techniques are not sufficient for deriving discretized data from continuous valued cosmetic data as
it has to balance the loss of information intrinsic to process adapted and generating a reasonable number of cut points,
that is, a reasonable search space. This paper proposes a two level discretization method which is a combination of
traditional k means clustering technique and rough set theory to discretize continuous features of cosmetic data.
Indexing terms/Keywords
Rough set Theory, Discretization, cut points, kmeans.
Academic Discipline And Sub-Disciplines
Data Mining and Retrieval
SUBJECT CLASSIFICATION
Discretization technique
TYPE (METHOD/APPROACH)
Rough set theory
Council for Innovative Research
Peer Review Research Publishing System
Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY
Vol. 14, No. 10
www.ijctonline.com , editorijctonline@gmail.com
ISSN 2277-3061
6148 | Page J u l y 1 0 , 2 0 1 5
INTRODUCTION
There are huge volumes of data in the cosmetic industry not only to analyze the problems of the customers but also to
rejuvenate a new product basing on the customer problems. Data mining algorithms help us to extract necessary
information for decision making from this cosmetic data. However, many mining algorithms or machine learning algorithms
cannot be applied on them as they are continuous in nature. Numeric data contain large number of values when compared
to discrete values, the rules discovered looks complex and gives less predictive accuracy. As discrete attributes are
represented with simple interval numbers they are understandable and easier to use. The rules of discrete attributes
usually are shorter and easy to understand, hence will increase the accurateness of predictions. Therefore, it is essential
to have good descretization techniques [1] to transform continuous valued features into discrete valued features. This not
only speeds up the mining process but also helps in developing a better model. This paper deals with a two level
discretization technique for cosmetic data which firstly uses the traditional kmeans algorithm and then applies rough set
theory to discrete the data at attribute level.
K means algorithm
Kmeans algorithm: Kmeans is a simple unsupervised clustering technique [2]. It follows simple and easy steps to form the
clusters. Initially number of clusters to be formed is to be determined. Then it follows three steps, initialization, expectation
and maximization. In initialization step k centers are created where k is the number of clusters to be formed which is
predetermined. In expectation step each data point is assigned to the center closest to it and maximization step deals with
computation of new center basing on the data points associated to it. These steps are carried out repeatedly until no more
changes are done to centers.Finally, this clustering technique aims at minimizing an objective function, in this case a
squared error function. The objective function used is
,
where is a chosen distance measure between a data point and the cluster centre is an indicator of the
distance of the data points from their respective cluster centres [3].
Kmeans Algorithm:
Make initial guesses for the centres c1, c2, ..., ck
Until there are no changes in any centre
o Use the estimated means to classify the samples into clusters
o For i from 1 to k
Replace mi with the mean of all of the samples for cluster i
o end_for
end_until
Application of k means algorithm to cosmetic data discretization
Initially kmeans algorithm is applied on sample cosmetic data to form the clusters as it is unsupervised [4]. This completes
the basic discretization step. This step discretizes the data into specified number of intervals .The results are then given to
the second phase which uses Rough set theory[5]
Rough Set Theory
Rough set theory was proposed by Professor Powlak (powlak, 1982:1991 skowron, 1990) [6]. The main goal of the rough
set analysis is induction of (learning) approximations of concepts. It offers mathematical tools to discover patterns hidden
in data. The basic concepts of rough set theory are described below:
Approximation Space: An approximation space is a pair (U, B) where U is a nonempty finite set called the universe and
B is an equivalence relation defined on U.
Information System: An information system is a pair S= (U, A), where U is a nonempty finite set called the universe and
A is a nonempty finite set of attributes, i.e., a: U→Va for aєA, where Va is called the domain of a.
Decision Table (Data Table): A decision table is a special case of information system, S = (U, A= C є {d}), where
attributes in C are called condition attributes and d is a designated attribute called the decision attribute.
Approximations of Sets: Let S = (U, B) be an approximation space and X be a subset of U.
ISSN 2277-3061
6149 | Page J u l y 1 0 , 2 0 1 5
The lower approximation of X in S is defined as
= {x є X: [x] BєX}
The upper approximation of X in S is defined as
= {x є X: [x] B ∩ X ≠ φ}
For a given set of conditional attributes B, the B- positive region POSB(D) in the relation IND(D) is defined as, POSB(D) =
є{BX : X є [x ]D } . The positive POSB(D) region contains all the objects in U that can be classified without any error into
distinct classes defined by IND (D), based only on information in the relation IND (B). Greater the cardinality POSB(D)
higher the significance of the attributes in the set B with respect to D.
The rough membership function quantifies the degree of relative overlap between X and the equivalence class to which x
belongs. Thus this rough membership function is also a measure of the significance of BA to describe X and is defined
by [7],
Application of rough set theory to refine the cut points generated by Kmeans
algorithm
The traits of the clusters formed by the kmeans algorithm vary. This discretization using clustering technique is not
sufficient to generate cut points with minimum information loss. Hence they are refined using Rough set theory concepts
[8]. The main aim in splitting the cluster is to refine the discretized interval. The refinement is to enhance the significance
of the attribute. In rough set theory the significance of an attribute is measured through rough membership function
POSai(D).Hence maximizing POSai(D) leads to maximizing the significance of the attribute. To maximize POSai (D), the
clusters formed through kmeans are refined further to generate new intervals or cut points. The refinement is processed in
such a way that the maximum number of objects is correctly classified by each of the interval of attribute ai, just as they
are classified by D [9]. This is done by a rough membership function applied to each interval of the attribute ai with respect
to the clusters formed through the kmeans which are further treated as class labels.
Let us take the data set U contains objects of m clusters say {c1, c2, c3 … cm} and let the k distinct values of an attribute
ai in ascending order be {vi1, vi2, vi3 … vik} i.e. the interval [vi1, vik]. The rough membership function of any interval
I= [Vi1, Vij] of the attribute ai for class cp is defined as
f (ai, cp, I) =
where ={x | ai(x) |є I} and, ={x | ai(x) є I, D(x) =cp}.
Maximizing f(ai,cp,I) is maximising which further maximizes POSai(D) [10].To achieve this each cluster
generated by kmeans is examined carefully and if necessary a cluster may be split into two or merged with the
neighbouring cluster. The splitting process uses the rough set membership function such that it maximizes the POSai
(D).in this way the intervals are refined. The refinement takes place as follows. Initially three predetermined parameters
are taken. Max_size determines the maximum no of values that could fall in each cluster. Min_size decides the minimum
number of values to form a cluster and Range gives the length of the cluster. These parameters decide whether the
cluster can be retained or still to be refined. The refinement process takes place if the cluster is large or small. The cluster
is said to be large if its cardinality is greater than the Max_size or the length is greater than the Range. A cluster is treated
as small if its cardinality is less than the Min_size. If the cluster is large it is split into two or else small, merged with other
small clusters thereby generating new cut points or intervals. This process is refined until there is no change in the cut
points or intervals.
Algorithm for the proposed method
Step1: Consider each attribute in the data set, select distinct values and sort them.
Step2: Apply kmeans algorithm to form clusters.
Step3.From the generated clusters determine the class labels as well as intervals.
Step4.Refine the intervals and add new intervals to the interval set.
Refine (I1, I2, …. Ir))
While (no change in no. of intervals) do
For each interval Ij
If SP-C (Ij, Min_size, Range) = True then
ISSN 2277-3061
6150 | Page J u l y 1 0 , 2 0 1 5
Temp= Cut Point ({vj1, vj2, vj3 … vjk})
Replace the interval Ij with two intervals
Ij1 = [vj1, Temp] and Ij2 = [Temp, vjk]
Else if | Ij | < Min_size then
If for Ik′ either neighbour of Ij
MR_C (Ij, Ik′, Max_size, Min_size) = True then Merge Ij to an interval Ik′
End if
End if
End for
End while
Cut Point ({vi1, vi2, vi3 … vik})
I = [vi1, vik/2]
MAXRMV= Max ({f (Ai, cp, I)}) є cp,
for each vij , j=k/2 to 2
I = [vi1, vij]
Temp = Max ({ f (A i , c p , I) }) є cp;
if Temp> MAXRMV then
MAXRMV=Temp;
else
break;
End if
If (j < k/2) then
return vij as cut point for the cluster
else
for each vij, j=k/2 to k-1
I = [vij, vij]
Temp = Max ({ f (A i , cp , I) }) є cp;
if Temp> MAXRMV then
MAXRMV=Temp;
else
return vij as cut point for the cluster
End if
End for
for each vij, j=k/2 to k-1
I = [vij, vij]
Temp = Max ({ f (Ai , cp , I) })є cp;
if Temp> MAXRMV then
ISSN 2277-3061
6151 | Page J u l y 1 0 , 2 0 1 5
MAXRMV=Temp;
else
return vij as cut point for the cluster
End if
End for
Results
Cosmetic data has been collected from the customers of different age groups. The facial images of the customers are
captured under sophisticated environment and then the features are extracted. The features are numeric in nature. To
analyse the data collected mining tools are applied. As a preprocessing step to mining process the numeric values are
discretized. To show the experimental results a dataset of 33 samples taken which consists of 17 numeric Features. After
applying the proposed algorithm the results are as shown in Table -1:
Table -1
Complexity
Before we apply kmeans algorithm first distinct values are identified then they are sorted. For carrying out this process
complexity is O(N log N) where N is the number of objects in the dataset. Kmeans is known to have the complexity
which may be in worst situation for the above algorithm i.e. when the attribute values for each object are
distinct. The complexity of the Refine function is bounded by k * N/2, where k is the number of intervals of an attribute and
the running time of the function Cut_Point is bounded by N/2. If n is the number of attribute then the total complexity of the
algorithm is bounded by,
n * (N log N + N log N + k * N /2 + N )
≈ n * (N log N)
Sno
Attribute
Type
Distin
ct
values
(distin
ct
values)
Interva
ls
1
Stype
Numeric
5
3
2
Saa
Numeric
30
7
3
S_Count
Numeric
29
6
4
A_Spots
Numeric
12
4
5
Pimples
Numeric
13
4
6
Pastules
Numeric
3
3
7
Papules
Numeric
5
4
8
Cysts
Numeric
2
2
9
B_Visi
Numeric
23
4
10
A_count
Numeric
20
4
11
p_count
Numeric
33
8
12
V_pores
Numeric
33
4
13
E_lines
Numeric
31
6
14
F_Lines
Numeric
27
7
15
D_lines
Numeric
20
7
16
E_Wrinkle
s
Numeric
1
1
17
h_skin
Numeric
33
5
ISSN 2277-3061
6152 | Page J u l y 1 0 , 2 0 1 5
The number of attributes n is normally small in comparison to N. The preprocessing of the dataset for selecting relevant
attributes further reduces the value of n to be small compared to N. Therefore, the running time of the proposed algorithm
for labeled data is bounded by N logN.
Conclusion
By the proposed method the natural intervals of the values of the continuous attributes are obtained which maximized the
mutual class-attribute interdependency. The method also generates the possibly minimum number of intervals.
Although the computational effort for the search algorithm for cut point has been reduced to half of N, the size of dataset,
by implementing binary search for the cut point can further reduce the complexity of search step.
REFERENCES
[1] D Sotiris Kotsiantis, Dimitris Kanellopoulos “Discretization Techniques: A recent survey”GESTS International
Transactions on Computer Science and Engineering, Vol.32 (1), 2006, pp. 47-58
[2] Tapas Kanungo, , David M. Mount, , Nathan S. Netanyahu, ,Christine D. Piatko, Ruth Silverman, and Angela Y. Wu
“An Efficient k-Means Clustering Algorithm: Analysis and Implementation” IEEE TRANSACTIONS ON PATTERN
ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002
[3] James G. Booth, Ithaca, George Casella and James, P. Hobert “ Clustering using objective functions and stochastic
search “J. R. Statist. Soc. B (2008) 70, Part 1, pp. 119–139
[4] Daniela Joiţa “ UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Titu Maiorescu
University, Bucharest, Romania
[5] XU Chenggang “A Two-step Discretization Algorithm Based on Rough Set 2012 International Conference on
Computer Science and Electronics Engineering
[6] Zbigniew Suraj “An Introduction to Rough Set Theory and Its Applications” ICENCO’2004, December 27-30, 2004,
Cairo, Egypt.
[7] Frida Coaquira and Edgar Acuña “Applications of Rough Sets Theory in Data Preprocessing for Knowledge
Discovery” Proceedings of the World Congress on Engineering and Computer Science 2007, San Francisco, USA
[8] Guan Xin, Yi Xiao, He You “Discretization Of Continuous Interval-Valued Attributes In Rough Set Theory And Its
Application Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong,
19-22 August 2007
[9] Nandita Sengupta “Evaluation of Rough Set Theory Based Network Traffic Data Classifier Using Different
Discretization Method “ International Journal of Information and Electronics Engineering, Vol. 2, No. 3, May 2012.
[10] Girish Kumar Singh, Sonajharia Minz “Discretization Using Clustering and Rough Set Theory” Proceedings of the
International Conference on Computing: Theory and Applications(ICCTA'07)0-7695-2770-1/07,2007Bowman, M.,
Debray, S. K., and Peterson, L. L. 1993. Reasoning about naming systems. .
[11] Y.T. Yu, M.F. Lau, "A comparison of MC/DC, MUMCUT and several other coverage criteria for logical decisions",
Journal of Systems and Software, 2005, in press.
[12] Spector, A. Z. 1989. Achieving application requirements. In Distributed Systems, S. Mullender
... There are various existing discretization methods, and different algorithms have varying impacts on the accuracy of rough set models. Nevertheless, regardless of the algorithm used, it inevitably leads to the problem of information loss [9,10]. To address this issue, Chen and Hu et al. introduce a neighborhood-based rough set model called neighborhood rough sets [11], which directly analyzes and handles continuous data. ...
Article
Full-text available
Welding technology plays a vital role in the manufacturing process of ships, automobiles, and aerospace vehicles because it directly impacts their operational safety and reliability. Hence, the development of an accurate system for identifying welding defects in arc welding is crucial to enhancing the quality of welding production. In this study, a defect recognition method combining the Neighborhood Rough Set (NRS) with the Dingo Optimization Algorithm Support Vector Machine (DOA-SVM) in a multisensory framework is proposed. The 195-dimensional decision-making system mentioned above was constructed to integrate multi-source information from molten pool images, welding current, and vibration signals. To optimize the system, it was further refined to a 12-dimensional decision-making setup through outlier processing and feature selection based on the Neighborhood Rough Set. Subsequently, the DOA-SVM is employed for detecting welding defects. Experimental results demonstrate a 98.98% accuracy rate in identifying welding defects using our model. Importantly, this method outperforms comparative techniques in terms of quickly and accurately identifying five common welding defects, thereby affirming its suitability for arc welding. The proposed method not only achieves high accuracy but also simplifies the model structure, enhances detection efficiency, and streamlines network training.
Conference Paper
Advent of technology enables content producers and content consumers to exchange huge amount of multimedia data. Existing literature on automatic classification of audio signals is studied. To automate the process of classification of audio signals an approach based on rough sets is proposed on tollywood movie trailers.
Article
Full-text available
In information security, intrusion detection is a challenging task for which designing of an efficient classifier is most important. In the paper, network traffic data is classified using rough set theory where discretization of data is a necessary preprocessing step. Different discretization methods are available and selection of one has great impact on classification accuracy, time complexity and system adaptability. Three discretization methods are applied on continuous KDD network data namely, rough set exploration system (RSES), supervised and unsupervised discretization methods to evaluate the classifier accuracy. It has been observed that supervised discretization yields best accuracy for rough set classification and provides system adaptability.
Article
Full-text available
Discretization of real-valued data is often used as a pre-processing step in many data mining algorithms. In this paper we review some important unsupervised discretization methods among which there are the discretization methods based on clustering. We propose a discretization method based on the k-means clustering algorithm which avoids the O(n log n) time requirement for sorting .
Conference Paper
Full-text available
The majority of the data mining algorithms are applied to data described by discrete or nominal attributes. In order to apply these algorithms effectively to any dataset the continuous attribute need to be transformed to discretized ones. This paper presents an approach using clustering and rough set theory (RST). The experiments are performed on four datasets from UCI ML repository. The performance of the proposed approach is compared with some common discretization methods based on the two parameters - the number of intervals and the class-attribute interdependence redundancy (CAIR) value. The results of the proposed method show a satisfactory trade off between the number of intervals and the information loss due to discretization
Article
Full-text available
Data preprocessing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Rough sets theory, where sets are approximated using elementary sets, is a different approach for developing methods for the data preprocessing process. In this paper Rough sets theory is applied to three preprocessing steps: Discretization, Feature selection, and instance selection. The new methods proposed in this paper have been tested on eight datasets widely used in the KDD community.
Conference Paper
A two-step discretization algorithm by dynamic clustering based on Rough set is proposed. The algorithm first discretization for decision table using dynamic clustering algorithm, then discrete again using cut importance discretization algorithms, and obtain the final cut sets, because the dynamic clustering algorithm processing speed is quickly and it impel a lot of breakpoint be screened, thus the operation efficiency of algorithm is increase notable. Finally, simulation results show that the proposed algorithm is correct and high efficiency.
Article
An abstract is not available.
Article
In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.
Article
Many testing criteria, including condition coverage and decision coverage, are inadequate for software characterised by complex logical decisions, such as those in safety-critical software. In the past decade, more sophisticated testing criteria have been advocated. In particular, compliance of the MC/DC criterion has been mandated in the commercial aviation industry for the approval of airborne software. Recently, the MUMCUT criterion has been proposed as it guarantees the detection of certain faults in logical decisions in disjunctive normal form in which no variable is redundant. This paper compares MC/DC, MUMCUT and several other related coverage criteria for logical decisions by both formal and empirical analysis, focusing on the fault-detecting ability of test sets satisfying these testing criteria. Our results show that MC/DC test sets are effective, but they may still miss some faults that can almost always be detected by test sets satisfying the MUMCUT criterion.