ArticlePDF Available

A Survey on Clustering Techniques for Big Data Mining

Authors:

Abstract and Figures

This paper focuses on a keen study of different clustering algorithms highlighting the characteristics of big data. Brief overview of various clustering algorithms which are grouped under partitioning, hierarchical, density, grid based and model based are discussed.
Content may be subject to copyright.
Indian Journal of Science and Technology, Vol 9(3), DOI: 10.17485/ijst/2016/v9i3/75971, January 2016
ISSN (Print) : 0974-6846
ISSN (Online) : 0974-5645
* Author for correspondence
Abstract
This paper focuses on a keen study of different clustering algorithms highlighting the characteristics of big data. Brief
overview of various clustering algorithms which are grouped under partitioning, hierarchical, density, grid based and
model based are discussed.
Keywords: Characteristics of Big Data, Clustering Algorithms - Partitioning, Density, Grid Based, Model Based, Homog-
enous Data, Hierarchical
A Survey on Clustering Techniques for Big
Data Mining
T. Sajana, C. M. Sheela Rani and K. V. Narayana
KL University, Vaddeswaram – 522502, Guntur Dist., Andhra Pradesh, India;
sajana.cse@kluniversity.in, sheelarani_cse@kluniversity.in, kvnarayana@kluniversity.in
1. Introduction
Big Data are the large amount of data being processed by
the Data Mining environment. In other words, it is the
collection of large and complex data sets which are dicult
to process using traditional data processing applications.
Big Data are about turning unstructured, invaluable,
imperfect, complex data into usable information[1].
But, it becomes dicult to maintain huge volume of
information and data day to day from many dierent
resources and services which were not available to human
space just a few decades ago. Very huge quantities of data
are produced every day by and about people, things, and
their interactions. Many dierent groups argue about the
potential benets and costs of analyzing the information
which comes from Twitter, Google, Face book, etc. Large
volume of data is available from dierent online resources
and services like sensor networks, cloud computing,
etc which were established to serve their customers.
To overcome these problems Big Data is clustered in
a compact format that is still an informative version of
entire data. e clustering techniques are very useful to
process data mining. ere are many approaches to mine
the data like neural algorithms, support vector machines,
association algorithms, genetic algorithms and clustering
algorithms. Among these mining techniques clustering
techniques are produce good quality of clusters with
grouping the unlabeled data. Clustering is the process of
grouping the data based on their similar properties. e
main goal of this paper is to provide various clustering
algorithms for Big Data.
is paper presents the survey of clustering
techniques dened with 4 V’s of Big Data characteristics
- Volume, Variety, Velocity and Value[2] [3]. Volume is the
basic characteristic of Big Data which deals with data
size, dimensionality of the data set and outlier’s detection.
Variety is deals with type of attributes of data set like
numerical, categorical, continuous, ordinal and ratio.
Velocity deals with algorithm analysis for computation
of various attributes to process data. Finally Value deals
with the parameters which are used for processing. In
the present paper Introduction to Big Data is discussed
in section1, Architecture of Big Data in section2,
Description of clustering algorithms in section3 and
nally in section4 comparison of dierent clustering
algorithms is presented.
is paper presents a clear survey of various clustering
algorithms[4][5][6][7] to process data which helps researches
and students to decide which algorithm is best for
clustering based on the requirements.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology
2
A Survey on Clustering Techniques for Big Data Mining
2. Big Data Architecture
As a decade large volumes of data can be stored in
every sector, it requires managing, store, analyzing and
predicting such large volumes of data called “Big Data”.
Data ware house architecture cannot maintain volumes
of large data sets because it uses centralized architecture
of 3-tiers where as in Big Data architecture it deals with
distributed processing of data [8]. e architecture of Big
Data is shown in Figure 1.
Figure 1. Big Data architecture.
3. Clustering Algorithms
is paper presents various clustering algorithms with
by considering the properties of Big Data characteristics
such as size, noise, dimensionality, computations of
algorithms, shape of cluster, etc [10] [11]. e overview of
clustering algorithms is depicted in Figure 2.
3.1 Partitioning based Clustering
algorithms:
All objects are considered initially as a single cluster. e
objects are divided into no of partitions by iteratively
locating the points between the partitions. e
partitioning algorithms like K-means, K-medoids (PAM,
CLARA, CLARANS, and FCM) and K-modes. Partition
based algorithms can found clusters of Non convex
shapes.
3.2 Hierarchical Clustering algorithms:
ere are two approaches to perform Hierarchical
clustering techniques Agglomerative (top-bottom) and
Divisive (bottom- top). In Agglomerative approach,
initially one object is selected and successively merges
the neighbor objects based on the distance as minimum,
maximum and average. e process is continuous until a
desired cluster is formed. e Divisive approach deals with
set of objects as single cluster and divides the cluster into
Figure 2. An overview of clustering algorithms for Big Data mining.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 3
T. Sajana, C. M. Sheela Rani and K. V. Narayana
further clusters until desired no of clusters are formed.
BIRCH, CURE, ROCK, Chameleon, Echidna, Wards,
SNN, GRIDCLUST, CACTUS are some of Hierarchical
clustering algorithms in which clusters of Non convex,
Arbitrary Hyper rectangular are formed.
3.3 Density based Clustering algorithms:
Data objects are categorized into core points, border
points and noise points. All the core points are connected
together based on the densities to form cluster. Arbitrary
shaped clusters are formed by various clustering
algorithms such as DBSCAN, OPTICS, DBCLASD,
GDBSCAN, DENCLU and SUBCLU.
3.4 Grid based Clustering algorithms:
Grid based algorithm partitions the data set into no
number of cells to form a grid structure. Clusters are
formed based on the grid structure. To form clusters
Grid algorithm uses subspace and hierarchical clustering
techniques. STING, CLIQUE, Wave cluster, BANG,
OptiGrid, MAFIA, ENCLUS, PROCLUS, ORCLUS, FC
and STIRR. Compare to all Clustering algorithms Grid
algorithms are very fast processing algorithms. Uniform
grid algorithms are not sucient to form desired clusters.
To overcome these problem Adaptive grid algorithms
such as MAFIA and AMR Arbitrary shaped clusters are
formed by the grid cells.
3.5 Model based Clustering algorithms:
Set of data points are connected together based on
various strategies like statistical methods, conceptual
methods, and robust clustering methods. ere are two
approaches for model based algorithms one is neural
network approach and another one is statistical approach.
Algorithms such as EM, COBWEB, CLASSIT, SOM, and
SLINK are well known Model based clustering algorithms.
4. Comparison of Clustering
Algorithms
Various clustering methods discussed which mine
the data from Big Data. Every algorithm has its own
greatness and weakness. is paper presents various
clustering algorithms related to the 4 V’s of Big Data
characteristics.
4.1 Volume:
it refers to the ability of an algorithm to deal with large
amounts of a data. With respect to the Volume property
the criteria for clustering algorithms to be considered is
a. Size of the data set b. High dimensionality c. Handling
Outliers.
• Size of the data set: Data set is collection of attributes.
e attributes are categorical, nominal, ordinal, in-
terval and ratio. Many clustering algorithms support
numerical and categorical data.
• High dimensionality: To handle big data as the size of
data set increases no of dimensions are also increases.
It is the curse of dimensionality.
• Outliers: Many clustering algorithms are capable of
handle outliers. Noise data cannot be making a group
with data points.
4.2 Variety:
refers to the ability of a clustering algorithm to handle
dierent types of data sets such as numerical, categorical,
nominal and ordinal. A criterion for clustering algorithms
is (a) type of data set (b) cluster shape.
• Type of data set: e size of the data set is small or big
but many of the clustering algorithms support large
data sets for big data mining.
• Cluster shape: Depends on the data set size and type
shape of the cluster formed.
4.3 Velocity:
Refers to the computations of clustering algorithm
based on the criteria (a) running time complexity of a
clustering algorithm.
• Time complexity: If the computations of algorithms
take very less no then algorithm has less run time.
e algorithms the run time calculation done based
on Big O notation.
4.4 Value:
For a clustering algorithm to process the data
accurately and to form a cluster with less computation
input parameter are play key role. e values of various
clustering algorithms are given in Table 1.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology
4
A Survey on Clustering Techniques for Big Data Mining
Table 1. Various Clustering algorithms for Big Data
Algorithm
type
Algorithm
name
Author Volume Variety Velocity Va l u e
data
set size
high
dimensi
onality
avoid
outliers
dataset
type
cluster
shape
time
complexity
Input
Haritigan et al .1975
K-means[12] Haritgan & Wang et
al.1979
Large No No Nu-
merical
Non con-
vex
O(n k d) 1
K- medoid [14] Haritgan & Wang et
al.1979
small Yes Yes Cate-
gorical
Non con-
vex
O( n2dt) 1
Partitional k-modes[13]  no Large Ye s No Cate-
gorical
Non con-
vex
O(n) 1
PAM[15] Kaufman&Rous-
seeuw et al.1990
Small No No Nu-
merical
Non con-
vex
O(k(n-k)2) 1
CLARA[16] Kaufman&Rous-
seeuw et al.1990
Large No No Nu-
merical
Non con-
vex
O(k(40+k)2
+k(n-k))
1
CLARANS[17] Ng &Han et al.1994 Large No No Nu-
merical
Non con-
vex
O(kn2) 2
FCM[9] Large No No Nu-
merical
O(n) 1
BIRCH[19] Zhang et al. 1996 Large No No Nu-
merical
Non con-
vex
O(n) 2
CURE[20] Guha et al. 1998 Large Yes Yes Nu-
merical
&Cate-
gorical
Arbitrary O(n2logn) 2
ROCK[21] Guha et al. 1999 Large No No Nu-
merical
&Cate-
gorical
Arbitrary O(n2+nmm-
ma+n2logn)
1
Chameleon[22] Krpyis et al. 1999a Large Ye s No All
types
data
Arbitrary O(n2) 3
ECHIDNA[23] Paoluzzi et al. 1999a Large No No Multi-
variate
Non con-
vex
O(N*B(1+logB
m))
2
Hierarchi-
cal
Wards[39] Wards et al.1963 Small No No Nu-
merical
Arbitrary  no
SNN [41] Ertoz et al. 2002 Small No No Cate-
gorical
Arbitrary O(n2 ) 1
CACTUS[45] Ganti et al. 1999a Small NO No Cate-
gorical
Hyper
rectangu-
lar
O(c N) 2
 GRID-
CLUST[46]
Schikuta et al. 1996 Small No No Nu-
merical
Arbitrary O(n) 2
DBSCAN[24] Ester et al. 1996 Large No No Nu-
merical
Arbitrary O(n log n)for
spatial data
2
OPTICS[25] Ankerst et al. 1999 Large No Ye s Nu-
merical
Arbitrary O(n log n) 2
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 5
T. Sajana, C. M. Sheela Rani and K. V. Narayana
5. Partitioned based Clustering
Algorithms:
5.1 FCM - Fuzzy CMEANS algorithm:
[9] the algorithm is based on the K-means concept to
partition dataset into Clusters.
e algorithm is as follows:
• Calculate the cluster centroids and the objective value
and initialize fuzzy matrix.
• Computer the membership values stored in the ma-
trix.
e paper presents list of all algorithms and their
eciency based on the input parameter to mine the Big
Data as described below:
• If the value of objective is between consecutive iter-
ations is less than the stopping condition then stop.
• is process is continuous until a partition matrix
and clusters are formed.
Density
Based
DBCLASD[26] Xu et al. 1998 Large No Ye s Nu-
merical
Arbitrary O(3n2 ) No
GDBSCAN[43] Sander et al. 1998 Large No No Nu-
merical
Arbitrary  no 2
DENCLUE[27] Hinneburg &Kein
et al. 1998
Large Ye s Ye s Nu-
merical
Arbitrary O(log |D|) 2
SUBCLU[42] Karin Kailling
&Hans-Peter Kriege
Large Ye s Ye s Nu-
merical
Arbitrary  no 2
STING[29] Wang et al. 1997 Large No Yes Spatial Arbitrary O(k) 1
Wave Clus -
ter[28]
Sheikholeslami et
al. 1998
Large No Ye s Nu-
merical
Arbitrary O(n) 3
BANG[18] Schikuta & Erhart
et al. 1997
Large Large Yes Nu-
merical
Arbitrary O(n) 2
CLIQUE[30] Aggarwal et al. 1998 Large No Yes Nu-
merical
Arbitrary O(C k + m k) 2
Grid OptiGrid[31] Hinneburg & Keim
et al. 1999
Large Ye s Ye s Spatial Arbitrary O(n d) to O(nd-
logn)
3
MAFIA[44] Goil et al.1999; Large No Ye s Nu-
merical
Arbitrary O(cp + p N ) 2
ENCLUS[36] Cheng et al. 1999 Large No Yes Nu-
merical
Arbitrary O(ND+ m D ) 2
PROCLUS[37] Aggarwal et al.
1999a
Large Ye s Ye s Spatial Arbitrary O(n) 2
ORCLUS[38] Aggarwal &Yu et al.
1999a
Large Ye s Ye s Spatial Arbitrary O(d3 ) 2
FC[11] Barbara &Chen et
al. 2000
Large Ye s Ye s Nu-
merical
Arbitrary O(n) 2
STIRR[11] Gibson et al. 1998 Large No No Cate-
gorical
Arbitrary O(n) 2
Model
Based
EM[32] Mitchell et al.1997 Large Ye s No Spatial Non con-
vex
O(knp) 3
COBWEB[33] Fisher et al 1987 Small No No Nu-
merical
Non con-
vex
O(n2 ) 1
CLASSIT[34] Fisher et al 1987 Small No No Nu-
merical
Non con-
vex
O(n2 ) 1
SOM’s[35] Kohonen 1990 Small Ye s No Multi
variant
Non con-
vex
O(n2m) 2
SLINK[40] Sibson et al.1973 Large No No Nu-
merical
Arbitrary O(n2 ) 2
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology
6
A Survey on Clustering Techniques for Big Data Mining
6. Hierarchical based Clustering
Algorithms:
6.1 BIRCH - Balanced Iterative Reducing
and Clustering using Hierarchies.
It is an agglomerative hierarchical algorithm which uses
a Clustering Feature (CF-Tree) and incrementally adjusts
the quality of sub clusters.
e algorithm is as follows:
• Load data into memory: CF Tree is constructed with
one scan of the data. Subsequent phases become fast,
accurate and less order sensitive.
• Condense data: Rebuilt the CF tree with larger T.
• Global Clustering: Use the existing clustering algo-
rithm on CF leaves.
• Cluster rening-Do additional passes over the dataset
and reassign data points to the closest centroids from
above step.
• e process continuous until to form k no of clusters.
6.2 CURE- Clustering Using
REpresentatives:
A hierarchy of Divisive approach is used and it selects
well scattered points from the cluster and then shrinks
towards the center of the cluster by a specied function.
Adjacent clusters are merged successively until the no of
clusters reduces to desired no of clusters.
e algorithm is as follows:
• Initially all points are in separate clusters, each cluster
is dened by the point in the cluster.
• e Representative points of a cluster are generated
by rst selecting well scattered objects for the cluster
and then perform shrinking or moving towards the
cluster by a specied factor.
• At each step of the algorithm, two clusters with clos-
est pair of representative point are chosen and merged
together to form cluster.
6.3 ROCK - Robust Clustering algorithm for
Categorical attributes.
It is a hierarchical clustering algorithm in which to form
clusters it uses a link strategy. From bottom to top links
are merging together to form a cluster.
e algorithm is as follows:
Initially consider set of points in which every point is a
cluster and compute the links between each pair of points.
Build a heap and maintain heap for each cluster.
A goodness measure based on the criterion function will
be calculated between pairs of clusters.
Merge the clusters which have maximum value of criteria
function.
6.4 Chameleon -
It is an agglomerative hierarchical clustering algorithm of
dynamic modeling which deals with two phase approach
of clustering
e algorithm is as follows:
A two phase approach of partition and merge is used
to form a cluster.
• During Partition phase-Initially consider all data
points as a single cluster.
• Using a graph partitioning algorithm divide the clus-
ter into a relatively large no of small clusters using
hMETIS method.
• e process terminates when a large sub cluster con-
tains slightly more than a specied no of vertices.
• In merge phase using agglomerative hierarchical ap-
proach select pairs of clusters whose inter connectiv-
ity and relative closeness are reaches the threshold
value.
• Merge the clusters which are having the highest inter
connectivity and closeness.
e algorithm is repeated until none of the adjacent
clusters satisfy the two conditions.
6.5 ECHIDNA
It is an agglomerative hierarchical approach for clustering
the network trac data.
e steps of algorithm are given below:
• e input data is extracted from network trac con-
sists of a 6 Tuple value of numerical and categorical
attributes.
• Each record iteratively builds a hierarchical tree of
clusters called CF-Tree.
• Insert each record into the closest cluster using a com-
bined distance function for all attributes into CF-Tree.
• e radius of a cluster determines if a record should
be absorbed into the cluster or if the cluster should
be split.
• Once the cluster is created and all the signicant
nodes are to form a Cluster Tree.
e Cluster Tree is further compressed to create a
concise and meaningful report.
6.6 SNN - Shared Nearest Neighbors
A hierarchy of top to bottom approach is used for
grouping the objects.
e steps of algorithm are given below:
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 7
T. Sajana, C. M. Sheela Rani and K. V. Narayana
• A proximity matrix should be maintained for the dis-
tances of set of points.
• Objects are clustered together based on the nearest
neighbor and the object with maximum distance can
be avoided.
6.7 CACTUS Clustering CaTegorical Data
Using Summaries.
It is a very fast and scalable algorithm for nding the
clusters. A hierarchy structure is used to generate
maximum segments or clusters. A two step procedure
deals with the description of algorithm as follows:
• Attributes are strongly connected if the data points
are having larger frequency.
• Clusters are formed based on the co-occurrences of
attribute value pairs.
• A cluster is formed if any segment is having no of ele-
ments α times greater than elements of other.
6.8 GRIDCLUST - GRID based hierarchical
CLUSTering algorithm.
A clustering algorithm of hierarchical method based on
grid structure.
e algorithm is as follows:
• Initially partition the data set into data space to form
grid structure and the topological distributions are
maintained.
• Once data is assigned to the blocks of cells or grids
density values are calculated and sorted according to
their values.
• e largest dense block was considered as cluster cen-
ter.
• Using the Neighbor search algorithm a cluster can be
formed with the remaining blocks.
7. Density based Clustering
Algorithms:
7.1 DBSCAN – Density Based SCAN
clustering algorithm.
It is a connectivity based algorithm which consists of 3
points namely core, border and noise.
e algorithm is as follows:
• Set of points to be considered to form a graph.
• Create an edge from each point c to the other point in
the neighborhood of c.
• If set of nodes N not contain any core points then ter-
minate N.
• Select a node X that must be reached form c.
Repeat the procedure until all core points forms a
cluster.
7.2 OPTICS – Ordering Points To Identify
the Clustering Structure.
It is also a connectivity based density algorithm. OPTICS
is an extension of DBSCAN algorithm which is also based
on the same parameters as DBSCAN algorithm. e
run time of OPTICS is 1.6 times greater than DBSCAN
algorithm.
e algorithm is as follows:
• Among the set of points select a point is a core point if
at least Minpts are found in the core distance.
• For each point c create an edge from c to other point
with a core distance of c.
• Select set of nodes which contain core points as a clus-
ter that reaches from c.
7.3 DBCLASD – Distribution Based
Clustering of Large Spatial Databases.
It is Connectivity based and application based clustering
algorithm for mining of large spatial data bases.
e algorithm is as follows:
• Construct set of candidates C based on the query.
• e point will be remains within the cluster if the dis-
tance between set of C has expected distribution.
• Otherwise the point will be considered as unsuccess-
ful candidate.
• e process is continuous until all points with expect-
ed distribution form cluster.
7.4 GDBSCAN – Generalized Density Based
Spatial Clustering of ApplicatioN
A connectivity based density algorithm in which it form
clusters with point objects and as well as spatial attributes.
e algorithm is as follows:
• An attribute object P is selected and retrieves all ob-
jects densities whether they are reachable from P with
respect to neighborhood of the object (NPred) and
minimum weighted cardinality (Min weight).
• If P is a core object this procedure yields a density con-
nected set Ci with respect to NPred and Min weight.
• Otherwise it does not belong to any density connect-
ed set Ci.
is procedure is iteratively applied to each object P
which has not yet been classied.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology
8
A Survey on Clustering Techniques for Big Data Mining
7.5 DENCLUE DENsity based CLUstEring
Among all algorithms of density based clustering
approach DENCLUE is the algorithm which is based on
the density function. Arbitrary shape of good quality of
clusters can be formed with large amount of data set.
e algorithm is as follows:
• Consider the data set in the grid structure and nd
the high density cells based on mean value (highest).
• If d (mean (c1), mean (c2)) < 4a then connectc1, c2.
• Find the density attractors using Hill-Climbing ap-
proach and they should be local maxima of overall
density function.
• Merge the attractors and they can be identied as
clusters.
7.6 SUBCLU SUBspace CLUstering
It is an ecient approach to the subspace clustering and
which is based on the formal clustering notion. It can
detect clusters of arbitrary shape.
e algorithm is as follows:
• Initially generate all 1-D subspace clusters in which at
least one cluster in the subspace found.
• Generate (k+1) dimensional candidate subspaces.
Test candidates and generate (k+1) dimensional clus-
ters.
• All the clusters in the higher dimensional subspace
will be the subsets of clusters which are detected in
the rst clustering.
e process continues until (k+1)–D clusters are
formed from k- D clusters.
8. GRID based Clustering
Algorithms:
8.1 STING – STatisitcal Information Grid
based method.
It is similar to BIRCH hierarchical algorithm to form a
cluster with spatial data bases.
e algorithm is as follows:
• Initially the spatial data stored into rectangular cells
using a hierarchical grid structure.
• Partition each cell into 4 child cells at the next level
with each child corresponding to a quadrant of the
parent cell.
• Calculate probability of each cell whether it is relevant
or not. If the cell is relevant then apply same calcula-
tions on each cell one by one.
• Find the regions of relevant cells in order to form
cluster.
8.2 Wave Cluster - Among all the clustering
algorithms, this is based on signal
processing.
e algorithm works with numerical attributes and has
multi-resolution. Outliers can be detected easily.
e algorithm is as follows:
• Fit all the data points into a cell. Apply wavelet trans-
form to lter the data points.
• Apply discrete Wavelet transform to accumulate data
points.
• High amplitude signals are applied to the correspond-
ing cluster interiors and high frequency is applied to
nd boundary of cluster.
• Signals are applied to the attribute space in order to
form cluster with more sharp and eliminates outliers
eas i l y.
8.3 BANG – Grid based clustering
algorithm.
It is an extension of GRIDCLUST algorithm which
initially considers all data points as blocks but it uses
BANG structure to maintain blocks.
e algorithm is as follows:
• Divide the feature space into rectangular blocks which
contains up to a maximum of P max data points.
• Build a binary tree to maintain the density indices of
all blocks are calculated and sorted in decreasing or-
der.
• Starting with the highest density index, all neighbor
blocks are determined and classied in decreasing or-
der to form a cluster.
• e process is repeated for the remaining blocks.
8.4 CLIQUE – CLustering In QUEst.
A subspace clustering algorithm for numerical attributes
in which bottom top approach is used to form clusters.
The algorithm is as follows:
• Consider set of data points, at one pass apply equal
width to the set of points to form grid cells.
• Let the rectangular cells into subspace whose density
exceed τ are placed into equal grids.
• e process is continuous recursively to form (q-1)
dimensional units to q dimensional units.
• e subspaces are connected to each other to form
cluster with equal width.
8.5 OPTI GRID – Optimal Grid.
e algorithm is designed to cluster large spatial data
bases.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 9
T. Sajana, C. M. Sheela Rani and K. V. Narayana
e algorithm is as follows:
• Dene the data set with best cutting hyper planes
through a set of selected projections.
• Select best local optima cutting plane.
• Insert all the cutting planes with a score greater than
or equal to minimum cut score into a BEST CUT.
• Select q cutting planes of the highest score form BEST
CUT and construct a multi dimensional grid G using
the q cutting planes.
• Insert all data points in D into G and determine the
highly populated grid cells in G and form a set of clus-
ters C.
8.6 MAFIA – Merging of Adaptive Finite
IntervAls.
It is descendant of CLIQUE algorithm in which instead of
using a xed size cell grid structure with an equal number
of bins in each dimension, it constructs an adaptive grid
to improve the quality of clustering.
e algorithm is as follows:
• In a single pass an adaptive grid structure was con-
structed by considering set of all points.
• Compute the histogram by reading blocks of data into
memory using bins.
• Bins are grouped together based on the dominance
factor α.
• Select the bins that are α times dense greater than av-
erage as p candidate dense units (CDU).
• Recursively the process continuous to form new
p-CDU’s and merge adjacent CDU’s into clusters.
8.7 ENCLUS – Entropy based CLUStering.
e algorithm is entropy based algorithm for clustering
large data sets. ENCLUS is an adaptation of CLIQUE
algorithm.
e algorithm is as follows:
• e objects whose subspaces are spanned by attri-
bute A1….AP with an entropy criteria H (A1….AP) < ϖ (a
threshold) are selected for clustering.
8.8 PROCLUS – PROjected CLUStering
algorithm.
e algorithm also uses medoids which is same as K –
medoids clustering criteria.
e algorithm is dened in three step procedure as
follows:
• Initialization: Consider the set of all points and select
data points randomly.
• Iteration phase: select medoids of the clusters as data
point and dene a subspace to each medoids.
• Renement phase: select best medoids form set of
medoids which has all dimensions. Select another
medoids which is nearest to best medoids.
All the data points within this distance will be formed
as a cluster.
8.9 ORCLUS- ORiented projected
CLUStering generation algorithm.
It is similar to PROCLUS clustering algorithm but it
focuses on non-axis parallel subspace.
e algorithm is dened by three strategies -
assignment, subspace determination and merge as
follows:
• Assignment: During this phase the algorithm itera-
tively assigns all the data points to the nearest cluster
centers.
• Sub space determination: To determine sub space cal-
culate co-variance matrix for each cluster and Eigen
vectors with the least Eigen values.
• Merge: Clusters which are near to each other and have
similar directions are merged.
8.10 FC – Fractal Clustering algorithm.
e algorithm deals with hierarchy approach works with
several layers of grids for numeric attributes and identies
clusters of irregular shapes.
e algorithm is as follows:
• Start with a data sample and a threshold value is con-
sidered for a given set of points.
• Initialize threshold value, scan full data incrementally.
• Using HFD-Hausdor Fractal Dimension (HFD)
method adds an incoming point to each cluster.
• If the smallest increase exceeds a threshold τ value, a
point is declared an outlier and shape of the cluster is
declared as irregular.
• Otherwise a point is assigned to cluster.
8.11 STIRR – Sieving Through Iterated
ReinRcement.
is algorithm deals with spectral partitioning using
dynamic system as follows:
• Set of attributes are considered and weights W= W v
are assigned to each attribute.
• Weights are assigned to set of attributes using com-
bining operator ϕ dened as
• ϕ (W1…Wn-1) = W1+…….. + Wn-1.
• At a particular point the process is stopped to achieve
dynamic system.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology
10
A Survey on Clustering Techniques for Big Data Mining
9. Model Based Clustering
Algorithms:
9.1 EM – Expectation and Maximization
is algorithm is based on two parameters- expectation
(E) and maximization (M).
• E: e current model parameter values are used to
evaluate the posterior distribution of the latent vari-
ables. en the objects are fractionally assigned to
each cluster based on this posterior distribution as
Q( θ , θ T ) = E[ log p(x g , x m | θ ) x g , θ T ]
• M: e fractional assignment is given by re-estimat-
ing the model parameters with the maximum likeli-
hood rule as
θ t + 1 = max Q (θ, θ T )
e process is repeated until the convergence
condition is satised.
9.2 COBWEB – Model based clustering
algorithm.
It is an Incremental clustering algorithm, which builds
taxonomy of clusters without having a predened number
of clusters. e clusters are represented probabilistically by
conditional probability P (A = v | C) with which attribute
A has value v, given that the instance belongs to class C.
e algorithm is as follows:
• e algorithm starts with an empty root node.
• Instances are added one by one.
• For each instance, the following options (operators)
are considered:
• - classifying the instance into an existing class;
• - creating a new class and place the instance into it.
• - combining two classes into a single class (merging)
and placing the new instance in the resulting hierar-
chy;
• - split the class into two classes (splitting) and placing
the new instance in the resulting hierarchy.
• e algorithm searches the space of possible hierar-
chies by applying the above operators and an evalua-
tion function based on the category utility.
9.3 SOM- Self Organized Map algorithm.
A Model based clustering incremental clustering
algorithm, which is based on the grid structure.
e algorithm is dened by a two step process:
• Place the grid of nodes along a plane where data
points are distributed.
• Sample the data point and subject the closest node
and neighboring node to its inuence. Sampling an-
other point and so on.
• e procedure is repeated until all data points have
been sampled several times.
• Each cluster is dened with reference to a node spe-
cically comprised by those data points for which it
represents the closest node.
9.4 SLINKSingle LINK clustering
algorithm.
A Model based clustering algorithm in which a hierarchy
approach is used to form clusters.
• Starts with set of points, let each point as a singleton
cluster.
• Using Euclidean distance determine the distance be-
tween the two points.
• Merge the links between all points’ shortest links rst.
• Combine the single links to form a cluster.
10. Conclusion
is paper analyzed dierent clustering algorithms
required for processing Big Data. e study revealed that
to identify the outliers in large data sets, the algorithms
that should be used are BIRCH, CLIQUE, and ORCLUS.
To perform clustering, various algorithms can be used but
to get appropriate results the present study suggests that
– by using CURE and ROCK algorithms on categorical
data, arbitrary shaped clusters will be created. By using
COBWEB and CLASSIT algorithms on numerical data
with model based, non-convex shape clusters can be
formed. For spatial data STING, OPTIGRID, PROCLUS
and ORCLUS algorithms when applied yield arbitrary
shaped clusters.
11. References
1. Yasodha P, Ananathanarayanan NR. Analyzing Big Data to
build knowledge based system for early detection of ovarian
cancer. Indian Journal of Science and Technology. 2015 Jul;
8(14):1–7.
2. Pandove D, Goel S. A comprehensive study on clustering ap-
proaches for Big Data mining. IEEE Transactions on Electron-
ics and Communication System; Coimbatore. 2015 Feb 26-27.
p. 1333–8.
3. Park H, Park J, Kwon YB. Topic clustering from selected area
papers. Indian Journal of Science and Technology. 2015 Oct;
8(26):1–7.
4. Abbasi A, Younis M. A survey on clustering algorithms for
wireless sensor networks. Computer Communications. 2007
Dec; 30(14-15):2826–41.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 11
T. Sajana, C. M. Sheela Rani and K. V. Narayana
5. Aggarwal C, Zhai C. A survey of text clustering algorithms.
Mining Text Data. New York, NY, USA. Springer-Verlag: 2012.
p. 77–128.
6. Brank J, Grobelnik M, Mladenic D. A survey of ontology eval-
uation techniques. Proceedings Conf Data Mining and Data
Warehouses (SiKDD); 2005. p. 166–9.
7. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans-
actions on Neural Networks. 2005 May; 16(3):645–78.
8. Yadav C, Wang S, Kumar M. Algorithms and approaches
to handle large data sets - A survey. International Journal of
Computer Science and Network. 2013; 2(3):1–5.
9. Bezdek JC, Ehrlich R, Full W. FCM: e Fuzzy C-Means Clus-
tering algorithm. Computers and Geosciences. 1984; 10(2-
3):191–203.
10. Fahad A, Alshatri N, Tari Z, Alamri A. A survey of clustering
algorithms for Big Data: Taxonomy and empirical analysis.
IEEE Transactions on Emerging Topics in Computing. 2014
Sep; 2(3):267–79.
11. Berkhin P. Survey of clustering data mining techniques in
grouping multidimensional data. Springer. 2006; 25–71.
12. Macqueen J. Some methods for classication and analysis of
multivariate observations. Proceedings 5th Berkeley Sympo-
sium on Mathematical Statistics Probability; Berkeley, CA,
USA. 1967. p. 281–97.
13. Huang Z. A fast clustering algorithm to cluster very large cate-
gorical data sets in data mining. Proceedings SIGMOD Work-
shop Res Issues Data Mining Knowl Discovery; 1997. p. 1–8.
14. Park HS, Jun CH. A simple and fast algorithm for K-me-
doids clustering. Expert Systems Applications. 2009 Mar;
36(2.2):3336–41.
15. Ng RT, Han J. Ecient and eective clustering methods for
spatial data mining. Proceedings Int Conf Very Large Data
Bases (VLDB); 1994. p. 144–55.
16. Kaufman L, Rousseau PJ. Finding groups in data: An intro-
duction to cluster analysis. USA, Johns and Sons Wiley; 2008.
17. Ng RT, Han J. CLARANS: A method for clustering objects for
spatial data mining. IEEE Transactions on Knowledge Data
Engineering (TKDE). 2002 Sep/Oct; 14(5):1003–16.
18. Schikuta E, Erchart M. e BANG – Clustering system: Grid–
based data analysis. Lecture Notes in Computer Science. 1997;
1280:513–24.
19. Zhang T, Ramakrishna R, Livny M. BIRCH: An ecient data
clustering method for very large databases. Proceedings of the
ACM SIGMOD International Conference on Management of
Data. 1996 Jun; 25(2):103–14.
20. Guha S, Rastogi R, Shim K. Cure: An ecient clustering algo-
rithm for large data bases. Proceedings of the ACM SICMOID
international Conference on Management of Data. 1998 Jun;
27(2):73–84.
21. Guha S, Rastogi R, Shim K. Rock: A robust clustering algo-
rithm for categorical attributes. 15th International Conference
on Data Engineering; 1999. p. 512–21.
22. Karypis G, Han EH, Kumar V. Chameleon: Hierarchical clus-
tering using dynamic modeling. IEEE Computer. 1999 Aug;
32(8): 68–75.
23. Mahmood AN, Leckie C, Udaya P. An ecient clustering
scheme to exploit hierarchical data in network trac analy-
sis. IEEE Transactions on Knowledge. Data Engineering. 2008
Jun; 20(6):752–67.
24. Ester M, Kriegel HP, Sander J, Xu X. A density-based algo-
rithm for discovering clusters in large spatial databases with
noise. Proceedings ACM SIGKDD Conf Knowl Discovery Ad
Data Mining (KDD); 1996. pp. 226–31.
25. Ankerst M, Breunig M, Kriegel HP, Sander J. Optics: Ordering
points to identify the clustering structure. Proceedings of the
ACM SIGMOD International Conference on Management of
Data. 1999 Jun; 28(2):49–60.
26. Xu X, Ester M, Krieger HP, Sander J. A distribution-based
clustering algorithm for mining in large spatial databases. Pro-
ceedings 14th IEEE International Conference on Data Engi-
neering (ICDE); Orlando, FL. 1998 Feb 23-27. p. 324–31.
27. Hinneburg A, Keim DA. An ecient approach to clustering
in large multimedia databases with noise. Proceedings ACM
SIGKDD Conf Knowl Discovery Ad Data Mining (KDD);
1998. p. 58–65.
28. Sheikholeslami G, Chatterjee S, Zhang A. Wave cluster: A
multi resolution clustering approach for very large spatial da-
tabases. Proceedings Int Conf Very Large Data Bases (VLDB);
1998. p. 428–39.
29. Wang W, Yang J, Muntz R. Sting: A statistical information grid
approach to spatial data mining. Proceedings 23rd Int Conf
Very Large Data Bases (VLDB); 1997. p. 186–95.
30. Jain AK, Dubes RC. Algorithms for Clustering Data. Upper
Saddle River, NJ, USA, Prentice-Hall; 1988.
31. Hinneburg A, Keim DA. Optimal grid-clustering: To-
wards breaking the curse of dimensionality in high-dimen-
sional clustering. Proceedings 25th Int Conf Very Large Data
Bases (VLDB); 1999. p. 506–17.
32. Dempster AP, Laird NM, Rubin DB. Maximum likelihood
from incomplete data via thee algorithm. Journal of the Royal
Statistical Society. 1977; 39(1):1–38.
33. Fisher DH. Knowledge acquisition via incremental conceptual
clustering. Machine Learning. 1987 Sep; 2(2):139–72.
34. Gennari JH, Langley P, Fisher D. Models of incremental con-
cept formation. Articial Intelligence. 1989 Sep; 40(1–3):11–
61.
35. Kohonen T. e self-organizing map. Neurocomputing. 1998
Nov; 21(1-3):1–6.
36. Cheng CH, Fu AW, Zhang Y. Entropy based sub space cluster-
ing for mining numerical data. Proceedings of the h ACM
SIGMOID International Conference on Knowledge discovery
and Data Mining; 1999. p. 84–93.
37. Milenova BL, Campos M. Clustering large databases with
numeric and nominal values using orthogonal projections. O
Cluster; 2006. p. 1–11.
38. Aggarwal CC, Yu PS. Finding generalized projected clusters
in high dimensional spaces. Proceedings of the 2000 ACM
SIGMOID International Conference on Management of Data.
2000 Jun; 29(2):70–81.
39. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans-
actions on Neural Networks. 2005 May; 16(3):645–78.
40. Han J, Kamber M. Data Mining: Concepts and Techniques.
Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology
12
A Survey on Clustering Techniques for Big Data Mining
2nd edition. San Mateo, CA, USA, Morgan Kaufmann; 2006.
41. Hubert L, Arabie P. Comparing partitions. Journal of Classifi-
cation. 1985 Dec; 2(1):193–218.
42. Kailing K, Kriegel HP, Kroger P. Density-connected subspace
clustering for high- dimensionality data. Proceedings of the
2004 SIAM International Conference on Data Mining; 2010.
p. 246–57.
43. Varghese BM, Unnikrishanan A, Paulose Jacob K. Spatial clus-
tering algorithms – An overview. Asian Journal of Computer
Science and Information Technology. 2014; 3(1):1–8.
44. Cheng W, Wang W, Batista S. Grid Based Clustering. 2009. p.
12–24.
45. Ganti V, Gehrke J, Ramakrishna R. CACTUS- Clustering Cat-
egorical Data Using Summaries. Proceeding of the h ACM
SIGMOID International Conference on Knowledge Discovery
and Datamining; 1999. p. 73–83.
46. Cao Q, Bouqata B, Mackenzie PD, Messiar D, Salvo J. A grid-
based clustering method for mining frequent trips from large-
scale, event-based telemetries datasets. e 2009 IEEE Inter-
national Conference on Systems, Man and Cybernetics; San
Anonio, TX, USA. 2009 Oct. p. 2996–3001.
... Singhal et al., (2013) provided a survey of clustering algorithms based on different criteria such as their score (merits), application domains, applicability factors, the number of clusters, size of datasets, stability, time complexity and so on. Sajana et al. (2016) and Fahad et al. (2014) linked big data challenges to clustering algorithms. Cai et al. (2016) and Zhao et al. (2017) compared respectively DB-SCAN and K-means clustering algorithms for financial datasets and agglomerative hierarchical clustering and SOM for packaging modularization datasets. ...
... In k-mean cluster analysis, each data point belongs to only one cluster with membership values of zero or one, while in fuzzy cluster analysis, the membership value of assigning a data point to clusters is between [0, 1], so data can belong to more than one cluster simultaneously with certain membership values, and it can give descriptions of objects in clusters in more detail. 15,[26][27][28][29] Hard clustering approaches such as k-means are simple, easy to modify, and less complex to interpret, but they are sensitive to the centroid initialization and outlier, 30,31 and FCM is more flexible than conventional k-means. Although soft clustering such as FCM is supposedly slower and computation time increases more rapidly for FCM than for the k-means algorithm with the growing number of clusters and sample size, this should not be of concern with the power of today's computers. ...
Article
Full-text available
Background: Pattern recognition of pedestrians’ traffic behavior can enhance the management efficiency of interested groups by targeting access to them and facilitating planning via more specific surveys. This study aimed to evaluate the pedestrians’ traffic behavior pattern by fuzzy clustering algorithm and assess the factors related to higher-risk traffic behavior of pedestrians. Study Design: This study is a secondary methodological study based on the data from a cross-sectional study. Methods: The fuzzy c-means (FCM), as a machine learning clustering method, was conducted to identify the pattern of traffic behaviors by collecting data from 600 pedestrians in Urmia, Iran via "the Pedestrian Behavior Questionnaire" (PBQ) and using 5 domains of PBQ. Multiple logistic regression was fitted to identify risk factors of traffic behaviors. Results: Results revealed two clusters consisting of lower-risk and higher-risk behaviors. The majority of pedestrians (64.33%) were in the lower-risk cluster. Subjects≤33 years old (Odds ratio [OR]=1.92, P<0.001), subjects with≤6 years of education (OR=1.74, P=0.010), males (OR=1.90, P=0.001), unmarried pedestrians (OR=3.61, P=0.007), and users of public transportation (OR=2.01, P=0.002) were more likely to have higher-risk traffic behavior. Conclusion: We identified traffic behavior patterns of Urmia pedestrians with lower-risk and higher-risk behaviors via FCM. The findings from this study would be helpful for policymakers to promote safety measures and train pedestrians.
... The objective function is frequently referred to as a model. In supervised learning, the objective function assigns each input data to the proper class [1]. The process of assigning the correct class label (aiming for class) to information which greatly contributes to that classification is known as classification. ...
Article
With hundreds or thousands of attributes in high-dimensional data, the computational workload is challenging. Attributes that have no meaningful influence on class predictions throughout the classification process increase the computing load. This article's goal is to use attribute selection to reduce the size of high-dimensional data, which will lessen the computational load. Considering selected attribute subsets that cover all attributes. As a result, there are two stages to the process: filtering out superfluous information and settling on a single attribute to stand in for a group of similar but otherwise meaningless characteristics. Numerous studies on attribute selection, including backward and forward selection, have been undertaken. This experiment and the accuracy of the categorization result recommend a k-means based PSO clustering-based attribute selection. It is likely that related attributes are present in the same cluster while irrelevant attributes are not identified in any clusters. Datasets for Credit Approval, Ionosphere, Annealing, Madelon, Isolet, and Multiple Attributes are employed alongside two other high-dimensional datasets. Both databases include the class label for each data point. Our test demonstrates that attribute selection using k-means clustering may be done to offer a subset of characteristics and that doing so produces classification outcomes that are more accurate than 80%.
... Researchhers [12][13][14][15] that studied digital image clustering algorithms categorized all algorithms based on the 3V characteristics as distinguished by Oracle Big Data, i.e., volume, variety, and velocity. Based on the 3Vs, the clustering algorithms are categorized into five subcategories: (i) density-based digital image clustering algorithms, (ii) grid-based digital image clustering algorithms, (iii) hierarchical-based digital image clustering algorithms, (iv) model-based digital image clustering algorithms, and (v) partition-based digital image clustering algorithms. ...
Article
Full-text available
Passpix is a key element in pixel value access control, containing a pixel value extracted from a digital image that users input to authenticate their username. However, it is unclear whether cloud storage settings apply compression to prevent deficiencies that would alter the file's 8-bit attribution and pixel value, causing user authentication failure. This study aims to determine the fastest clustering algorithm for faulty Passpix similarity classification, using a dataset of 1,000 objects. The source code for the K-Means, ISODATA, and K-Harmonic Mean scripts was loaded into a clustering experiment prototype compiled as Clustering.exe. The results demonstrate that the number of clusters affects the time taken to complete the clustering process, with the 20-cluster setting taking longer than the 10-cluster setting. The K-Harmonic Mean algorithm was the fastest, while K-Means performed moderately and ISODATA was the slowest of the three clustering algorithms. The results also indicate that the number of iterations did not affect the time taken to complete the clustering process. These findings provide a basis for future studies to increase the number of clusters for better accuracy.
Chapter
Finding related information within a cluster is done using a technique called clustering. The dataset cluster uses the data's maximum and minimum values to group together similar data. Clustering is a process in which matter has been split into groups and grouped based on a rule to maximize within-group similarity and minimize between-group difference likeness. In this chapter, the authors examine and contrast the various group analysis techniques and algorithms employed by Rapid Miner. Multiple clustering methods have been developed. In the chapter, two types of clustering for algorithms are analyzed. One area of mall patrons was evaluated. The data set is used with Rapid Miner tools to determine the proper cluster.
Article
Full-text available
In the era of big data, where the amount of information is growing exponentially, the importance of data mining has never been greater. Educational institutions today collect and store vast amounts of data, such as student enrollment and attendance records, and their exam results. With the need to sift through enormous amounts of data and present it in a way that anyone can understand, educational institutions are at the forefront of this trend, and this calls for a more sophisticated set of algorithms. Data mining in education was born as a response to this problem. Traditional data mining methods cannot be directly applied to educational problems because of the special purpose and function they serve. Defining at-risk students, identifying priority learning requirements for varied groups of students, increasing graduation rates, monitoring institutional performance efficiently, managing campus resources, and optimizing curriculum renewal are just a few of the applications of educational data mining. This paper reviews methodologies used as knowledge extractors to tackle specific education challenges from large data sets of higher education institutions to the benefit of all educational stakeholders.
Research
Full-text available
Clustering is the process of combination of identical objects into same classes. A cluster is a grouping of data objects that are analogous to one another within the same cluster and are disparate to the objects in other clusters. Data clustering can be performed on various areas such as data mining, statistics, machine learning, spatial database, biology and marketing. Machine learning is classified into supervised and unsupervised learning. Clustering is the example of unsupervised learning that has no predefined classes and deals with unknown samples. Cluster analysis can be done with different types of methods includes partitioning methods, hierarchical methods, density based methods, grid based methods and model based methods. Quality of clusters can be determined by the two factors that they are high intra-cluster similarity and low inter-cluster similarity. In this paper, various clustering techniques has been analyzed in data mining in terms of methodology adopted, dataset handled, accuracy, advantages and limitations.
Article
Full-text available
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organiza-tion, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.
Book
The Wiley-Interscience Paperback Series consists of selected books that have been made more accessible to consumers in an effort to increase global appeal and general circulation. With these new unabridged softcover volumes, Wiley hopes to extend the lives of these works by making them available to future generations of statisticians, mathematicians, and scientists. "Cluster analysis is the increasingly important and practical subject of finding groupings in data. The authors set out to write a book for the user who does not necessarily have an extensive background in mathematics. They succeed very well." textemdash}Mathematical Reviews "Finding Groups in Data [is] a clear, readable, and interesting presentation of a small number of clustering methods. In addition, the book introduced some interesting innovations of applied value to clustering literature." textemdash{Journal of Classification "This is a very good, easy-to-read, and practical book. It has many nice features and is highly recommended for students and practitioners in various fields of study." textemdashTechnometrics An introduction to the practical application of cluster analysis, this text presents a selection of methods that together can deal with most applications. These methods are chosen for their robustness, consistency, and general applicability. This book discusses various types of data, including interval-scaled and binary variables as well as similarity data, and explains how these can be transformed prior to clustering.
Article
An extracting method of research trend in the field of a computer network which contained in the published papers of related conferences is presented. A topic is defined as a subset of vocabulary and the interest of the topic is represented as 'saliency'. The saliency, the degree of topic interest is measured as a product of joint distributions of vocabularies which consist the topic. To reduce the computational burden, clustering and selection procedures of vocabularies are applied before actual topic grouping. Two experiments: 1. Research trend analysis and 2. Topic correlation analysis of conferences has been performed. The leading 24 conferences related to the computer networks which are held during 2009-2010 are exploited. The experimental results show the validity of the presented method.
Article
Technological advancement has enabled us to store and process huge amount of data items in a relatively much lesser span of time. The term 'Big Data' simply refers to huge amount of data nowadays used frequently in industrial and research circles. The focus point here is not just the collection of data but careful analysis of the collected data so that meaningful results can be obtained. There are various ways of handling the huge incoming streams of data. One such way is clustering of data into compact units. This not only reduces the size of the data but also helps to utilize it in a more effective manner. This paper gives an overview and comparison of basic clustering algorithms, and suggests the suitability of clustering approaches for various sizes of data sets. A brief introduction to evolution of the clustering algorithms is also given.
Article
Big data analysis plays a crucial role in the health care for early diagnosis of fatal disease. The data mining techniques are widely used for data analysis problem to discover valuable knowledge from a large amount of data. This paper uses the data mining methods such as feature selection and classification to provide a predictive model for ovarian cancer detection. A huge amount of dataset is gathered to build knowledge based system. Rough set theory is utilized to find the data reliance and reduce the feature set contained in the data set. The Hybrid Particle Genetic Swarm Optimization (PGSO) is used to optimize the selected features to efficiently classify the ovarian cancer, either normal or early or different stages of ovarian cancer. Multi class SVM is adopted as the classifier to classify normal or different stages of ovarian cancer using the optimized feature set. The experiment is done on different ovarian cancer dataset and the proposed system has obtained better results for all datasets.
Article
Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.
Article
Conceptual clustering is an important way of summarizing and explaining data. However, the recent formulation of this paradigm has allowed little exploration of conceptual clustering as a means of improving performance. Furthermore, previous work in conceptual clustering has not explicitly dealt with constraints imposed by real world environments. This article presents COBWEB, a conceptual clustering system that organizes data so as to maximize inference ability. Additionally, COBWEB is incremental and computationally economical, and thus can be flexibly applied in a variety of domains.