ArticlePDF Available

A Survey on Clustering Techniques for Big Data Mining

February 2016
Indian Journal of Science and Technology 9(3):1-12

February 2016
9(3):1-12

DOI:10.17485/ijst/2016/v9i3/75971

Authors:

Sajana Tiruveedhula

K L University

Venkata Narayana

Sri Krishnadevaraya University

This paper focuses on a keen study of different clustering algorithms highlighting the characteristics of big data. Brief overview of various clustering algorithms which are grouped under partitioning, hierarchical, density, grid based and model based are discussed.

Big Data architecture.

…

. Various Clustering algorithms for Big Data

…

Figures - uploaded by the authors

Content may be subject to copyright.

Content uploaded by Venkata Narayana

Content may be subject to copyright.

Content uploaded by Venkata Narayana

Content may be subject to copyright.

Content uploaded by Sajana Tiruveedhula

Content may be subject to copyright.

Indian Journal of Science and Technology, Vol 9(3), DOI: 10.17485/ijst/2016/v9i3/75971, January 2016

ISSN (Print) : 0974-6846

ISSN (Online) : 0974-5645

* Author for correspondence

Abstract

This paper focuses on a keen study of different clustering algorithms highlighting the characteristics of big data. Brief

overview of various clustering algorithms which are grouped under partitioning, hierarchical, density, grid based and

model based are discussed.

Keywords: Characteristics of Big Data, Clustering Algorithms - Partitioning, Density, Grid Based, Model Based, Homog-

enous Data, Hierarchical

A Survey on Clustering Techniques for Big

Data Mining

T. Sajana, C. M. Sheela Rani and K. V. Narayana

KL University, Vaddeswaram – 522502, Guntur Dist., Andhra Pradesh, India;

sajana.cse@kluniversity.in, sheelarani_cse@kluniversity.in, kvnarayana@kluniversity.in

1. Introduction

Big Data are the large amount of data being processed by

the Data Mining environment. In other words, it is the

collection of large and complex data sets which are dicult

to process using traditional data processing applications.

Big Data are about turning unstructured, invaluable,

imperfect, complex data into usable information[1].

But, it becomes dicult to maintain huge volume of

information and data day to day from many dierent

resources and services which were not available to human

space just a few decades ago. Very huge quantities of data

are produced every day by and about people, things, and

their interactions. Many dierent groups argue about the

potential benets and costs of analyzing the information

which comes from Twitter, Google, Face book, etc. Large

volume of data is available from dierent online resources

and services like sensor networks, cloud computing,

etc which were established to serve their customers.

To overcome these problems Big Data is clustered in

a compact format that is still an informative version of

entire data. e clustering techniques are very useful to

process data mining. ere are many approaches to mine

the data like neural algorithms, support vector machines,

association algorithms, genetic algorithms and clustering

algorithms. Among these mining techniques clustering

techniques are produce good quality of clusters with

grouping the unlabeled data. Clustering is the process of

grouping the data based on their similar properties. e

main goal of this paper is to provide various clustering

algorithms for Big Data.

is paper presents the survey of clustering

techniques dened with 4 V’s of Big Data characteristics

- Volume, Variety, Velocity and Value[2] [3]. Volume is the

basic characteristic of Big Data which deals with data

size, dimensionality of the data set and outlier’s detection.

Variety is deals with type of attributes of data set like

numerical, categorical, continuous, ordinal and ratio.

Velocity deals with algorithm analysis for computation

of various attributes to process data. Finally Value deals

with the parameters which are used for processing. In

the present paper Introduction to Big Data is discussed

in section1, Architecture of Big Data in section2,

Description of clustering algorithms in section3 and

nally in section4 comparison of dierent clustering

algorithms is presented.

is paper presents a clear survey of various clustering

algorithms[4][5][6][7] to process data which helps researches

and students to decide which algorithm is best for

clustering based on the requirements.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology

A Survey on Clustering Techniques for Big Data Mining

2. Big Data Architecture

As a decade large volumes of data can be stored in

every sector, it requires managing, store, analyzing and

predicting such large volumes of data called “Big Data”.

Data ware house architecture cannot maintain volumes

of large data sets because it uses centralized architecture

of 3-tiers where as in Big Data architecture it deals with

distributed processing of data [8]. e architecture of Big

Data is shown in Figure 1.

Figure 1. Big Data architecture.

3. Clustering Algorithms

is paper presents various clustering algorithms with

by considering the properties of Big Data characteristics

such as size, noise, dimensionality, computations of

algorithms, shape of cluster, etc [10] [11]. e overview of

clustering algorithms is depicted in Figure 2.

3.1 Partitioning based Clustering

algorithms:

All objects are considered initially as a single cluster. e

objects are divided into no of partitions by iteratively

locating the points between the partitions. e

partitioning algorithms like K-means, K-medoids (PAM,

CLARA, CLARANS, and FCM) and K-modes. Partition

based algorithms can found clusters of Non convex

shapes.

3.2 Hierarchical Clustering algorithms:

ere are two approaches to perform Hierarchical

clustering techniques Agglomerative (top-bottom) and

Divisive (bottom- top). In Agglomerative approach,

initially one object is selected and successively merges

the neighbor objects based on the distance as minimum,

maximum and average. e process is continuous until a

desired cluster is formed. e Divisive approach deals with

set of objects as single cluster and divides the cluster into

Figure 2. An overview of clustering algorithms for Big Data mining.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 3

T. Sajana, C. M. Sheela Rani and K. V. Narayana

further clusters until desired no of clusters are formed.

BIRCH, CURE, ROCK, Chameleon, Echidna, Wards,

SNN, GRIDCLUST, CACTUS are some of Hierarchical

clustering algorithms in which clusters of Non convex,

Arbitrary Hyper rectangular are formed.

3.3 Density based Clustering algorithms:

Data objects are categorized into core points, border

points and noise points. All the core points are connected

together based on the densities to form cluster. Arbitrary

shaped clusters are formed by various clustering

algorithms such as DBSCAN, OPTICS, DBCLASD,

GDBSCAN, DENCLU and SUBCLU.

3.4 Grid based Clustering algorithms:

Grid based algorithm partitions the data set into no

number of cells to form a grid structure. Clusters are

formed based on the grid structure. To form clusters

Grid algorithm uses subspace and hierarchical clustering

techniques. STING, CLIQUE, Wave cluster, BANG,

OptiGrid, MAFIA, ENCLUS, PROCLUS, ORCLUS, FC

and STIRR. Compare to all Clustering algorithms Grid

algorithms are very fast processing algorithms. Uniform

grid algorithms are not sucient to form desired clusters.

To overcome these problem Adaptive grid algorithms

such as MAFIA and AMR Arbitrary shaped clusters are

formed by the grid cells.

3.5 Model based Clustering algorithms:

Set of data points are connected together based on

various strategies like statistical methods, conceptual

methods, and robust clustering methods. ere are two

approaches for model based algorithms one is neural

network approach and another one is statistical approach.

Algorithms such as EM, COBWEB, CLASSIT, SOM, and

SLINK are well known Model based clustering algorithms.

4. Comparison of Clustering

Algorithms

Various clustering methods discussed which mine

the data from Big Data. Every algorithm has its own

greatness and weakness. is paper presents various

clustering algorithms related to the 4 V’s of Big Data

characteristics.

4.1 Volume:

it refers to the ability of an algorithm to deal with large

amounts of a data. With respect to the Volume property

the criteria for clustering algorithms to be considered is

a. Size of the data set b. High dimensionality c. Handling

Outliers.

• Size of the data set: Data set is collection of attributes.

e attributes are categorical, nominal, ordinal, in-

terval and ratio. Many clustering algorithms support

numerical and categorical data.

• High dimensionality: To handle big data as the size of

data set increases no of dimensions are also increases.

It is the curse of dimensionality.

• Outliers: Many clustering algorithms are capable of

handle outliers. Noise data cannot be making a group

with data points.

4.2 Variety:

refers to the ability of a clustering algorithm to handle

dierent types of data sets such as numerical, categorical,

nominal and ordinal. A criterion for clustering algorithms

is (a) type of data set (b) cluster shape.

• Type of data set: e size of the data set is small or big

but many of the clustering algorithms support large

data sets for big data mining.

• Cluster shape: Depends on the data set size and type

shape of the cluster formed.

4.3 Velocity:

Refers to the computations of clustering algorithm

based on the criteria (a) running time complexity of a

clustering algorithm.

• Time complexity: If the computations of algorithms

take very less no then algorithm has less run time.

e algorithms the run time calculation done based

on Big O notation.

4.4 Value:

For a clustering algorithm to process the data

accurately and to form a cluster with less computation

input parameter are play key role. e values of various

clustering algorithms are given in Table 1.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology

A Survey on Clustering Techniques for Big Data Mining

Table 1. Various Clustering algorithms for Big Data

Algorithm

type

Algorithm

name

Author Volume Variety Velocity Va l u e

   data

set size

high

dimensi

onality

avoid

outliers

dataset

type

cluster

shape

time

complexity

Input

  Haritigan et al .1975       

 K-means[12] Haritgan & Wang et

al.1979

Large No No Nu-

merical

Non con-

vex

O(n k d) 1

 K- medoid [14] Haritgan & Wang et

al.1979

small Yes Yes Cate-

gorical

Non con-

vex

O( n2dt) 1

Partitional k-modes[13]  no Large Ye s No Cate-

gorical

Non con-

vex

O(n) 1

 PAM[15] Kaufman&Rous-

seeuw et al.1990

Small No No Nu-

merical

Non con-

vex

O(k(n-k)2) 1

 CLARA[16] Kaufman&Rous-

seeuw et al.1990

Large No No Nu-

merical

Non con-

vex

O(k(40+k)2

+k(n-k))

 CLARANS[17] Ng &Han et al.1994 Large No No Nu-

merical

Non con-

vex

O(kn2) 2

 FCM[9]  Large No No Nu-

merical

 O(n) 1

 BIRCH[19] Zhang et al. 1996 Large No No Nu-

merical

Non con-

vex

O(n) 2

 CURE[20] Guha et al. 1998 Large Yes Yes Nu-

merical

&Cate-

gorical

Arbitrary O(n2logn) 2

 ROCK[21] Guha et al. 1999 Large No No Nu-

merical

&Cate-

gorical

Arbitrary O(n2+nmm-

ma+n2logn)

 Chameleon[22] Krpyis et al. 1999a Large Ye s No All

types

data

Arbitrary O(n2) 3

 ECHIDNA[23] Paoluzzi et al. 1999a Large No No Multi-

variate

Non con-

vex

O(N*B(1+logB

m))

Hierarchi-

cal

Wards[39] Wards et al.1963 Small No No Nu-

merical

Arbitrary  no 

 SNN [41] Ertoz et al. 2002 Small No No Cate-

gorical

Arbitrary O(n2 ) 1

 CACTUS[45] Ganti et al. 1999a Small NO No Cate-

gorical

Hyper

rectangu-

lar

O(c N) 2

 GRID-

CLUST[46]

Schikuta et al. 1996 Small No No Nu-

merical

Arbitrary O(n) 2

 DBSCAN[24] Ester et al. 1996 Large No No Nu-

merical

Arbitrary O(n log n)for

spatial data

 OPTICS[25] Ankerst et al. 1999 Large No Ye s Nu-

merical

Arbitrary O(n log n) 2

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 5

T. Sajana, C. M. Sheela Rani and K. V. Narayana

5. Partitioned based Clustering

Algorithms:

5.1 FCM - Fuzzy CMEANS algorithm:

[9] the algorithm is based on the K-means concept to

partition dataset into Clusters.

e algorithm is as follows:

• Calculate the cluster centroids and the objective value

and initialize fuzzy matrix.

• Computer the membership values stored in the ma-

trix.

e paper presents list of all algorithms and their

eciency based on the input parameter to mine the Big

Data as described below:

• If the value of objective is between consecutive iter-

ations is less than the stopping condition then stop.

• is process is continuous until a partition matrix

and clusters are formed.

Density

Based

DBCLASD[26] Xu et al. 1998 Large No Ye s Nu-

merical

Arbitrary O(3n2 ) No

 GDBSCAN[43] Sander et al. 1998 Large No No Nu-

merical

Arbitrary  no 2

 DENCLUE[27] Hinneburg &Kein

et al. 1998

Large Ye s Ye s Nu-

merical

Arbitrary O(log |D|) 2

 SUBCLU[42] Karin Kailling

&Hans-Peter Kriege

Large Ye s Ye s Nu-

merical

Arbitrary  no 2

 STING[29] Wang et al. 1997 Large No Yes Spatial Arbitrary O(k) 1

Wave Clus -

ter[28]

Sheikholeslami et

al. 1998

Large No Ye s Nu-

merical

Arbitrary O(n) 3

 BANG[18] Schikuta & Erhart

et al. 1997

Large Large Yes Nu-

merical

Arbitrary O(n) 2

 CLIQUE[30] Aggarwal et al. 1998 Large No Yes Nu-

merical

Arbitrary O(C k + m k) 2

Grid OptiGrid[31] Hinneburg & Keim

et al. 1999

Large Ye s Ye s Spatial Arbitrary O(n d) to O(nd-

logn)

 MAFIA[44] Goil et al.1999; Large No Ye s Nu-

merical

Arbitrary O(cp + p N ) 2

 ENCLUS[36] Cheng et al. 1999 Large No Yes Nu-

merical

Arbitrary O(ND+ m D ) 2

 PROCLUS[37] Aggarwal et al.

1999a

Large Ye s Ye s Spatial Arbitrary O(n) 2

 ORCLUS[38] Aggarwal &Yu et al.

1999a

Large Ye s Ye s Spatial Arbitrary O(d3 ) 2

 FC[11] Barbara &Chen et

al. 2000

Large Ye s Ye s Nu-

merical

Arbitrary O(n) 2

 STIRR[11] Gibson et al. 1998 Large No No Cate-

gorical

Arbitrary O(n) 2



Model

Based

EM[32] Mitchell et al.1997 Large Ye s No Spatial Non con-

vex

O(knp) 3

COBWEB[33] Fisher et al 1987 Small No No Nu-

merical

Non con-

vex

O(n2 ) 1

CLASSIT[34] Fisher et al 1987 Small No No Nu-

merical

Non con-

vex

O(n2 ) 1

 SOM’s[35] Kohonen 1990 Small Ye s No Multi

variant

Non con-

vex

O(n2m) 2

SLINK[40] Sibson et al.1973 Large No No Nu-

merical

Arbitrary O(n2 ) 2

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology

A Survey on Clustering Techniques for Big Data Mining

6. Hierarchical based Clustering

Algorithms:

6.1 BIRCH - Balanced Iterative Reducing

and Clustering using Hierarchies.

It is an agglomerative hierarchical algorithm which uses

a Clustering Feature (CF-Tree) and incrementally adjusts

the quality of sub clusters.

e algorithm is as follows:

• Load data into memory: CF Tree is constructed with

one scan of the data. Subsequent phases become fast,

accurate and less order sensitive.

• Condense data: Rebuilt the CF tree with larger T.

• Global Clustering: Use the existing clustering algo-

rithm on CF leaves.

• Cluster rening-Do additional passes over the dataset

and reassign data points to the closest centroids from

above step.

• e process continuous until to form k no of clusters.

6.2 CURE- Clustering Using

REpresentatives:

A hierarchy of Divisive approach is used and it selects

well scattered points from the cluster and then shrinks

towards the center of the cluster by a specied function.

Adjacent clusters are merged successively until the no of

clusters reduces to desired no of clusters.

e algorithm is as follows:

• Initially all points are in separate clusters, each cluster

is dened by the point in the cluster.

• e Representative points of a cluster are generated

by rst selecting well scattered objects for the cluster

and then perform shrinking or moving towards the

cluster by a specied factor.

• At each step of the algorithm, two clusters with clos-

est pair of representative point are chosen and merged

together to form cluster.

6.3 ROCK - Robust Clustering algorithm for

Categorical attributes.

It is a hierarchical clustering algorithm in which to form

clusters it uses a link strategy. From bottom to top links

are merging together to form a cluster.

e algorithm is as follows:

Initially consider set of points in which every point is a

cluster and compute the links between each pair of points.

Build a heap and maintain heap for each cluster.

A goodness measure based on the criterion function will

be calculated between pairs of clusters.

Merge the clusters which have maximum value of criteria

function.

6.4 Chameleon -

It is an agglomerative hierarchical clustering algorithm of

dynamic modeling which deals with two phase approach

of clustering

e algorithm is as follows:

A two phase approach of partition and merge is used

to form a cluster.

• During Partition phase-Initially consider all data

points as a single cluster.

• Using a graph partitioning algorithm divide the clus-

ter into a relatively large no of small clusters using

hMETIS method.

• e process terminates when a large sub cluster con-

tains slightly more than a specied no of vertices.

• In merge phase using agglomerative hierarchical ap-

proach select pairs of clusters whose inter connectiv-

ity and relative closeness are reaches the threshold

value.

• Merge the clusters which are having the highest inter

connectivity and closeness.

e algorithm is repeated until none of the adjacent

clusters satisfy the two conditions.

6.5 ECHIDNA

It is an agglomerative hierarchical approach for clustering

the network trac data.

e steps of algorithm are given below:

• e input data is extracted from network trac con-

sists of a 6 Tuple value of numerical and categorical

attributes.

• Each record iteratively builds a hierarchical tree of

clusters called CF-Tree.

• Insert each record into the closest cluster using a com-

bined distance function for all attributes into CF-Tree.

• e radius of a cluster determines if a record should

be absorbed into the cluster or if the cluster should

be split.

• Once the cluster is created and all the signicant

nodes are to form a Cluster Tree.

e Cluster Tree is further compressed to create a

concise and meaningful report.

6.6 SNN - Shared Nearest Neighbors

A hierarchy of top to bottom approach is used for

grouping the objects.

e steps of algorithm are given below:

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 7

T. Sajana, C. M. Sheela Rani and K. V. Narayana

• A proximity matrix should be maintained for the dis-

tances of set of points.

• Objects are clustered together based on the nearest

neighbor and the object with maximum distance can

be avoided.

6.7 CACTUS – Clustering CaTegorical Data

Using Summaries.

It is a very fast and scalable algorithm for nding the

clusters. A hierarchy structure is used to generate

maximum segments or clusters. A two step procedure

deals with the description of algorithm as follows:

• Attributes are strongly connected if the data points

are having larger frequency.

• Clusters are formed based on the co-occurrences of

attribute value pairs.

• A cluster is formed if any segment is having no of ele-

ments α times greater than elements of other.

6.8 GRIDCLUST - GRID based hierarchical

CLUSTering algorithm.

A clustering algorithm of hierarchical method based on

grid structure.

e algorithm is as follows:

• Initially partition the data set into data space to form

grid structure and the topological distributions are

maintained.

• Once data is assigned to the blocks of cells or grids

density values are calculated and sorted according to

their values.

• e largest dense block was considered as cluster cen-

ter.

• Using the Neighbor search algorithm a cluster can be

formed with the remaining blocks.

7. Density based Clustering

Algorithms:

7.1 DBSCAN – Density Based SCAN

clustering algorithm.

It is a connectivity based algorithm which consists of 3

points namely core, border and noise.

e algorithm is as follows:

• Set of points to be considered to form a graph.

• Create an edge from each point c to the other point in

the neighborhood of c.

• If set of nodes N not contain any core points then ter-

minate N.

• Select a node X that must be reached form c.

Repeat the procedure until all core points forms a

cluster.

7.2 OPTICS – Ordering Points To Identify

the Clustering Structure.

It is also a connectivity based density algorithm. OPTICS

is an extension of DBSCAN algorithm which is also based

on the same parameters as DBSCAN algorithm. e

run time of OPTICS is 1.6 times greater than DBSCAN

algorithm.

e algorithm is as follows:

• Among the set of points select a point is a core point if

at least Minpts are found in the core distance.

• For each point c create an edge from c to other point

with a core distance of c.

• Select set of nodes which contain core points as a clus-

ter that reaches from c.

7.3 DBCLASD – Distribution Based

Clustering of Large Spatial Databases.

It is Connectivity based and application based clustering

algorithm for mining of large spatial data bases.

e algorithm is as follows:

• Construct set of candidates C based on the query.

• e point will be remains within the cluster if the dis-

tance between set of C has expected distribution.

• Otherwise the point will be considered as unsuccess-

ful candidate.

• e process is continuous until all points with expect-

ed distribution form cluster.

7.4 GDBSCAN – Generalized Density Based

Spatial Clustering of ApplicatioN

A connectivity based density algorithm in which it form

clusters with point objects and as well as spatial attributes.

e algorithm is as follows:

• An attribute object P is selected and retrieves all ob-

jects densities whether they are reachable from P with

respect to neighborhood of the object (NPred) and

minimum weighted cardinality (Min weight).

• If P is a core object this procedure yields a density con-

nected set Ci with respect to NPred and Min weight.

• Otherwise it does not belong to any density connect-

ed set Ci.

is procedure is iteratively applied to each object P

which has not yet been classied.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology

A Survey on Clustering Techniques for Big Data Mining

7.5 DENCLUE – DENsity based CLUstEring

Among all algorithms of density based clustering

approach DENCLUE is the algorithm which is based on

the density function. Arbitrary shape of good quality of

clusters can be formed with large amount of data set.

e algorithm is as follows:

• Consider the data set in the grid structure and nd

the high density cells based on mean value (highest).

• If d (mean (c1), mean (c2)) < 4a then connectc1, c2.

• Find the density attractors using Hill-Climbing ap-

proach and they should be local maxima of overall

density function.

• Merge the attractors and they can be identied as

clusters.

7.6 SUBCLU – SUBspace CLUstering

It is an ecient approach to the subspace clustering and

which is based on the formal clustering notion. It can

detect clusters of arbitrary shape.

e algorithm is as follows:

• Initially generate all 1-D subspace clusters in which at

least one cluster in the subspace found.

• Generate (k+1) dimensional candidate subspaces.

Test candidates and generate (k+1) dimensional clus-

ters.

• All the clusters in the higher dimensional subspace

will be the subsets of clusters which are detected in

the rst clustering.

e process continues until (k+1)–D clusters are

formed from k- D clusters.

8. GRID based Clustering

Algorithms:

8.1 STING – STatisitcal Information Grid

based method.

It is similar to BIRCH hierarchical algorithm to form a

cluster with spatial data bases.

e algorithm is as follows:

• Initially the spatial data stored into rectangular cells

using a hierarchical grid structure.

• Partition each cell into 4 child cells at the next level

with each child corresponding to a quadrant of the

parent cell.

• Calculate probability of each cell whether it is relevant

or not. If the cell is relevant then apply same calcula-

tions on each cell one by one.

• Find the regions of relevant cells in order to form

cluster.

8.2 Wave Cluster - Among all the clustering

algorithms, this is based on signal

processing.

e algorithm works with numerical attributes and has

multi-resolution. Outliers can be detected easily.

e algorithm is as follows:

• Fit all the data points into a cell. Apply wavelet trans-

form to lter the data points.

• Apply discrete Wavelet transform to accumulate data

points.

• High amplitude signals are applied to the correspond-

ing cluster interiors and high frequency is applied to

nd boundary of cluster.

• Signals are applied to the attribute space in order to

form cluster with more sharp and eliminates outliers

eas i l y.

8.3 BANG – Grid based clustering

algorithm.

It is an extension of GRIDCLUST algorithm which

initially considers all data points as blocks but it uses

BANG structure to maintain blocks.

e algorithm is as follows:

• Divide the feature space into rectangular blocks which

contains up to a maximum of P max data points.

• Build a binary tree to maintain the density indices of

all blocks are calculated and sorted in decreasing or-

der.

• Starting with the highest density index, all neighbor

blocks are determined and classied in decreasing or-

der to form a cluster.

• e process is repeated for the remaining blocks.

8.4 CLIQUE – CLustering In QUEst.

A subspace clustering algorithm for numerical attributes

in which bottom top approach is used to form clusters.

The algorithm is as follows:

• Consider set of data points, at one pass apply equal

width to the set of points to form grid cells.

• Let the rectangular cells into subspace whose density

exceed τ are placed into equal grids.

• e process is continuous recursively to form (q-1)

dimensional units to q dimensional units.

• e subspaces are connected to each other to form

cluster with equal width.

8.5 OPTI GRID – Optimal Grid.

e algorithm is designed to cluster large spatial data

bases.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 9

T. Sajana, C. M. Sheela Rani and K. V. Narayana

e algorithm is as follows:

• Dene the data set with best cutting hyper planes

through a set of selected projections.

• Select best local optima cutting plane.

• Insert all the cutting planes with a score greater than

or equal to minimum cut score into a BEST CUT.

• Select q cutting planes of the highest score form BEST

CUT and construct a multi dimensional grid G using

the q cutting planes.

• Insert all data points in D into G and determine the

highly populated grid cells in G and form a set of clus-

ters C.

8.6 MAFIA – Merging of Adaptive Finite

IntervAls.

It is descendant of CLIQUE algorithm in which instead of

using a xed size cell grid structure with an equal number

of bins in each dimension, it constructs an adaptive grid

to improve the quality of clustering.

e algorithm is as follows:

• In a single pass an adaptive grid structure was con-

structed by considering set of all points.

• Compute the histogram by reading blocks of data into

memory using bins.

• Bins are grouped together based on the dominance

factor α.

• Select the bins that are α times dense greater than av-

erage as p candidate dense units (CDU).

• Recursively the process continuous to form new

p-CDU’s and merge adjacent CDU’s into clusters.

8.7 ENCLUS – Entropy based CLUStering.

e algorithm is entropy based algorithm for clustering

large data sets. ENCLUS is an adaptation of CLIQUE

algorithm.

e algorithm is as follows:

• e objects whose subspaces are spanned by attri-

bute A1….AP with an entropy criteria H (A1….AP) < ϖ (a

threshold) are selected for clustering.

8.8 PROCLUS – PROjected CLUStering

algorithm.

e algorithm also uses medoids which is same as K –

medoids clustering criteria.

e algorithm is dened in three step procedure as

follows:

• Initialization: Consider the set of all points and select

data points randomly.

• Iteration phase: select medoids of the clusters as data

point and dene a subspace to each medoids.

• Renement phase: select best medoids form set of

medoids which has all dimensions. Select another

medoids which is nearest to best medoids.

All the data points within this distance will be formed

as a cluster.

8.9 ORCLUS- ORiented projected

CLUStering generation algorithm.

It is similar to PROCLUS clustering algorithm but it

focuses on non-axis parallel subspace.

e algorithm is dened by three strategies -

assignment, subspace determination and merge as

follows:

• Assignment: During this phase the algorithm itera-

tively assigns all the data points to the nearest cluster

centers.

• Sub space determination: To determine sub space cal-

culate co-variance matrix for each cluster and Eigen

vectors with the least Eigen values.

• Merge: Clusters which are near to each other and have

similar directions are merged.

8.10 FC – Fractal Clustering algorithm.

e algorithm deals with hierarchy approach works with

several layers of grids for numeric attributes and identies

clusters of irregular shapes.

e algorithm is as follows:

• Start with a data sample and a threshold value is con-

sidered for a given set of points.

• Initialize threshold value, scan full data incrementally.

• Using HFD-Hausdor Fractal Dimension (HFD)

method adds an incoming point to each cluster.

• If the smallest increase exceeds a threshold τ value, a

point is declared an outlier and shape of the cluster is

declared as irregular.

• Otherwise a point is assigned to cluster.

8.11 STIRR – Sieving Through Iterated

ReinRcement.

is algorithm deals with spectral partitioning using

dynamic system as follows:

• Set of attributes are considered and weights W= W v

are assigned to each attribute.

• Weights are assigned to set of attributes using com-

bining operator ϕ dened as

• ϕ (W1…Wn-1) = W1+…….. + Wn-1.

• At a particular point the process is stopped to achieve

dynamic system.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology

A Survey on Clustering Techniques for Big Data Mining

9. Model Based Clustering

Algorithms:

9.1 EM – Expectation and Maximization

is algorithm is based on two parameters- expectation

(E) and maximization (M).

• E: e current model parameter values are used to

evaluate the posterior distribution of the latent vari-

ables. en the objects are fractionally assigned to

each cluster based on this posterior distribution as

Q( θ , θ T ) = E[ log p(x g , x m | θ ) x g , θ T ]

• M: e fractional assignment is given by re-estimat-

ing the model parameters with the maximum likeli-

hood rule as

θ t + 1 = max Q (θ, θ T )

e process is repeated until the convergence

condition is satised.

9.2 COBWEB – Model based clustering

algorithm.

It is an Incremental clustering algorithm, which builds

taxonomy of clusters without having a predened number

of clusters. e clusters are represented probabilistically by

conditional probability P (A = v | C) with which attribute

A has value v, given that the instance belongs to class C.

e algorithm is as follows:

• e algorithm starts with an empty root node.

• Instances are added one by one.

• For each instance, the following options (operators)

are considered:

• - classifying the instance into an existing class;

• - creating a new class and place the instance into it.

• - combining two classes into a single class (merging)

and placing the new instance in the resulting hierar-

chy;

• - split the class into two classes (splitting) and placing

the new instance in the resulting hierarchy.

• e algorithm searches the space of possible hierar-

chies by applying the above operators and an evalua-

tion function based on the category utility.

9.3 SOM- Self Organized Map algorithm.

A Model based clustering incremental clustering

algorithm, which is based on the grid structure.

e algorithm is dened by a two step process:

• Place the grid of nodes along a plane where data

points are distributed.

• Sample the data point and subject the closest node

and neighboring node to its inuence. Sampling an-

other point and so on.

• e procedure is repeated until all data points have

been sampled several times.

• Each cluster is dened with reference to a node spe-

cically comprised by those data points for which it

represents the closest node.

9.4 SLINK – Single LINK clustering

algorithm.

A Model based clustering algorithm in which a hierarchy

approach is used to form clusters.

• Starts with set of points, let each point as a singleton

cluster.

• Using Euclidean distance determine the distance be-

tween the two points.

• Merge the links between all points’ shortest links rst.

• Combine the single links to form a cluster.

10. Conclusion

is paper analyzed dierent clustering algorithms

required for processing Big Data. e study revealed that

to identify the outliers in large data sets, the algorithms

that should be used are BIRCH, CLIQUE, and ORCLUS.

To perform clustering, various algorithms can be used but

to get appropriate results the present study suggests that

– by using CURE and ROCK algorithms on categorical

data, arbitrary shaped clusters will be created. By using

COBWEB and CLASSIT algorithms on numerical data

with model based, non-convex shape clusters can be

formed. For spatial data STING, OPTIGRID, PROCLUS

and ORCLUS algorithms when applied yield arbitrary

shaped clusters.

11. References

1. Yasodha P, Ananathanarayanan NR. Analyzing Big Data to

build knowledge based system for early detection of ovarian

cancer. Indian Journal of Science and Technology. 2015 Jul;

8(14):1–7.

2. Pandove D, Goel S. A comprehensive study on clustering ap-

proaches for Big Data mining. IEEE Transactions on Electron-

ics and Communication System; Coimbatore. 2015 Feb 26-27.

p. 1333–8.

3. Park H, Park J, Kwon YB. Topic clustering from selected area

papers. Indian Journal of Science and Technology. 2015 Oct;

8(26):1–7.

4. Abbasi A, Younis M. A survey on clustering algorithms for

wireless sensor networks. Computer Communications. 2007

Dec; 30(14-15):2826–41.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology 11

T. Sajana, C. M. Sheela Rani and K. V. Narayana

5. Aggarwal C, Zhai C. A survey of text clustering algorithms.

Mining Text Data. New York, NY, USA. Springer-Verlag: 2012.

p. 77–128.

6. Brank J, Grobelnik M, Mladenic D. A survey of ontology eval-

uation techniques. Proceedings Conf Data Mining and Data

Warehouses (SiKDD); 2005. p. 166–9.

7. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans-

actions on Neural Networks. 2005 May; 16(3):645–78.

8. Yadav C, Wang S, Kumar M. Algorithms and approaches

to handle large data sets - A survey. International Journal of

Computer Science and Network. 2013; 2(3):1–5.

9. Bezdek JC, Ehrlich R, Full W. FCM: e Fuzzy C-Means Clus-

tering algorithm. Computers and Geosciences. 1984; 10(2-

3):191–203.

10. Fahad A, Alshatri N, Tari Z, Alamri A. A survey of clustering

algorithms for Big Data: Taxonomy and empirical analysis.

IEEE Transactions on Emerging Topics in Computing. 2014

Sep; 2(3):267–79.

11. Berkhin P. Survey of clustering data mining techniques in

grouping multidimensional data. Springer. 2006; 25–71.

12. Macqueen J. Some methods for classication and analysis of

multivariate observations. Proceedings 5th Berkeley Sympo-

sium on Mathematical Statistics Probability; Berkeley, CA,

USA. 1967. p. 281–97.

13. Huang Z. A fast clustering algorithm to cluster very large cate-

gorical data sets in data mining. Proceedings SIGMOD Work-

shop Res Issues Data Mining Knowl Discovery; 1997. p. 1–8.

14. Park HS, Jun CH. A simple and fast algorithm for K-me-

doids clustering. Expert Systems Applications. 2009 Mar;

36(2.2):3336–41.

15. Ng RT, Han J. Ecient and eective clustering methods for

spatial data mining. Proceedings Int Conf Very Large Data

Bases (VLDB); 1994. p. 144–55.

16. Kaufman L, Rousseau PJ. Finding groups in data: An intro-

duction to cluster analysis. USA, Johns and Sons Wiley; 2008.

17. Ng RT, Han J. CLARANS: A method for clustering objects for

spatial data mining. IEEE Transactions on Knowledge Data

Engineering (TKDE). 2002 Sep/Oct; 14(5):1003–16.

18. Schikuta E, Erchart M. e BANG – Clustering system: Grid–

based data analysis. Lecture Notes in Computer Science. 1997;

1280:513–24.

19. Zhang T, Ramakrishna R, Livny M. BIRCH: An ecient data

clustering method for very large databases. Proceedings of the

ACM SIGMOD International Conference on Management of

Data. 1996 Jun; 25(2):103–14.

20. Guha S, Rastogi R, Shim K. Cure: An ecient clustering algo-

rithm for large data bases. Proceedings of the ACM SICMOID

international Conference on Management of Data. 1998 Jun;

27(2):73–84.

21. Guha S, Rastogi R, Shim K. Rock: A robust clustering algo-

rithm for categorical attributes. 15th International Conference

on Data Engineering; 1999. p. 512–21.

22. Karypis G, Han EH, Kumar V. Chameleon: Hierarchical clus-

tering using dynamic modeling. IEEE Computer. 1999 Aug;

32(8): 68–75.

23. Mahmood AN, Leckie C, Udaya P. An ecient clustering

scheme to exploit hierarchical data in network trac analy-

sis. IEEE Transactions on Knowledge. Data Engineering. 2008

Jun; 20(6):752–67.

24. Ester M, Kriegel HP, Sander J, Xu X. A density-based algo-

rithm for discovering clusters in large spatial databases with

noise. Proceedings ACM SIGKDD Conf Knowl Discovery Ad

Data Mining (KDD); 1996. pp. 226–31.

25. Ankerst M, Breunig M, Kriegel HP, Sander J. Optics: Ordering

points to identify the clustering structure. Proceedings of the

ACM SIGMOD International Conference on Management of

Data. 1999 Jun; 28(2):49–60.

26. Xu X, Ester M, Krieger HP, Sander J. A distribution-based

clustering algorithm for mining in large spatial databases. Pro-

ceedings 14th IEEE International Conference on Data Engi-

neering (ICDE); Orlando, FL. 1998 Feb 23-27. p. 324–31.

27. Hinneburg A, Keim DA. An ecient approach to clustering

in large multimedia databases with noise. Proceedings ACM

SIGKDD Conf Knowl Discovery Ad Data Mining (KDD);

1998. p. 58–65.

28. Sheikholeslami G, Chatterjee S, Zhang A. Wave cluster: A

multi resolution clustering approach for very large spatial da-

tabases. Proceedings Int Conf Very Large Data Bases (VLDB);

1998. p. 428–39.

29. Wang W, Yang J, Muntz R. Sting: A statistical information grid

approach to spatial data mining. Proceedings 23rd Int Conf

Very Large Data Bases (VLDB); 1997. p. 186–95.

30. Jain AK, Dubes RC. Algorithms for Clustering Data. Upper

Saddle River, NJ, USA, Prentice-Hall; 1988.

31. Hinneburg A, Keim DA. Optimal grid-clustering: To-

wards breaking the curse of dimensionality in high-dimen-

sional clustering. Proceedings 25th Int Conf Very Large Data

Bases (VLDB); 1999. p. 506–17.

32. Dempster AP, Laird NM, Rubin DB. Maximum likelihood

from incomplete data via thee algorithm. Journal of the Royal

Statistical Society. 1977; 39(1):1–38.

33. Fisher DH. Knowledge acquisition via incremental conceptual

clustering. Machine Learning. 1987 Sep; 2(2):139–72.

34. Gennari JH, Langley P, Fisher D. Models of incremental con-

cept formation. Articial Intelligence. 1989 Sep; 40(1–3):11–

61.

35. Kohonen T. e self-organizing map. Neurocomputing. 1998

Nov; 21(1-3):1–6.

36. Cheng CH, Fu AW, Zhang Y. Entropy based sub space cluster-

ing for mining numerical data. Proceedings of the h ACM

SIGMOID International Conference on Knowledge discovery

and Data Mining; 1999. p. 84–93.

37. Milenova BL, Campos M. Clustering large databases with

numeric and nominal values using orthogonal projections. O

Cluster; 2006. p. 1–11.

38. Aggarwal CC, Yu PS. Finding generalized projected clusters

in high dimensional spaces. Proceedings of the 2000 ACM

SIGMOID International Conference on Management of Data.

2000 Jun; 29(2):70–81.

39. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans-

actions on Neural Networks. 2005 May; 16(3):645–78.

40. Han J, Kamber M. Data Mining: Concepts and Techniques.

Vol 9 (3) | January 2016 | www.indjst.org Indian Journal of Science and Technology

A Survey on Clustering Techniques for Big Data Mining

2nd edition. San Mateo, CA, USA, Morgan Kaufmann; 2006.

41. Hubert L, Arabie P. Comparing partitions. Journal of Classiﬁ-

cation. 1985 Dec; 2(1):193–218.

42. Kailing K, Kriegel HP, Kroger P. Density-connected subspace

clustering for high- dimensionality data. Proceedings of the

2004 SIAM International Conference on Data Mining; 2010.

p. 246–57.

43. Varghese BM, Unnikrishanan A, Paulose Jacob K. Spatial clus-

tering algorithms – An overview. Asian Journal of Computer

Science and Information Technology. 2014; 3(1):1–8.

44. Cheng W, Wang W, Batista S. Grid Based Clustering. 2009. p.

12–24.

45. Ganti V, Gehrke J, Ramakrishna R. CACTUS- Clustering Cat-

egorical Data Using Summaries. Proceeding of the h ACM

SIGMOID International Conference on Knowledge Discovery

and Datamining; 1999. p. 73–83.

46. Cao Q, Bouqata B, Mackenzie PD, Messiar D, Salvo J. A grid-

based clustering method for mining frequent trips from large-

scale, event-based telemetries datasets. e 2009 IEEE Inter-

national Conference on Systems, Man and Cybernetics; San

Anonio, TX, USA. 2009 Oct. p. 2996–3001.

Customer’s Purchase Prediction Using Customer Segmentation Approach for Clustering of Categorical Data

Article

Jun 2021

A Fuzzy Clustering Approach to Identify Pedestrians’ Traffic Behavior Patterns

Article

Full-text available

Sep 2023

Background: Pattern recognition of pedestrians’ traffic behavior can enhance the management efficiency of interested groups by targeting access to them and facilitating planning via more specific surveys. This study aimed to evaluate the pedestrians’ traffic behavior pattern by fuzzy clustering algorithm and assess the factors related to higher-risk traffic behavior of pedestrians. Study Design: This study is a secondary methodological study based on the data from a cross-sectional study. Methods: The fuzzy c-means (FCM), as a machine learning clustering method, was conducted to identify the pattern of traffic behaviors by collecting data from 600 pedestrians in Urmia, Iran via "the Pedestrian Behavior Questionnaire" (PBQ) and using 5 domains of PBQ. Multiple logistic regression was fitted to identify risk factors of traffic behaviors. Results: Results revealed two clusters consisting of lower-risk and higher-risk behaviors. The majority of pedestrians (64.33%) were in the lower-risk cluster. Subjects≤33 years old (Odds ratio [OR]=1.92, P<0.001), subjects with≤6 years of education (OR=1.74, P=0.010), males (OR=1.90, P=0.001), unmarried pedestrians (OR=3.61, P=0.007), and users of public transportation (OR=2.01, P=0.002) were more likely to have higher-risk traffic behavior. Conclusion: We identified traffic behavior patterns of Urmia pedestrians with lower-risk and higher-risk behaviors via FCM. The findings from this study would be helpful for policymakers to promote safety measures and train pedestrians.

Attribute Selection Algorithm with Clustering based Optimization Approach based on Mean and Similarity Distance

Article

Aug 2023

With hundreds or thousands of attributes in high-dimensional data, the computational workload is challenging. Attributes that have no meaningful influence on class predictions throughout the classification process increase the computing load. This article's goal is to use attribute selection to reduce the size of high-dimensional data, which will lessen the computational load. Considering selected attribute subsets that cover all attributes. As a result, there are two stages to the process: filtering out superfluous information and settling on a single attribute to stand in for a group of similar but otherwise meaningless characteristics. Numerous studies on attribute selection, including backward and forward selection, have been undertaken. This experiment and the accuracy of the categorization result recommend a k-means based PSO clustering-based attribute selection. It is likely that related attributes are present in the same cluster while irrelevant attributes are not identified in any clusters. Datasets for Credit Approval, Ionosphere, Annealing, Madelon, Isolet, and Multiple Attributes are employed alongside two other high-dimensional datasets. Both databases include the class label for each data point. Our test demonstrates that attribute selection using k-means clustering may be done to offer a subset of characteristics and that doing so produces classification outcomes that are more accurate than 80%.

Pixel Value Graphical Password Scheme: Analysis on Time Complexity performance of Clustering Algorithm for Passpix Segmentation

Article

Full-text available

Mar 2023

Passpix is a key element in pixel value access control, containing a pixel value extracted from a digital image that users input to authenticate their username. However, it is unclear whether cloud storage settings apply compression to prevent deficiencies that would alter the file's 8-bit attribution and pixel value, causing user authentication failure. This study aims to determine the fastest clustering algorithm for faulty Passpix similarity classification, using a dataset of 1,000 objects. The source code for the K-Means, ISODATA, and K-Harmonic Mean scripts was loaded into a clustering experiment prototype compiled as Clustering.exe. The results demonstrate that the number of clusters affects the time taken to complete the clustering process, with the 20-cluster setting taking longer than the 10-cluster setting. The K-Harmonic Mean algorithm was the fastest, while K-Means performed moderately and ISODATA was the slowest of the three clustering algorithms. The results also indicate that the number of iterations did not affect the time taken to complete the clustering process. These findings provide a basis for future studies to increase the number of clusters for better accuracy.

Information Security Protection Method for Computer Networks Considering Big Data Clustering Algorithm

Conference Paper

Nov 2023

Yapeng Han

Enhancing Business Intelligence through Business Analytics and Data Mining Techniques using Python

Conference Paper

Jun 2023

Aneesh Pradeep

Mall Customer Segmentation Engine Through Clustering Analysis

Chapter

May 2023

Finding related information within a cluster is done using a technique called clustering. The dataset cluster uses the data's maximum and minimum values to group together similar data. Clustering is a process in which matter has been split into groups and grouped based on a rule to maximize within-group similarity and minimize between-group difference likeness. In this chapter, the authors examine and contrast the various group analysis techniques and algorithms employed by Rapid Miner. Multiple clustering methods have been developed. In the chapter, two types of clustering for algorithms are analyzed. One area of mall patrons was evaluated. The data set is used with Rapid Miner tools to determine the proper cluster.

Handling Big Data in Education: A Review of Educational Data Mining Techniques for Specific Educational Problems

Article

Full-text available

Apr 2023

Yaw Boateng Ampadu

In the era of big data, where the amount of information is growing exponentially, the importance of data mining has never been greater. Educational institutions today collect and store vast amounts of data, such as student enrollment and attendance records, and their exam results. With the need to sift through enormous amounts of data and present it in a way that anyone can understand, educational institutions are at the forefront of this trend, and this calls for a more sophisticated set of algorithms. Data mining in education was born as a response to this problem. Traditional data mining methods cannot be directly applied to educational problems because of the special purpose and function they serve. Defining at-risk students, identifying priority learning requirements for varied groups of students, increasing graduation rates, monitoring institutional performance efficiently, managing campus resources, and optimizing curriculum renewal are just a few of the applications of educational data mining. This paper reviews methodologies used as knowledge extractors to tackle specific education challenges from large data sets of higher education institutions to the benefit of all educational stakeholders.

Cardio Vascular Disease Prediction based on ANN Algorithm

Conference Paper

Mar 2023

A Comparative Study on Various Clustering Techniques in Data Mining

Research

Full-text available

Dec 2018

Kasthuri Stephen

Clustering is the process of combination of identical objects into same classes. A cluster is a grouping of data objects that are analogous to one another within the same cluster and are disparate to the objects in other clusters. Data clustering can be performed on various areas such as data mining, statistics, machine learning, spatial database, biology and marketing. Machine learning is classified into supervised and unsupervised learning. Clustering is the example of unsupervised learning that has no predefined classes and deals with unknown samples. Cluster analysis can be done with different types of methods includes partitioning methods, hierarchical methods, density based methods, grid based methods and model based methods. Quality of clusters can be determined by the two factors that they are high intra-cluster similarity and low inter-cluster similarity. In this paper, various clustering techniques has been analyzed in data mining in terms of methodology adopted, dataset handled, accuracy, advantages and limitations.

A Survey of Text Clustering Algorithms

Article

Full-text available

Aug 2012

Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organiza-tion, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.

Grid-Based Clustering

Chapter

Sep 2018

Finding Groups in Data: An Introduction to Cluster Analysis

Book

Sep 2009

The Wiley-Interscience Paperback Series consists of selected books that have been made more accessible to consumers in an effort to increase global appeal and general circulation. With these new unabridged softcover volumes, Wiley hopes to extend the lives of these works by making them available to future generations of statisticians, mathematicians, and scientists. "Cluster analysis is the increasingly important and practical subject of finding groupings in data. The authors set out to write a book for the user who does not necessarily have an extensive background in mathematics. They succeed very well." textemdash}Mathematical Reviews "Finding Groups in Data [is] a clear, readable, and interesting presentation of a small number of clustering methods. In addition, the book introduced some interesting innovations of applied value to clustering literature." textemdash{Journal of Classification "This is a very good, easy-to-read, and practical book. It has many nice features and is highly recommended for students and practitioners in various fields of study." textemdashTechnometrics An introduction to the practical application of cluster analysis, this text presents a selection of methods that together can deal with most applications. These methods are chosen for their robustness, consistency, and general applicability. This book discusses various types of data, including interval-scaled and binary variables as well as similarity data, and explains how these can be transformed prior to clustering.

Data Mining: Concepts and Techniques

Article

Jan 2006

Topic Clustering from Selected Area Papers

Article

Oct 2015

An extracting method of research trend in the field of a computer network which contained in the published papers of related conferences is presented. A topic is defined as a subset of vocabulary and the interest of the topic is represented as 'saliency'. The saliency, the degree of topic interest is measured as a product of joint distributions of vocabularies which consist the topic. To reduce the computational burden, clustering and selection procedures of vocabularies are applied before actual topic grouping. Two experiments: 1. Research trend analysis and 2. Topic correlation analysis of conferences has been performed. The leading 24 conferences related to the computer networks which are held during 2009-2010 are exploited. The experimental results show the validity of the presented method.

A comprehensive study on clustering approaches for big data mining

Article

Jun 2015

Technological advancement has enabled us to store and process huge amount of data items in a relatively much lesser span of time. The term 'Big Data' simply refers to huge amount of data nowadays used frequently in industrial and research circles. The focus point here is not just the collection of data but careful analysis of the collected data so that meaningful results can be obtained. There are various ways of handling the huge incoming streams of data. One such way is clustering of data into compact units. This not only reduces the size of the data but also helps to utilize it in a more effective manner. This paper gives an overview and comparison of basic clustering algorithms, and suggests the suitability of clustering approaches for various sizes of data sets. A brief introduction to evolution of the clustering algorithms is also given.

Analysing Big Data to Build Knowledge Based System for Early Detection of Ovarian Cancer

Article

Jul 2015

Big data analysis plays a crucial role in the health care for early diagnosis of fatal disease. The data mining techniques are widely used for data analysis problem to discover valuable knowledge from a large amount of data. This paper uses the data mining methods such as feature selection and classification to provide a predictive model for ovarian cancer detection. A huge amount of dataset is gathered to build knowledge based system. Rough set theory is utilized to find the data reliance and reduce the feature set contained in the data set. The Hybrid Particle Genetic Swarm Optimization (PGSO) is used to optimize the selected features to efficiently classify the ovarian cancer, either normal or early or different stages of ovarian cancer. Multi class SVM is adopted as the classifier to classify normal or different stages of ovarian cancer using the optimized feature set. The experiment is done on different ovarian cancer dataset and the proposed system has obtained better results for all datasets.

A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis

Article

Sep 2014

Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.

The self-organizing map

Article

Nov 1998
NEUROCOMPUTING

Knowledge Acquisition Via Incremental Conceptual Clustering

Article

Sep 1987

Doug Fisher

Conceptual clustering is an important way of summarizing and explaining data. However, the recent formulation of this paradigm has allowed little exploration of conceptual clustering as a means of improving performance. Furthermore, previous work in conceptual clustering has not explicitly dealt with constraints imposed by real world environments. This article presents COBWEB, a conceptual clustering system that organizes data so as to maximize inference ability. Additionally, COBWEB is incremental and computationally economical, and thus can be flexibly applied in a variety of domains.

A Survey on Clustering Techniques for Big Data Mining

Abstract and Figures

Recommended publications

Comprehensive Analysis & Performance Comparison of Clustering Algorithms for Big Data

A Survey on Density Based Clustering Algorithms for Mining Large Spatial Databases

A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis

EMERGING TOPICS IN COMPUTING A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical...

Subject Review: Text Clustering Algorithms