Conference PaperPDF Available

Big Data Clustering: A Review

June 2014

June 2014
8583

DOI:10.1007/978-3-319-09156-3_49

Conference: ICCSA 2014
At: Guimarães, Portugal
Volume: 8583

Authors:

Ali Seyed Shirkhorshidi

University of Malaya

Sr Aghabozorgi

University of Malaya

Teh Ying Wah

University of Malaya

Tutut Herawan

University of Malaya

Clustering is an essential data mining and tool for analyzing big data. There are difficulties for applying clustering techniques to big data duo to new challenges that are raised with big data. As Big Data is referring to terabytes and petabytes of data and clustering algorithms are come with high computa-tional costs, the question is how to cope with this problem and how to deploy clustering techniques to big data and get the results in a reasonable time. This study is aimed to review the trend and progress of clustering algorithms to cope with big data challenges from very first proposed algorithms until today's novel solutions. The algorithms and the targeted challenges for producing improved clustering algorithms are introduced and analyzed, and afterward the possible future path for more advanced algorithms is illuminated based on today's avail-able technologies and frameworks.

Progress of developments in clustering algorithms to deal with big data

…

Big data clustering techniques

…

BIRCH algorithm flowchart

…

CURE flowchart

…

General concept of multi-machine clustering techniques

…

Figures - uploaded by Ali Seyed Shirkhorshidi

Content may be subject to copyright.

Content uploaded by Ali Seyed Shirkhorshidi

Content may be subject to copyright.

Big Data Clustering: A Review

Ali Seyed Shirkhorshidi1, Saeed Aghabozorgi1, Teh Ying Wah1 and Tutut Herawan1,2

1Department of Information Systems

Faculty of Computer Science and Information Technology

University of Malaya

50603 Pantai Valley, Kuala Lumpur, Malaysia

2AMCS Research Center, Yogyakarta, Indonesia

shirkhorshidi_ali@siswa.um.edu.my, saeed@um.edu.my,

tehyw@um.edu.my, tutut@um.edu.my

Abstract. Clustering is an essential data mining and tool for analyzing big data.

There are difficulties for applying clustering techniques to big data duo to new

challenges that are raised with big data. As Big Data is referring to terabytes

and petabytes of data and clustering algorithms are come with high computa-

tional costs, the question is how to cope with this problem and how to deploy

clustering techniques to big data and get the results in a reasonable time. This

study is aimed to review the trend and progress of clustering algorithms to cope

with big data challenges from very first proposed algorithms until today’s novel

solutions. The algorithms and the targeted challenges for producing improved

clustering algorithms are introduced and analyzed, and afterward the possible

future path for more advanced algorithms is illuminated based on today’s avail-

able technologies and frameworks.

Keywords: Big Data; Clustering; MapReduce, Parallel Clustering.

1 Introduction

After an era of dealing with data collection challenges, nowadays the problem is

changed into the question of how to process these huge amounts of data. Scientists

and researchers believe that today one of the most important topics in computing sci-

ence is Big Data. Social networking websites such as Facebook and Twitter have

billions of users and they produce hundreds of gigabytes of contents per minute, retail

stores continuously collect their customers’ data, You Tube has 1 billion unique users

which are producing 100 hours of video each an hour and its content ID service scans

over 400 years of video every day [1], [2]. To deal with this avalanche of data, it is

necessary to use powerful tools for knowledge discovery. Data mining techniques are

well-known knowledge discovery tools for this purpose [3]–[9]. Clustering is one of

them that is defined as a method in which data are divided into groups in a way that

objects in each group share more similarity than with other objects in other groups

[1]. Data clustering is a well-known technique in various areas of computer science

and related domains. Although data mining can be considered as the main origin of

clustering, but it is vastly used in other fields of study such as bio informatics, energy

studies, machine learning, networking, pattern recognition and therefore a lot of re-

search works has been done in this area [10]–[13]. From the very beginning research-

ers were dealing with clustering algorithms in order to handle their complexity and

computational cost and consequently increase scalability and speed. Emersion of big

data in recent years added more challenges to this topic which urges more research for

clustering algorithms improvement. Before focusing on clustering big data the ques-

tion which needs to be clarified is how big the big data is. To address this question

Bezdek and Hathaway represented a categorization of data sizes which is represented

in table 1 [14].

Table 1. Bezdek and Hathaway categorization for big data

Big data

Bytes

106

108

1010

1012

10>12

“Size”

Medium

Large

Huge

Monster

Very large

Challenges of big data have root in its five important characteristics [15]:

 Volume: The first one is Volume and an example is the unstructured data stream-

ing in form of social media and it rises question such as how to determine the rele-

vance within large data volumes and how to analyze the relevant data to produce

valuable information.

 Velocity: Data is flooding at very high speed and it has to be dealt with in reasona-

ble time. Responding quickly to data velocity is one of the challenges in big data.

 Variety: Another challenging issue is to manage, merge and govern data that

comes from different sources with different specifications such as: email, audio,

unstructured data, social data, video and etc.

 Variability: Inconsistency in data flow is another challenge. For example in social

media it could be daily or seasonal peak data loads which makes it harder to deal

and manage the data specially when the data is unstructured.

 Complexity: Data is coming from different sources and have different structures;

consequently it is necessary to connect and correlate relationships and data linkag-

es or you find your data to be out of control quickly.

Traditional clustering techniques cannot cope with this huge amount of data because

of their high complexity and computational cost. As an instance, the traditional K-

means clustering is NP-hard, even when the number of clusters is k=2. Consequently,

scalability is the main challenge for clustering big data.

Fig. 1. Progress of developments in clustering algorithms to deal with big data

The main target is to scale up and speed up clustering algorithms with minimum sacri-

fice to the clustering quality. Although scalability and speed of clustering algorithms

were always a target for researchers in this domain, but big data challenges underline

these shortcomings and demand more attention and research on this topic. Reviewing

the literature of clustering techniques shows that the advancement of these techniques

could be classified in five stages as shown in Figure 1.

In rest of this study advantages and drawbacks of algorithms in each stage will be

discussed as they appeared in the figure respectively. In conclusion and future works

we will represent an additional stage which could be the next stage for big data clus-

tering algorithms based on recent and novel methods.

Techniques that are used to empower clustering algorithms to work with bigger da-

tasets trough improving their scalability and speed can be classified into two main

categories:

 Single-machine clustering techniques

 Multiple-machine clustering techniques

Single-machine clustering algorithms run in one machine and can use resources of

just one single machine while the multiple-machine clustering techniques can run in

several machines and has access to more resources. In the following section algo-

rithms in each of these categories will be reviewed.

2 Big Data Clustering

In general, big data clustering techniques can be classified into two major categories:

single-machine clustering techniques and multiple-machine clustering techniques.

Recently multiple machine clustering techniques has attracted more attention because

they are more flexible in scalability and offer faster response time to the users. As it is

demonstrated in Fig. 2 single-machine and multiple-machine clustering techniques

include different techniques:

 Single-machine clustering

o Sample based techniques

o Dimension reduction techniques

 Multiple-machine clustering

o Parallel clustering

o MapReduce based clustering

In this section advancements of clustering algorithms for big data analysis in catego-

ries that are mentioned above will be reviewed.

Fig. 2. Big data clustering techniques

2.1 Single-machine clustering techniques

2.1.1 Sampling Based Techniques

These algorithms were the very first attempts to improve speed and scalability of

them and their target was dealing with exponential search space. These algorithms are

called as sampling based algorithms, because instead of performing clustering on the

whole dataset, it performs clustering algorithms on a sample of the datasets and then

generalize it to whole dataset. This will speed up the algorithm because computation

needs to take place for smaller number of instances and consequently complexity and

memory space needed for the process decreases.

Clustering Large Applications based on Randomized Sampling (CLARANS).

Before introducing CLARANS [16] let us take a look at its predecessor Clustering

Large Applications (CLARA) [17]. Comparing to partitioning around medoids

(PAM) [17], CLARA can deal with larger datasets. CLARA decreases the overall

quadratic complexity and time requirements into linear in total number of objects.

PAM calculates the entire pair-wise dissimilarity matrix between objects and store it

in central memory, consequently it consume O (n2) memory space, thus PAM cannot

be used for large number of n. To cope with this problem, CLARA does not calculate

the entire dissimilarity matrix at a time. PAM and CLARA can be regarded conceptu-

ally as graph-searching problems in which each node is a possible clustering solution

and the criteria for two nodes for being linked is that they should be differ in 1 out of

k medoids.

PAM starts with one of the randomly chosen nodes and greedily moves to one of its

neighbors until it cannot find a better neighbor. CLARA reduce the search space by

searching only a sub-graph which is prepared by the sampled O (k) data points.

CLARANS is proposed in order to improve efficiency in comparison to CLARA. like

PAM , CLARANS, aims to find a local optimal solution by searching the entire

graph, but the difference is that in each iteration, it checks only a sample of the neigh-

bors of the current node in the graph. Obviously, CLARA and CLARANS both are

using sampling technique to reduce search space, but their difference is in the way

that they perform the sampling. Sampling in CLARA is done at the beginning stage

and it restricts the whole search process to a particular sub-graph while in

CLARANS; the sampling is conduct dynamically for each iteration of the search pro-

cedure. Observation results shows that dynamic sampling used in CLARANS is more

efficient than the method used in CLARA [18].

BIRCH

If the data size is larger than the memory size, then the I/O cost dominates the compu-

tational time. BIRCH [19] is offering a solution to this problem which is not ad-

dressed by other previously mentioned algorithms. BIRCH uses its own data structure

called clustering feature (CF) and also CF-tree. CF is a concise summary of each clus-

ter. It takes this fact into the consideration that every data point is not equally im-

portant for clustering and all the data points cannot be accommodated in main

memory.

CF is a triple <N,LS,SS> which contains the number of the data points in the cluster,

the linear sum of the data points in the cluster and the square sum of the data points in

the cluster, the linear sum of the data points in the cluster and the square sum of the

data points in the cluster. Checking if CF satisfies the additive property is easy, if two

existing clusters need to be merged, the CF for the merged cluster is the sum of the

CFs of the two original clusters. The importance of this feature is for the reason that it

allows merging two existing clusters simply without accessing the original dataset.

There are two key phase for BIRCH algorithm. First, it scans the data points and build

an in memory tree and the second applies clustering algorithm to cluster the leaf

nodes. Experiments conducted in [20] reveal that in terms of time and space, BIRCH

performs better than CLARANS and it is also more tolerable in terms of handling

outliers. Fig. 3 represents a flowchart which demonstrates steps in BIRCH algorithm.

Fig. 3. BIRCH algorithm flowchart

CURE

Single data point is used to represent a cluster in all previously mentioned algorithms

which means that these algorithms are working well if clusters have spherical shape,

while in the real applications clusters could be from different complex shapes. To deal

with this challenge, clustering by using representatives (CURE) [21] uses a set of

well-scattered data points to represent a cluster. In fact CURE is a hierarchical algo-

rithm. Basically it considers each data point as a single cluster and continually merges

two existing clusters until it reaches to precisely k clusters remaining. The process of

selecting two clusters for merging them in each stage is based on calculating mini-

mum distance between all possible pairs of representative points from the two clus-

ters. Two main data structures empower CURE for efficient search. Heap is the first,

which is used to track the distance for each existing cluster to its closest cluster and

the other is k-d tree which is used to store all the representative points for each clus-

ter.

CURE also uses sampling technique to speed up the computation. It draws a sample

of the input dataset and run mentioned procedure on the sample data. To clarify the

necessary sample size Chernoff bound is used in the original study. If the dataset is

very large, even after sampling, the data size is still big and consequently the process

will be time consuming. To solve this issue, CURE uses partitions to accelerate the

algorithm. If we consider 𝑛 as the original dataset and the 𝑛′ as the sampled data, then

CURE will partition the 𝑛′ into 𝑝 partitions and within each partition it runs a partial

hierarchical clustering until either a predefined number of clusters is reached or the

distance between the two clusters to be merged exceeds some threshold. Then another

clustering runs and passes on all partial clusters from all the 𝑝 partitions. At the final

stage all non-sampled data points will assign to the nearest clusters. Results [21] rep-

resent that in comparison to BIRCH, execution time for CURE is lower while it main-

tain the privilege of robustness in handling outliers by shrinking the representative

points to the centroid of the cluster with a constant factor. Fig. 4 demonstrate the

flowchart of CURE algorithm.

Fig. 4. CURE flowchart

2.1.2 Dimension Reduction Techniques

Although the complexity and speed of clustering algorithms is related to the number

of instances in the dataset, but at the other hand dimensionality of the dataset is other

influential aspect. In fact the more dimensions data have, the more is complexity and

it means the longer execution time. Sampling techniques reduce the dataset size but

they do not offer a solution for high dimensional datasets.

Locality-Preserving Projection (Random Projection)

In this method, after projecting the dataset from d-dimensional to a lower dimensional

space, it is desired that the pairwise distance to be roughly preserved. Many clustering

algorithms are distance based, so it is expected that the clustering result in the project-

ed space provide an acceptable approximation of the same clustering in the original

space. Random projection can be accomplished by a linear transformation of the data

matrix 𝐴, which is the original data matrix. If 𝑅 be a 𝑑 × 𝑡 and rotation matrix (𝑡 <<

𝑑) and its elements 𝑅(𝑖, 𝑗) be independent random variables, then 𝐴’ = 𝐴. 𝑅 is the

projection of matrix 𝐴 in 𝑡 dimensional space. It means each row in 𝐴’ has 𝑡 dimen-

sions. Construction of rotation matrix is different in different random projection algo-

rithms. Preliminary methods propose Normal random variables with mean 0 and vari-

ance 1 for 𝑅(𝑖 , 𝑗) although there are studies which represent different methods for

allocating values to 𝑅(𝑖 , 𝑗) [22]. After generating the projected matrix into lower di-

mensional space, clustering can be performed. Samples of algorithm implementation

is illustrated in [23], [24] and recently a method is proposed to speed up the computa-

tion [25].

Global Projection

In global projection the objective is for each data point it is desirable that the project-

ed data point to be as close as possible to original data point while in local preserving

projection the objective was to maintain pairwise distances roughly the same between

two pairs in projected space. In fact if the main dataset matrix is considered to be 𝐴

and 𝐴’ be the approximation of it, in global projection aim is to minimize ||𝐴’ − 𝐴||.

Different approaches are available to create the approximation matrix such as SVD

(singular value decomposition) [26], CX/CUR [27], CMD [28] and Colibri [29].

2.2 Multi-Machine clustering techniques

Although sampling and dimension reduction methods used in single-machine cluster-

ing algorithms represented in previous section improves the scalability and speed of

the algorithms, but nowadays the growth of data size is way much faster than memory

and processor advancements, consequently one machine with a single processor and a

memory cannot handle terabytes and petabytes of data and it underlines the need algo-

rithms that can be run on multiple machines. As it is shown in Fig. 5, this technique

allows to breakdown the huge amount of data into smaller pieces which can be loaded

on different machines and then uses processing power of these machines to solve the

huge problem. Multi machine clustering algorithms are divided into two main catego-

ries:

 Un-automated distributing– parallel

 Automated distributing– MapReduce

Fig. 5. General concept of multi-machine clustering techniques

In parallel clustering, developers are involved with not just parallel clustering chal-

lenges, but also with details in data distribution process between different machines

available in the network as well, which makes it very complicated and time consum-

ing. Difference between parallel algorithms and the MapReduce framework is in the

comfortless that MapReduce provides for programmers and reveals them form unnec-

essary networking problems and concepts such as load balancing, data distribution,

fault tolerance and etc. by handling them automatically. This feature allows huge

parallelism and easier and faster scalability of the parallel system. Parallel and dis-

tributed clustering algorithms follows a general cycle as represented below:

Fig. 6. General cycle for multi-machine clustering algorithms

In the first stage, data is going to be divided into partitions and they distribute over

machines. Afterward, each machine performs clustering individually on the assigned

partition of data. Two main challenges for parallel and distributed clustering are min-

imizing data traffic and its lower accuracy in comparison with its serial equivalent.

Lower accuracy in distributed algorithms could be caused by two main reasons, first,

it is possible that different clustering algorithms deploy in different machines and

secondly even if the same clustering algorithm is used in all machines, in some cases

the divided data might change the final result of clustering. In the rest of this study,

parallel algorithms and MapReduce algorithms will be discussed subsequently, then

more advanced algorithms proposed recently for big data will be covered.

2.2.1 Parallel clustering

Although parallel algorithms add difficulty of distribution for programmers, but it is

worthfull because of the major improvements in scaling and speed of clustering algo-

rithms. At the following parts some of them will be reviewed.

DBDC

DBDC [30], [31] is a distributed and density based clustering algorithm. Discovery of

clusters of arbitrary shapes is the main objective of density based clustering. The den-

sity of points in each cluster is much higher than outside of the cluster, while the den-

sity of the regions of noise is lower than the density in any of the clusters. DBDC [32]

is an algorithm which obeys the cycle mentioned in Figure 2 . At the individual clus-

tering stage, it uses a defined algorithm for clustering and then for general clustering,

a single machine density algorithm called DBSCAN is used for finalizing the results.

The results show that although DBDC maintain the same clustering quality in com-

parison to its serial interpretation, but it runs 30 times faster than that.

ParMETIS

ParMETIS [33] is the parallel version of METIS [34] and is a multilevel partitioning

algorithm. Graph partitioning is a clustering problem with the goal of finding the good

cluster of vertices. METIS contains three main steps. First step is called as coarsening

phase. In this stage maximal matching on the original graph is done and then the ver-

tices which are matched create a smaller graph and this process is iterated till the

number of vertices become small enough. The second stage is partitioning stage in

which k-way partitioning of the coarsened graph is performed using multilevel recur-

sive bisection algorithm. Finally in third un-coarsening stage, a greedy refinement

algorithm is used to project back the partitioning from second stage to the original

graph.

ParMETIS is a distributed version of METIS. Because of graph based nature of Par-

METIS it is different from clustering operations and it does not follow the general

cycle mentioned earlier for parallel and distributed clustering. An equal number of

vertices are going to distribute initially, then a coloring of a graph will compute in

machines. Afterward, a global graph incrementally matching only vertices of the same

color one at a time will be computed. In partitioning stage this graph broadcast to

machines and recursive bisection by exploring only a single path of the recursive

bisection tree performs in each machine. Finally un-coarsening stage is consisting of

moving vertices of edge-cut. Experiments represent that ParMETIS was 14 to 35

times faster than serial algorithm while maintaining the quality close to the serial

algorithm.

GPU based parallel clustering

A new issue is opened recently in parallel computing to use processing power of GPU

instead of CPU to speed up the computation. G-DBSCAN [35] is a GPU accelerated

parallel algorithm for density-based clustering algorithm, DBSCAN. It is one of the

recently proposed algorithms in this category. Authors distinguished their method by

using a graph based data indexing to add flexibility to their algorithm to allow more

parallelization opportunities. G-DBSCAN is a two-step algorithm and both of these

steps have been parallelized. The first step constructs a graph. Each object represents

a node and an edge is created between two objects if their distance is lower than or

equal to a predefined threshold. When this graph is ready, the second step is to identi-

fy the clusters. It uses breath first search (BFS) to traverse the graph created in the

first step. Results show that in comparison to its serial implementation, G-DBSCAN

is 112 times faster.

2.2.2 MapReduce

Although parallel clustering algorithms improved the scalability and speed of cluster-

ing algorithms still the complexity of dealing with memory and processor distribution

was a quiet important challenge. MapReduce is a framework which is illustrated in

Fig. 7 initially represented by Google and Hadoop is an open source version of it [36].

In this section algorithms which are implemented based on this framework are re-

viewed and their improvements are discussed in terms of three features:

 Speed up: means the ratio of running time while the dataset remains constant and

the number of machines in the system is increased.

 Scale up: measures if x time larger system can perform x time larger job with the

same run time

 Size up: keeping the number of machines unchanged, running time grows linearly

with the data size

Fig. 7. MapReduce Framework

MapReduce based K-means (PK-means)

PKMeans [37] is distributed version of well-known clustering algorithm K-means

[38], [39]. The aim of k-means algorithm is to cluster the desire dataset into k clusters

in the way that instances in one cluster share more similarity than the instances of

other clusters. K-means clustering randomly choose k instance of the dataset in initial

step and performs two phases repeatedly: first it assigns each instance to the nearest

cluster and after finishing the assignment for all of the instances in the second phase it

updates the centers for each cluster with the mean of the instances.

PKMeans distributes the computation between multiple machines using MapReduce

framework to speed up and scale up the process. Individual clustering which contain

the first phase happens in the mapper and then general clustering perform second

phase in the reducer.

PKMeans has almost linear speed up and also a linear size up. It also has a good scale

up. For 4 machines it represented a scale up of 0.75. At the other hand, PKMeans is

an exact algorithm, it means that it offer the same clustering quality as its serial coun-

terpart k-means.

MR-DBSCAN

Big Data

A very recent proposed algorithm is MR-DBSCAN [40] which is a scalable MapRe-

duce-based DBSCAN algorithm. Three major draw backs are existed in parallel

DBSCAN algorithms which MR-DBSCAN is fulfilling: first they are not successful

to balance the load between the parallel nodes, secondly these algorithms are limited

in scalability because all critical sub procedures are not parallelized, and finally their

architecture and design limit them to less portability to emerging parallel processing

paradigms.

MR-DBSCAN proposes a novel data partitioning method based on computation

cost emission as well as a scalable DBSCAN algorithm in which all critical sub-

procedures are fully parallelized. Experiments on large datasets confirm the scalabil-

ity and efficiency of MR-DBSCAN.

MapReduce based on GPU

As it discussed in G-DBSCAN section, GPUs are much more efficient than CPUs.

While CPUs have several processing cores GPUs are consisted of thousands of cores

which make them much more powerful and faster than CPUs. Although MapReduce

with CPUs represents very efficient framework for distributed computing, but if

GPUs is used instead, the framework can improve the speed up and scale up for dis-

tributed applications. GPMR is a MapReduce framework to use multiple GPUs. Alt-

hough clustering applications are not still implemented in this framework, but the

growth of data size urge researcher to represent faster and more scalable algorithms so

maybe this framework could be the appropriate solution to fulfill those needs.

3 Conclusion and Future Works

Clustering is one of the essential tasks in data mining and they need improvement

nowadays more than before to assist data analysts to extract knowledge from terabytes

and petabytes of data. In this study the improvement trend of data clustering algo-

rithms were discussed. to sum up, while traditional sampling and dimension reduction

algorithms still are useful, but they don’t have enough power to deal with huge

amount of data because even after sampling a petabyte of data, it is still very big and

it cannot be clustered by clustering algorithms, consequently the future of clustering is

tied with distributed computing. Although parallel clustering is potentially very useful

for clustering, but the complexity of implementing such algorithms is a challenge. At

the other hand, MapReduce framework provides a very satisfying base for implement-

ing clustering algorithms. As results shows, MapReduce based algorithms offer im-

pressive scalability and speed in comparison to serial counterparts while they are

maintaining same quality. Regarding to the fact that GPUs are much powerful than

CPUs as a future work, it is considerable to deploy clustering algorithms on GPU

based MapReduce frameworks in order to achieve even better scalability and speed.

Acknowledgments. This work is supported by University of Malaya High Impact

Research Grant no vote UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Education

Malaysia.

References

1. T. C. Havens, J. C. Bezdek, and M. Palaniswami, “Scalable single linkage hierarchical

clustering for big data,” in Intelligent Sensors, Sensor Networks and Information

Processing, 2013 IEEE Eighth International Conference on. IEEE, 2013, pp. 396–401.

2. “YouTube Statistic,” 2014. [Online]. Available:

http://www.youtube.com/yt/press/statistics.html.

3. P. Williams, C. Soares, and and J. E. Gilbert, “A Clustering Rule Based Approach for

Classifi-cation Problems,” Int. J. Data Warehous. Min., vol. 8, no. 1, pp. 1–23, 2012.

4. R. V. Priya and A. Vadivel, “User Behaviour Pattern Mining from Weblog,” Int. J. Data

Warehous. Min., vol. 8, no. 2, pp. 1–22, 2012.

5. T. Kwok, K. A. Smith, S. Lozano, and D. Taniar, “No Title,” in Parallel Fuzzy c-Means

Clustering for Large Data Sets, 2002, pp. 365–374.

6. H. Kalia, S. Dehuri, and A. Ghosh, “A Survey on Fuzzy Association Rule Mining,” Int. J.

Data Warehous. Min., vol. 9, no. 1, pp. 1–27, 2013.

7. O. Daly and D. Taniar, “Exception Rules Mining Based on Negative Association Rules,”

in Proceedings of the International Conference on Computational Science and Its

Applications (ICCSA 2004), 2004, pp. 543–552.

8. M. Z. Ashrafi, D. Taniar, and K. A. Smith, “Redundant association rules reduction

techniques,” Int. J. Bus. Intell. Data Min., vol. 2, no. 1, pp. 29–63, 2007.

9. D. Taniar, W. Rahayu, V. C. S. Lee, and O. Daly, “Exception rules in association rule

mining,” Appl. Math. Comput., vol. 205, no. 2, pp. 735–750, 2008.

10. Meyer, F. G., and J. Chinrungrueng., “Spatiotemporal clustering of fMRI time series in the

spectral domain,” Med. Image Anal., vol. 9, no. 1, pp. 51–68, 2004.

11. J. Ernst, G. J. Nau, and Z. Bar-Joseph, “Clustering short time series gene expression data.,”

Bioinforma. 21, vol. 21, no. suppl 1, pp. i159 – i168, Jun. 2005.

12. F. Iglesias and W. Kastner, “Analysis of Similarity Measures in Times Series Clustering

for the Discovery of Building Energy Patterns,” Energies, vol. 6, no. 2, pp. 579–597, Jan.

2013.

13. Y. Zhao and G. Karypis, “Empirical and theoretical comparisons of selected criterion

functions for document clustering,” Mach. Learn., vol. 55, no. 3, pp. 311–331, 2004.

14. R. Hathaway and J. Bezdek, “Extending fuzzy and probabilistic clustering to very large

data sets,” Comput. Stat. Data Anal., vol. 51, no. 1, pp. 215–234, 2006.

15. “Big Data, What is it and why it is important.” [Online]. Available:

http://www.sas.com/en_us/insights/big-data/what-is-big-data.html.

16. R. T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial data

mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 1003–1016, 2002.

17. L. Kaufman and peter J. Rousseeuw, Finding Groups in Data: An Introduction on Cluster

Analysis. John Wiley and Sons, 1990.

18. R. T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial data

mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 1003–1016, 2002.

19. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method

for very large database,” in SIGMOD Conference, 1996, pp. 103–114.

20. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method

for very large database,” in SIGMOD Conference, 1996, pp. 103–114.

21. S. Guha and R. Rastogi, “CURE: An efficient clustering algorithm for large database,” Inf.

Syst., vol. 26, no. 1, pp. 35–58, 2001.

22. D. Achlioptas and F. McSherry, “Fast computation of low rank matrix approximations,” J.

C=ACM, vol. 54, no. 2, p. 9, 2007.

23. X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A

cluster ensemble approach,” in ICML, 2003, pp. 186–193.

24. S. Dasgupta, “Experiments with random projection,” UAI. pp. 143–151, 2000.

25. C. Boutsidis, C. Chekuri, T. Feder, and R. Motwani, “Random projections for k-means

clustering,” in NIPS, 2010, pp. 298–306.

26. G. H. Golub and C. F. Van-Loan, Matrix computations, 2nd ed. The Johns Hopkins

University Press, 1989.

27. P. Drineas, R. Kannan, and M. W. Mahony, “Fast Monte Carlo algorithms for matrices III:

Computing a compressed approximate matrix decomposition,” SIAM J. Comput., vol. 36,

no. 1, pp. 132–157, 2006.

28. J. Sun, Y. Xie, H. Zhang, and C. Faloutsos, “Less is More: Compact Matrix

Decomposition for Large Sparse Graphs.,” in SDM, 2007.

29. H. Tong, S. Papadimitriou, J. Sun, P. S. Yu, and C. Faloutsos, “Colibri: Fast mining of

large static and dynamic graphs,” in Proceedings of the 14th ACM SIGKDD international

conference on Knowledge discovery and data mining, 2008, pp. 686–694.

30. E. Januzaj, hans P. Kriegel, and M. Pfeifle, “DBDC: Density based distributed clustering,”

in Advances in Database Technology-EDBT 2004., 2004, pp. 88–105.

31. Aggarwal, C. C., C. K. Reddy, and Eds, Data Clustering: Algorithms and Applications.

2013.

32. M. Ester, hans P. Kriegel, J. Sander, and X. Xui, “A density-based algorithm for

discovering clusters in large spatial database with noise,” in KDD, 1996, pp. 226–231.

33. G. Karypis and V. Kumar, “Parallel multilevel k-way partitioning for irregular graphs,”

SIAM Rev., vol. 41, no. 2, pp. 278–300, 1999.

34. G. Karypis and V. Kumar, “Multilevel k-way partitining scheme for irregular graphs,” J.

Parallel Disteributed Comput., vol. 48, no. 1, pp. 96–129, 1998.

35. G. Andrade, G. Ramos, D. Madeira, R. Sachetto, R. Ferreira, and L. Rocha, “G-DBSCAN:

A GPU Accelerated Algorithm for Density-based Clustering,” Procedia Comput. Sci., vol.

18, pp. 369–378, 2013.

36. P. P. Anchalia, A. K. Koundinya, and S. NK., “MapReduce Design of K-Means Clustering

Algorithm,” in Information Science and Applications (ICISA), 2013 International

Conference on, 2013, pp. 1–5.

37. W. Zhao, H. Ma, and Q. He, “parallel k-means clustering based on MapReduce,” in Cloud

Computing, 2009, pp. 674–679.

38. J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan Kaufmann,

2006.

39. B. Mirkin, Clustering for data mining a data recovery approach. CRC Press, 2012.

40. Y. He, H. Tan, W. Luo, S. Feng, and J. Fan, “MR-DBSCAN: a scalable MapReduce-based

DBSCAN algorithm for heavily skewed data,” Front. Comput. Sci., vol. 8, no. 1, pp. 83–

99, 2014.

An Enhanced Sampling-Based Viewpoints Cosine Visual Model for an Efficient Big Data Clustering

Article

Sep 2023

Bunching is registering the item's similitude includes that can be utilized to segment the information. Object similarity (or dissimilarity) features are taken into account when locating relevant data object clusters. Removing the quantity of bunch data for any information is known as the grouping inclination. Top enormous information bunching calculations, similar to single pass k-implies (spkm), k-implies ++, smaller than usual group k-implies (mbkm), are created in the groups with k worth. By and by, the k worth is alloted by one or the other client or with any outside impedance. Along these lines, it is feasible to get this worth immovable once in a while. In the wake of concentrating on related work, it is researched that visual appraisal of (bunch) propensity (Tank) and its high level visual models extraordinarily decide the obscure group propensity esteem k. Multi-perspectives based cosine measure Tank (MVCM-Tank) utilized the multi-perspectives to evaluate grouping inclination better. Be that as it may, the MVCM-Tank experiences versatility issues in regards to computational time and memory designation. This paper improves the MVCM-Tank with the inspecting methodology to defeat the versatility issue for large information grouping. Trial investigation is performed utilizing the enormous gaussian engineered datasets and large constant datasets to show the effectiveness of the proposed work.

An Extended Clusters Assessment Method with the Multi-Viewpoints for Effective Visualization of Data Partitions

Article

Jan 2023

Cluster analysis is the most important for the data partitions of unlabelled data in various big data applications. It analyses the data based on similarity features of data objects. Two significant steps of the cluster analysis are as follows: assess the initial cluster tendency, and explore the data partitions. Top big data clustering techniques, such as k-means ++, single pass k-means (spkm), mini-batch-k-means (mbkm), and spherical k-means, effectively generate the big data clusters. However, they cannot get the initial knowledge about the clustering tendency. Estimation of the knowledge about the number of clusters is known as the clustering tendency. Various estimation methods of cluster tendency are surveyed and finally investigated that visual assessment of cluster tendency (VAT) accurately assesses the clustering tendency. Finding the accurate similarity features plays a vital role in accurately assessing clusters in the VAT algorithm. This paper proposes a novel computational similarity measure for the best assessment of big data clusters. The experiments are conducted on big synthetic and big real datasets to illustrate the proposed technique's efficiency.

Ok-NB: An Enhanced OPTICS and k-Naive Bayes Classifier for Imbalance Classification with Overlapping

Article

Full-text available

Jan 2024

Class imbalance problems have received a lot of attention throughout the last a few years. It poses considerable hurdles to conventional classifiers, especially when combined with overlapping instances, the complexity of the classification task increases. In this study, we have proposed a novel density-based method that combines the Ordering points to identify the clustering structure (OPTICS) algorithm with the naive-Bayes approach to effectively handle overlapped and imbalanced problems at the same time and is known as OPTICS-based k-Naive Bayes (Ok-NB). The Ok-NB method is used to correctly identify and construct the training data into overlapping and non-overlapping groups based on their density and reachability, while the naive-Bayes technique is used to correctly map the test data samples to the appropriate class for accurate output. It offers adaptability and reliability in classifying complex datasets with overlapping and imbalanced properties. Cluster-based proximity assessment and probabilistic classification are combined to improve classification accuracy and guarantee that the most reliable neighbors’ opinions are given the greatest weight during the decision-making process. Extensive experiments were conducted on 21 benchmark datasets. The experiment results demonstrate how effectively the suggested approach works to achieve high classification accuracy and also prove the effectiveness and superiority of this proposed approach compared to existing State-of-the-art methods in tackling overlapping and imbalance challenges in classification tasks.

Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure

Article

Full-text available

Apr 2024

Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P ∗ N(N−1) 2 . To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Article

Full-text available

Apr 2024

In the world of big data, extracting meaningful insights from large and continually growing distributed datasets is a major challenge. Classical clustering algorithms are effective at identifying clusters with convex structures. However, they fall short in identifying arbitrary-shaped clusters (more irregular and complex patterns), which are often encountered in real-world applications. The process of identifying non-convex cluster representations from very large and growing datasets is a challenge. It is further compounded by the distributed nature of the data, necessitating complex computations across multiple devices. Support Vector Clustering (SVC) is a much-celebrated algorithm capable of finding arbitrarily shaped clusters. However, the major limitation of this algorithm is that it will not scale to large volumes of data as the time and space complexity is high. The second limitation of the SVC algorithm is the requirement for large computation time in finding cluster structures. The adoption of a coreset based methodology is required for finding the true representation of the underlying large datasets. The implementation of hierarchical clustering on these distributed coresets, unlocks the potential to uncover a structured hierarchy of abstractions across the disseminated data. Moreover, a distance-based clustering approach guarantees the identification of clusters with diverse and arbitrary shapes, providing a robust framework for detecting complex structures. This research utilizes the Core Vector Machine (CVM) approach using an approximate Minimum Enclosing Ball (MEB) algorithm to efficiently address the complexities inherent in traditional SVC. Additionally, an enhanced medoid algorithm is employed for cluster head identification across the data sources. Hierarchical clustering is performed in the Reproducing Kernel Hilbert Space (RKHS) using cosine similarity distance matrices. This is used to identify compact non-convex clusters within distributed datasets. Performance assessment involves benchmarking our approach against state-of-the-art improved SVC algorithms using large datasets. The outcomes validate the superior performance of our approach compared to existing methods.

Big data clustering based on spark chaotic improved particle swarm optimization

Article

Full-text available

Apr 2024

In recent years, the surge in continuously accelerating data generation has given rise to the prominence of big data technology. The MapReduce architecture, situated at the core of this technology, provides a robust parallel environment. Spark, a leading framework in the big data landscape, extends the capabilities of the traditional MapReduce model. Coping with big data, especially in the realm of clustering, requires more efficient techniques. Meta-heuristic-based clustering, known for offering global solutions within reasonable time frames, emerges as a promising approach. This paper introduces a parallel-distributed clustering algorithm for big data within the Spark Framework, named Spark, chaotic improved PSO (S-CIPSO). Centered on particle swarm optimization (PSO), the proposed algorithm is enhanced with a chaotic map and an efficient procedure. Test results, conducted on both real and artificial datasets, establish the superior performance and quality of clustering results achieved by the proposed approach. Additionally, the scalability and robustness of S-CIPSO are validated, demonstrating its effectiveness in handling large-scale datasets.

A Fast and Highly Scalable Frequent Pattern Mining Algorithm

Article

Jun 2024
FUTURE GENER COMP SY

Information Security Protection Method for Computer Networks Considering Big Data Clustering Algorithm

Conference Paper

Nov 2023

Yapeng Han

Augmented weighted K‑means grey wolf optimizer: An enhanced metaheuristic algorithm for data clustering problems

Article

Full-text available

Mar 2024

This study presents the K-means clustering-based grey wolf optimizer, a new algorithm intended to improve the optimization capabilities of the conventional grey wolf optimizer in order to address the problem of data clustering. The process that groups similar items within a dataset into non-overlapping groups. Grey wolf hunting behaviour served as the model for grey wolf optimizer, however, it frequently lacks the exploration and exploitation capabilities that are essential for efficient data clustering. This work mainly focuses on enhancing the grey wolf optimizer using a new weight factor and the K-means algorithm concepts in order to increase variety and avoid premature convergence. Using a partitional clustering-inspired fitness function, the K-means clustering-based grey wolf optimizer was extensively evaluated on ten numerical functions and multiple real-world datasets with varying levels of complexity and dimensionality. The methodology is based on incorporating the K-means algorithm concept for the purpose of refining initial solutions and adding a weight factor to increase the diversity of solutions during the optimization phase. The results show that the K-means clustering-based grey wolf optimizer performs much better than the standard grey wolf optimizer in discovering optimal clustering solutions, indicating a higher capacity for effective exploration and exploitation of the solution space. The study found that the K-means clustering-based grey wolf optimizer was able to produce high-quality cluster centres in fewer iterations, demonstrating its efficacy and efficiency on various datasets. Finally, the study demonstrates the robustness and dependability of the K-means clustering-based grey wolf optimizer in resolving data clustering issues, which represents a significant advancement over conventional techniques. In addition to addressing the shortcomings of the initial algorithm, the incorporation of K-means and the innovative weight factor into the grey wolf optimizer establishes a new standard for further study in metaheuristic clustering algorithms. The performance of the K-means clustering-based grey wolf optimizer is around 34% better than the original grey wolf optimizer algorithm for both numerical test problems and data clustering problems.

A new technique for classification method with imbalanced training data

Article

Feb 2024

Sufal Das

Imbalanced classification is a very common and crucial challenge in the machine learning domain. Due to unequal instances in different classes, the performance of traditional classifiers may decrease. How to deal with this imbalanced data is a major focus for the majority of studies. In terms of balancing the data, at the data level point under-sampling, over-sampling, and their variants are widely used. Since oversampling creates precise replicas of examples from the minority class, it may increase the risk that overfitting may occur. Undersampling wipes out a significant quantity of data, making it more difficult to determine where the decision boundary between minority and majority classes lies. In this work, we have proposed a novel method that combines both of these strategies based on the association rule and GAN to avoid this kind of problem score and produce a well-balanced data set. Extensive experiments have been conducted by using diverse benchmark datasets to compare this proposed approach with various state-of-the-art methods. The results demonstrate that the proposed approach outperforms existing techniques, achieving better classification performance while efficiently addressing the imbalanced classification problem.

Article

Full-text available

Feb 2013

Forecasting and modeling building energy profiles require tools able to discover patterns within large amounts of collected information. Clustering is the main technique used to partition data into groups based on internal and a priori unknown schemes inherent of the data. The adjustment and parameterization of the whole clustering task is complex and submitted to several uncertainties, being the similarity metric one of the first decisions to be made in order to establish how the distance between two independent vectors must be measured. The present paper checks the effect of similarity measures in the application of clustering for discovering representatives in cases where correlation is supposed to be an important factor to consider, e. g., time series. This is a necessary step for the optimized design and development of efficient clustering-based models, predictors and controllers of time-dependent processes, e. g., building energy consumption patterns. In addition, clustered-vector balance is proposed as a validation technique to compare clustering performances.

Clustering for Data Mining: A Data Recovery Approach

Article

Full-text available

Apr 2005

Boris G Mirkin

One of the goals of the first edition of this book back in 2005 was to present a coherent theory for K-Means partitioning and Ward hierarchical clustering. This theory leads to effective data pre-processing options, clustering algorithms and interpretation aids, as well as to firm relations to other areas of data analysis. The goal of this second edition is to consolidate, strengthen and extend this island of understanding in the light of recent developments. Here are examples of newly added material for each of the objectives: Consolidating: - Five equivalent formulations for K-Means criterion - Usage of split base vectors in hierarchical clustering - Similarities between the clustering data recovery models and singular/eigenvalue decompositions Strengthening: - Experimental evidence to support the PCA-like Anomalous Pattern clustering as a tool to initialize K-Means - Weighting variables with Minkowski metric three-step K-Means - Effective versions of least squares divisive clustering Extending: - Similarity and network clustering - Consensus clustering. The structure of the book has been streamlined; the chapter on Mathematics of the data recovery approach has almost doubled in size, now concludes the book. Parts of the removed chapters are integrated within the new structure. The change has added a hundred pages and a couple of dozen examples to the text and, in fact, transformed it into a different species of a book. In the first edition, the book had a Russian doll structure, with a core and a couple of nested shells around. Now it is a linear structure presentation of the data recovery clustering. This book offers advice regarding clustering goals and ways to achieve them to a student, an educated user, and application developer. This advice involves methods that are compatible with the data recovery framework and experimentally tested. Fortunately, this embraces most popular approaches including most recent ones. The emphasis on the data recovery framework sets this book apart from the other books on clustering that try to inform the reader of as many approaches as possible with no much regard for their properties.

Scalable single linkage hierarchical clustering for big data

Conference Paper

Full-text available

Apr 2013

Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets-the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (sVAT) algorithm to return single-linkage partitions of big data sets. The sVAT algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for sVAT enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of sVAT are (usually) significantly less than the O(n2) requirement of the classic single-linkage hierarchical algorithm. We show that sVAT is a scalable instantiation of single-linkage clustering for data sets that contain c compact-separated clusters, where c ≪ n; n is the number of objects. For data sets that do not contain compact-separated clusters, we show that sVAT produces a good approximation of single-linkage partitions. Experimental results are presented for both synthetic and real data sets.

Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Article

Jun 2004

This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

Data Mining: Concepts and Techniques

Book

Jan 2012

This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.

Data Mining: Concepts and Techniques

Article

Jan 2006

BIRCH

Article

Jun 1996
SIGMOD REC

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

Finding Groups in Data: An Introduction to Cluster Analysis

Book

Jan 1990

User Behaviour Pattern Mining from Weblog

Article

Apr 2012

In this paper, the authors build a tree using both frequent as well as non-frequent items and named as Revised PLWAP with Non-frequent Items RePLNI-tree in single scan. While mining sequential patterns, the links related to the non-frequent items are virtually discarded. Hence, it is not required to delete or maintain the information of nodes while revising the tree for mining updated weblog. It is not required to reconstruct the tree from scratch and re-compute the patterns each time, while weblog is updated or minimum support changed, since the algorithm supports both incremental and interactive mining. The performance of the proposed tree is better, even the size of incremental database is more than 50% of existing one, while it is not so in recently proposed algorithm. For evaluation purpose, the authors have used the benchmark weblog and found that the performance of proposed tree is encouraging compared to some of the recently proposed approaches.

MapReduce Design of K-Means Clustering Algorithm

Conference Paper

Jun 2013

Cluster is a collection of data members having similar characteristics. The process of establishing a relation or deriving information from raw data by performing some operations on the data set like clustering is known as data mining. Data collected in practical scenarios is more often than not completely random and unstructured. Hence, there is always a need for analysis of unstructured data sets to derive meaningful information. This is where unsupervised algorithms come in to picture to process unstructured or even semi structured data sets by resultant. K-Means Clustering is one such technique used to provide a structure to unstructured data so that valuable information can be extracted. This paper discusses the implementation of the K-Means Clustering Algorithm over a distributed environment using ApacheTM Hadoop. The key to the implementation of the K-Means Algorithm is the design of the Mapper and Reducer routines which has been discussed in the later part of the paper. The steps involved in the execution of the K-Means Algorithm has also been described in this paper based on a small scale implementation of the K-Means Clustering Algorithm on an experimental setup to serve as a guide for practical implementations.