Conference PaperPDF Available

Abstract and Figures

Clustering is an essential data mining and tool for analyzing big data. There are difficulties for applying clustering techniques to big data duo to new challenges that are raised with big data. As Big Data is referring to terabytes and petabytes of data and clustering algorithms are come with high computa-tional costs, the question is how to cope with this problem and how to deploy clustering techniques to big data and get the results in a reasonable time. This study is aimed to review the trend and progress of clustering algorithms to cope with big data challenges from very first proposed algorithms until today's novel solutions. The algorithms and the targeted challenges for producing improved clustering algorithms are introduced and analyzed, and afterward the possible future path for more advanced algorithms is illuminated based on today's avail-able technologies and frameworks.
Content may be subject to copyright.
Big Data Clustering: A Review
Ali Seyed Shirkhorshidi1, Saeed Aghabozorgi1, Teh Ying Wah1 and Tutut Herawan1,2
1Department of Information Systems
Faculty of Computer Science and Information Technology
University of Malaya
50603 Pantai Valley, Kuala Lumpur, Malaysia
2AMCS Research Center, Yogyakarta, Indonesia
shirkhorshidi_ali@siswa.um.edu.my, saeed@um.edu.my,
tehyw@um.edu.my, tutut@um.edu.my
Abstract. Clustering is an essential data mining and tool for analyzing big data.
There are difficulties for applying clustering techniques to big data duo to new
challenges that are raised with big data. As Big Data is referring to terabytes
and petabytes of data and clustering algorithms are come with high computa-
tional costs, the question is how to cope with this problem and how to deploy
clustering techniques to big data and get the results in a reasonable time. This
study is aimed to review the trend and progress of clustering algorithms to cope
with big data challenges from very first proposed algorithms until today’s novel
solutions. The algorithms and the targeted challenges for producing improved
clustering algorithms are introduced and analyzed, and afterward the possible
future path for more advanced algorithms is illuminated based on today’s avail-
able technologies and frameworks.
Keywords: Big Data; Clustering; MapReduce, Parallel Clustering.
1 Introduction
After an era of dealing with data collection challenges, nowadays the problem is
changed into the question of how to process these huge amounts of data. Scientists
and researchers believe that today one of the most important topics in computing sci-
ence is Big Data. Social networking websites such as Facebook and Twitter have
billions of users and they produce hundreds of gigabytes of contents per minute, retail
stores continuously collect their customers’ data, You Tube has 1 billion unique users
which are producing 100 hours of video each an hour and its content ID service scans
over 400 years of video every day [1], [2]. To deal with this avalanche of data, it is
necessary to use powerful tools for knowledge discovery. Data mining techniques are
well-known knowledge discovery tools for this purpose [3][9]. Clustering is one of
them that is defined as a method in which data are divided into groups in a way that
objects in each group share more similarity than with other objects in other groups
[1]. Data clustering is a well-known technique in various areas of computer science
and related domains. Although data mining can be considered as the main origin of
clustering, but it is vastly used in other fields of study such as bio informatics, energy
studies, machine learning, networking, pattern recognition and therefore a lot of re-
search works has been done in this area [10][13]. From the very beginning research-
ers were dealing with clustering algorithms in order to handle their complexity and
computational cost and consequently increase scalability and speed. Emersion of big
data in recent years added more challenges to this topic which urges more research for
clustering algorithms improvement. Before focusing on clustering big data the ques-
tion which needs to be clarified is how big the big data is. To address this question
Bezdek and Hathaway represented a categorization of data sizes which is represented
in table 1 [14].
Table 1. Bezdek and Hathaway categorization for big data
Big data
Bytes
106
108
1010
1012
10>12
“Size”
Medium
Large
Huge
Monster
Very large
Challenges of big data have root in its five important characteristics [15]:
Volume: The first one is Volume and an example is the unstructured data stream-
ing in form of social media and it rises question such as how to determine the rele-
vance within large data volumes and how to analyze the relevant data to produce
valuable information.
Velocity: Data is flooding at very high speed and it has to be dealt with in reasona-
ble time. Responding quickly to data velocity is one of the challenges in big data.
Variety: Another challenging issue is to manage, merge and govern data that
comes from different sources with different specifications such as: email, audio,
unstructured data, social data, video and etc.
Variability: Inconsistency in data flow is another challenge. For example in social
media it could be daily or seasonal peak data loads which makes it harder to deal
and manage the data specially when the data is unstructured.
Complexity: Data is coming from different sources and have different structures;
consequently it is necessary to connect and correlate relationships and data linkag-
es or you find your data to be out of control quickly.
Traditional clustering techniques cannot cope with this huge amount of data because
of their high complexity and computational cost. As an instance, the traditional K-
means clustering is NP-hard, even when the number of clusters is k=2. Consequently,
scalability is the main challenge for clustering big data.
Fig. 1. Progress of developments in clustering algorithms to deal with big data
The main target is to scale up and speed up clustering algorithms with minimum sacri-
fice to the clustering quality. Although scalability and speed of clustering algorithms
were always a target for researchers in this domain, but big data challenges underline
these shortcomings and demand more attention and research on this topic. Reviewing
the literature of clustering techniques shows that the advancement of these techniques
could be classified in five stages as shown in Figure 1.
In rest of this study advantages and drawbacks of algorithms in each stage will be
discussed as they appeared in the figure respectively. In conclusion and future works
we will represent an additional stage which could be the next stage for big data clus-
tering algorithms based on recent and novel methods.
Techniques that are used to empower clustering algorithms to work with bigger da-
tasets trough improving their scalability and speed can be classified into two main
categories:
Single-machine clustering techniques
Multiple-machine clustering techniques
Single-machine clustering algorithms run in one machine and can use resources of
just one single machine while the multiple-machine clustering techniques can run in
several machines and has access to more resources. In the following section algo-
rithms in each of these categories will be reviewed.
2 Big Data Clustering
In general, big data clustering techniques can be classified into two major categories:
single-machine clustering techniques and multiple-machine clustering techniques.
Recently multiple machine clustering techniques has attracted more attention because
they are more flexible in scalability and offer faster response time to the users. As it is
demonstrated in Fig. 2 single-machine and multiple-machine clustering techniques
include different techniques:
Single-machine clustering
o Sample based techniques
o Dimension reduction techniques
Multiple-machine clustering
o Parallel clustering
o MapReduce based clustering
In this section advancements of clustering algorithms for big data analysis in catego-
ries that are mentioned above will be reviewed.
Fig. 2. Big data clustering techniques
2.1 Single-machine clustering techniques
2.1.1 Sampling Based Techniques
These algorithms were the very first attempts to improve speed and scalability of
them and their target was dealing with exponential search space. These algorithms are
called as sampling based algorithms, because instead of performing clustering on the
whole dataset, it performs clustering algorithms on a sample of the datasets and then
generalize it to whole dataset. This will speed up the algorithm because computation
needs to take place for smaller number of instances and consequently complexity and
memory space needed for the process decreases.
Clustering Large Applications based on Randomized Sampling (CLARANS).
Before introducing CLARANS [16] let us take a look at its predecessor Clustering
Large Applications (CLARA) [17]. Comparing to partitioning around medoids
(PAM) [17], CLARA can deal with larger datasets. CLARA decreases the overall
quadratic complexity and time requirements into linear in total number of objects.
PAM calculates the entire pair-wise dissimilarity matrix between objects and store it
in central memory, consequently it consume O (n2) memory space, thus PAM cannot
be used for large number of n. To cope with this problem, CLARA does not calculate
the entire dissimilarity matrix at a time. PAM and CLARA can be regarded conceptu-
ally as graph-searching problems in which each node is a possible clustering solution
and the criteria for two nodes for being linked is that they should be differ in 1 out of
k medoids.
PAM starts with one of the randomly chosen nodes and greedily moves to one of its
neighbors until it cannot find a better neighbor. CLARA reduce the search space by
searching only a sub-graph which is prepared by the sampled O (k) data points.
CLARANS is proposed in order to improve efficiency in comparison to CLARA. like
PAM , CLARANS, aims to find a local optimal solution by searching the entire
graph, but the difference is that in each iteration, it checks only a sample of the neigh-
bors of the current node in the graph. Obviously, CLARA and CLARANS both are
using sampling technique to reduce search space, but their difference is in the way
that they perform the sampling. Sampling in CLARA is done at the beginning stage
and it restricts the whole search process to a particular sub-graph while in
CLARANS; the sampling is conduct dynamically for each iteration of the search pro-
cedure. Observation results shows that dynamic sampling used in CLARANS is more
efficient than the method used in CLARA [18].
BIRCH
If the data size is larger than the memory size, then the I/O cost dominates the compu-
tational time. BIRCH [19] is offering a solution to this problem which is not ad-
dressed by other previously mentioned algorithms. BIRCH uses its own data structure
called clustering feature (CF) and also CF-tree. CF is a concise summary of each clus-
ter. It takes this fact into the consideration that every data point is not equally im-
portant for clustering and all the data points cannot be accommodated in main
memory.
CF is a triple <N,LS,SS> which contains the number of the data points in the cluster,
the linear sum of the data points in the cluster and the square sum of the data points in
the cluster, the linear sum of the data points in the cluster and the square sum of the
data points in the cluster. Checking if CF satisfies the additive property is easy, if two
existing clusters need to be merged, the CF for the merged cluster is the sum of the
CFs of the two original clusters. The importance of this feature is for the reason that it
allows merging two existing clusters simply without accessing the original dataset.
There are two key phase for BIRCH algorithm. First, it scans the data points and build
an in memory tree and the second applies clustering algorithm to cluster the leaf
nodes. Experiments conducted in [20] reveal that in terms of time and space, BIRCH
performs better than CLARANS and it is also more tolerable in terms of handling
outliers. Fig. 3 represents a flowchart which demonstrates steps in BIRCH algorithm.
Fig. 3. BIRCH algorithm flowchart
CURE
Single data point is used to represent a cluster in all previously mentioned algorithms
which means that these algorithms are working well if clusters have spherical shape,
while in the real applications clusters could be from different complex shapes. To deal
with this challenge, clustering by using representatives (CURE) [21] uses a set of
well-scattered data points to represent a cluster. In fact CURE is a hierarchical algo-
rithm. Basically it considers each data point as a single cluster and continually merges
two existing clusters until it reaches to precisely k clusters remaining. The process of
selecting two clusters for merging them in each stage is based on calculating mini-
mum distance between all possible pairs of representative points from the two clus-
ters. Two main data structures empower CURE for efficient search. Heap is the first,
which is used to track the distance for each existing cluster to its closest cluster and
the other is k-d tree which is used to store all the representative points for each clus-
ter.
CURE also uses sampling technique to speed up the computation. It draws a sample
of the input dataset and run mentioned procedure on the sample data. To clarify the
necessary sample size Chernoff bound is used in the original study. If the dataset is
very large, even after sampling, the data size is still big and consequently the process
will be time consuming. To solve this issue, CURE uses partitions to accelerate the
algorithm. If we consider 𝑛 as the original dataset and the 𝑛 as the sampled data, then
CURE will partition the 𝑛 into 𝑝 partitions and within each partition it runs a partial
hierarchical clustering until either a predefined number of clusters is reached or the
distance between the two clusters to be merged exceeds some threshold. Then another
clustering runs and passes on all partial clusters from all the 𝑝 partitions. At the final
stage all non-sampled data points will assign to the nearest clusters. Results [21] rep-
resent that in comparison to BIRCH, execution time for CURE is lower while it main-
tain the privilege of robustness in handling outliers by shrinking the representative
points to the centroid of the cluster with a constant factor. Fig. 4 demonstrate the
flowchart of CURE algorithm.
Fig. 4. CURE flowchart
2.1.2 Dimension Reduction Techniques
Although the complexity and speed of clustering algorithms is related to the number
of instances in the dataset, but at the other hand dimensionality of the dataset is other
influential aspect. In fact the more dimensions data have, the more is complexity and
it means the longer execution time. Sampling techniques reduce the dataset size but
they do not offer a solution for high dimensional datasets.
Locality-Preserving Projection (Random Projection)
In this method, after projecting the dataset from d-dimensional to a lower dimensional
space, it is desired that the pairwise distance to be roughly preserved. Many clustering
algorithms are distance based, so it is expected that the clustering result in the project-
ed space provide an acceptable approximation of the same clustering in the original
space. Random projection can be accomplished by a linear transformation of the data
matrix 𝐴, which is the original data matrix. If 𝑅 be a 𝑑 × 𝑡 and rotation matrix (𝑡 <<
𝑑) and its elements 𝑅(𝑖, 𝑗) be independent random variables, then 𝐴’ = 𝐴. 𝑅 is the
projection of matrix 𝐴 in 𝑡 dimensional space. It means each row in 𝐴’ has 𝑡 dimen-
sions. Construction of rotation matrix is different in different random projection algo-
rithms. Preliminary methods propose Normal random variables with mean 0 and vari-
ance 1 for 𝑅(𝑖 , 𝑗) although there are studies which represent different methods for
allocating values to 𝑅(𝑖 , 𝑗) [22]. After generating the projected matrix into lower di-
mensional space, clustering can be performed. Samples of algorithm implementation
is illustrated in [23], [24] and recently a method is proposed to speed up the computa-
tion [25].
Global Projection
In global projection the objective is for each data point it is desirable that the project-
ed data point to be as close as possible to original data point while in local preserving
projection the objective was to maintain pairwise distances roughly the same between
two pairs in projected space. In fact if the main dataset matrix is considered to be 𝐴
and 𝐴 be the approximation of it, in global projection aim is to minimize ||𝐴’ 𝐴||.
Different approaches are available to create the approximation matrix such as SVD
(singular value decomposition) [26], CX/CUR [27], CMD [28] and Colibri [29].
2.2 Multi-Machine clustering techniques
Although sampling and dimension reduction methods used in single-machine cluster-
ing algorithms represented in previous section improves the scalability and speed of
the algorithms, but nowadays the growth of data size is way much faster than memory
and processor advancements, consequently one machine with a single processor and a
memory cannot handle terabytes and petabytes of data and it underlines the need algo-
rithms that can be run on multiple machines. As it is shown in Fig. 5, this technique
allows to breakdown the huge amount of data into smaller pieces which can be loaded
on different machines and then uses processing power of these machines to solve the
huge problem. Multi machine clustering algorithms are divided into two main catego-
ries:
Un-automated distributing parallel
Automated distributing MapReduce
Fig. 5. General concept of multi-machine clustering techniques
In parallel clustering, developers are involved with not just parallel clustering chal-
lenges, but also with details in data distribution process between different machines
available in the network as well, which makes it very complicated and time consum-
ing. Difference between parallel algorithms and the MapReduce framework is in the
comfortless that MapReduce provides for programmers and reveals them form unnec-
essary networking problems and concepts such as load balancing, data distribution,
fault tolerance and etc. by handling them automatically. This feature allows huge
parallelism and easier and faster scalability of the parallel system. Parallel and dis-
tributed clustering algorithms follows a general cycle as represented below:
Fig. 6. General cycle for multi-machine clustering algorithms
In the first stage, data is going to be divided into partitions and they distribute over
machines. Afterward, each machine performs clustering individually on the assigned
partition of data. Two main challenges for parallel and distributed clustering are min-
imizing data traffic and its lower accuracy in comparison with its serial equivalent.
Lower accuracy in distributed algorithms could be caused by two main reasons, first,
it is possible that different clustering algorithms deploy in different machines and
secondly even if the same clustering algorithm is used in all machines, in some cases
the divided data might change the final result of clustering. In the rest of this study,
parallel algorithms and MapReduce algorithms will be discussed subsequently, then
more advanced algorithms proposed recently for big data will be covered.
2.2.1 Parallel clustering
Although parallel algorithms add difficulty of distribution for programmers, but it is
worthfull because of the major improvements in scaling and speed of clustering algo-
rithms. At the following parts some of them will be reviewed.
DBDC
DBDC [30], [31] is a distributed and density based clustering algorithm. Discovery of
clusters of arbitrary shapes is the main objective of density based clustering. The den-
sity of points in each cluster is much higher than outside of the cluster, while the den-
sity of the regions of noise is lower than the density in any of the clusters. DBDC [32]
is an algorithm which obeys the cycle mentioned in Figure 2 . At the individual clus-
tering stage, it uses a defined algorithm for clustering and then for general clustering,
a single machine density algorithm called DBSCAN is used for finalizing the results.
The results show that although DBDC maintain the same clustering quality in com-
parison to its serial interpretation, but it runs 30 times faster than that.
ParMETIS
ParMETIS [33] is the parallel version of METIS [34] and is a multilevel partitioning
algorithm. Graph partitioning is a clustering problem with the goal of finding the good
cluster of vertices. METIS contains three main steps. First step is called as coarsening
phase. In this stage maximal matching on the original graph is done and then the ver-
tices which are matched create a smaller graph and this process is iterated till the
number of vertices become small enough. The second stage is partitioning stage in
which k-way partitioning of the coarsened graph is performed using multilevel recur-
sive bisection algorithm. Finally in third un-coarsening stage, a greedy refinement
algorithm is used to project back the partitioning from second stage to the original
graph.
ParMETIS is a distributed version of METIS. Because of graph based nature of Par-
METIS it is different from clustering operations and it does not follow the general
cycle mentioned earlier for parallel and distributed clustering. An equal number of
vertices are going to distribute initially, then a coloring of a graph will compute in
machines. Afterward, a global graph incrementally matching only vertices of the same
color one at a time will be computed. In partitioning stage this graph broadcast to
machines and recursive bisection by exploring only a single path of the recursive
bisection tree performs in each machine. Finally un-coarsening stage is consisting of
moving vertices of edge-cut. Experiments represent that ParMETIS was 14 to 35
times faster than serial algorithm while maintaining the quality close to the serial
algorithm.
GPU based parallel clustering
A new issue is opened recently in parallel computing to use processing power of GPU
instead of CPU to speed up the computation. G-DBSCAN [35] is a GPU accelerated
parallel algorithm for density-based clustering algorithm, DBSCAN. It is one of the
recently proposed algorithms in this category. Authors distinguished their method by
using a graph based data indexing to add flexibility to their algorithm to allow more
parallelization opportunities. G-DBSCAN is a two-step algorithm and both of these
steps have been parallelized. The first step constructs a graph. Each object represents
a node and an edge is created between two objects if their distance is lower than or
equal to a predefined threshold. When this graph is ready, the second step is to identi-
fy the clusters. It uses breath first search (BFS) to traverse the graph created in the
first step. Results show that in comparison to its serial implementation, G-DBSCAN
is 112 times faster.
2.2.2 MapReduce
Although parallel clustering algorithms improved the scalability and speed of cluster-
ing algorithms still the complexity of dealing with memory and processor distribution
was a quiet important challenge. MapReduce is a framework which is illustrated in
Fig. 7 initially represented by Google and Hadoop is an open source version of it [36].
In this section algorithms which are implemented based on this framework are re-
viewed and their improvements are discussed in terms of three features:
Speed up: means the ratio of running time while the dataset remains constant and
the number of machines in the system is increased.
Scale up: measures if x time larger system can perform x time larger job with the
same run time
Size up: keeping the number of machines unchanged, running time grows linearly
with the data size
Fig. 7. MapReduce Framework
MapReduce based K-means (PK-means)
PKMeans [37] is distributed version of well-known clustering algorithm K-means
[38], [39]. The aim of k-means algorithm is to cluster the desire dataset into k clusters
in the way that instances in one cluster share more similarity than the instances of
other clusters. K-means clustering randomly choose k instance of the dataset in initial
step and performs two phases repeatedly: first it assigns each instance to the nearest
cluster and after finishing the assignment for all of the instances in the second phase it
updates the centers for each cluster with the mean of the instances.
PKMeans distributes the computation between multiple machines using MapReduce
framework to speed up and scale up the process. Individual clustering which contain
the first phase happens in the mapper and then general clustering perform second
phase in the reducer.
PKMeans has almost linear speed up and also a linear size up. It also has a good scale
up. For 4 machines it represented a scale up of 0.75. At the other hand, PKMeans is
an exact algorithm, it means that it offer the same clustering quality as its serial coun-
terpart k-means.
MR-DBSCAN
Big Data
Big Data
A very recent proposed algorithm is MR-DBSCAN [40] which is a scalable MapRe-
duce-based DBSCAN algorithm. Three major draw backs are existed in parallel
DBSCAN algorithms which MR-DBSCAN is fulfilling: first they are not successful
to balance the load between the parallel nodes, secondly these algorithms are limited
in scalability because all critical sub procedures are not parallelized, and finally their
architecture and design limit them to less portability to emerging parallel processing
paradigms.
MR-DBSCAN proposes a novel data partitioning method based on computation
cost emission as well as a scalable DBSCAN algorithm in which all critical sub-
procedures are fully parallelized. Experiments on large datasets confirm the scalabil-
ity and efficiency of MR-DBSCAN.
MapReduce based on GPU
As it discussed in G-DBSCAN section, GPUs are much more efficient than CPUs.
While CPUs have several processing cores GPUs are consisted of thousands of cores
which make them much more powerful and faster than CPUs. Although MapReduce
with CPUs represents very efficient framework for distributed computing, but if
GPUs is used instead, the framework can improve the speed up and scale up for dis-
tributed applications. GPMR is a MapReduce framework to use multiple GPUs. Alt-
hough clustering applications are not still implemented in this framework, but the
growth of data size urge researcher to represent faster and more scalable algorithms so
maybe this framework could be the appropriate solution to fulfill those needs.
3 Conclusion and Future Works
Clustering is one of the essential tasks in data mining and they need improvement
nowadays more than before to assist data analysts to extract knowledge from terabytes
and petabytes of data. In this study the improvement trend of data clustering algo-
rithms were discussed. to sum up, while traditional sampling and dimension reduction
algorithms still are useful, but they don’t have enough power to deal with huge
amount of data because even after sampling a petabyte of data, it is still very big and
it cannot be clustered by clustering algorithms, consequently the future of clustering is
tied with distributed computing. Although parallel clustering is potentially very useful
for clustering, but the complexity of implementing such algorithms is a challenge. At
the other hand, MapReduce framework provides a very satisfying base for implement-
ing clustering algorithms. As results shows, MapReduce based algorithms offer im-
pressive scalability and speed in comparison to serial counterparts while they are
maintaining same quality. Regarding to the fact that GPUs are much powerful than
CPUs as a future work, it is considerable to deploy clustering algorithms on GPU
based MapReduce frameworks in order to achieve even better scalability and speed.
Acknowledgments. This work is supported by University of Malaya High Impact
Research Grant no vote UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Education
Malaysia.
References
1. T. C. Havens, J. C. Bezdek, and M. Palaniswami, “Scalable single linkage hierarchical
clustering for big data,” in Intelligent Sensors, Sensor Networks and Information
Processing, 2013 IEEE Eighth International Conference on. IEEE, 2013, pp. 396401.
2. “YouTube Statistic,” 2014. [Online]. Available:
http://www.youtube.com/yt/press/statistics.html.
3. P. Williams, C. Soares, and and J. E. Gilbert, “A Clustering Rule Based Approach for
Classifi-cation Problems,” Int. J. Data Warehous. Min., vol. 8, no. 1, pp. 123, 2012.
4. R. V. Priya and A. Vadivel, “User Behaviour Pattern Mining from Weblog,” Int. J. Data
Warehous. Min., vol. 8, no. 2, pp. 122, 2012.
5. T. Kwok, K. A. Smith, S. Lozano, and D. Taniar, “No Title,” in Parallel Fuzzy c-Means
Clustering for Large Data Sets, 2002, pp. 365374.
6. H. Kalia, S. Dehuri, and A. Ghosh, “A Survey on Fuzzy Association Rule Mining,” Int. J.
Data Warehous. Min., vol. 9, no. 1, pp. 127, 2013.
7. O. Daly and D. Taniar, “Exception Rules Mining Based on Negative Association Rules,”
in Proceedings of the International Conference on Computational Science and Its
Applications (ICCSA 2004), 2004, pp. 543552.
8. M. Z. Ashrafi, D. Taniar, and K. A. Smith, Redundant association rules reduction
techniques,” Int. J. Bus. Intell. Data Min., vol. 2, no. 1, pp. 2963, 2007.
9. D. Taniar, W. Rahayu, V. C. S. Lee, and O. Daly, “Exception rules in association rule
mining,” Appl. Math. Comput., vol. 205, no. 2, pp. 735750, 2008.
10. Meyer, F. G., and J. Chinrungrueng., “Spatiotemporal clustering of fMRI time series in the
spectral domain,Med. Image Anal., vol. 9, no. 1, pp. 5168, 2004.
11. J. Ernst, G. J. Nau, and Z. Bar-Joseph, “Clustering short time series gene expression data.,”
Bioinforma. 21, vol. 21, no. suppl 1, pp. i159 i168, Jun. 2005.
12. F. Iglesias and W. Kastner, “Analysis of Similarity Measures in Times Series Clustering
for the Discovery of Building Energy Patterns,” Energies, vol. 6, no. 2, pp. 579597, Jan.
2013.
13. Y. Zhao and G. Karypis, “Empirical and theoretical comparisons of selected criterion
functions for document clustering,” Mach. Learn., vol. 55, no. 3, pp. 311331, 2004.
14. R. Hathaway and J. Bezdek, “Extending fuzzy and probabilistic clustering to very large
data sets,” Comput. Stat. Data Anal., vol. 51, no. 1, pp. 215234, 2006.
15. “Big Data, What is it and why it is important.” [Online]. Available:
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html.
16. R. T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial data
mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 10031016, 2002.
17. L. Kaufman and peter J. Rousseeuw, Finding Groups in Data: An Introduction on Cluster
Analysis. John Wiley and Sons, 1990.
18. R. T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial data
mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 10031016, 2002.
19. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method
for very large database,” in SIGMOD Conference, 1996, pp. 103114.
20. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method
for very large database,” in SIGMOD Conference, 1996, pp. 103114.
21. S. Guha and R. Rastogi, “CURE: An efficient clustering algorithm for large database,” Inf.
Syst., vol. 26, no. 1, pp. 3558, 2001.
22. D. Achlioptas and F. McSherry, “Fast computation of low rank matrix approximations,” J.
C=ACM, vol. 54, no. 2, p. 9, 2007.
23. X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A
cluster ensemble approach,” in ICML, 2003, pp. 186193.
24. S. Dasgupta, “Experiments with random projection,” UAI. pp. 143151, 2000.
25. C. Boutsidis, C. Chekuri, T. Feder, and R. Motwani, “Random projections for k-means
clustering,” in NIPS, 2010, pp. 298306.
26. G. H. Golub and C. F. Van-Loan, Matrix computations, 2nd ed. The Johns Hopkins
University Press, 1989.
27. P. Drineas, R. Kannan, and M. W. Mahony, “Fast Monte Carlo algorithms for matrices III:
Computing a compressed approximate matrix decomposition,” SIAM J. Comput., vol. 36,
no. 1, pp. 132157, 2006.
28. J. Sun, Y. Xie, H. Zhang, and C. Faloutsos, “Less is More: Compact Matrix
Decomposition for Large Sparse Graphs.,” in SDM, 2007.
29. H. Tong, S. Papadimitriou, J. Sun, P. S. Yu, and C. Faloutsos, “Colibri: Fast mining of
large static and dynamic graphs,” in Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, 2008, pp. 686694.
30. E. Januzaj, hans P. Kriegel, and M. Pfeifle, “DBDC: Density based distributed clustering,”
in Advances in Database Technology-EDBT 2004., 2004, pp. 88105.
31. Aggarwal, C. C., C. K. Reddy, and Eds, Data Clustering: Algorithms and Applications.
2013.
32. M. Ester, hans P. Kriegel, J. Sander, and X. Xui, “A density-based algorithm for
discovering clusters in large spatial database with noise,” in KDD, 1996, pp. 226231.
33. G. Karypis and V. Kumar, “Parallel multilevel k-way partitioning for irregular graphs,”
SIAM Rev., vol. 41, no. 2, pp. 278300, 1999.
34. G. Karypis and V. Kumar, “Multilevel k-way partitining scheme for irregular graphs,” J.
Parallel Disteributed Comput., vol. 48, no. 1, pp. 96129, 1998.
35. G. Andrade, G. Ramos, D. Madeira, R. Sachetto, R. Ferreira, and L. Rocha, “G-DBSCAN:
A GPU Accelerated Algorithm for Density-based Clustering,” Procedia Comput. Sci., vol.
18, pp. 369378, 2013.
36. P. P. Anchalia, A. K. Koundinya, and S. NK., “MapReduce Design of K-Means Clustering
Algorithm,” in Information Science and Applications (ICISA), 2013 International
Conference on, 2013, pp. 15.
37. W. Zhao, H. Ma, and Q. He, “parallel k-means clustering based on MapReduce,” in Cloud
Computing, 2009, pp. 674679.
38. J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan Kaufmann,
2006.
39. B. Mirkin, Clustering for data mining a data recovery approach. CRC Press, 2012.
40. Y. He, H. Tan, W. Luo, S. Feng, and J. Fan, “MR-DBSCAN: a scalable MapReduce-based
DBSCAN algorithm for heavily skewed data,” Front. Comput. Sci., vol. 8, no. 1, pp. 83
99, 2014.
... Big data [1] is becoming increasingly popular in various applications that analyse similarity characteristics [2] across multiple data items. Web and streaming data applications [3] exude massive data; analysing and grouping such dynamic vast data is a difficult topic in current big data clustering research. ...
Article
Bunching is registering the item's similitude includes that can be utilized to segment the information. Object similarity (or dissimilarity) features are taken into account when locating relevant data object clusters. Removing the quantity of bunch data for any information is known as the grouping inclination. Top enormous information bunching calculations, similar to single pass k-implies (spkm), k-implies ++, smaller than usual group k-implies (mbkm), are created in the groups with k worth. By and by, the k worth is alloted by one or the other client or with any outside impedance. Along these lines, it is feasible to get this worth immovable once in a while. In the wake of concentrating on related work, it is researched that visual appraisal of (bunch) propensity (Tank) and its high level visual models extraordinarily decide the obscure group propensity esteem k. Multi-perspectives based cosine measure Tank (MVCM-Tank) utilized the multi-perspectives to evaluate grouping inclination better. Be that as it may, the MVCM-Tank experiences versatility issues in regards to computational time and memory designation. This paper improves the MVCM-Tank with the inspecting methodology to defeat the versatility issue for large information grouping. Trial investigation is performed utilizing the enormous gaussian engineered datasets and large constant datasets to show the effectiveness of the proposed work.
... Big data clustering [1] is one of the prominent research areas that has progressed in many emerging applications, like social data computing [2], video analytics [3], fraud detection systems [4], and Artificial Intelligence (AI) -based data-driven systems [5], etc. Big data is the massive collection of data objects with higher dimensional data. ...
Article
Cluster analysis is the most important for the data partitions of unlabelled data in various big data applications. It analyses the data based on similarity features of data objects. Two significant steps of the cluster analysis are as follows: assess the initial cluster tendency, and explore the data partitions. Top big data clustering techniques, such as k-means ++, single pass k-means (spkm), mini-batch-k-means (mbkm), and spherical k-means, effectively generate the big data clusters. However, they cannot get the initial knowledge about the clustering tendency. Estimation of the knowledge about the number of clusters is known as the clustering tendency. Various estimation methods of cluster tendency are surveyed and finally investigated that visual assessment of cluster tendency (VAT) accurately assesses the clustering tendency. Finding the accurate similarity features plays a vital role in accurately assessing clusters in the VAT algorithm. This paper proposes a novel computational similarity measure for the best assessment of big data clusters. The experiments are conducted on big synthetic and big real datasets to illustrate the proposed technique's efficiency.
... Class imbalance in the dataset is a prevalent and challenging issue in machine learning, where the instances of the classes are not evenly distributed [4], [7], [19], [23], [30], [33], [36], [37], [39], [45], [47]. Most of the time, traditional machine learning algorithms are more attracted to the majority class (classes having a large number of instances [3], [4]), and the minority class (classes having a few numbers of instances [3], [4]) is ignored which results in reduced accuracy and performance [2], [37], [45]. ...
Article
Full-text available
Class imbalance problems have received a lot of attention throughout the last a few years. It poses considerable hurdles to conventional classifiers, especially when combined with overlapping instances, the complexity of the classification task increases. In this study, we have proposed a novel density-based method that combines the Ordering points to identify the clustering structure (OPTICS) algorithm with the naive-Bayes approach to effectively handle overlapped and imbalanced problems at the same time and is known as OPTICS-based k-Naive Bayes (Ok-NB). The Ok-NB method is used to correctly identify and construct the training data into overlapping and non-overlapping groups based on their density and reachability, while the naive-Bayes technique is used to correctly map the test data samples to the appropriate class for accurate output. It offers adaptability and reliability in classifying complex datasets with overlapping and imbalanced properties. Cluster-based proximity assessment and probabilistic classification are combined to improve classification accuracy and guarantee that the most reliable neighbors’ opinions are given the greatest weight during the decision-making process. Extensive experiments were conducted on 21 benchmark datasets. The experiment results demonstrate how effectively the suggested approach works to achieve high classification accuracy and also prove the effectiveness and superiority of this proposed approach compared to existing State-of-the-art methods in tackling overlapping and imbalance challenges in classification tasks.
... For comparison, the random sampling method with K-means and the BIRCH clustering method were also employed on this dataset. BIRCH (balanced iterative reducing and clustering using hierarchies) is a commonly used algorithm to perform clustering over particularly large datasets (Saeed et al., 2020;Shirkhorshidi et al., 2014;Zhang et al., 1996). It first builds a clustering feature (CF) tree out of the data points, and then applies hierarchical clustering to cluster all leaf entries (Zhang et al., 1997). ...
Article
Full-text available
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P ∗ N(N−1) 2 . To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
... Big data systems have emerged with distributed architecture, where the systems continue to generate large-scale data in a distributed fashion. Data generation happens from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media, etc [1][2][3]. It is a challenging task to process, store, and analyze distributed data using traditional approaches. ...
Article
Full-text available
In the world of big data, extracting meaningful insights from large and continually growing distributed datasets is a major challenge. Classical clustering algorithms are effective at identifying clusters with convex structures. However, they fall short in identifying arbitrary-shaped clusters (more irregular and complex patterns), which are often encountered in real-world applications. The process of identifying non-convex cluster representations from very large and growing datasets is a challenge. It is further compounded by the distributed nature of the data, necessitating complex computations across multiple devices. Support Vector Clustering (SVC) is a much-celebrated algorithm capable of finding arbitrarily shaped clusters. However, the major limitation of this algorithm is that it will not scale to large volumes of data as the time and space complexity is high. The second limitation of the SVC algorithm is the requirement for large computation time in finding cluster structures. The adoption of a coreset based methodology is required for finding the true representation of the underlying large datasets. The implementation of hierarchical clustering on these distributed coresets, unlocks the potential to uncover a structured hierarchy of abstractions across the disseminated data. Moreover, a distance-based clustering approach guarantees the identification of clusters with diverse and arbitrary shapes, providing a robust framework for detecting complex structures. This research utilizes the Core Vector Machine (CVM) approach using an approximate Minimum Enclosing Ball (MEB) algorithm to efficiently address the complexities inherent in traditional SVC. Additionally, an enhanced medoid algorithm is employed for cluster head identification across the data sources. Hierarchical clustering is performed in the Reproducing Kernel Hilbert Space (RKHS) using cosine similarity distance matrices. This is used to identify compact non-convex clusters within distributed datasets. Performance assessment involves benchmarking our approach against state-of-the-art improved SVC algorithms using large datasets. The outcomes validate the superior performance of our approach compared to existing methods.
... While approaches in this class are valuable, they have their own set of shortcomings. For instance, they are constrained by limited memory, their processing time is time-consuming, they are sensitive to noise and outliers, and dimensionality reduction can impact the effectiveness of clustering algorithms [8], [9], [13]− [15]. ...
Article
Full-text available
In recent years, the surge in continuously accelerating data generation has given rise to the prominence of big data technology. The MapReduce architecture, situated at the core of this technology, provides a robust parallel environment. Spark, a leading framework in the big data landscape, extends the capabilities of the traditional MapReduce model. Coping with big data, especially in the realm of clustering, requires more efficient techniques. Meta-heuristic-based clustering, known for offering global solutions within reasonable time frames, emerges as a promising approach. This paper introduces a parallel-distributed clustering algorithm for big data within the Spark Framework, named Spark, chaotic improved PSO (S-CIPSO). Centered on particle swarm optimization (PSO), the proposed algorithm is enhanced with a chaotic map and an efficient procedure. Test results, conducted on both real and artificial datasets, establish the superior performance and quality of clustering results achieved by the proposed approach. Additionally, the scalability and robustness of S-CIPSO are validated, demonstrating its effectiveness in handling large-scale datasets.
Article
Full-text available
This study presents the K-means clustering-based grey wolf optimizer, a new algorithm intended to improve the optimization capabilities of the conventional grey wolf optimizer in order to address the problem of data clustering. The process that groups similar items within a dataset into non-overlapping groups. Grey wolf hunting behaviour served as the model for grey wolf optimizer, however, it frequently lacks the exploration and exploitation capabilities that are essential for efficient data clustering. This work mainly focuses on enhancing the grey wolf optimizer using a new weight factor and the K-means algorithm concepts in order to increase variety and avoid premature convergence. Using a partitional clustering-inspired fitness function, the K-means clustering-based grey wolf optimizer was extensively evaluated on ten numerical functions and multiple real-world datasets with varying levels of complexity and dimensionality. The methodology is based on incorporating the K-means algorithm concept for the purpose of refining initial solutions and adding a weight factor to increase the diversity of solutions during the optimization phase. The results show that the K-means clustering-based grey wolf optimizer performs much better than the standard grey wolf optimizer in discovering optimal clustering solutions, indicating a higher capacity for effective exploration and exploitation of the solution space. The study found that the K-means clustering-based grey wolf optimizer was able to produce high-quality cluster centres in fewer iterations, demonstrating its efficacy and efficiency on various datasets. Finally, the study demonstrates the robustness and dependability of the K-means clustering-based grey wolf optimizer in resolving data clustering issues, which represents a significant advancement over conventional techniques. In addition to addressing the shortcomings of the initial algorithm, the incorporation of K-means and the innovative weight factor into the grey wolf optimizer establishes a new standard for further study in metaheuristic clustering algorithms. The performance of the K-means clustering-based grey wolf optimizer is around 34% better than the original grey wolf optimizer algorithm for both numerical test problems and data clustering problems.
Article
Imbalanced classification is a very common and crucial challenge in the machine learning domain. Due to unequal instances in different classes, the performance of traditional classifiers may decrease. How to deal with this imbalanced data is a major focus for the majority of studies. In terms of balancing the data, at the data level point under-sampling, over-sampling, and their variants are widely used. Since oversampling creates precise replicas of examples from the minority class, it may increase the risk that overfitting may occur. Undersampling wipes out a significant quantity of data, making it more difficult to determine where the decision boundary between minority and majority classes lies. In this work, we have proposed a novel method that combines both of these strategies based on the association rule and GAN to avoid this kind of problem score and produce a well-balanced data set. Extensive experiments have been conducted by using diverse benchmark datasets to compare this proposed approach with various state-of-the-art methods. The results demonstrate that the proposed approach outperforms existing techniques, achieving better classification performance while efficiently addressing the imbalanced classification problem.
Article
Full-text available
Forecasting and modeling building energy profiles require tools able to discover patterns within large amounts of collected information. Clustering is the main technique used to partition data into groups based on internal and a priori unknown schemes inherent of the data. The adjustment and parameterization of the whole clustering task is complex and submitted to several uncertainties, being the similarity metric one of the first decisions to be made in order to establish how the distance between two independent vectors must be measured. The present paper checks the effect of similarity measures in the application of clustering for discovering representatives in cases where correlation is supposed to be an important factor to consider, e. g., time series. This is a necessary step for the optimized design and development of efficient clustering-based models, predictors and controllers of time-dependent processes, e. g., building energy consumption patterns. In addition, clustered-vector balance is proposed as a validation technique to compare clustering performances.
Article
Full-text available
One of the goals of the first edition of this book back in 2005 was to present a coherent theory for K-Means partitioning and Ward hierarchical clustering. This theory leads to effective data pre-processing options, clustering algorithms and interpretation aids, as well as to firm relations to other areas of data analysis. The goal of this second edition is to consolidate, strengthen and extend this island of understanding in the light of recent developments. Here are examples of newly added material for each of the objectives: Consolidating: - Five equivalent formulations for K-Means criterion - Usage of split base vectors in hierarchical clustering - Similarities between the clustering data recovery models and singular/eigenvalue decompositions Strengthening: - Experimental evidence to support the PCA-like Anomalous Pattern clustering as a tool to initialize K-Means - Weighting variables with Minkowski metric three-step K-Means - Effective versions of least squares divisive clustering Extending: - Similarity and network clustering - Consensus clustering. The structure of the book has been streamlined; the chapter on Mathematics of the data recovery approach has almost doubled in size, now concludes the book. Parts of the removed chapters are integrated within the new structure. The change has added a hundred pages and a couple of dozen examples to the text and, in fact, transformed it into a different species of a book. In the first edition, the book had a Russian doll structure, with a core and a couple of nested shells around. Now it is a linear structure presentation of the data recovery clustering. This book offers advice regarding clustering goals and ways to achieve them to a student, an educated user, and application developer. This advice involves methods that are compatible with the data recovery framework and experimentally tested. Fortunately, this embraces most popular approaches including most recent ones. The emphasis on the data recovery framework sets this book apart from the other books on clustering that try to inform the reader of as many approaches as possible with no much regard for their properties.
Conference Paper
Full-text available
Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets-the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (sVAT) algorithm to return single-linkage partitions of big data sets. The sVAT algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for sVAT enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of sVAT are (usually) significantly less than the O(n2) requirement of the classic single-linkage hierarchical algorithm. We show that sVAT is a scalable instantiation of single-linkage clustering for data sets that contain c compact-separated clusters, where c ≪ n; n is the number of objects. For data sets that do not contain compact-separated clusters, we show that sVAT produces a good approximation of single-linkage partitions. Experimental results are presented for both synthetic and real data sets.
Article
This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
Article
In this paper, the authors build a tree using both frequent as well as non-frequent items and named as Revised PLWAP with Non-frequent Items RePLNI-tree in single scan. While mining sequential patterns, the links related to the non-frequent items are virtually discarded. Hence, it is not required to delete or maintain the information of nodes while revising the tree for mining updated weblog. It is not required to reconstruct the tree from scratch and re-compute the patterns each time, while weblog is updated or minimum support changed, since the algorithm supports both incremental and interactive mining. The performance of the proposed tree is better, even the size of incremental database is more than 50% of existing one, while it is not so in recently proposed algorithm. For evaluation purpose, the authors have used the benchmark weblog and found that the performance of proposed tree is encouraging compared to some of the recently proposed approaches.
Conference Paper
Cluster is a collection of data members having similar characteristics. The process of establishing a relation or deriving information from raw data by performing some operations on the data set like clustering is known as data mining. Data collected in practical scenarios is more often than not completely random and unstructured. Hence, there is always a need for analysis of unstructured data sets to derive meaningful information. This is where unsupervised algorithms come in to picture to process unstructured or even semi structured data sets by resultant. K-Means Clustering is one such technique used to provide a structure to unstructured data so that valuable information can be extracted. This paper discusses the implementation of the K-Means Clustering Algorithm over a distributed environment using ApacheTM Hadoop. The key to the implementation of the K-Means Algorithm is the design of the Mapper and Reducer routines which has been discussed in the later part of the paper. The steps involved in the execution of the K-Means Algorithm has also been described in this paper based on a small scale implementation of the K-Means Clustering Algorithm on an experimental setup to serve as a guide for practical implementations.