Content uploaded by Malika Charrad
Author content
All content in this area was uploaded by Malika Charrad on May 06, 2015
Content may be subject to copyright.
JSS Journal of Statistical Software
October 2014, Volume 61, Issue 6. http://www.jstatsoft.org/
NbClust: An RPackage for Determining the
Relevant Number of Clusters in a Data Set
Malika Charrad
Universit´e de Gabes
Nadia Ghazzali
Universit´e du Qu´ebec
`a Trois-Rivi`eres
V´eronique Boiteau
Universit´e Laval
Azam Niknafs
Universit´e Laval
Abstract
Clustering is the partitioning of a set of objects into groups (clusters) so that objects
within a group are more similar to each others than objects in different groups. Most of
the clustering algorithms depend on some assumptions in order to define the subgroups
present in a data set. As a consequence, the resulting clustering scheme requires some
sort of evaluation as regards its validity.
The evaluation procedure has to tackle difficult problems such as the quality of clusters,
the degree with which a clustering scheme fits a specific data set and the optimal number
of clusters in a partitioning. In the literature, a wide variety of indices have been proposed
to find the optimal number of clusters in a partitioning of a data set during the clustering
process. However, for most of indices proposed in the literature, programs are unavailable
to test these indices and compare them.
The Rpackage NbClust has been developed for that purpose. It provides 30 indices
which determine the number of clusters in a data set and it offers also the best clus-
tering scheme from different results to the user. In addition, it provides a function to
perform k-means and hierarchical clustering with different distance measures and aggre-
gation methods. Any combination of validation indices and clustering methods can be
requested in a single function call. This enables the user to simultaneously evaluate sev-
eral clustering schemes while varying the number of clusters, to help determining the most
appropriate number of clusters for the data set of interest.
Keywords:Rpackage, cluster validity, number of clusters, clustering, indices, k-means, hier-
archical clustering.
1. Introduction and related work
Clustering is the task of assigning a set of objects into groups (clusters) so that the objects
in the same cluster are more similar to each other than objects in other clusters. There is a
32 NbClust: Determining the Relevant Number of Clusters in R
Acknowledgments
The authors would like to thank the NSERC-Industrial Alliance Chair for Women in Science
and Engineering in Quebec for the support to this research.
References
Baker FB, Hubert LJ (1975). “Measuring the Power of Hierarchical Cluster Analysis.” Journal
of the American Statistical Association,70(349), 31–38.
Ball GH, Hall DJ (1965). “ISODATA: A Novel Method of Data Analysis and Pattern Classi-
fication.” Stanford Research Institute, Menlo Park. (NTIS No. AD 699616).
Beale EML (1969). Cluster Analysis. Scientific Control Systems, London.
Bezdek JC, Pal NR (1998). “Some New Indexes of Cluster Validity.” IEEE Transactions on
Systems, Man and Cybernetics,28(3), 301–315.
Brock G, Pihur V, Datta S (2014). clValid: Validation of Clustering Results.Rpackage
version 0.6-6, URL http://CRAN.R-project.org/package=clValid.
Brock G, Pihur V, Datta S, Datta S (2008). “clValid: An RPackage for Cluster Validation.”
Journal of Statistical Software,25(4), 1–22. URL http://www.jstatsoft.org/v25/i04/.
Calinski T, Harabasz J (1974). “A Dendrite Method for Cluster Analysis.” Communications
in Statistics – Theory and Methods,3(1), 1–27.
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014). NbClust Package for Determining
the Best Number of Clusters.Rpackage version 2.0.3, URL http://CRAN.R-project.org/
package=NbClust.
Chipman H, Tibshirani R (2014). hybridHclust: Hybrid Hierarchical Clustering.Rpackage
version 1.0.4, URL http://CRAN.R-project.org/package=hybridHclust.
Davies DL, Bouldin DW (1979). “A Cluster Separation Measure.” IEEE Transactions on
Pattern Analysis and Machine Intelligence,1(2), 224–227.
Dimitriadou E (2014). cclust: Convex Clustering Methods and Clustering Indexes.Rpackage
version 0.6-18, URL http://CRAN.R-project.org/package=cclust.
Dimitriadou E, Dolnicar S, Weingessel A (2002). “An Examination of Indexes for Determining
the Number of Clusters in Binary Data Sets.” Psychometrika,67(3), 137–160.
Duda RO, Hart PE (1973). Pattern Classification and Scene Analysis. John Wiley & Sons,
New York.
Dunn J (1974). “Well Separated Clusters and Optimal Fuzzy Partitions.” Journal Cybernetics,
4(1), 95–104.
Edwards AWF, Cavalli-Sforza L (1965). “A Method for Cluster Analysis.” Biometrics,21(2),
362–375.
Journal of Statistical Software 33
Everitt B (1974). Cluster Analysis. Heinemann Educational, London.
Fisher RA (1936). “The Use of Multiple Measurements in Taxonomic Problems.” The Annals
of Eugenics,7(2), 179–188.
Florek K, Lukaszewicz J, Perkal J, Zubrzycki S (1951). “Sur la Liaison et la Division des
Points d’un Ensemble Fini.” Colloquium Mathematicae,2(3–4), 282–285.
Frey T, Van Groenewoud H (1972). “A Cluster Analysis of the D-Squared Matrix of White
Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle.” Journal of
Ecology,60(3), 873–886.
Friedman HP, Rubin J (1967). “On Some Invariant Criteria for Grouping Data.” Journal of
the American Statistical Association,62(320), 1159–1178.
Fukunaga K, Koontz WLG (1970). “A Criterion and An Algorithm for Grouping Data.” IEEE
Transactions on Computers,C-19(10), 917–923.
Gordon AD (1999). Classification. 2nd edition. Chapman & Hall/CRC, London.
Gower JC (1967). “A Comparison of Some Methods of Cluster Analysis.” Biometrics,23(4),
623–637.
Halkidi M, Batistakis I, Vazirgiannis M (2001). “On Clustering Validation Techniques.” Jour-
nal of Intelligent Information Systems,17(2/3), 107–145.
Halkidi M, Vazirgiannis M (2001). “Clustering Validity Assessment: Finding the Optimal
Partitioning of a Data Set.” In ICDM’01 Proceedings of the 2001 IEEE International
Conference on Data Mining, pp. 187–194.
Halkidi M, Vazirgiannis M, Batistakis I (2000). “Quality Scheme Assessment in the Clustering
Process.” In Principles of Data Mining and Knowledge Discovery, volume 1910 of Lecture
Notes in Computer Science, pp. 265–276. Springer-Verlag, Berlin Heidelberg. Proceedings
of the 4th European Conference, PKDD 2000, Lyon, France, September 13–16 2000.
Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons, New York.
Hartigan JA, Wong MA (1979). “A K-Means Clustering Algorithm.” Journal of the Royal
Statistical Society C,28(1), 100–108.
Hill RS (1980). “A Stopping Rule for Partitioning Dendrograms.” Botanical Gazette,141(3),
321–324.
Hornik K (2005). “A CLUE for CLUster Ensembles.” Journal of Statistical Software,14(12),
1–25. URL http://www.jstatsoft.org/v14/i12/.
Hornik K (2014). clue: Cluster Ensembles.Rpackage version 0.3-48, URL http://CRAN.
R-project.org/package=clue.
Hubert LJ, Arabie P (1985). “Comparing Partitions.” Journal of Classification,2(1), 193–218.
Hubert LJ, Levin JR (1976). “A General Statistical Framework for Assessing Categorical
Clustering in Free Recall.” Psychological Bulletin,83(6), 1072–1080.
34 NbClust: Determining the Relevant Number of Clusters in R
Jain AK, Murty PJ, Flyn PJ (1998). “Data Clustering: A Review.” ACM Computing Surveys,
31(3), 264–323.
Kaufman L, Rousseeuw PJ (1990). Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons, New York.
Kraemer HC (1982). Biserial Correlation. John Wiley & Sons. Reference taken from a
SAS note about the BISERIAL macro on this Web Site: http://support.sas.com/kb/
24/991.html.
Krzanowski WJ, Lai YT (1988). “A Criterion for Determining the Number of Groups in a
Data Set Using Sum-of-Squares Clustering.” Biometrics,44(1), 23–34.
Lebart L, Morineau A, Piron M (2000). Statistique Exploratoire Multidimensionnelle. Dunod,
Paris.
MacQueen JB (1967). “Some Methods for Classification and Analysis of Multivariate Obser-
vations.” In LML Cam, J Neyman (eds.), Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, volume 1, pp. 281–297.
Maechler M, Rousseeuw P, Struyf A, Hubert M (2014). cluster: Cluster Analysis Extended
Rousseeuw et al.Rpackage version 1.15.2, URL http://CRAN.R-project.org/package=
cluster.
Marozzi M (2014). “Construction, Dimension Reduction and Uncertainty Analysis of an Index
of Trust in Public Institutions.” Quality and Quantity,48(2), 939–953.
Marriot FHC (1971). “Practical Problems in a Method of Cluster Analysis.” Biometrics,
27(3), 501–514.
McClain JO, Rao VR (1975). “CLUSTISZ: A Program to Test for The Quality of Clustering
of a Set of Objects.” Journal of Marketing Research,12(4), 456–460.
McQuitty LL (1966). “Similarity Analysis by Reciprocal Pairs for Discrete and Continuous
Data.” Educational and Psychological Measurement,26(4), 825–831.
Milligan GW (1980). “An Examination of the Effect of Six Types of Error Perturbation on
Fifteen Clustering Algorithms.” Psychometrika,45(3), 325–342.
Milligan GW (1981). “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster
Analysis.” Psychometrika,46(2), 187–199.
Milligan GW, Cooper MC (1985). “An Examination of Procedures for Determining the Num-
ber of Clusters in a Data Set.” Psychometrika,50(2), 159–179.
Murtagh F, Legendre P (2011). “Ward’s Hierarchical Clustering Method: Clustering Criterion
and Agglomerative Algorithm.” Unpublished preprint.
Murtagh F, Legendre P (2014). “Ward’s Hierarchical Agglomerative Clustering Method:
Which Algorithms Implement Ward’s Criterion?” Journal of Classification. Forthcoming.
Nieweglowski L (2014). clv: Cluster Validation Techniques.Rpackage version 0.3-2.1, URL
http://CRAN.R-project.org/package=clv.
Journal of Statistical Software 35
Orloci L (1967). “An Agglomerative Method for Classification of Plant Communities.”Journal
of Ecology,55(1), 193–206.
Ratkowsky DA, Lance GN (1978). “A Criterion for Determining the Number of Groups in a
Classification.” Australian Computer Journal,10(3), 115–117.
RCore Team (2014). R: A Language and Environment for Statistical Computing.RFounda-
tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Rohlf FJ (1974). “Methods of Comparing Classifications.” Annual Review of Ecology and
Systematics,5, 101–113.
Rousseeuw P (1987). “Silhouettes: A Graphical Aid to the Interpretation and Validation of
Cluster Analysis.” Journal of Computational and Applied Mathematics,20, 53–65.
Saisana M, Saltelli A, Tarantola S (2005). “Uncertainty and Sensitivity Analysis Techniques as
Tools for the Quality Assessment of Composite Indicators.” Journal of the Royal Statistical
Society A,168(2), 307–323.
Sarle WS (1983). “SAS Technical Report A-108, Cubic Clustering Criterion.” SAS Institute
Inc. Cary, NC.
SAS Institute Inc (2012). SAS/STAT Software, Version 12.1.SAS Institute Inc., Cary, NC.
URL http://www.sas.com/.
Scott AJ, Symons MJ (1971). “Clustering Methods Based on Likelihood Ratio Criteria.”
Biometrics,27(2), 387–397.
Seber GAF (1984). Multivariate Observations. John Wiley & Sons, New York.
Sheikholeslami C, Chatterjee S, Zhang A (2000). “WaveCluster: A Multi-Resolution Cluster-
ing Approach for Very Large Spatial Database.” The International Journal on Very Large
Data Bases,8(3–4), 289–304.
Sokal R, Michener C (1958). “A Statistical Method for Evaluating Systematic Relationships.”
University of Kansas Science Bulletin,38(22), 1409–1438.
Sørensen TA (1948). “A Method of Establishing Groups of Equal Amplitude in Plant Sociology
Based on Similarity of Species and its Application to Analyses of the Vegetation on Danish
Commons.” Biologiske Skrifter,5, 1–34.
Suzuki R, Shimodaira H (2014). pvclust: Hierarchical Clustering with P-Values via Multi-
scale Bootstrap Resampling.Rpackage version 1.2-2, URL http://CRAN.R-project.org/
package=pvclust.
Templ M (2007). clustTool: GUI for Clustering Data with Spatial Information.Rpackage
version 1.3, URL http://CRAN.R-project.org/package=clustTool.
Theodoridis S, Koutroubas K (2008). Pattern Recognition. 4th edition. Academic Press.
Tibshirani R, Walther G, Hastie T (2001). “Estimating the Number of Clusters in a Data Set
Via the Gap Statistic.” Journal of the Royal Statistical Society B,63(2), 411–423.
36 NbClust: Determining the Relevant Number of Clusters in R
Walesiak M, Dudek A (2014). clusterSim: Searching for Optimal Clustering Procedure
for a Data Set.Rpackage version 0.43-4, URL http://CRAN.R-project.org/package=
clusterSim.
Ward JH (1963). “Hierarchical Grouping to Optimize an Objective Function.” Journal of the
American Statistical Association,58(301), 236–244.
Affiliation:
Malika Charrad
Universit´e de Gabes
Institut Sup´erieur de l’Informatique
Route Djerba Km 3, Boite Postale N 283
4100 Medenine, Tunisie
and
Universit´e Laval, Qu´ebec
E-mail: malika.charrad@riadi.rnu.tn
Nadia Ghazzali
Universit´e du Qu´ebec `a Trois-Rivi`eres
E-mail: nadia.ghazzali@uqtr.ca
V´eronique Boiteau, Azam Niknafs
D´epartement de Math´ematiques et de Statistique
Universit´e Laval, Qu´ebec
E-mail: veronique.boiteau.1@ulaval.ca,azam.niknafs.1@ulaval.ca
Journal of Statistical Software http://www.jstatsoft.org/
published by the American Statistical Association http://www.amstat.org/
Volume 61, Issue 6 Submitted: 2012-08-13
October 2014 Accepted: 2013-04-08