ArticlePDF Available

NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set

Authors:

Abstract

Download at : http://www.jstatsoft.org/v61/i06/paper Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform kmeans and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the dataset of interest.
JSS Journal of Statistical Software
October 2014, Volume 61, Issue 6. http://www.jstatsoft.org/
NbClust: An RPackage for Determining the
Relevant Number of Clusters in a Data Set
Malika Charrad
Universit´e de Gabes
Nadia Ghazzali
Universit´e du Qu´ebec
`a Trois-Rivi`eres
eronique Boiteau
Universit´e Laval
Azam Niknafs
Universit´e Laval
Abstract
Clustering is the partitioning of a set of objects into groups (clusters) so that objects
within a group are more similar to each others than objects in different groups. Most of
the clustering algorithms depend on some assumptions in order to define the subgroups
present in a data set. As a consequence, the resulting clustering scheme requires some
sort of evaluation as regards its validity.
The evaluation procedure has to tackle difficult problems such as the quality of clusters,
the degree with which a clustering scheme fits a specific data set and the optimal number
of clusters in a partitioning. In the literature, a wide variety of indices have been proposed
to find the optimal number of clusters in a partitioning of a data set during the clustering
process. However, for most of indices proposed in the literature, programs are unavailable
to test these indices and compare them.
The Rpackage NbClust has been developed for that purpose. It provides 30 indices
which determine the number of clusters in a data set and it offers also the best clus-
tering scheme from different results to the user. In addition, it provides a function to
perform k-means and hierarchical clustering with different distance measures and aggre-
gation methods. Any combination of validation indices and clustering methods can be
requested in a single function call. This enables the user to simultaneously evaluate sev-
eral clustering schemes while varying the number of clusters, to help determining the most
appropriate number of clusters for the data set of interest.
Keywords:Rpackage, cluster validity, number of clusters, clustering, indices, k-means, hier-
archical clustering.
1. Introduction and related work
Clustering is the task of assigning a set of objects into groups (clusters) so that the objects
in the same cluster are more similar to each other than objects in other clusters. There is a
32 NbClust: Determining the Relevant Number of Clusters in R
Acknowledgments
The authors would like to thank the NSERC-Industrial Alliance Chair for Women in Science
and Engineering in Quebec for the support to this research.
References
Baker FB, Hubert LJ (1975). “Measuring the Power of Hierarchical Cluster Analysis.” Journal
of the American Statistical Association,70(349), 31–38.
Ball GH, Hall DJ (1965). “ISODATA: A Novel Method of Data Analysis and Pattern Classi-
fication.” Stanford Research Institute, Menlo Park. (NTIS No. AD 699616).
Beale EML (1969). Cluster Analysis. Scientific Control Systems, London.
Bezdek JC, Pal NR (1998). “Some New Indexes of Cluster Validity.” IEEE Transactions on
Systems, Man and Cybernetics,28(3), 301–315.
Brock G, Pihur V, Datta S (2014). clValid: Validation of Clustering Results.Rpackage
version 0.6-6, URL http://CRAN.R-project.org/package=clValid.
Brock G, Pihur V, Datta S, Datta S (2008). “clValid: An RPackage for Cluster Validation.”
Journal of Statistical Software,25(4), 1–22. URL http://www.jstatsoft.org/v25/i04/.
Calinski T, Harabasz J (1974). “A Dendrite Method for Cluster Analysis.” Communications
in Statistics – Theory and Methods,3(1), 1–27.
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014). NbClust Package for Determining
the Best Number of Clusters.Rpackage version 2.0.3, URL http://CRAN.R-project.org/
package=NbClust.
Chipman H, Tibshirani R (2014). hybridHclust: Hybrid Hierarchical Clustering.Rpackage
version 1.0.4, URL http://CRAN.R-project.org/package=hybridHclust.
Davies DL, Bouldin DW (1979). “A Cluster Separation Measure.” IEEE Transactions on
Pattern Analysis and Machine Intelligence,1(2), 224–227.
Dimitriadou E (2014). cclust: Convex Clustering Methods and Clustering Indexes.Rpackage
version 0.6-18, URL http://CRAN.R-project.org/package=cclust.
Dimitriadou E, Dolnicar S, Weingessel A (2002). “An Examination of Indexes for Determining
the Number of Clusters in Binary Data Sets.” Psychometrika,67(3), 137–160.
Duda RO, Hart PE (1973). Pattern Classification and Scene Analysis. John Wiley & Sons,
New York.
Dunn J (1974). “Well Separated Clusters and Optimal Fuzzy Partitions.” Journal Cybernetics,
4(1), 95–104.
Edwards AWF, Cavalli-Sforza L (1965). “A Method for Cluster Analysis.” Biometrics,21(2),
362–375.
Journal of Statistical Software 33
Everitt B (1974). Cluster Analysis. Heinemann Educational, London.
Fisher RA (1936). “The Use of Multiple Measurements in Taxonomic Problems.” The Annals
of Eugenics,7(2), 179–188.
Florek K, Lukaszewicz J, Perkal J, Zubrzycki S (1951). “Sur la Liaison et la Division des
Points d’un Ensemble Fini.” Colloquium Mathematicae,2(3–4), 282–285.
Frey T, Van Groenewoud H (1972). “A Cluster Analysis of the D-Squared Matrix of White
Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle.” Journal of
Ecology,60(3), 873–886.
Friedman HP, Rubin J (1967). “On Some Invariant Criteria for Grouping Data.” Journal of
the American Statistical Association,62(320), 1159–1178.
Fukunaga K, Koontz WLG (1970). “A Criterion and An Algorithm for Grouping Data.” IEEE
Transactions on Computers,C-19(10), 917–923.
Gordon AD (1999). Classification. 2nd edition. Chapman & Hall/CRC, London.
Gower JC (1967). “A Comparison of Some Methods of Cluster Analysis.” Biometrics,23(4),
623–637.
Halkidi M, Batistakis I, Vazirgiannis M (2001). “On Clustering Validation Techniques.” Jour-
nal of Intelligent Information Systems,17(2/3), 107–145.
Halkidi M, Vazirgiannis M (2001). “Clustering Validity Assessment: Finding the Optimal
Partitioning of a Data Set.” In ICDM’01 Proceedings of the 2001 IEEE International
Conference on Data Mining, pp. 187–194.
Halkidi M, Vazirgiannis M, Batistakis I (2000). “Quality Scheme Assessment in the Clustering
Process.” In Principles of Data Mining and Knowledge Discovery, volume 1910 of Lecture
Notes in Computer Science, pp. 265–276. Springer-Verlag, Berlin Heidelberg. Proceedings
of the 4th European Conference, PKDD 2000, Lyon, France, September 13–16 2000.
Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons, New York.
Hartigan JA, Wong MA (1979). “A K-Means Clustering Algorithm.” Journal of the Royal
Statistical Society C,28(1), 100–108.
Hill RS (1980). “A Stopping Rule for Partitioning Dendrograms.” Botanical Gazette,141(3),
321–324.
Hornik K (2005). “A CLUE for CLUster Ensembles.” Journal of Statistical Software,14(12),
1–25. URL http://www.jstatsoft.org/v14/i12/.
Hornik K (2014). clue: Cluster Ensembles.Rpackage version 0.3-48, URL http://CRAN.
R-project.org/package=clue.
Hubert LJ, Arabie P (1985). “Comparing Partitions.” Journal of Classification,2(1), 193–218.
Hubert LJ, Levin JR (1976). “A General Statistical Framework for Assessing Categorical
Clustering in Free Recall.” Psychological Bulletin,83(6), 1072–1080.
34 NbClust: Determining the Relevant Number of Clusters in R
Jain AK, Murty PJ, Flyn PJ (1998). “Data Clustering: A Review.” ACM Computing Surveys,
31(3), 264–323.
Kaufman L, Rousseeuw PJ (1990). Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons, New York.
Kraemer HC (1982). Biserial Correlation. John Wiley & Sons. Reference taken from a
SAS note about the BISERIAL macro on this Web Site: http://support.sas.com/kb/
24/991.html.
Krzanowski WJ, Lai YT (1988). “A Criterion for Determining the Number of Groups in a
Data Set Using Sum-of-Squares Clustering.” Biometrics,44(1), 23–34.
Lebart L, Morineau A, Piron M (2000). Statistique Exploratoire Multidimensionnelle. Dunod,
Paris.
MacQueen JB (1967). “Some Methods for Classification and Analysis of Multivariate Obser-
vations.” In LML Cam, J Neyman (eds.), Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, volume 1, pp. 281–297.
Maechler M, Rousseeuw P, Struyf A, Hubert M (2014). cluster: Cluster Analysis Extended
Rousseeuw et al.Rpackage version 1.15.2, URL http://CRAN.R-project.org/package=
cluster.
Marozzi M (2014). “Construction, Dimension Reduction and Uncertainty Analysis of an Index
of Trust in Public Institutions.” Quality and Quantity,48(2), 939–953.
Marriot FHC (1971). “Practical Problems in a Method of Cluster Analysis.” Biometrics,
27(3), 501–514.
McClain JO, Rao VR (1975). “CLUSTISZ: A Program to Test for The Quality of Clustering
of a Set of Objects.” Journal of Marketing Research,12(4), 456–460.
McQuitty LL (1966). “Similarity Analysis by Reciprocal Pairs for Discrete and Continuous
Data.” Educational and Psychological Measurement,26(4), 825–831.
Milligan GW (1980). “An Examination of the Effect of Six Types of Error Perturbation on
Fifteen Clustering Algorithms.” Psychometrika,45(3), 325–342.
Milligan GW (1981). “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster
Analysis.” Psychometrika,46(2), 187–199.
Milligan GW, Cooper MC (1985). “An Examination of Procedures for Determining the Num-
ber of Clusters in a Data Set.” Psychometrika,50(2), 159–179.
Murtagh F, Legendre P (2011). “Ward’s Hierarchical Clustering Method: Clustering Criterion
and Agglomerative Algorithm.” Unpublished preprint.
Murtagh F, Legendre P (2014). “Ward’s Hierarchical Agglomerative Clustering Method:
Which Algorithms Implement Ward’s Criterion?” Journal of Classification. Forthcoming.
Nieweglowski L (2014). clv: Cluster Validation Techniques.Rpackage version 0.3-2.1, URL
http://CRAN.R-project.org/package=clv.
Journal of Statistical Software 35
Orloci L (1967). “An Agglomerative Method for Classification of Plant Communities.”Journal
of Ecology,55(1), 193–206.
Ratkowsky DA, Lance GN (1978). “A Criterion for Determining the Number of Groups in a
Classification.” Australian Computer Journal,10(3), 115–117.
RCore Team (2014). R: A Language and Environment for Statistical Computing.RFounda-
tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Rohlf FJ (1974). “Methods of Comparing Classifications.” Annual Review of Ecology and
Systematics,5, 101–113.
Rousseeuw P (1987). “Silhouettes: A Graphical Aid to the Interpretation and Validation of
Cluster Analysis.” Journal of Computational and Applied Mathematics,20, 53–65.
Saisana M, Saltelli A, Tarantola S (2005). “Uncertainty and Sensitivity Analysis Techniques as
Tools for the Quality Assessment of Composite Indicators.” Journal of the Royal Statistical
Society A,168(2), 307–323.
Sarle WS (1983). “SAS Technical Report A-108, Cubic Clustering Criterion.” SAS Institute
Inc. Cary, NC.
SAS Institute Inc (2012). SAS/STAT Software, Version 12.1.SAS Institute Inc., Cary, NC.
URL http://www.sas.com/.
Scott AJ, Symons MJ (1971). “Clustering Methods Based on Likelihood Ratio Criteria.”
Biometrics,27(2), 387–397.
Seber GAF (1984). Multivariate Observations. John Wiley & Sons, New York.
Sheikholeslami C, Chatterjee S, Zhang A (2000). “WaveCluster: A Multi-Resolution Cluster-
ing Approach for Very Large Spatial Database.” The International Journal on Very Large
Data Bases,8(3–4), 289–304.
Sokal R, Michener C (1958). “A Statistical Method for Evaluating Systematic Relationships.”
University of Kansas Science Bulletin,38(22), 1409–1438.
Sørensen TA (1948). “A Method of Establishing Groups of Equal Amplitude in Plant Sociology
Based on Similarity of Species and its Application to Analyses of the Vegetation on Danish
Commons.” Biologiske Skrifter,5, 1–34.
Suzuki R, Shimodaira H (2014). pvclust: Hierarchical Clustering with P-Values via Multi-
scale Bootstrap Resampling.Rpackage version 1.2-2, URL http://CRAN.R-project.org/
package=pvclust.
Templ M (2007). clustTool: GUI for Clustering Data with Spatial Information.Rpackage
version 1.3, URL http://CRAN.R-project.org/package=clustTool.
Theodoridis S, Koutroubas K (2008). Pattern Recognition. 4th edition. Academic Press.
Tibshirani R, Walther G, Hastie T (2001). “Estimating the Number of Clusters in a Data Set
Via the Gap Statistic.” Journal of the Royal Statistical Society B,63(2), 411–423.
36 NbClust: Determining the Relevant Number of Clusters in R
Walesiak M, Dudek A (2014). clusterSim: Searching for Optimal Clustering Procedure
for a Data Set.Rpackage version 0.43-4, URL http://CRAN.R-project.org/package=
clusterSim.
Ward JH (1963). “Hierarchical Grouping to Optimize an Objective Function.” Journal of the
American Statistical Association,58(301), 236–244.
Affiliation:
Malika Charrad
Universit´e de Gabes
Institut Sup´erieur de l’Informatique
Route Djerba Km 3, Boite Postale N 283
4100 Medenine, Tunisie
and
Universit´e Laval, Qu´ebec
E-mail: malika.charrad@riadi.rnu.tn
Nadia Ghazzali
Universit´e du Qu´ebec `a Trois-Rivi`eres
E-mail: nadia.ghazzali@uqtr.ca
eronique Boiteau, Azam Niknafs
epartement de Math´ematiques et de Statistique
Universit´e Laval, Qu´ebec
E-mail: veronique.boiteau.1@ulaval.ca,azam.niknafs.1@ulaval.ca
Journal of Statistical Software http://www.jstatsoft.org/
published by the American Statistical Association http://www.amstat.org/
Volume 61, Issue 6 Submitted: 2012-08-13
October 2014 Accepted: 2013-04-08

Supplementary resource (1)

... To conduct a K-means cluster analysis, we used the NbClust function in the NbClust package (Charrad et al. 2014) in R Studio (RStudio Team 2020) to determine the ideal number of clusters. The NbClust function computes 30 indices that have been used throughout the literature to determine the appropriate number of clusters to use in cluster analysis (Charrad et al. 2014). ...
... To conduct a K-means cluster analysis, we used the NbClust function in the NbClust package (Charrad et al. 2014) in R Studio (RStudio Team 2020) to determine the ideal number of clusters. The NbClust function computes 30 indices that have been used throughout the literature to determine the appropriate number of clusters to use in cluster analysis (Charrad et al. 2014). When computing the 30 indices, NbClust takes the result of each index, and uses majority rule to recommend the appropriate number of clusters. ...
Article
Full-text available
Salinization threatens freshwater resources and freshwater‐dependent wetlands in coastal areas worldwide. Many research efforts focus on gradual or chronic salinization, but the phenomenon is also episodic in nature, particularly in small streams and artificial waterways. In surface waters, salinization events may coincide with storms, droughts, wind tides, and other episodic events. A lack of standardized quantitative methods and metrics for describing and discussing episodic salinization hinders cross‐disciplinary efforts by scientists and others to analyze, discuss, and make recommendations concerning these events. Here, we present a set of metrics that use statistics which describe flow characteristics in rivers and streams as a template for empirically describing and characterizing salinization events. We developed a set of metrics to quantify the duration, magnitude, and other characteristics of episodic salinization, and we apply the metrics to extensive time‐series data from a field site in coastal North Carolina. We then demonstrate the utility of these metrics by coupling them with ancillary data to perform an unsupervised classification that groups individual salinization events by their primary meteorological driver. We provide simple and flexible code needed to compute metrics in any environment experiencing salinization events in hopes that it will facilitate more standardized approaches to the quantification and study of widespread freshwater salinization.
... We used unsupervised machine learning (k-means) to group metacommunities based on similarities in dominant dispersal traits (assessed through estimates of MWM ep, hs and dp). The optimal number of metacommunity clusters was identified using the general consensus of multiple algorithms in the NbClust package [27]. Concurrently, a principal component analysis ordinated metacommunities in a multivariate trait space, revealing associations of MWM traits in each cluster. ...
Article
Full-text available
While the influence of dispersal on ecological selection is the subject of intense research, we still lack a thorough understanding of how ecological selection operates to favour distinct dispersal strategies in metacommunities. To address this issue, we developed a model framework in which species with distinct quantitative dispersal traits that govern the three stages of dispersal—departure, movement and settlement—compete under different ecological contexts. The model identified three primary dispersal strategies (referred to as nomadic, homebody and habitat-sorting) that consistently dominated metacommunities owing to the interplay of spatiotemporal environmental variation and different types of competitive interactions. We outlined the key characteristics of each strategy and formulated theoretical predictions regarding the abiotic and biotic conditions under which each strategy is more likely to prevail in metacommunities. By presenting our results as relationships between dispersal traits and well-known ecological gradients (e.g. seasonality), we were able to contrast our theoretical findings with previous empirical research. Our model demonstrates how landscape environmental characteristics and competitive interactions at the intra- and interspecific levels can interact to favour distinct multivariate and context-dependent dispersal strategies in metacommunities. This article is part of the theme issue ‘Diversity-dependence of dispersal: interspecific interactions determine spatial dynamics’.
Article
Fishing activities have been recognized as one of the primary contributors to marine environmental pollution. Studies have been conducted on the impact of fishing activities on the accumulation of marine debris, but most of these studies have been conducted at specific points in time. This study collected marine debris data over four years in the coastal area of Korea. Data on the magnitude of nearshore fishing activities during the same period were collected and analyzed. Regression models were constructed to explore the impact of nearshore fishing activities on coastal waste accumulation over time. This research aimed to understand the influence of nearshore fishing activities on the accumulation of ocean-sourced coastal waste, leading to the development of a time series regression model. The results indicated that time series models have substantially more explanatory power compared to conventional models, emphasizing the importance of temporal considerations in quantifying the relationship between fishing activities and coastal litter over time.
Article
Full-text available
The difficult current global situation in the aspect of Human Resources for Health was clearly seen during the COVID-19 pandemic. The spending on healthcare is still increasing and the rate of increase outpaces the growth rate of GDP. Only part of these funds is dedicated to the training of new staff and current healthcare employees migrate in search for better job conditions and worklife balance. Personnel migration combined with the demographic structure in the high-income countries simultaneously leads to increasing demand for healthcare services and limits the supply of specialists who can provide such services. The confrontation between the demand for medical personnel and its supply will lead to a reduction in the quality of care and accessibility of services. In the study based on the large group of Polish county hospitals in 2015–2018, differences and similarities between the hospitals in terms of employment, measured in full-time equivalents (FTEs) and in terms of wages were analyzed. Similarity and dissimilarity analysis was conducted, based on distance measures and cluster analysis. Bigger differences between the hospitals were found for wages than employment levels. The hospitals with an ED and efficient units were less similar to one another than their counterparts in terms of employment (FTEs), except for 2016. When it comes to wages and both types of variables (wages and employment) considered simultaneously, the hospitals with an ED and high number of beds were characterized by lower similarity to one another than their counterparts during the whole period. Clustering all the 3 approaches (FTEs, wages, FTEs and wages) the results were the same. One of these groups was characterized by a rather low employment level per bed, while the other one – by high.
Preprint
We compute the optimal signal mix for freelance workers vying for a job in a two-sided contract labor market platform.
Article
Scholars acknowledge the existence of intra-party divisions and the potentially negative electoral effects of disunity. Some assume that intra-party divides are between professional politicians and grassroots members, others highlight the importance of ideological blocs. Yet, precisely mapping factional structures, especially ideological factions, is difficult because of the “black box of intra-party politics.” Based on theories of party change and spatial competition, we argue for the existence of two distinct ideological factional dimensions that may differ from hierarchical factions. We test our expectations by triangulating evidence from three unique datasets from Sweden: a survey of party members, a media content analysis, and interviews with politicians. Our mixed-methods approach allows identifying the number, structure, content, sizes, and ideological positions of factions. The results show substantial variation in all aspects and that hierarchical and ideological factions rarely coincide. These findings have important theoretical, conceptual, and methodological implications for comparative politics.
Article
Full-text available
Type I hypolithons are microbial communities dominated by Cyanobacteria. They adhere to the underside of semi‐translucent rocks in desert pavements, providing them with a refuge from the harsh abiotic stresses found on the desert soil surface. Despite their crucial role in soil nutrient cycling, our understanding of their growth rates and community development pathways remains limited. This study aimed to quantify the dynamics of hypolithon formation in the pavements of the Namib Desert. We established replicate arrays of sterile rock tiles with varying light transmission in two areas of the Namib Desert, each with different annual precipitation regimes. These were sampled annually over 7 years, and the samples were analysed using eDNA extraction and 16S rRNA gene amplicon sequencing. Our findings revealed that in the zone with higher precipitation, hypolithon formation became evident in semi‐translucent rocks 3 years after the arrays were set up. This coincided with a Cyanobacterial ‘bloom’ in the adherent microbial community in the third year. In contrast, no visible hypolithon formation was observed at the array set up in the hyper‐arid zone. This study provides the first quantitative evidence of the kinetics of hypolithon development in hot desert environments, suggesting that development rates are strongly influenced by precipitation regimes.