ArticlePDF Available

NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set

October 2014
Journal of Statistical Software 61(6):1-36

October 2014
61(6):1-36

DOI:10.18637/jss.v061.i06

License
CC BY 4.0

Authors:

Malika Charrad

Université Panthéon-Assas Paris 2

Azam Niknafs

Université Laval

Download at : http://www.jstatsoft.org/v61/i06/paper Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform kmeans and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the dataset of interest.

Content uploaded by Malika Charrad

Content may be subject to copyright.

JSS Journal of Statistical Software

October 2014, Volume 61, Issue 6. http://www.jstatsoft.org/

NbClust: An RPackage for Determining the

Relevant Number of Clusters in a Data Set

Malika Charrad

Universit´e de Gabes

Nadia Ghazzali

Universit´e du Qu´ebec

`a Trois-Rivi`eres

V´eronique Boiteau

Universit´e Laval

Azam Niknafs

Universit´e Laval

Abstract

Clustering is the partitioning of a set of objects into groups (clusters) so that objects

within a group are more similar to each others than objects in diﬀerent groups. Most of

the clustering algorithms depend on some assumptions in order to deﬁne the subgroups

present in a data set. As a consequence, the resulting clustering scheme requires some

sort of evaluation as regards its validity.

The evaluation procedure has to tackle diﬃcult problems such as the quality of clusters,

the degree with which a clustering scheme ﬁts a speciﬁc data set and the optimal number

of clusters in a partitioning. In the literature, a wide variety of indices have been proposed

to ﬁnd the optimal number of clusters in a partitioning of a data set during the clustering

process. However, for most of indices proposed in the literature, programs are unavailable

to test these indices and compare them.

The Rpackage NbClust has been developed for that purpose. It provides 30 indices

which determine the number of clusters in a data set and it oﬀers also the best clus-

tering scheme from diﬀerent results to the user. In addition, it provides a function to

perform k-means and hierarchical clustering with diﬀerent distance measures and aggre-

gation methods. Any combination of validation indices and clustering methods can be

requested in a single function call. This enables the user to simultaneously evaluate sev-

eral clustering schemes while varying the number of clusters, to help determining the most

appropriate number of clusters for the data set of interest.

Keywords:Rpackage, cluster validity, number of clusters, clustering, indices, k-means, hier-

archical clustering.

1. Introduction and related work

Clustering is the task of assigning a set of objects into groups (clusters) so that the objects

in the same cluster are more similar to each other than objects in other clusters. There is a

32 NbClust: Determining the Relevant Number of Clusters in R

Acknowledgments

The authors would like to thank the NSERC-Industrial Alliance Chair for Women in Science

and Engineering in Quebec for the support to this research.

References

Baker FB, Hubert LJ (1975). “Measuring the Power of Hierarchical Cluster Analysis.” Journal

of the American Statistical Association,70(349), 31–38.

Ball GH, Hall DJ (1965). “ISODATA: A Novel Method of Data Analysis and Pattern Classi-

ﬁcation.” Stanford Research Institute, Menlo Park. (NTIS No. AD 699616).

Beale EML (1969). Cluster Analysis. Scientiﬁc Control Systems, London.

Bezdek JC, Pal NR (1998). “Some New Indexes of Cluster Validity.” IEEE Transactions on

Systems, Man and Cybernetics,28(3), 301–315.

Brock G, Pihur V, Datta S (2014). clValid: Validation of Clustering Results.Rpackage

version 0.6-6, URL http://CRAN.R-project.org/package=clValid.

Brock G, Pihur V, Datta S, Datta S (2008). “clValid: An RPackage for Cluster Validation.”

Journal of Statistical Software,25(4), 1–22. URL http://www.jstatsoft.org/v25/i04/.

Calinski T, Harabasz J (1974). “A Dendrite Method for Cluster Analysis.” Communications

in Statistics – Theory and Methods,3(1), 1–27.

Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014). NbClust Package for Determining

the Best Number of Clusters.Rpackage version 2.0.3, URL http://CRAN.R-project.org/

package=NbClust.

Chipman H, Tibshirani R (2014). hybridHclust: Hybrid Hierarchical Clustering.Rpackage

version 1.0.4, URL http://CRAN.R-project.org/package=hybridHclust.

Davies DL, Bouldin DW (1979). “A Cluster Separation Measure.” IEEE Transactions on

Pattern Analysis and Machine Intelligence,1(2), 224–227.

Dimitriadou E (2014). cclust: Convex Clustering Methods and Clustering Indexes.Rpackage

version 0.6-18, URL http://CRAN.R-project.org/package=cclust.

Dimitriadou E, Dolnicar S, Weingessel A (2002). “An Examination of Indexes for Determining

the Number of Clusters in Binary Data Sets.” Psychometrika,67(3), 137–160.

Duda RO, Hart PE (1973). Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons,

New York.

Dunn J (1974). “Well Separated Clusters and Optimal Fuzzy Partitions.” Journal Cybernetics,

4(1), 95–104.

Edwards AWF, Cavalli-Sforza L (1965). “A Method for Cluster Analysis.” Biometrics,21(2),

362–375.

Journal of Statistical Software 33

Everitt B (1974). Cluster Analysis. Heinemann Educational, London.

Fisher RA (1936). “The Use of Multiple Measurements in Taxonomic Problems.” The Annals

of Eugenics,7(2), 179–188.

Florek K, Lukaszewicz J, Perkal J, Zubrzycki S (1951). “Sur la Liaison et la Division des

Points d’un Ensemble Fini.” Colloquium Mathematicae,2(3–4), 282–285.

Frey T, Van Groenewoud H (1972). “A Cluster Analysis of the D-Squared Matrix of White

Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle.” Journal of

Ecology,60(3), 873–886.

Friedman HP, Rubin J (1967). “On Some Invariant Criteria for Grouping Data.” Journal of

the American Statistical Association,62(320), 1159–1178.

Fukunaga K, Koontz WLG (1970). “A Criterion and An Algorithm for Grouping Data.” IEEE

Transactions on Computers,C-19(10), 917–923.

Gordon AD (1999). Classiﬁcation. 2nd edition. Chapman & Hall/CRC, London.

Gower JC (1967). “A Comparison of Some Methods of Cluster Analysis.” Biometrics,23(4),

623–637.

Halkidi M, Batistakis I, Vazirgiannis M (2001). “On Clustering Validation Techniques.” Jour-

nal of Intelligent Information Systems,17(2/3), 107–145.

Halkidi M, Vazirgiannis M (2001). “Clustering Validity Assessment: Finding the Optimal

Partitioning of a Data Set.” In ICDM’01 Proceedings of the 2001 IEEE International

Conference on Data Mining, pp. 187–194.

Halkidi M, Vazirgiannis M, Batistakis I (2000). “Quality Scheme Assessment in the Clustering

Process.” In Principles of Data Mining and Knowledge Discovery, volume 1910 of Lecture

Notes in Computer Science, pp. 265–276. Springer-Verlag, Berlin Heidelberg. Proceedings

of the 4th European Conference, PKDD 2000, Lyon, France, September 13–16 2000.

Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons, New York.

Hartigan JA, Wong MA (1979). “A K-Means Clustering Algorithm.” Journal of the Royal

Statistical Society C,28(1), 100–108.

Hill RS (1980). “A Stopping Rule for Partitioning Dendrograms.” Botanical Gazette,141(3),

321–324.

Hornik K (2005). “A CLUE for CLUster Ensembles.” Journal of Statistical Software,14(12),

1–25. URL http://www.jstatsoft.org/v14/i12/.

Hornik K (2014). clue: Cluster Ensembles.Rpackage version 0.3-48, URL http://CRAN.

R-project.org/package=clue.

Hubert LJ, Arabie P (1985). “Comparing Partitions.” Journal of Classiﬁcation,2(1), 193–218.

Hubert LJ, Levin JR (1976). “A General Statistical Framework for Assessing Categorical

Clustering in Free Recall.” Psychological Bulletin,83(6), 1072–1080.

34 NbClust: Determining the Relevant Number of Clusters in R

Jain AK, Murty PJ, Flyn PJ (1998). “Data Clustering: A Review.” ACM Computing Surveys,

31(3), 264–323.

Kaufman L, Rousseeuw PJ (1990). Finding Groups in Data: An Introduction to Cluster

Analysis. John Wiley & Sons, New York.

Kraemer HC (1982). Biserial Correlation. John Wiley & Sons. Reference taken from a

SAS note about the BISERIAL macro on this Web Site: http://support.sas.com/kb/

24/991.html.

Krzanowski WJ, Lai YT (1988). “A Criterion for Determining the Number of Groups in a

Data Set Using Sum-of-Squares Clustering.” Biometrics,44(1), 23–34.

Lebart L, Morineau A, Piron M (2000). Statistique Exploratoire Multidimensionnelle. Dunod,

Paris.

MacQueen JB (1967). “Some Methods for Classiﬁcation and Analysis of Multivariate Obser-

vations.” In LML Cam, J Neyman (eds.), Proceedings of the Fifth Berkeley Symposium on

Mathematical Statistics and Probability, volume 1, pp. 281–297.

Maechler M, Rousseeuw P, Struyf A, Hubert M (2014). cluster: Cluster Analysis Extended

Rousseeuw et al.Rpackage version 1.15.2, URL http://CRAN.R-project.org/package=

cluster.

Marozzi M (2014). “Construction, Dimension Reduction and Uncertainty Analysis of an Index

of Trust in Public Institutions.” Quality and Quantity,48(2), 939–953.

Marriot FHC (1971). “Practical Problems in a Method of Cluster Analysis.” Biometrics,

27(3), 501–514.

McClain JO, Rao VR (1975). “CLUSTISZ: A Program to Test for The Quality of Clustering

of a Set of Objects.” Journal of Marketing Research,12(4), 456–460.

McQuitty LL (1966). “Similarity Analysis by Reciprocal Pairs for Discrete and Continuous

Data.” Educational and Psychological Measurement,26(4), 825–831.

Milligan GW (1980). “An Examination of the Eﬀect of Six Types of Error Perturbation on

Fifteen Clustering Algorithms.” Psychometrika,45(3), 325–342.

Milligan GW (1981). “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster

Analysis.” Psychometrika,46(2), 187–199.

Milligan GW, Cooper MC (1985). “An Examination of Procedures for Determining the Num-

ber of Clusters in a Data Set.” Psychometrika,50(2), 159–179.

Murtagh F, Legendre P (2011). “Ward’s Hierarchical Clustering Method: Clustering Criterion

and Agglomerative Algorithm.” Unpublished preprint.

Murtagh F, Legendre P (2014). “Ward’s Hierarchical Agglomerative Clustering Method:

Which Algorithms Implement Ward’s Criterion?” Journal of Classiﬁcation. Forthcoming.

Nieweglowski L (2014). clv: Cluster Validation Techniques.Rpackage version 0.3-2.1, URL

http://CRAN.R-project.org/package=clv.

Journal of Statistical Software 35

Orloci L (1967). “An Agglomerative Method for Classiﬁcation of Plant Communities.”Journal

of Ecology,55(1), 193–206.

Ratkowsky DA, Lance GN (1978). “A Criterion for Determining the Number of Groups in a

Classiﬁcation.” Australian Computer Journal,10(3), 115–117.

RCore Team (2014). R: A Language and Environment for Statistical Computing.RFounda-

tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Rohlf FJ (1974). “Methods of Comparing Classiﬁcations.” Annual Review of Ecology and

Systematics,5, 101–113.

Rousseeuw P (1987). “Silhouettes: A Graphical Aid to the Interpretation and Validation of

Cluster Analysis.” Journal of Computational and Applied Mathematics,20, 53–65.

Saisana M, Saltelli A, Tarantola S (2005). “Uncertainty and Sensitivity Analysis Techniques as

Tools for the Quality Assessment of Composite Indicators.” Journal of the Royal Statistical

Society A,168(2), 307–323.

Sarle WS (1983). “SAS Technical Report A-108, Cubic Clustering Criterion.” SAS Institute

Inc. Cary, NC.

SAS Institute Inc (2012). SAS/STAT Software, Version 12.1.SAS Institute Inc., Cary, NC.

URL http://www.sas.com/.

Scott AJ, Symons MJ (1971). “Clustering Methods Based on Likelihood Ratio Criteria.”

Biometrics,27(2), 387–397.

Seber GAF (1984). Multivariate Observations. John Wiley & Sons, New York.

Sheikholeslami C, Chatterjee S, Zhang A (2000). “WaveCluster: A Multi-Resolution Cluster-

ing Approach for Very Large Spatial Database.” The International Journal on Very Large

Data Bases,8(3–4), 289–304.

Sokal R, Michener C (1958). “A Statistical Method for Evaluating Systematic Relationships.”

University of Kansas Science Bulletin,38(22), 1409–1438.

Sørensen TA (1948). “A Method of Establishing Groups of Equal Amplitude in Plant Sociology

Based on Similarity of Species and its Application to Analyses of the Vegetation on Danish

Commons.” Biologiske Skrifter,5, 1–34.

Suzuki R, Shimodaira H (2014). pvclust: Hierarchical Clustering with P-Values via Multi-

scale Bootstrap Resampling.Rpackage version 1.2-2, URL http://CRAN.R-project.org/

package=pvclust.

Templ M (2007). clustTool: GUI for Clustering Data with Spatial Information.Rpackage

version 1.3, URL http://CRAN.R-project.org/package=clustTool.

Theodoridis S, Koutroubas K (2008). Pattern Recognition. 4th edition. Academic Press.

Tibshirani R, Walther G, Hastie T (2001). “Estimating the Number of Clusters in a Data Set

Via the Gap Statistic.” Journal of the Royal Statistical Society B,63(2), 411–423.

36 NbClust: Determining the Relevant Number of Clusters in R

Walesiak M, Dudek A (2014). clusterSim: Searching for Optimal Clustering Procedure

for a Data Set.Rpackage version 0.43-4, URL http://CRAN.R-project.org/package=

clusterSim.

Ward JH (1963). “Hierarchical Grouping to Optimize an Objective Function.” Journal of the

American Statistical Association,58(301), 236–244.

Aﬃliation:

Malika Charrad

Universit´e de Gabes

Institut Sup´erieur de l’Informatique

Route Djerba Km 3, Boite Postale N 283

4100 Medenine, Tunisie

and

Universit´e Laval, Qu´ebec

E-mail: malika.charrad@riadi.rnu.tn

Nadia Ghazzali

Universit´e du Qu´ebec `a Trois-Rivi`eres

E-mail: nadia.ghazzali@uqtr.ca

V´eronique Boiteau, Azam Niknafs

D´epartement de Math´ematiques et de Statistique

Universit´e Laval, Qu´ebec

E-mail: veronique.boiteau.1@ulaval.ca,azam.niknafs.1@ulaval.ca

Journal of Statistical Software http://www.jstatsoft.org/

published by the American Statistical Association http://www.amstat.org/

Volume 61, Issue 6 Submitted: 2012-08-13

October 2014 Accepted: 2013-04-08

NbClust package : Manual

Data

November 2014

Malika Charrad · Nadia Ghazzali · Véronique Boiteau · Azam Niknafs

Download

Standard metrics for characterizing episodic salinization in freshwater systems

Article

Full-text available

Jun 2024
LIMNOL OCEANOGR-METH

Salinization threatens freshwater resources and freshwater‐dependent wetlands in coastal areas worldwide. Many research efforts focus on gradual or chronic salinization, but the phenomenon is also episodic in nature, particularly in small streams and artificial waterways. In surface waters, salinization events may coincide with storms, droughts, wind tides, and other episodic events. A lack of standardized quantitative methods and metrics for describing and discussing episodic salinization hinders cross‐disciplinary efforts by scientists and others to analyze, discuss, and make recommendations concerning these events. Here, we present a set of metrics that use statistics which describe flow characteristics in rivers and streams as a template for empirically describing and characterizing salinization events. We developed a set of metrics to quantify the duration, magnitude, and other characteristics of episodic salinization, and we apply the metrics to extensive time‐series data from a field site in coastal North Carolina. We then demonstrate the utility of these metrics by coupling them with ancillary data to perform an unsupervised classification that groups individual salinization events by their primary meteorological driver. We provide simple and flexible code needed to compute metrics in any environment experiencing salinization events in hopes that it will facilitate more standardized approaches to the quantification and study of widespread freshwater salinization.

The biotic and abiotic contexts of ecological selection mediate the dominance of distinct dispersal strategies in competitive metacommunities

Article

Full-text available

Jun 2024
PHILOS T R SOC B

While the influence of dispersal on ecological selection is the subject of intense research, we still lack a thorough understanding of how ecological selection operates to favour distinct dispersal strategies in metacommunities. To address this issue, we developed a model framework in which species with distinct quantitative dispersal traits that govern the three stages of dispersal—departure, movement and settlement—compete under different ecological contexts. The model identified three primary dispersal strategies (referred to as nomadic, homebody and habitat-sorting) that consistently dominated metacommunities owing to the interplay of spatiotemporal environmental variation and different types of competitive interactions. We outlined the key characteristics of each strategy and formulated theoretical predictions regarding the abiotic and biotic conditions under which each strategy is more likely to prevail in metacommunities. By presenting our results as relationships between dispersal traits and well-known ecological gradients (e.g. seasonality), we were able to contrast our theoretical findings with previous empirical research. Our model demonstrates how landscape environmental characteristics and competitive interactions at the intra- and interspecific levels can interact to favour distinct multivariate and context-dependent dispersal strategies in metacommunities. This article is part of the theme issue ‘Diversity-dependence of dispersal: interspecific interactions determine spatial dynamics’.

Estimating the Temporal Impacts of Nearshore Fisheries on Coastal Ocean-Sourced Waste Accumulation in South Korea Using Stepwise Regression

Article

Jul 2024

Fishing activities have been recognized as one of the primary contributors to marine environmental pollution. Studies have been conducted on the impact of fishing activities on the accumulation of marine debris, but most of these studies have been conducted at specific points in time. This study collected marine debris data over four years in the coastal area of Korea. Data on the magnitude of nearshore fishing activities during the same period were collected and analyzed. Regression models were constructed to explore the impact of nearshore fishing activities on coastal waste accumulation over time. This research aimed to understand the influence of nearshore fishing activities on the accumulation of ocean-sourced coastal waste, leading to the development of a time series regression model. The results indicated that time series models have substantially more explanatory power compared to conventional models, emphasizing the importance of temporal considerations in quantifying the relationship between fishing activities and coastal litter over time.

Pre-COVID-19‑pandemic differentiation of employment and wages in county hospitals in Poland

Article

Full-text available

Jul 2024

The difficult current global situation in the aspect of Human Resources for Health was clearly seen during the COVID-19 pandemic. The spending on healthcare is still increasing and the rate of increase outpaces the growth rate of GDP. Only part of these funds is dedicated to the training of new staff and current healthcare employees migrate in search for better job conditions and worklife balance. Personnel migration combined with the demographic structure in the high-income countries simultaneously leads to increasing demand for healthcare services and limits the supply of specialists who can provide such services. The confrontation between the demand for medical personnel and its supply will lead to a reduction in the quality of care and accessibility of services. In the study based on the large group of Polish county hospitals in 2015–2018, differences and similarities between the hospitals in terms of employment, measured in full-time equivalents (FTEs) and in terms of wages were analyzed. Similarity and dissimilarity analysis was conducted, based on distance measures and cluster analysis. Bigger differences between the hospitals were found for wages than employment levels. The hospitals with an ED and efficient units were less similar to one another than their counterparts in terms of employment (FTEs), except for 2016. When it comes to wages and both types of variables (wages and employment) considered simultaneously, the hospitals with an ED and high number of beds were characterized by lower similarity to one another than their counterparts during the whole period. Clustering all the 3 approaches (FTEs, wages, FTEs and wages) the results were the same. One of these groups was characterized by a rather low employment level per bed, while the other one – by high.

Ship trajectory segmentation and semisupervised clustering via geospatial background knowledge

Article

Jul 2024
OCEAN ENG

Vulnerability Heatmapping in Debris Management

Article

Jul 2024

Cluster Analysis and Ablation Success Rate in Atrial Fibrillation Patients Undergoing Catheter Ablation

Article

May 2024

Optimizing Signal Mix in Contract Labor Markets: A network- based approach

Preprint

Jun 2024

We compute the optimal signal mix for freelance workers vying for a job in a two-sided contract labor market platform.

Structuring Intra-Party Politics: A Mixed-Method Study of Ideological and Hierarchical Factions in Parties

Article

Aug 2023
COMP POLIT STUD

Scholars acknowledge the existence of intra-party divisions and the potentially negative electoral effects of disunity. Some assume that intra-party divides are between professional politicians and grassroots members, others highlight the importance of ideological blocs. Yet, precisely mapping factional structures, especially ideological factions, is difficult because of the “black box of intra-party politics.” Based on theories of party change and spatial competition, we argue for the existence of two distinct ideological factional dimensions that may differ from hierarchical factions. We test our expectations by triangulating evidence from three unique datasets from Sweden: a survey of party members, a media content analysis, and interviews with politicians. Our mixed-methods approach allows identifying the number, structure, content, sizes, and ideological positions of factions. The results show substantial variation in all aspects and that hierarchical and ideological factions rarely coincide. These findings have important theoretical, conceptual, and methodological implications for comparative politics.

Kinetics and pathways of sub‐lithic microbial community (hypolithon) development

Article

Full-text available

Jun 2024

Type I hypolithons are microbial communities dominated by Cyanobacteria. They adhere to the underside of semi‐translucent rocks in desert pavements, providing them with a refuge from the harsh abiotic stresses found on the desert soil surface. Despite their crucial role in soil nutrient cycling, our understanding of their growth rates and community development pathways remains limited. This study aimed to quantify the dynamics of hypolithon formation in the pavements of the Namib Desert. We established replicate arrays of sterile rock tiles with varying light transmission in two areas of the Namib Desert, each with different annual precipitation regimes. These were sampled annually over 7 years, and the samples were analysed using eDNA extraction and 16S rRNA gene amplicon sequencing. Our findings revealed that in the zone with higher precipitation, hypolithon formation became evident in semi‐translucent rocks 3 years after the arrays were set up. This coincided with a Cyanobacterial ‘bloom’ in the adherent microbial community in the third year. In contrast, no visible hypolithon formation was observed at the array set up in the hyper‐arid zone. This study provides the first quantitative evidence of the kinetics of hypolithon development in hot desert environments, suggesting that development rates are strongly influenced by precipitation regimes.