PreprintPDF Available

Revisiting Agglomerative Clustering

May 2020

May 2020

Authors:

University of São Paulo

Preprints and early-stage research may not have been peer reviewed yet.

In data clustering, emphasis is often placed in finding groups of points. An equally important subject concerns the avoidance of false positives. As it could be expected, these two goals oppose one another, in the sense that emphasis on finding clusters tends to imply in higher probability of obtaining false positives. The present work addresses this problem considering some traditional agglomerative methods, namely single, average, median, complete, centroid and Ward's applied to unimodal and bimodal datasets following uniform, gaussian, exponential and power-law distributions. More importantly, we adopt a generic model of clusters involving a higher density core surrounded by a transition zone, followed by a sparser set of outliers. Combined with preliminary specification of the size of the expected clusters, this model paved the way to the implementation of an objective means for identifying the clusters from dendrograms. In addition, the adopted model also allowed the relevance of the detected clusters to be estimated in terms of the height of the subtrees corresponding to the identified clusters. More specifically, the lower this height, the more compact and relevant the clusters tend to be. Several interesting results have been obtained, including the tendency of several of the considered methods to detect two clusters in unimodal data. The single-linkage method has been found to provide the best resilience to this tendency. In addition, several methods tended to detect clusters that do not correspond directly to the cores, therefore characterized by lower relevance. The possibility of identifying the type of distribution of points from the adopted measurements was also investigated.

Content uploaded by Luciano da F. Costa

Content may be subject to copyright.

=========================================!

A preprint of this work can be found at:!

https://arxiv.org/abs/2005.07995!

Thanks for visiting.!

=========================================!

ResearchGate has not been able to resolve any citations for this publication.

7he Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and Islands of Bass Strait

Article

Full-text available

Jan 1994

Finding Groups in Data: An Introduction to Cluster Analysis.

Article

Jan 1991

Hierarchical grouping to optimize an ob.iective function.

Article

Jan 1963

J.H. Ward

A comparison of some methods of cluster analysis

Article

Jan 1967

John Gower

Data Clustering: A Review

Article

Jan 1999

Method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish Commons

Article

Jan 1948

TA Sorensen

A comparative study of divisive hierarchical clustering algorithms

Article

Jun 2015

Maurice Roux

A general scheme for divisive hierarchical clustering algorithms is proposed. It is made of three main steps : first a splitting procedure for the subdivision of clusters into two subclusters, second a local evaluation of the bipartitions resulting from the tentative splits and, third, a formula for determining the nodes levels of the resulting dendrogram. A handfull of such algorithms is given. These algorithms are compared using the Goodman-Kruskal correlation coefficient. As a global criterion it is an internal goodness-of-fit measure based on the set order induced by the hierarchy compared to the order associated to the given dissimilarities. Applied to a hundred of random data tables, these comparisons are in favor of two methods based on non-usual ratio-type formulas for the splitting procedures, namely the Silhouette criterion and the Dunn's criterion. These two criteria take into account both the within cluster and the between cluster mean dissimilarity. In general the results of these two algorithms are better than the classical Agglomerative Average Link method.

Minimum Spanning Trees and Single Link Cluster Analysis

Article

Jan 1969

Minimum spanning trees (MST) and single linkage cluster analysis (SLCA) are explained and it is shown that all the information required for the SLCA of a set of points is contained in their MST. Known algorithms for finding the MST are discussed. They are efficient even when there are very many points; this makes a SLCA practicable when other methods of cluster analysis are not. The relevant computing procedures are published in the Algorithm section of the same issue of Applied Statistics. The use of the MST in the interpretation of vector diagrams arising in multivariate analysis is illustrated by an example.

A Data-Driven Approach to Predict the Success of Bank Telemarketing

Article

Jun 2014
DECIS SUPPORT SYST

We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent financial crisis. We analyzed a large set of 150 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 22 features. We also compared four DM models: logistic regression, decision trees (DT), neural network (NN) and support vector machine. Using two metrics, area of the receiver operating characteristic curve (AUC) and area of the LIFT cumulative curve (ALIFT), the four models were tested on an evaluation phase, using the most recent data (after July 2012) and a rolling windows scheme. The NN presented the best results (AUC = 0.8 and ALIFT = 0.7), allowing to reach 79% of the subscribers by selecting the half better classified clients. Also, two knowledge extraction methods, a sensitivity analysis and a DT, were applied to the NN model and revealed several key attributes (e.g., Euribor rate, direction of the call and bank agent experience). Such knowledge extraction confirmed the obtained model as credible and valuable for telemarketing campaign managers.

Sur la liason et la division des points ďun ensemble fini

Article

Jan 1951
Colloq Math

Revisiting Agglomerative Clustering

Abstract

Recommended publications

Submit your application to win an all-inclusive 11-days at Sao Paulo School of Advanced Sciences on...

Revisiting agglomerative clustering

Modeling the Perspectives for Scientific Advancement

On the Relationship between City Mobility and Blocks Uniformity

Impact of the topology of urban streets on mobility optimization

How does the Topology of City Streets Impact on their Respective Optimization?