PreprintPDF Available

Revisiting Agglomerative Clustering

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

In data clustering, emphasis is often placed in finding groups of points. An equally important subject concerns the avoidance of false positives. As it could be expected, these two goals oppose one another, in the sense that emphasis on finding clusters tends to imply in higher probability of obtaining false positives. The present work addresses this problem considering some traditional agglomerative methods, namely single, average, median, complete, centroid and Ward's applied to unimodal and bimodal datasets following uniform, gaussian, exponential and power-law distributions. More importantly, we adopt a generic model of clusters involving a higher density core surrounded by a transition zone, followed by a sparser set of outliers. Combined with preliminary specification of the size of the expected clusters, this model paved the way to the implementation of an objective means for identifying the clusters from dendrograms. In addition, the adopted model also allowed the relevance of the detected clusters to be estimated in terms of the height of the subtrees corresponding to the identified clusters. More specifically, the lower this height, the more compact and relevant the clusters tend to be. Several interesting results have been obtained, including the tendency of several of the considered methods to detect two clusters in unimodal data. The single-linkage method has been found to provide the best resilience to this tendency. In addition, several methods tended to detect clusters that do not correspond directly to the cores, therefore characterized by lower relevance. The possibility of identifying the type of distribution of points from the adopted measurements was also investigated.
=========================================!
A preprint of this work can be found at:!
https://arxiv.org/abs/2005.07995!
Thanks for visiting.!
=========================================!
ResearchGate has not been able to resolve any citations for this publication.
Article
A general scheme for divisive hierarchical clustering algorithms is proposed. It is made of three main steps : first a splitting procedure for the subdivision of clusters into two subclusters, second a local evaluation of the bipartitions resulting from the tentative splits and, third, a formula for determining the nodes levels of the resulting dendrogram. A handfull of such algorithms is given. These algorithms are compared using the Goodman-Kruskal correlation coefficient. As a global criterion it is an internal goodness-of-fit measure based on the set order induced by the hierarchy compared to the order associated to the given dissimilarities. Applied to a hundred of random data tables, these comparisons are in favor of two methods based on non-usual ratio-type formulas for the splitting procedures, namely the Silhouette criterion and the Dunn's criterion. These two criteria take into account both the within cluster and the between cluster mean dissimilarity. In general the results of these two algorithms are better than the classical Agglomerative Average Link method.
Article
Minimum spanning trees (MST) and single linkage cluster analysis (SLCA) are explained and it is shown that all the information required for the SLCA of a set of points is contained in their MST. Known algorithms for finding the MST are discussed. They are efficient even when there are very many points; this makes a SLCA practicable when other methods of cluster analysis are not. The relevant computing procedures are published in the Algorithm section of the same issue of Applied Statistics. The use of the MST in the interpretation of vector diagrams arising in multivariate analysis is illustrated by an example.
Article
We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent financial crisis. We analyzed a large set of 150 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 22 features. We also compared four DM models: logistic regression, decision trees (DT), neural network (NN) and support vector machine. Using two metrics, area of the receiver operating characteristic curve (AUC) and area of the LIFT cumulative curve (ALIFT), the four models were tested on an evaluation phase, using the most recent data (after July 2012) and a rolling windows scheme. The NN presented the best results (AUC = 0.8 and ALIFT = 0.7), allowing to reach 79% of the subscribers by selecting the half better classified clients. Also, two knowledge extraction methods, a sensitivity analysis and a DT, were applied to the NN model and revealed several key attributes (e.g., Euribor rate, direction of the call and bank agent experience). Such knowledge extraction confirmed the obtained model as credible and valuable for telemarketing campaign managers.