Figure - available from: International Journal of Data Science and Analytics
This content is subject to copyright. Terms and conditions apply.
Runtime versus memory usage of sGridSLINK with increasing τ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document} for HHP2M5d dataset

Runtime versus memory usage of sGridSLINK with increasing τ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document} for HHP2M5d dataset

Source publication
Article
Full-text available
The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be cluster...

Citations

... The SLINK algorithm (Sibson 1973) carries out single-link cluster analysis on an arbitrary dissimilarity coefficient and provides a representation of the resultant dendrogram which can readily be converted into the usual tree-diagram. There exist also alternative implementation (Goyal et al. 2020) which comes from a reduction in the number of distance calculations required by the standard implementation of SLINK with time complexity O (m logm) in the case of m patterns. Hierarchical clustering omitting the initial sorting and consecutive clustering (Schmidt et al. 2017) having a linear time complexity as alternative to single linkage clustering has also been presented. ...
Article
Full-text available
Our paper presents a novel approach to pattern classification. The general disadvantage of a traditional classifier is in too different behaviour and optimal parameter settings during training on a given pattern set and the following cross-validation. We describe the term critical sensitivity, which means the lowest reached sensitivity for an individual class. This approach ensures a uniform classification quality for individual class classification. Therefore, it prevents outlier classes with terrible results. We focus on the evaluation of critical sensitivity, as a quality criterion. Our proposed classifier eliminates this disadvantage in many cases. Our aim is to present that easily formed hidden classes can significantly contribute to improving the quality of a classifier. Therefore, we decided to propose classifier will have a relatively simple structure. The proposed classifier structure consists of three layers. The first is linear, used for dimensionality reduction. The second layer serves for clustering and forms hidden classes. The third one is the output layer for optimal cluster unioning. For verification of the proposed system results, we use standard datasets. Cross-validation performed on standard datasets showed that our critical sensitivity-based classifier provides comparable sensitivity to reference classifiers.
... Parallelizing different categories of clustering algorithms for distributed memory clusters using MPI and OpenMP is an active research area [17,18,32,37]. General-purpose distributed computing frameworks such as MapReduce [11], its variants [12], and Spark [43] provide a high level of programming abstraction while implicitly supporting data parallel execution and fault tolerance. ...
... Iterative refinement pattern (Code 1 [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28], Code 2 [13-36]) It can be implemented either using a while loop or a repeatUntil loop. ...
... Table 2 shows various hash maps that will be used during code generation of K-means and EM codes. Corresponding to each such hash map on a slave, code to gather the hash maps from all slave processes, to merge them based on keys, and to assign it to the original entity is added to the master code (lines [11][12][13][14][15][16][17][18][19][20]. ...
Article
Full-text available
Ease of programming and optimal parallel performance have historically been on the opposite side of a trade-off, forcing the user to choose. With the advent of the Big Data era and the rapid evolution of sequential algorithms, the data analytics community can no longer afford the trade-off. We observed that several clustering algorithms often share common traits—particularly, algorithms belonging to the same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. The compiler either generates MPI C++ code for distributed memory parallel processing or MPI–OpenMP C++ code for hybrid memory parallel processing, depending upon the target architecture. Our experiments on different state-of-the-art parallelization frameworks show that our system can achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community. Results are presented for both distributed and hybrid memory systems.
... Multi-dimensional indexing structures are extensively used for various data mining tasks such as clustering [1][2][3][4] and classification [5,6]. Various such algorithms includedensity-based clustering algorithms (DBSCAN [7] and OPTICS [8,9]), hierarchical clustering algorithms (S-LINK [1], C-LINK [1], Average LINK [1]), k-nearest neighbors (k-NN) classifiers [5,6], etc. ...
... Grid-R-tree have been used to reengineer clustering algorithms like DBSCAN and SLINK from their basics [2,[35][36][37]. These reengineered versions work at grid level with operations over cells as well as over points, rather than only points as in the native algorithms, and give exact clustering results as that of their native versions. ...
Article
Full-text available
The use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold \(\tau \) to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.
Article
The surge in data sizes in fluid processing applications necessitates partitioning the data into clusters and studying their representatives instead of studying each voxel data point. In addition, the dynamic nature of these data poses further challenges. Under such circumstances, it becomes essential to develop an approach that can handle the delta data with minimal updates to the underlying data structure, without processing the complete data from scratch on every update. However, this poses synchronization challenges in parallelization. In this article, we propose SLCoDD (single‐linkage clustering of dynamic data), a geometric distance based dynamic clustering and its multi‐core parallelization using OpenMP. To improve efficiency, SLCoDD exploits geometric properties of the bounding squares. We illustrate trade‐offs in various ways of performing point additions to clusters, point deletions, and their batched versions. Using a suite of large inputs, we demonstrate the effectiveness of SLCoDD. SLCoDD's fully dynamic version achieves a substantial geomean speedup of 8.49×$$ 8.49\times $$ over the static parallel version and of 5×$$ 5\times $$ over the dynamic sequential version.