$Runtime versus memory usage of sGridSLINK with increasing τ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document} for HHP2M5d dataset$

Runtime versus memory usage of sGridSLINK with increasing τ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau $$\end{document} for HHP2M5d dataset

Source publication

Speedup of sGridSLINK with increasing t...

Runtime of sGridSLINK with increasing τ\documentclass[12pt]{minimal}...

Runtime versus memory usage of sGridSLINK with increasing...

Parallel SLINK for big data

Article

Full-text available

Apr 2020

The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be cluster...

Concept of hidden classes in pattern classification

Article

Full-text available

Feb 2023
ARTIF INTELL REV

Our paper presents a novel approach to pattern classification. The general disadvantage of a traditional classifier is in too different behaviour and optimal parameter settings during training on a given pattern set and the following cross-validation. We describe the term critical sensitivity, which means the lowest reached sensitivity for an individual class. This approach ensures a uniform classification quality for individual class classification. Therefore, it prevents outlier classes with terrible results. We focus on the evaluation of critical sensitivity, as a quality criterion. Our proposed classifier eliminates this disadvantage in many cases. Our aim is to present that easily formed hidden classes can significantly contribute to improving the quality of a classifier. Therefore, we decided to propose classifier will have a relatively simple structure. The proposed classifier structure consists of three layers. The first is linear, used for dimensionality reduction. The second layer serves for clustering and forms hidden classes. The third one is the output layer for optimal cluster unioning. For verification of the proposed system results, we use standard datasets. Cross-validation performed on standard datasets showed that our critical sensitivity-based classifier provides comparable sensitivity to reference classifiers.

Automatic parallelization of representative-based clustering algorithms for multicore cluster systems

Article

Full-text available

Aug 2020

Ease of programming and optimal parallel performance have historically been on the opposite side of a trade-off, forcing the user to choose. With the advent of the Big Data era and the rapid evolution of sequential algorithms, the data analytics community can no longer afford the trade-off. We observed that several clustering algorithms often share common traits—particularly, algorithms belonging to the same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. The compiler either generates MPI C++ code for distributed memory parallel processing or MPI–OpenMP C++ code for hybrid memory parallel processing, depending upon the target architecture. Our experiments on different state-of-the-art parallelization frameworks show that our system can achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community. Results are presented for both distributed and hybrid memory systems.

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Article

Full-text available

Jun 2020

The use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold $\tau $ to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.

Single‐linkage clustering of dynamic data

Article

Nov 2022
CONCURR COMP-PRACT E

The surge in data sizes in fluid processing applications necessitates partitioning the data into clusters and studying their representatives instead of studying each voxel data point. In addition, the dynamic nature of these data poses further challenges. Under such circumstances, it becomes essential to develop an approach that can handle the delta data with minimal updates to the underlying data structure, without processing the complete data from scratch on every update. However, this poses synchronization challenges in parallelization. In this article, we propose SLCoDD (single‐linkage clustering of dynamic data), a geometric distance based dynamic clustering and its multi‐core parallelization using OpenMP. To improve efficiency, SLCoDD exploits geometric properties of the bounding squares. We illustrate trade‐offs in various ways of performing point additions to clusters, point deletions, and their batched versions. Using a suite of large inputs, we demonstrate the effectiveness of SLCoDD. SLCoDD's fully dynamic version achieves a substantial geomean speedup of 8.49×$$ 8.49\times $$ over the static parallel version and of 5×$$ 5\times $$ over the dynamic sequential version.

Citations