An example of Definition 4.9

Source publication

Towards Optimal Coreset Construction for $(k,z)$-Clustering: Breaking the Quadratic Dependency on $k

Preprint

Full-text available

Nov 2022

Constructing small-sized coresets for various clustering problems has attracted significant attention recently. We provide efficient coreset construction algorithms for $(k, z)$-Clustering with improved coreset sizes in several metric spaces. In particular, we provide an $\tilde{O}_z(k^{(2z+2)/(z+2)}\varepsilon^{-2})$-sized coreset for $(k, z)$-Clu...

Context 1

... the above definition, for a main group G ∈ G (m) (j) and a given k-center set C ∈ X k , H(G, C) and all R(G, C, β)s (0 ≤ β ≤ 2 + log ε −1 ) are disjoint, and their union is exactly G; see Figure 2 for an example. Intuitively, for all points p ∈ R(G, C, β), the fractions d(p,C) d(p,A ) are "close", which is an important property for our coreset construction. ...

View in full-text

Box plot showing link prediction performance by method and dataset....

Bar charts showing Sobol sensitivities of the hyperparameters of...

Network diagrams for second-order Sobol sensitivities between...

Assessing the effects of hyperparameters on knowledge graph embedding quality

Article

Full-text available

May 2023

Unlabelled: Embedding knowledge graphs into low-dimensional spaces is a popular method for applying approaches, such as link prediction or node classification, to these databases. This embedding process is very costly in terms of both computational time and space. Part of the reason for this is the optimisation of hyperparameters, which involves r...

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

Article

May 2024

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points -- while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the datase size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time -- within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available at https://github.com/Andrew-Draganov/Fast-Coreset-Generation and has scripts to recreate the experiments.

Fitting Data on a Grain of Rice

Conference Paper

Jan 2024

Chris Schwiegelshohn

Coresets are among the most successful compression paradigms. For clustering, a coreset B of a point set A preserves the clustering cost for any candidate solution C. In general, we are interested in finding a B that is as small as possible. In this overview, we will survey techniques for constructing coresets for clustering problems, their applications, and potential future directions.

An example of Definition 4.9

Context in source publication

Similar publications

Citations