An example of Definition 4.9

An example of Definition 4.9

Source publication
Preprint
Full-text available
Constructing small-sized coresets for various clustering problems has attracted significant attention recently. We provide efficient coreset construction algorithms for $(k, z)$-Clustering with improved coreset sizes in several metric spaces. In particular, we provide an $\tilde{O}_z(k^{(2z+2)/(z+2)}\varepsilon^{-2})$-sized coreset for $(k, z)$-Clu...

Context in source publication

Context 1
... the above definition, for a main group G ∈ G (m) (j) and a given k-center set C ∈ X k , H(G, C) and all R(G, C, β)s (0 ≤ β ≤ 2 + log ε −1 ) are disjoint, and their union is exactly G; see Figure 2 for an example. Intuitively, for all points p ∈ R(G, C, β), the fractions d(p,C) d(p,A ) are "close", which is an important property for our coreset construction. ...

Similar publications

Article
Full-text available
Unlabelled: Embedding knowledge graphs into low-dimensional spaces is a popular method for applying approaches, such as link prediction or node classification, to these databases. This embedding process is very costly in terms of both computational time and space. Part of the reason for this is the optimisation of hyperparameters, which involves r...

Citations

Article
We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points -- while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the datase size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time -- within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available at https://github.com/Andrew-Draganov/Fast-Coreset-Generation and has scripts to recreate the experiments.
Conference Paper
Coresets are among the most successful compression paradigms. For clustering, a coreset B of a point set A preserves the clustering cost for any candidate solution C. In general, we are interested in finding a B that is as small as possible. In this overview, we will survey techniques for constructing coresets for clustering problems, their applications, and potential future directions.