PreprintPDF Available

Too Much Information Kills Information: A Clustering Perspective

September 2020

September 2020

Authors:

Yicheng Xu

Chinese Academy of Sciences

Chenchen Wu

Tianjin University of Technology

Yong Zhang

The University of Hong Kong

Show all 6 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means++ method (Arthur and Vassilvitskii 2007), we only need 7% information of the dataset. If we have up to 15% information of the dataset, then our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks, in terms of the quality of clustering. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.

Examples of Voronoi diagram in the plane

…

A minimum-cost flow instance

…

Figures - uploaded by Yicheng Xu

Content may be subject to copyright.

Content uploaded by Yicheng Xu

Content may be subject to copyright.

Too Much Information Kills Information:

A Clustering Perspective

Yicheng Xu1Vincent Chau1Chenchen Wu2Yong Zhang1Vassilis Zissimopoulos3Yifei Zou4

1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, P.R.China.

{yc.xu, vincentchau, zhangyong}@siat.ac.cn

2Tianjin University of Technology, P.R.China. wu chenchen tjut@163.com

3National and Kapodistrian University of Athens, Greece. vassilis@di.uoa.gr

4The University of Hong Kong, P.R.China. yfzou@cs.hku.hk

Abstract—Clustering is one of the most fundamental tools

in the artiﬁcial intelligence area, particularly in the pattern

recognition and learning theory. In this paper, we propose a

simple, but novel approach for variance-based k-clustering tasks,

included in which is the widely known k-means clustering. The

proposed approach picks a sampling subset from the given

dataset and makes decisions based on the data information in the

subset only. With certain assumptions, the resulting clustering

is provably good to estimate the optimum of the variance-

based objective with high probability. Extensive experiments on

synthetic datasets and real-world datasets show that to obtain

competitive results compared with k-means method (Llyod 1982)

and k-means++ method (Arthur and Vassilvitskii 2007), we only

need 7% information of the dataset. If we have up to 15%

information of the dataset, then our algorithm outperforms both

the k-means method and k-means++ method in at least 80% of

the clustering tasks, in terms of the quality of clustering. Also, an

extended algorithm based on the same idea guarantees a balanced

k-clustering result.

INTRODUCTION

Cluster analysis is a subarea of machine learning that studies

methods of unsupervised discovery of homogeneous subsets of

data instances from heterogeneous datasets. Methods of cluster

analysis have been successfully applied in a wide spectrum of

areas of image processing, information retrieval, text mining

and cybersecurity. Cluster analysis has a rich history in dis-

ciplines such as biology, psychology, archaeology, psychiatry,

geology and geography, even through there is an increasing

interest in the use of clustering methods in very hot ﬁelds like

natural language processing, recommended system, image and

video processing, etc. The importance and interdisciplinary

nature of clustering is evident through its vast literature.

The goal of variance-based k-clustering is to ﬁnd a k

sized partition of a given dataset so as to minimize the sum

of the within-cluster variances. The well-known k-means is

a variance-based clustering which deﬁnes the within-cluster

variance as the sum of squared distances from each data to

the means of the cluster it belongs to. The folklore of k-

means method [15], also known as the Lloyd’s algorithm,

is still one of the top ten popular data mining algorithms

and is implemented as a standard clustering method in most

machine learning libraries, according to [16]. To overcome

the high sensitivity to proper initialization, [2] propose the

k-means++ method by augmenting the k-means method with

a careful randomized seeding preprocessing. The k-means++

method is proved to be O(log k)-competitive with the optimal

clustering and the analysis is tight. Even through it is easy

to implement, k-means++ has to make a full pass through

the dataset for every single pick of the seedings, which leads

to a high complexity. [5] drastically reduce the number of

passes needed to obtain, in parallel, a good initialization. The

proposed k-meanskobtains a nearly optimal solution after

a logarithmic number of passes, and in practice a constant

number of passes sufﬁces. Following this path, there are

several speed-ups or hybrid methods. For example, [4] replace

the seeding method in k-means++ with a substantially faster

approximation based on Markov Chain Monte Carlo sampling.

The proposed method retains the full theoretical guarantees

of k-means++ while its computational complexity is only

sublinear in the number of data points. A simple combination

of k-means++ with a local search strategy achieves a constant

approximation guarantee in expectation and is more competi-

tive in practice [11]. Furthermore, the number of local search

steps is dramatically reduced from O(klog log k)to k while

maintaining the constant performance guarantee [6].

A balanced clustering result is often required in a variety

of applications. However, many existing clustering algorithms

have good clustering performances, yet fail in producing

balanced clusters. The balanced clustering, which requires

size constraints for the resulting clusters, is at least APX-

hard in general under the assumption P6=NP [3]. It attracts

research interests simultaneously from approximation and

heuristic perspectives. Heuristically, [14] apply the method of

augmented Lagrange multipliers to minimize the least square

linear regression in order to regularize the clustering model.

The proposed approach not only produces good clustering

performance but also guarantees a balanced clustering result.

To achieve more accurate clustering for large scale dataset,

exclusive lasso on k-means and min-cut are leveraged to

regulate the balance degree of the clustering results. By

optimizing the objective functions that build atop the exclusive

lasso, one can make the clustering result as much balanced as

possible [12]. Recently, [13] introduce a balance regularization

term in the objective function of k-means and by replacing the

assignment step of k-means method with a simplex algorithm

they give a fast algorithm for soft-balanced clustering, and

the hard-balanced requirement can be satisﬁed by enlarging

the multiplier in the regularization term. Also, there are some

algorithmic results for balanced k-clustering tasks with valid

performance guarantees. The ﬁrst constant approximation al-

gorithm for the variance based hard-balanced clustering is a

(69 + )-approximation in fpt-time [17]. The approximation

ratio is then improved to 7 + [8] and 1 + [7] sequentially

with the same asymptotic running time.

Our contributions In this paper, we propose a simple,

but novel algorithm based on random sampling that com-

putes provably good k-clustering results for variance based

clustering tasks. An extended version based on the same

idea is valid for balanced k-clustering tasks with hard size

constraints. We make cross comparisons between the proposed

Random Sampling method with the k-means method and

the k-means++ method in both synthetic datasets and real-

world datasets. The numerical results show that our method is

competitive with the k-means method and k-means++ method

with a sampling size of only 7% of the dataset. When the

sampling size reaches 15% or higher, the Random Sampling

method outperforms both the k-means method and the k-

means++ method in at least 80% rounds of the clustering tasks.

The remainder of the paper is organized as follows. In

the Warm-up section, we mainly provide some preliminaries

towards a better understand of the proposed algorithm. In the

Random Sampling section, we present the main algorithm

and the analysis. After that, we provide the performance of

the proposed algorithm on different datasets in the Numerical

Results section. Then we extend the proposed algorithm to deal

with the balanced clustering tasks in the Extension section.

In the last section, we discuss the advantages as well as

disadvantages of the proposed algorithm, and some promising

areas where our algorithm has the potential to outperform

existing clustering methods.

WARM-UP

Variance-Based k-Clustering

Roughly speaking, clustering tasks seek an organization of

a collection of patterns into clusters based on similarity, such

that patterns within a cluster are very similar while patterns

from different clusters are highly dissimilar. One way to

measure the similarity is the so-called variance-based objective

function, that leverages the squared distances between patterns

and the centroid of the cluster they belong to.

A well-known variance-based clustering task is the k-means

clustering, which is a method of vector quantization that

originally comes up from signal processing, which aims to

partition nreal vectors (quantiﬁcation from colors) into k

clusters so as to minimize the within-cluster variances. What

makes the k-means clustering different from other variance-

based k-clustering is the way it measures the similarity. The

k-means deﬁnes the similarity between vectors as the squared

Euclidean distance between them. For simplicity, we mainly

take the k-means as an example in the later discussion but

most of the results carry over to the general variance-based

k-clustering tasks.

The k-means clustering can be formally described as fol-

lows. Given are a data set X={x1, x2,· · · , xn}and an

integral number k, where each data in Xis a d-dimensional

real vector. The objective is to partition Xinto k(≤n)disjoint

subsets so as to minimize the total within-cluster sum of

squared distances (or variances). For a ﬁxed ﬁnite data set

A⊆Rd, the centroid (also known as the means) of Ais

denoted by c(A) := Px∈Ax/|A|.Therefore, the objective of

the k-means clustering is to ﬁnd a partition {X1, X2,· · · , Xk}

of Xsuch that the following is minimized:

i=1 X

x∈Xi

||x−c(Xi)||2,

where ||a−b|| denotes the Euclidean distance between vectors

aand b.

Also, we will extend our result to a general scenario of

balanced clustering, where the capacity constraints must be

satisﬁed. For the balanced k-clustering, the only difference is

additional global constraints for the size of the clusters. Both

lower bound and upper bound constraints are considered in

this paper. Based on the above, the balanced k-means can

be described as ﬁnding a partition {Xi}1≤i≤kof Xso as to

minimize the aforementioned k-means objective and

l≤ |Xi| ≤ u, for all 1 ≤i≤k.

Obviously, by taking appropriate values for land u, we reduce

it to the k-means clustering. Thus, it is more difﬁcult to obtain

an optimal balanced k-means clustering.

Voronoi Diagram and Centroid Lemma

Solving the optimal k-means clustering for an arbitrary data

set is NP-hard. However, Lloyd proposes a fast local search

based heuristic for k-means clustering, also known as the k-

means method. A survey of data mining techniques states

that it is by far the most popular clustering algorithm used

in scientiﬁc and industrial applications. The k-means method

is carried out through iterative Voronoi Diagram construction,

combined with the centroid adjustment according to the Cen-

troid Lemma.

Voronoi Diagram is a partition of a space into regions close

to each of a given set of centers. Formally, given centers C=

{c1, c2, ..., ck}in Rdfor example, the Voronoi Diagram w.r.t.

(with respect to) Cconsists of the following Voronoi cells

deﬁned for i= 1,2, ..., k as

Cell(i) = {x∈Rd:d(x, ci)≤d(x, cj) for all j6=i}.

See Figure 1 as examples of the Voronoi Diagrams in the

plane. Obviously, any Voronoi Diagram Πof Rdgives a

feasible partition for any set X⊆Rd(ties broken arbitrarily),

which is called the Voronoi Partition of Xw.r.t. Π. More

precisely, the Voronoi Partition of Xis given by {Xi}1≤i≤k,

where Xi=X∩Cell(i).

Fig. 1. Examples of Voronoi diagram in the plane

On the other hand, given X⊆Rd, it holds for any v∈Rd

that

x∈X

||x−v||2=X

x∈X

||x−c(X)||2+|X| · ||c(X)−v||2,

which is the so-called Centroid Lemma. An example of

application of the Centroid Lemma refers to [10]. Note that the

Centroid Lemma implies that the centroid/means of a cluster

is the minimizer of the within-cluster variance.

RANDOM SAMPLING

Given a dataset X, we say S⊆Xis a random sampling

of Xif Sis obtained by several independent draws from X

uniformly at random. We show that it is not bad to estimate the

objective value of the variance-based k-clustering of Xusing

S. Before that, we introduce two basic facts on expectation and

variance from probability theory. Given independent random

variables V1and V2, we have the follows.

Fact 1 E(aV1+bV2) = aE (V1) + bE(V2)

Fact 2 var(aV1+bV2) = a2var(V1) + b2var(V2)

Suppose Sis an m-draws random sampling of X. Then,

c(S)is an unbiased estimation of c(X)and the squared

Euclidean distance between them can be estimated by the

following lemma.

Lemma 1: E(c(S)) = c(X),E(||c(S)−c(X)||2) =

mvar(X).

Proof: Assume w.o.l.g. that S={V1, V2, ..., Vm}and

recall Viare independent random variables. Based on Fact 1,

it holds that

E(c(S)) = E(1

i=1

Vi) = 1

i=1

E(Vi)

i=1

c(X) = c(X).

Then

E(||c(S)−c(X)||2) = E(||c(S)−E(c(S))||2)

=var(c(S))

=var(1

i=1

Vi)

i=1

var(Vi)

mvar(X),

where the second last equality is derived from Fact 2.

Based on the above, we conclude that c(S)is indeed a

good estimate for c(X). A natural idea comes from here that

it is probably a good estimate for Px∈X||x−c(X)||2using

Px∈X||x−c(S)||2, as given in the following lemma.

Lemma 2: With probability at least 1−δ,

x∈X

||x−c(S)||2≤(1 + 1

mδ )X

x∈X

||x−c(X)||2.

Proof: From Lemma 1 and the Markov Inequality we

know, with probability at least 1−δ,

||c(S)−c(X)||2≤1

mδ X

x∈X

||x−c(X)||2.

Recalling the Centroid Lemma, immediately we have with

probability at least 1−δthat

x∈X

||x−c(S)||2

x∈X

||x−c(X)||2+|X| · ||c(S)−c(X)||2

≤(1 + 1

mδ )X

x∈X

||x−c(X)||2,

completing the proof.

Consider the following randomized algorithm for the k-

clustering task based on the random sampling idea, which we

simply call Random Sampling. Given the sampling set S, we

construct every k-clustering of Sby a brute force search. Note

that there are O(mdk)many possibilities due to [9], but we

are allowed to do this because Sis much smaller than X.

For each k-clustering of S, we divide the Rdspace into k

Voronoi cells according to the centroids of the kclusters of S.

Subsequently, we obtain a feasible k-clustering of X, simply

by grouping the data points in the same Voronoi cell together.

Then we choose the best one among these possible results.

The Random Sampling algorithm is provided as Algorithm 1.

Next, we estimate the value for each of the kclusters of X.

Let {X0

i}1≤i≤kbe the output of the Random Sampling algo-

rithm, from which we obtain the corresponding k-clustering

{S0

i}1≤i≤kof the random sampling subset S0. Because the

centroid of each cluster in {X0

i}1≤i≤kdeﬁnes a Voronoi cell of

the space, according to which we partition S0into k-clustering.

Assume w.o.l.g. that |S0

i| ≤ |S0

i+1|for i= 1,2, ..., k −1.

Suppose {X∗

i}1≤i≤kis the optimal solution such that |X∗

i| ≤

Algorithm 1: Random Sampling for k-clustering tasks

Input: Dataset X, integer k;

Output: k-clustering of X.

1Sample a subset Sby m(≥k)independent draws from

Xuniformly at random;

2for every k-clustering {Si}1≤i≤kof Sdo

3Compute the centroid set C={c(Si)}1≤i≤k;

4Obtain {Xi}1≤i≤k, the Voronoi Partition of Xw.r.t.

the Voronoi Diagram generated by C;

5Compute the value

i=1 P

x∈Xi

||x−c(Xi)||2;

6return {Xi}1≤i≤kwith the minimum value.

|X∗

i+1|for i= 1,2, ..., k −1. Since S0is obtained from

mindependent draws from X, the size of each cluster in

{S0

i}1≤i≤kis determined by independent Bernoulli trials, and

is dependent on the distribution of |X∗

i|over all i. Thus it

must be that E(|S0

i|) = m

nE(|X∗

i|). We denote the distribution

function of |X∗

i|by p(i) := |X∗

nover all i∈ {1, ..., k}. We

call Xaµ-balanced instance (0≤µ≤1) if there exists

an optimal k-clustering for Xsuch that all clusters have size

at least µ|X|. For example, if p(1) ≥µ, then we call X

aµ-balanced instance. Recall X∗

1is the smallest cluster in

{X∗

i}1≤i≤k. We obtain the following lemma.

Lemma 3: If Xis a (ln m/m)-balanced instance, then for

any small positive constant η, it holds with probability at least

1−m−η2/2that

|S0

i| ≥ (1 −η)mp(i)

for all i= 1, ..., k.

Proof: It is obvious that

E(|S0

i|) = m

nE(|X∗

i|) = mp(i).

We now start the proof with S0

1, the smallest cluster in

expectation. Consider mrounds of the following Bernoulli

trial 1,with probability p(1);

0,with probability 1 −p(1).

Let B1, B2, ..., Bmbe the independent random variables of

the mtrials and let B=Pm

i=1 Bi. Obviously E(B) = mp(1)

and from the Chernoff Bound we have

Pr[B < (1 −η)mp(1)] < e−mp(1)η2

2≤e−ln mη2

2=m−η2

Thus, with probability at least 1−m−η2/2, it follows that

|S0

1| ≥ (1 −η)mp(1).

Similarly for i= 2, ..., k as p(i)≥ln m/m hold for all i,

complete the proof.

By combining Lemma 2 and 3, we conclude the following

estimate for the Random Sampling algorithm.

Theorem 1: For any (ln m/m)-balanced instance of a k-

clustering task, Algorithm 1 returns a feasible solution that it

is with probability at least 1−δ−m−η2/2within a factor of

1 + 1

(1−η)δln mto the optimum.

Proof: Considering the objective value of the output of

Algorithm 1, and using the Centroid Lemma, we have

i=1 X

x∈X0

||x−c(X0

i)||2≤

i=1 X

x∈X0

||x−c(S0

i)||2.

From line 4 of Algorithm 1, we know that the partition

{X0

i}1≤i≤kis obtained from the Voronoi Diagram generated

by {c(S0

i)}1≤i≤k. That is to say, for any x∈X0

iand an

arbitrary j6=i, it must be the case that

||x−c(S0

i)|| ≤ ||x−c(S0

j)||.

Summing over all x, we obtain

i=1 X

x∈X0

||x−c(S0

i)||2≤

i=1 X

x∈X∗

||x−c(S0

i)||2.

The right hand side implies an assignment where an xis

assigned to c(S0

i)as long as x∈X∗

ifor some i. Considering

an x∈X∗

i, we do not change its cost of those x∈X∗

i∩X0

But we increase the cost of those x∈X∗

i∩X0

jfor any j6=i.

Applying Lemma 2 to every cluster in {X∗

i}1≤i≤k, with

probability at least 1−δ, it holds that

i=1 X

x∈X∗

||x−c(S0

i)||2≤

i=1 X

x∈X∗

(1 + 1

δ|S0

i|)||x−c(X∗

i)||2.

Combining with Lemma 3, we obtain with probability at least

(1 −δ)(1 −mη2/2)≈1−δ−mη2/2that

i=1 X

x∈X∗

||x−c(S0

i)||2

≤

i=1 X

x∈X∗

(1 + 1

(1 −η)δmp(i))||x−c(X∗

i)||2

≤(1 + 1

(1 −η)δln m)

i=1 X

x∈X∗

||x−c(X∗

i)||2,

where the last inequality follows from the assumption that X

is a (ln m/m)-balanced instance. Complete the proof.

NUMERICAL RES ULTS

In this section, we evaluate the performance of the proposed

RS (abbreviation for the Random Sampling algorithm) mainly

through the cross comparisons with the widely known KM

(abbreviation for the k-means method) and KM++ (abbre-

viation for the k-means++ method) on the same datasets.

The environment for experiments is Intel(R) Xeon(R) CPU

E5-2620 v4 @ 2.10GHz with 64GB memory. We construct

extensive numerical experiments to analyze different impacts

of the proposed algorithm as well as the parameter settings.

Since all algorithms are randomized, we run RS, KM and

KM++ on 100 instances per setting and report the number

of instances of each algorithm hitting the minimum objective

value. We mainly design the following experiments due to

disparate purposes.

1) Effect of n:

We generate 100 instances of each n=

{100,200, ..., 1000}with a standard normal distribu-

tion, after which we run simultaneously the RS, KM

and KM++ on the same instance and record which

of the three algorithms hits the minimum objective

value. We ﬁx m=n/10,k= 3 throughout the

experiments and see Figure 2 the numerical results.

100

200

300

400

500

600

700

800

900

1000

100

value of n

# instances hit the minimum

KM KM++ RS

Fig. 2. Effect of the size of the dataset n

The RS performs not so good as KM or KM++ at

the beginning because the sampling set is too small

to represent the entire dataset. Taking n= 100

as an example, a 10-sized sampling set is proba-

bly not a good estimate for the original 100-sized

dataset. However, when nincreases to 700, a 70-

sized sampling set seems good enough for RS to

be competitive with KM and KM++. With the rise

of n, RS performs increasingly better and tends to

outperform both the KM and KM++. Note that ﬁxing

n= 100 for example, the total number of instances

that any of the three algorithms hitting the minimum

exceeds 100. This is because for smaller instances, it

is more likely that not only one algorithm is hitting

the minimum, and in this case we count all of them

once in Figure 2.

2) Effect of k:

We generate 100 instances with a standard normal

distribution, after which we run simultaneously the

RS, KM and KM++ on the same instance for dif-

ferent k-clustering tasks with each k={2,3, ..., 8},

and record which of the three algorithms hits the

minimum objective value. We ﬁx n= 100,m= 50

throughout the experiments and see Figure 3 the

numerical results.

As shown, the RS reaches the best performance in

2-clustering and worst performance in 5-clustering.

Overall, it is competitive with KM and KM++ with

2345678

100

vaue of k

# instances hit the minimum

KM KM++ RS

Fig. 3. Effect of the number of clusters k

these settings.

3) Effect of m:

We evaluate the performance of our algorithm on

real-world dataset. The Cloud dataset consists of

1024 points and represents the 1st cloud cover

database available from the UC-Irvine Machine

Learning Repository. We run simultaneously the KM

and KM++ on the Cloud dataset, along with the RS

with each sampling size m={25,50,75, ..., 200}.

Since there is only one instance here, we run 100

rounds of each algorithm per setting and report

the one hitting the minimum objective value. Note

that n= 1024 and we ﬁx k= 3 throughout the

experiments and see Figure 4 the numerical results.

25 50 75 100 125 150 175 200

100

value of m

# rounds hit the minimum

KM KM++ RS

Fig. 4. Effect of the size of the sampling set m

As predicted, the RS performs increasingly better

when the sampling size gets large. But it is quite

surprising that when m= 75 (about only 7% of the

Cloud dataset), the RS performs as good as KM++.

When mis higher than 100 (about 10% of the Cloud

dataset), the RS outperforms any one of the KM and

KM++. If mreaches 150 (about 15% of the Cloud

dataset) or higher, the RS wins in at least 80% rounds

of the clustering tasks.

EXT ENS ION TO BALANCED k-CLUSTERING

An additional important feature of the proposed Random

Sampling algorithm is the extension to handle the balanced

variance-based k-clustering tasks, for which the k-means

method and the k-means++ method can not deal with. Both

upper bound and lower bound constraints are considered,

which means a feasible balanced k-clustering has a global

lower bound land an upper bound ufor the cluster sizes. We

assume w.o.l.g. that land uare positive integers. The main

idea is a minimum-cost ﬂow subroutine embedded into the

Random Sampling algorithm.

To start, we introduce the well-known minimum-cost ﬂow

problem. Given a directed graph G= (V, E), every edge e∈

Ehas a weight c(e)representing its cost of sending a unit

of ﬂow. Also, every e∈Eis equipped with a bandwidth

constraint. Only those ﬂows within a maximum ﬂow value of

upper(e)and minimum value of lower(e)can pass through

edge efor each e∈E, where upper(e)and lower(e)denote

the upper bound and the lower bound for the bandwidth of e

respectively. Every node v∈Vhas a demand d(v), deﬁned as

the total outﬂow minus total inﬂow. Thus a negative demand

represents a need for ﬂow and a positive one represents a

supply.

A ﬂow in Gis deﬁned as a function from Vto R+. A

feasible ﬂow carrying famount of ﬂow in the graph requires

a source sand a sink twith d(s) = fand d(t) = −f. Every

node v∈V\ {s, t}must have d(v)=0, which means it is

either an intermediate node or an idle node. The cost of ﬂow f

is deﬁned as c(f) = Pe∈Ef(e)·c(e), where f(·) : V→R+

is the corresponding function of ﬂow f. The minimum–cost

ﬂow is the optimization problem to ﬁnd a cheapest way (i.e.

with the minimum cost) of sending a certain amount of ﬂow

through graph G.

To deal with the capacity constraints, we herein propose

a Random Sampling based randomized algorithm embedding

in the minimum–cost ﬂow subroutine. Obviously, the Voronoi

Diagram generated by the centroids of the k-clustering of the

sampling set Sdoes not guarantee a feasible Voronoi Partition

of Xsatisfying the capacity constraints. Assume that we are

given a k-clustering of Sand we look for a feasible balanced

k-clustering of X.

Consider the following instance of the minimum–cost ﬂow

problem. Let Vbe X∪C∪ {s, t}, where Cconsists of the

centroids {c(Ci)}1≤i≤kobtained from the given k-clustering

of S, and sand tare the dummy source and sink nodes

respectively. Let Ebe E1∪E2∪E3, where E1are the directed

edges (s, i)from sto each i∈X,E2are the edges (i, j)from

each i∈Xto j∈C, and E3are the edges (j, t)from each

j∈Cto t. Every edge in E1∪E2has bandwidth interval

[0,1] while E3has [l, u]. Edges in E1∪E3are unweighted

and edge (i, j)∈E2has weight ||i−j||2for each i∈Xand

j∈C. See Figure 5 as a description.

As shown in the ﬁgure, the bandwidth intervals and the

weights/costs are labeled on the edges. All the edges are

oriented from the source to the sink and we simply omit

Fig. 5. A minimum-cost ﬂow instance

the direction labels. Inside the shadowed box is a complete

bipartite graph, also known as a biclique, consisting of vertices

X∪Cand edges E2. Consider a ﬂow fthat carrying n

(n=|X|) amount of ﬂow from the source to the sink in

Gand suppose that function f:E7→ R+reﬂects such a ﬂow.

Recall that d(v) = Pe∈δ+(v)f(e)−Pe∈δ−(v)f(e), where

δ+(v)denotes the edges leading away from node vand δ−(v)

denotes the edges leading into v. Then the follows must hold.

•Flow conservation:

d(v) = 





n, v =s;

−n, v =t;

0,∀v6=s, t.

•Bandwidth constraints:







0≤f(e)≤1,∀e∈E1;

0≤f(e)≤1,∀e∈E2;

l≤f(e)≤u, ∀e∈E3.

Then minimum–cost ﬂow problem aims to ﬁnd a function

f:E7→ R+satisfying both the ﬂow conservation and

the bandwidth constraints so as to minimize its cost, i.e.,

Pe∈Ef(e)·c(e). An important property of the minimum-cost

ﬂow problem is that basic feasible solutions are integer-valued

if capacity constraints and quantity of ﬂow produced at each

node are integer-valued, as captured by the following lemma.

Lemma 4: [1] If the objective value of the minimum-cost

ﬂow is bounded from below on the feasible region, the

problem has a feasible solution, and if capacity constraints

and quantity of ﬂow are all integral, then the problem has at

least one integral optimal solution.

The integral solution can be computed efﬁciently by the

Cycle Canceling algorithms, Successive Shortest Path al-

gorithms, Out-of-Kilter algorithms and Linear Programming

based algorithms. These algorithms can be found in many

textbooks. See for example [1]. We take any one of these

algorithms as the MCF (Minimum-Cost Flow) subroutine in

our algorithm. We show the following theorem.

Theorem 2: The integral optimal solution to the above

minimum-cost ﬂow instance provides an optimal assignment

from Xto Cfor balanced clustering tasks.

Proof: We only need to prove that any feasible assignment

from the dataset Xto the given centroid Ccan be represented

by a feasible integral ﬂow to the aforementioned minimum-

cost ﬂow instance, and vice versa.

Let σ:X7→ Cbe a feasible assignment from Xto C.

Consider the following ﬂow f:E7→ R+.

f(e) = 









1,∀e∈E1,

1,∀e∈E2and σ(eo) = ed,

0,∀e∈E2and σ(eo)6=ed,

Pe0:e0

d=eof(e0),∀e∈E3,

where we denote the origin and destination of edge eby eoand

edrespectively. Note that the quantity of fis n. Obviously,

fsatisﬁes the ﬂow conversation and every edge in E1and

E2obeys the bandwidth constraints. For e∈E3, from the

construction we have f(e) = Pe0:e0

d=eof(e0) = |σ−1(eo)|.

Since σis feasible, then it must hold for every j∈Cthat l≤

|σ−1(j)| ≤ u, which implies the feasibility of the bandwidth

constraints for E3.

On the other hand, given an integral feasible ﬂow f, the

corresponding assignment must be feasible, i.e., satisfying the

size constraints. Note that a feasible ﬂow with quantity nin

the above instance must have all f(e)=1for every e∈E1.

Consider the following assignment σ: For any i∈X, j ∈C,

σ(i) = jif and only if an edge with eo=iand ed=jis

such that f(e) = 1. The deﬁned assignment must be feasible

because |σ−1(j)|=Pe∈δ−(j)f(e) = Pe∈δ+(j)f(e)holds

for any j∈C. Then from the feasibility of ﬂow fwe know

that l≤Pe∈δ+(j)f(e)≤u.

It is obvious that the cost of a feasible assignment and the

cost of its corresponding ﬂow are exactly the same. Because

e∈E

c(e)f(e) = X

e∈E2

c(e)f(e)

e∈E2:f(e)=1

||eo−ed||2

i∈XX

j=σ(i)

||i−j||2

x∈X

||x−σ(x)||2

i=1 X

x∈Xi

||x−σ(x)||2,

where the ﬁrst equality is derived from the construction and

the last equality holds for any feasible partition of X, which

we assume without loss of generality is {Xi}1≤i≤k. Implies

the lemma.

Based on the above, we conclude that a MCF subroutine

embedded in the Random Sampling algorithm guarantees a

valid solution for the balanced k-clustering problem. The

pseudocode is provided as Algorithm 2.

DISCUSSION

We are incredibly well informed yet we know incredibly

little, and this is what is happening in the clustering tasks.

Algorithm 2: Random Sampling for balanced k-clustering

tasks

Input: Dataset X, integer k;

Output: k-clustering of X.

1Sample a subset Sby m(≥k)independent draws from

Xuniformly at random;

2for every k-clustering {Si}1≤i≤kof Sdo

3Compute the centroid set C={c(Si)}1≤i≤k;

4Obtain {Xi}1≤i≤kby the MCF subroutine;

5Compute the value

i=1 P

x∈Xi

||x−c(Xi)||2;

6return {Xi}1≤i≤kwith the minimum value.

Our work implies that we do not need so much information of

dataset when doing clustering. From the experiments, roughly

speaking, to obtain a competitive clustering result compared

with the k-means method and k-means++ method, we only

need about 7% information of the dataset. For the rest of

the 93% data, we immediately make decisions for them with

only O(k)additional computations. Note that the resources

consumed in the algorithm are dominated by the brute force

search for the k-clustering of the sampling set. If we have up

to 15% information of the dataset, then with high probability,

our algorithm outperforms both the k-means method and k-

means++ method in terms of the quality of clustering. The

above statements hold only when 1) The dataset is independent

and identically distributed; 2) The sampling set is picked

uniformly at random from the original dataset; 3) The most

important, the dataset is large enough (experimentally 500 data

points or above sufﬁce). At a cost, the proposed algorithm

has a high complexity with respect to k, but fortunately not

sensitive to the size of the dataset or the size of the sampling

set.

We believe that the Random Sampling idea as well as

the framework of the analysis has the potential to deal with

incomplete dataset and online clustering tasks.

REFERENCES

[1] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network

ﬂows - theory, algorithms and applications. Prentice Hall, 1993.

[2] David Arthur and Sergei Vassilvitskii. k-means++: The advantages

of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms

(SODA), pages 1027–1035, 2007.

[3] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and

Ali Kemal Sinop. The hardness of approximation of euclidean k-means.

In International Symposium on Computational Geometry (SoCG), pages

754–767, 2015.

[4] Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause.

Approximate k-means++ in sublinear time. In AAAI Conference on

Artiﬁcial Intelligence (AAAI), pages 1459–1467, 2016.

[5] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and

Sergei Vassilvitskii. Scalable k-means++. In Very Large Data Bases

(VLDB), pages 622–633, 2012.

[6] Davin Choo, Christoph Grunau, Julian Portmann, and V´

aclav Rozhoˇ

k-means++: few more steps yield constant approximation. arXiv preprint

arXiv:2002.07784, 2020.

[7] Vincent Cohen-Addad. Approximation schemes for capacitated clus-

tering in doubling metrics. In ACM-SIAM Symposium on Discrete

Algorithms (SODA), pages 2241–2259, 2020.

[8] Vincent Cohen-Addad and Jason Li. On the ﬁxed-parameter tractability

of capacitated clustering. In International Colloquium on Automata,

Languages, and Programming (ICALP), pages 1–14, 2019.

[9] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted

voronoi diagrams and randomization to variance-based k-clustering. In

International Symposium on Computational Geometry (SoCG), pages

332–339, 1994.

[10] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric

facility location and k-median problems using the primal-dual schema

and lagrangian relaxation. Journal of the ACM, 48(2):274–296, 2001.

[11] Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via

local search. In International Conference on Machine Learning (ICML),

pages 3662–3671, 2019.

[12] Zhihui Li, Feiping Nie, Xiaojun Chang, Zhigang Ma, and Yi Yang.

Balanced clustering via exclusive lasso: A pragmatic approach. In AAAI

Conference on Artiﬁcial Intelligence (AAAI), pages 3596–3603, 2018.

[13] Weibo Lin, Zhu He, and Mingyu Xiao. Balanced clustering: A uniform

model and fast algorithm. In International Joint Conference on Artiﬁcial

Intelligence (IJCAI), pages 2987–2993, 2019.

[14] Hanyang Liu, Junwei Han, Feiping Nie, and Xuelong Li. Balanced

clustering with least square regression. In AAAI Conference on Artiﬁcial

Intelligence (AAAI), pages 2231–2237, 2017.

[15] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on

information theory, 28(2):129–137, 1982.

[16] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang

Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu,

S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and

information systems, 14(1):1–37, 2008.

[17] Yicheng Xu, Rolf H M¨

ohring, Dachuan Xu, Yong Zhang, and Yifei Zou.

A constant fpt approximation algorithm for hard-capacitated k-means.

Optimization and Engineering, pages 1–14, 2020.

ResearchGate has not been able to resolve any citations for this publication.

A constant FPT approximation algorithm for hard-capacitated k-means

Article

Full-text available

Sep 2020
OPTIM ENG

Hard-capacitated k-means (HCKM) is one of the fundamental problems remaining open in combinatorial optimization and engineering. In HCKM, one is required to partition a given n-point set into k disjoint clusters with known capacity so as to minimize the sum of within-cluster variances. It is known to be at least APX-hard, and most of the work on it has been done from a meta heuristic or bi-criteria approximation perspective. To the best our knowledge, no constant approximation algorithm or existence proof of such an algorithm is known. As our main contribution, we propose an FPT(k) approximation algorithm with constant performance guarantee for HCKM in this paper.

Balanced Clustering: A Uniform Model and Fast Algorithm

Conference Paper

Full-text available

Aug 2019

Clustering is a fundamental research topic in data mining and machine learning. In addition, many specific applications demand that the clusters obtained be balanced. In this paper, we present a balanced clustering model that is to minimize the sum of squared distances to cluster centers, with uniform regularization functions to control the balance degree of the clustering results. To solve the model, we adopt the idea of the k-means method. We show that the k-means assignment step has an equivalent minimum cost flow formulation when the regularization functions are all convex. By using a novel and simple acceleration technique for the k-means and network simplex methods our model can be solved quite efficiently. Experimental results over benchmarks validate the advantage of our algorithm compared to the state-of-the-art balanced clustering algorithms. On most datasets, our algorithm runs more than 100 times faster than previous algorithms with a better solution.

Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation

Article

Full-text available

Mar 2001

We present approximation algorithms for the metric uncapacitated facility location problem and the metric k-median problem achieving guarantees of 3 and 6 respectively. The distinguishing feature of our algorithms is their low running time: O(m log m) and O(m log m(L + log(n))) respectively, where n and m are the total number of vertices and edges in the underlying complete bipartite graph on cities and facilities. The main algorithmic ideas are a new extension of the primal-dual schema and the use of Lagrangian relaxation to derive approximation algorithms.

Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based

Conference Paper

Full-text available

Jan 1994

Top 10 algorithms in data mining

Article

Full-text available

Dec 2007

This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development. Yes Yes

Least squares quantization in PCM

Article

Jan 2006

S. Lloyd

K-Means++: The Advantages of Careful Seeding

Conference Paper

Jan 2007

The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

Network Flows: Theory, Algorithms and Applications

Book

Jan 1993
J OPER RES SOC

The hardness of approximation of euclidean k-means

Jan 2015
754-767

Pranjal Awasthi
Moses Charikar
Ravishankar Krishnaswamy
Ali Kemal Sinop

Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In International Symposium on Computational Geometry (SoCG), pages 754-767, 2015.

Approximate k-means++ in sublinear time

Jan 2016
1459-1467

Olivier Bachem
Mario Lucic
Andreas Hamed Hassani
Krause

Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In AAAI Conference on Artificial Intelligence (AAAI), pages 1459-1467, 2016.

Too Much Information Kills Information: A Clustering Perspective

Abstract and Figures

Recommended publications

Algorithms for k-means clustering problem with balancing constraint

A Semi Brute-Force Search Approach for (Balanced) Clustering

A constant parameterized approximation for hard-capacitated k-means

A constant FPT approximation algorithm for hard-capacitated k-means

Online Joint Placement and Allocation of Virtual Network Functions with Heterogeneous Servers