PreprintPDF Available

Too Much Information Kills Information: A Clustering Perspective

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means++ method (Arthur and Vassilvitskii 2007), we only need 7% information of the dataset. If we have up to 15% information of the dataset, then our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks, in terms of the quality of clustering. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.
Content may be subject to copyright.
Too Much Information Kills Information:
A Clustering Perspective
Yicheng Xu1Vincent Chau1Chenchen Wu2Yong Zhang1Vassilis Zissimopoulos3Yifei Zou4
1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, P.R.China.
{yc.xu, vincentchau, zhangyong}@siat.ac.cn
2Tianjin University of Technology, P.R.China. wu chenchen tjut@163.com
3National and Kapodistrian University of Athens, Greece. vassilis@di.uoa.gr
4The University of Hong Kong, P.R.China. yfzou@cs.hku.hk
Abstract—Clustering is one of the most fundamental tools
in the artificial intelligence area, particularly in the pattern
recognition and learning theory. In this paper, we propose a
simple, but novel approach for variance-based k-clustering tasks,
included in which is the widely known k-means clustering. The
proposed approach picks a sampling subset from the given
dataset and makes decisions based on the data information in the
subset only. With certain assumptions, the resulting clustering
is provably good to estimate the optimum of the variance-
based objective with high probability. Extensive experiments on
synthetic datasets and real-world datasets show that to obtain
competitive results compared with k-means method (Llyod 1982)
and k-means++ method (Arthur and Vassilvitskii 2007), we only
need 7% information of the dataset. If we have up to 15%
information of the dataset, then our algorithm outperforms both
the k-means method and k-means++ method in at least 80% of
the clustering tasks, in terms of the quality of clustering. Also, an
extended algorithm based on the same idea guarantees a balanced
k-clustering result.
INTRODUCTION
Cluster analysis is a subarea of machine learning that studies
methods of unsupervised discovery of homogeneous subsets of
data instances from heterogeneous datasets. Methods of cluster
analysis have been successfully applied in a wide spectrum of
areas of image processing, information retrieval, text mining
and cybersecurity. Cluster analysis has a rich history in dis-
ciplines such as biology, psychology, archaeology, psychiatry,
geology and geography, even through there is an increasing
interest in the use of clustering methods in very hot fields like
natural language processing, recommended system, image and
video processing, etc. The importance and interdisciplinary
nature of clustering is evident through its vast literature.
The goal of variance-based k-clustering is to find a k
sized partition of a given dataset so as to minimize the sum
of the within-cluster variances. The well-known k-means is
a variance-based clustering which defines the within-cluster
variance as the sum of squared distances from each data to
the means of the cluster it belongs to. The folklore of k-
means method [15], also known as the Lloyd’s algorithm,
is still one of the top ten popular data mining algorithms
and is implemented as a standard clustering method in most
machine learning libraries, according to [16]. To overcome
the high sensitivity to proper initialization, [2] propose the
k-means++ method by augmenting the k-means method with
a careful randomized seeding preprocessing. The k-means++
method is proved to be O(log k)-competitive with the optimal
clustering and the analysis is tight. Even through it is easy
to implement, k-means++ has to make a full pass through
the dataset for every single pick of the seedings, which leads
to a high complexity. [5] drastically reduce the number of
passes needed to obtain, in parallel, a good initialization. The
proposed k-meanskobtains a nearly optimal solution after
a logarithmic number of passes, and in practice a constant
number of passes suffices. Following this path, there are
several speed-ups or hybrid methods. For example, [4] replace
the seeding method in k-means++ with a substantially faster
approximation based on Markov Chain Monte Carlo sampling.
The proposed method retains the full theoretical guarantees
of k-means++ while its computational complexity is only
sublinear in the number of data points. A simple combination
of k-means++ with a local search strategy achieves a constant
approximation guarantee in expectation and is more competi-
tive in practice [11]. Furthermore, the number of local search
steps is dramatically reduced from O(klog log k)to k while
maintaining the constant performance guarantee [6].
A balanced clustering result is often required in a variety
of applications. However, many existing clustering algorithms
have good clustering performances, yet fail in producing
balanced clusters. The balanced clustering, which requires
size constraints for the resulting clusters, is at least APX-
hard in general under the assumption P6=NP [3]. It attracts
research interests simultaneously from approximation and
heuristic perspectives. Heuristically, [14] apply the method of
augmented Lagrange multipliers to minimize the least square
linear regression in order to regularize the clustering model.
The proposed approach not only produces good clustering
performance but also guarantees a balanced clustering result.
To achieve more accurate clustering for large scale dataset,
exclusive lasso on k-means and min-cut are leveraged to
regulate the balance degree of the clustering results. By
optimizing the objective functions that build atop the exclusive
lasso, one can make the clustering result as much balanced as
possible [12]. Recently, [13] introduce a balance regularization
term in the objective function of k-means and by replacing the
assignment step of k-means method with a simplex algorithm
they give a fast algorithm for soft-balanced clustering, and
the hard-balanced requirement can be satisfied by enlarging
the multiplier in the regularization term. Also, there are some
algorithmic results for balanced k-clustering tasks with valid
performance guarantees. The first constant approximation al-
gorithm for the variance based hard-balanced clustering is a
(69 + )-approximation in fpt-time [17]. The approximation
ratio is then improved to 7 + [8] and 1 + [7] sequentially
with the same asymptotic running time.
Our contributions In this paper, we propose a simple,
but novel algorithm based on random sampling that com-
putes provably good k-clustering results for variance based
clustering tasks. An extended version based on the same
idea is valid for balanced k-clustering tasks with hard size
constraints. We make cross comparisons between the proposed
Random Sampling method with the k-means method and
the k-means++ method in both synthetic datasets and real-
world datasets. The numerical results show that our method is
competitive with the k-means method and k-means++ method
with a sampling size of only 7% of the dataset. When the
sampling size reaches 15% or higher, the Random Sampling
method outperforms both the k-means method and the k-
means++ method in at least 80% rounds of the clustering tasks.
The remainder of the paper is organized as follows. In
the Warm-up section, we mainly provide some preliminaries
towards a better understand of the proposed algorithm. In the
Random Sampling section, we present the main algorithm
and the analysis. After that, we provide the performance of
the proposed algorithm on different datasets in the Numerical
Results section. Then we extend the proposed algorithm to deal
with the balanced clustering tasks in the Extension section.
In the last section, we discuss the advantages as well as
disadvantages of the proposed algorithm, and some promising
areas where our algorithm has the potential to outperform
existing clustering methods.
WARM-UP
Variance-Based k-Clustering
Roughly speaking, clustering tasks seek an organization of
a collection of patterns into clusters based on similarity, such
that patterns within a cluster are very similar while patterns
from different clusters are highly dissimilar. One way to
measure the similarity is the so-called variance-based objective
function, that leverages the squared distances between patterns
and the centroid of the cluster they belong to.
A well-known variance-based clustering task is the k-means
clustering, which is a method of vector quantization that
originally comes up from signal processing, which aims to
partition nreal vectors (quantification from colors) into k
clusters so as to minimize the within-cluster variances. What
makes the k-means clustering different from other variance-
based k-clustering is the way it measures the similarity. The
k-means defines the similarity between vectors as the squared
Euclidean distance between them. For simplicity, we mainly
take the k-means as an example in the later discussion but
most of the results carry over to the general variance-based
k-clustering tasks.
The k-means clustering can be formally described as fol-
lows. Given are a data set X={x1, x2,· · · , xn}and an
integral number k, where each data in Xis a d-dimensional
real vector. The objective is to partition Xinto k(n)disjoint
subsets so as to minimize the total within-cluster sum of
squared distances (or variances). For a fixed finite data set
ARd, the centroid (also known as the means) of Ais
denoted by c(A) := PxAx/|A|.Therefore, the objective of
the k-means clustering is to find a partition {X1, X2,· · · , Xk}
of Xsuch that the following is minimized:
k
X
i=1 X
xXi
||xc(Xi)||2,
where ||ab|| denotes the Euclidean distance between vectors
aand b.
Also, we will extend our result to a general scenario of
balanced clustering, where the capacity constraints must be
satisfied. For the balanced k-clustering, the only difference is
additional global constraints for the size of the clusters. Both
lower bound and upper bound constraints are considered in
this paper. Based on the above, the balanced k-means can
be described as finding a partition {Xi}1ikof Xso as to
minimize the aforementioned k-means objective and
l≤ |Xi| ≤ u, for all 1 ik.
Obviously, by taking appropriate values for land u, we reduce
it to the k-means clustering. Thus, it is more difficult to obtain
an optimal balanced k-means clustering.
Voronoi Diagram and Centroid Lemma
Solving the optimal k-means clustering for an arbitrary data
set is NP-hard. However, Lloyd proposes a fast local search
based heuristic for k-means clustering, also known as the k-
means method. A survey of data mining techniques states
that it is by far the most popular clustering algorithm used
in scientific and industrial applications. The k-means method
is carried out through iterative Voronoi Diagram construction,
combined with the centroid adjustment according to the Cen-
troid Lemma.
Voronoi Diagram is a partition of a space into regions close
to each of a given set of centers. Formally, given centers C=
{c1, c2, ..., ck}in Rdfor example, the Voronoi Diagram w.r.t.
(with respect to) Cconsists of the following Voronoi cells
defined for i= 1,2, ..., k as
Cell(i) = {xRd:d(x, ci)d(x, cj) for all j6=i}.
See Figure 1 as examples of the Voronoi Diagrams in the
plane. Obviously, any Voronoi Diagram Πof Rdgives a
feasible partition for any set XRd(ties broken arbitrarily),
which is called the Voronoi Partition of Xw.r.t. Π. More
precisely, the Voronoi Partition of Xis given by {Xi}1ik,
where Xi=XCell(i).
Fig. 1. Examples of Voronoi diagram in the plane
On the other hand, given XRd, it holds for any vRd
that
X
xX
||xv||2=X
xX
||xc(X)||2+|X| · ||c(X)v||2,
which is the so-called Centroid Lemma. An example of
application of the Centroid Lemma refers to [10]. Note that the
Centroid Lemma implies that the centroid/means of a cluster
is the minimizer of the within-cluster variance.
RANDOM SAMPLING
Given a dataset X, we say SXis a random sampling
of Xif Sis obtained by several independent draws from X
uniformly at random. We show that it is not bad to estimate the
objective value of the variance-based k-clustering of Xusing
S. Before that, we introduce two basic facts on expectation and
variance from probability theory. Given independent random
variables V1and V2, we have the follows.
Fact 1 E(aV1+bV2) = aE (V1) + bE(V2)
Fact 2 var(aV1+bV2) = a2var(V1) + b2var(V2)
Suppose Sis an m-draws random sampling of X. Then,
c(S)is an unbiased estimation of c(X)and the squared
Euclidean distance between them can be estimated by the
following lemma.
Lemma 1: E(c(S)) = c(X),E(||c(S)c(X)||2) =
1
mvar(X).
Proof: Assume w.o.l.g. that S={V1, V2, ..., Vm}and
recall Viare independent random variables. Based on Fact 1,
it holds that
E(c(S)) = E(1
m
m
X
i=1
Vi) = 1
m
m
X
i=1
E(Vi)
=1
m
m
X
i=1
c(X) = c(X).
Then
E(||c(S)c(X)||2) = E(||c(S)E(c(S))||2)
=var(c(S))
=var(1
m
m
X
i=1
Vi)
=1
m2
m
X
i=1
var(Vi)
=1
mvar(X),
where the second last equality is derived from Fact 2.
Based on the above, we conclude that c(S)is indeed a
good estimate for c(X). A natural idea comes from here that
it is probably a good estimate for PxX||xc(X)||2using
PxX||xc(S)||2, as given in the following lemma.
Lemma 2: With probability at least 1δ,
X
xX
||xc(S)||2(1 + 1
)X
xX
||xc(X)||2.
Proof: From Lemma 1 and the Markov Inequality we
know, with probability at least 1δ,
||c(S)c(X)||21
X
xX
||xc(X)||2.
Recalling the Centroid Lemma, immediately we have with
probability at least 1δthat
X
xX
||xc(S)||2
=X
xX
||xc(X)||2+|X| · ||c(S)c(X)||2
(1 + 1
)X
xX
||xc(X)||2,
completing the proof.
Consider the following randomized algorithm for the k-
clustering task based on the random sampling idea, which we
simply call Random Sampling. Given the sampling set S, we
construct every k-clustering of Sby a brute force search. Note
that there are O(mdk)many possibilities due to [9], but we
are allowed to do this because Sis much smaller than X.
For each k-clustering of S, we divide the Rdspace into k
Voronoi cells according to the centroids of the kclusters of S.
Subsequently, we obtain a feasible k-clustering of X, simply
by grouping the data points in the same Voronoi cell together.
Then we choose the best one among these possible results.
The Random Sampling algorithm is provided as Algorithm 1.
Next, we estimate the value for each of the kclusters of X.
Let {X0
i}1ikbe the output of the Random Sampling algo-
rithm, from which we obtain the corresponding k-clustering
{S0
i}1ikof the random sampling subset S0. Because the
centroid of each cluster in {X0
i}1ikdefines a Voronoi cell of
the space, according to which we partition S0into k-clustering.
Assume w.o.l.g. that |S0
i| ≤ |S0
i+1|for i= 1,2, ..., k 1.
Suppose {X
i}1ikis the optimal solution such that |X
i| ≤
Algorithm 1: Random Sampling for k-clustering tasks
Input: Dataset X, integer k;
Output: k-clustering of X.
1Sample a subset Sby m(k)independent draws from
Xuniformly at random;
2for every k-clustering {Si}1ikof Sdo
3Compute the centroid set C={c(Si)}1ik;
4Obtain {Xi}1ik, the Voronoi Partition of Xw.r.t.
the Voronoi Diagram generated by C;
5Compute the value
k
P
i=1 P
xXi
||xc(Xi)||2;
6return {Xi}1ikwith the minimum value.
|X
i+1|for i= 1,2, ..., k 1. Since S0is obtained from
mindependent draws from X, the size of each cluster in
{S0
i}1ikis determined by independent Bernoulli trials, and
is dependent on the distribution of |X
i|over all i. Thus it
must be that E(|S0
i|) = m
nE(|X
i|). We denote the distribution
function of |X
i|by p(i) := |X
i|
nover all i∈ {1, ..., k}. We
call Xaµ-balanced instance (0µ1) if there exists
an optimal k-clustering for Xsuch that all clusters have size
at least µ|X|. For example, if p(1) µ, then we call X
aµ-balanced instance. Recall X
1is the smallest cluster in
{X
i}1ik. We obtain the following lemma.
Lemma 3: If Xis a (ln m/m)-balanced instance, then for
any small positive constant η, it holds with probability at least
1mη2/2that
|S0
i| ≥ (1 η)mp(i)
for all i= 1, ..., k.
Proof: It is obvious that
E(|S0
i|) = m
nE(|X
i|) = mp(i).
We now start the proof with S0
1, the smallest cluster in
expectation. Consider mrounds of the following Bernoulli
trial 1,with probability p(1);
0,with probability 1 p(1).
Let B1, B2, ..., Bmbe the independent random variables of
the mtrials and let B=Pm
i=1 Bi. Obviously E(B) = mp(1)
and from the Chernoff Bound we have
Pr[B < (1 η)mp(1)] < emp(1)η2
2eln 2
2=mη2
2.
Thus, with probability at least 1mη2/2, it follows that
|S0
1| ≥ (1 η)mp(1).
Similarly for i= 2, ..., k as p(i)ln m/m hold for all i,
complete the proof.
By combining Lemma 2 and 3, we conclude the following
estimate for the Random Sampling algorithm.
Theorem 1: For any (ln m/m)-balanced instance of a k-
clustering task, Algorithm 1 returns a feasible solution that it
is with probability at least 1δmη2/2within a factor of
1 + 1
(1η)δln mto the optimum.
Proof: Considering the objective value of the output of
Algorithm 1, and using the Centroid Lemma, we have
k
X
i=1 X
xX0
i
||xc(X0
i)||2
k
X
i=1 X
xX0
i
||xc(S0
i)||2.
From line 4 of Algorithm 1, we know that the partition
{X0
i}1ikis obtained from the Voronoi Diagram generated
by {c(S0
i)}1ik. That is to say, for any xX0
iand an
arbitrary j6=i, it must be the case that
||xc(S0
i)|| ≤ ||xc(S0
j)||.
Summing over all x, we obtain
k
X
i=1 X
xX0
i
||xc(S0
i)||2
k
X
i=1 X
xX
i
||xc(S0
i)||2.
The right hand side implies an assignment where an xis
assigned to c(S0
i)as long as xX
ifor some i. Considering
an xX
i, we do not change its cost of those xX
iX0
i.
But we increase the cost of those xX
iX0
jfor any j6=i.
Applying Lemma 2 to every cluster in {X
i}1ik, with
probability at least 1δ, it holds that
k
X
i=1 X
xX
i
||xc(S0
i)||2
k
X
i=1 X
xX
i
(1 + 1
δ|S0
i|)||xc(X
i)||2.
Combining with Lemma 3, we obtain with probability at least
(1 δ)(1 mη2/2)1δmη2/2that
k
X
i=1 X
xX
i
||xc(S0
i)||2
k
X
i=1 X
xX
i
(1 + 1
(1 η)δmp(i))||xc(X
i)||2
(1 + 1
(1 η)δln m)
k
X
i=1 X
xX
i
||xc(X
i)||2,
where the last inequality follows from the assumption that X
is a (ln m/m)-balanced instance. Complete the proof.
NUMERICAL RES ULTS
In this section, we evaluate the performance of the proposed
RS (abbreviation for the Random Sampling algorithm) mainly
through the cross comparisons with the widely known KM
(abbreviation for the k-means method) and KM++ (abbre-
viation for the k-means++ method) on the same datasets.
The environment for experiments is Intel(R) Xeon(R) CPU
E5-2620 v4 @ 2.10GHz with 64GB memory. We construct
extensive numerical experiments to analyze different impacts
of the proposed algorithm as well as the parameter settings.
Since all algorithms are randomized, we run RS, KM and
KM++ on 100 instances per setting and report the number
of instances of each algorithm hitting the minimum objective
value. We mainly design the following experiments due to
disparate purposes.
1) Effect of n:
We generate 100 instances of each n=
{100,200, ..., 1000}with a standard normal distribu-
tion, after which we run simultaneously the RS, KM
and KM++ on the same instance and record which
of the three algorithms hits the minimum objective
value. We fix m=n/10,k= 3 throughout the
experiments and see Figure 2 the numerical results.
100
200
300
400
500
600
700
800
900
1000
0
20
40
60
80
100
value of n
# instances hit the minimum
KM KM++ RS
Fig. 2. Effect of the size of the dataset n
The RS performs not so good as KM or KM++ at
the beginning because the sampling set is too small
to represent the entire dataset. Taking n= 100
as an example, a 10-sized sampling set is proba-
bly not a good estimate for the original 100-sized
dataset. However, when nincreases to 700, a 70-
sized sampling set seems good enough for RS to
be competitive with KM and KM++. With the rise
of n, RS performs increasingly better and tends to
outperform both the KM and KM++. Note that fixing
n= 100 for example, the total number of instances
that any of the three algorithms hitting the minimum
exceeds 100. This is because for smaller instances, it
is more likely that not only one algorithm is hitting
the minimum, and in this case we count all of them
once in Figure 2.
2) Effect of k:
We generate 100 instances with a standard normal
distribution, after which we run simultaneously the
RS, KM and KM++ on the same instance for dif-
ferent k-clustering tasks with each k={2,3, ..., 8},
and record which of the three algorithms hits the
minimum objective value. We fix n= 100,m= 50
throughout the experiments and see Figure 3 the
numerical results.
As shown, the RS reaches the best performance in
2-clustering and worst performance in 5-clustering.
Overall, it is competitive with KM and KM++ with
2345678
0
20
40
60
80
100
vaue of k
# instances hit the minimum
KM KM++ RS
Fig. 3. Effect of the number of clusters k
these settings.
3) Effect of m:
We evaluate the performance of our algorithm on
real-world dataset. The Cloud dataset consists of
1024 points and represents the 1st cloud cover
database available from the UC-Irvine Machine
Learning Repository. We run simultaneously the KM
and KM++ on the Cloud dataset, along with the RS
with each sampling size m={25,50,75, ..., 200}.
Since there is only one instance here, we run 100
rounds of each algorithm per setting and report
the one hitting the minimum objective value. Note
that n= 1024 and we fix k= 3 throughout the
experiments and see Figure 4 the numerical results.
25 50 75 100 125 150 175 200
0
20
40
60
80
100
value of m
# rounds hit the minimum
KM KM++ RS
Fig. 4. Effect of the size of the sampling set m
As predicted, the RS performs increasingly better
when the sampling size gets large. But it is quite
surprising that when m= 75 (about only 7% of the
Cloud dataset), the RS performs as good as KM++.
When mis higher than 100 (about 10% of the Cloud
dataset), the RS outperforms any one of the KM and
KM++. If mreaches 150 (about 15% of the Cloud
dataset) or higher, the RS wins in at least 80% rounds
of the clustering tasks.
EXT ENS ION TO BALANCED k-CLUSTERING
An additional important feature of the proposed Random
Sampling algorithm is the extension to handle the balanced
variance-based k-clustering tasks, for which the k-means
method and the k-means++ method can not deal with. Both
upper bound and lower bound constraints are considered,
which means a feasible balanced k-clustering has a global
lower bound land an upper bound ufor the cluster sizes. We
assume w.o.l.g. that land uare positive integers. The main
idea is a minimum-cost flow subroutine embedded into the
Random Sampling algorithm.
To start, we introduce the well-known minimum-cost flow
problem. Given a directed graph G= (V, E), every edge e
Ehas a weight c(e)representing its cost of sending a unit
of flow. Also, every eEis equipped with a bandwidth
constraint. Only those flows within a maximum flow value of
upper(e)and minimum value of lower(e)can pass through
edge efor each eE, where upper(e)and lower(e)denote
the upper bound and the lower bound for the bandwidth of e
respectively. Every node vVhas a demand d(v), defined as
the total outflow minus total inflow. Thus a negative demand
represents a need for flow and a positive one represents a
supply.
A flow in Gis defined as a function from Vto R+. A
feasible flow carrying famount of flow in the graph requires
a source sand a sink twith d(s) = fand d(t) = f. Every
node vV\ {s, t}must have d(v)=0, which means it is
either an intermediate node or an idle node. The cost of flow f
is defined as c(f) = PeEf(e)·c(e), where f(·) : VR+
is the corresponding function of flow f. The minimum–cost
flow is the optimization problem to find a cheapest way (i.e.
with the minimum cost) of sending a certain amount of flow
through graph G.
To deal with the capacity constraints, we herein propose
a Random Sampling based randomized algorithm embedding
in the minimum–cost flow subroutine. Obviously, the Voronoi
Diagram generated by the centroids of the k-clustering of the
sampling set Sdoes not guarantee a feasible Voronoi Partition
of Xsatisfying the capacity constraints. Assume that we are
given a k-clustering of Sand we look for a feasible balanced
k-clustering of X.
Consider the following instance of the minimum–cost flow
problem. Let Vbe XC∪ {s, t}, where Cconsists of the
centroids {c(Ci)}1ikobtained from the given k-clustering
of S, and sand tare the dummy source and sink nodes
respectively. Let Ebe E1E2E3, where E1are the directed
edges (s, i)from sto each iX,E2are the edges (i, j)from
each iXto jC, and E3are the edges (j, t)from each
jCto t. Every edge in E1E2has bandwidth interval
[0,1] while E3has [l, u]. Edges in E1E3are unweighted
and edge (i, j)E2has weight ||ij||2for each iXand
jC. See Figure 5 as a description.
As shown in the figure, the bandwidth intervals and the
weights/costs are labeled on the edges. All the edges are
oriented from the source to the sink and we simply omit
Fig. 5. A minimum-cost flow instance
the direction labels. Inside the shadowed box is a complete
bipartite graph, also known as a biclique, consisting of vertices
XCand edges E2. Consider a flow fthat carrying n
(n=|X|) amount of flow from the source to the sink in
Gand suppose that function f:E7→ R+reflects such a flow.
Recall that d(v) = Peδ+(v)f(e)Peδ(v)f(e), where
δ+(v)denotes the edges leading away from node vand δ(v)
denotes the edges leading into v. Then the follows must hold.
Flow conservation:
d(v) =
n, v =s;
n, v =t;
0,v6=s, t.
Bandwidth constraints:
0f(e)1,eE1;
0f(e)1,eE2;
lf(e)u, eE3.
Then minimum–cost flow problem aims to find a function
f:E7→ R+satisfying both the flow conservation and
the bandwidth constraints so as to minimize its cost, i.e.,
PeEf(e)·c(e). An important property of the minimum-cost
flow problem is that basic feasible solutions are integer-valued
if capacity constraints and quantity of flow produced at each
node are integer-valued, as captured by the following lemma.
Lemma 4: [1] If the objective value of the minimum-cost
flow is bounded from below on the feasible region, the
problem has a feasible solution, and if capacity constraints
and quantity of flow are all integral, then the problem has at
least one integral optimal solution.
The integral solution can be computed efficiently by the
Cycle Canceling algorithms, Successive Shortest Path al-
gorithms, Out-of-Kilter algorithms and Linear Programming
based algorithms. These algorithms can be found in many
textbooks. See for example [1]. We take any one of these
algorithms as the MCF (Minimum-Cost Flow) subroutine in
our algorithm. We show the following theorem.
Theorem 2: The integral optimal solution to the above
minimum-cost flow instance provides an optimal assignment
from Xto Cfor balanced clustering tasks.
Proof: We only need to prove that any feasible assignment
from the dataset Xto the given centroid Ccan be represented
by a feasible integral flow to the aforementioned minimum-
cost flow instance, and vice versa.
Let σ:X7→ Cbe a feasible assignment from Xto C.
Consider the following flow f:E7→ R+.
f(e) =
1,eE1,
1,eE2and σ(eo) = ed,
0,eE2and σ(eo)6=ed,
Pe0:e0
d=eof(e0),eE3,
where we denote the origin and destination of edge eby eoand
edrespectively. Note that the quantity of fis n. Obviously,
fsatisfies the flow conversation and every edge in E1and
E2obeys the bandwidth constraints. For eE3, from the
construction we have f(e) = Pe0:e0
d=eof(e0) = |σ1(eo)|.
Since σis feasible, then it must hold for every jCthat l
|σ1(j)| ≤ u, which implies the feasibility of the bandwidth
constraints for E3.
On the other hand, given an integral feasible flow f, the
corresponding assignment must be feasible, i.e., satisfying the
size constraints. Note that a feasible flow with quantity nin
the above instance must have all f(e)=1for every eE1.
Consider the following assignment σ: For any iX, j C,
σ(i) = jif and only if an edge with eo=iand ed=jis
such that f(e) = 1. The defined assignment must be feasible
because |σ1(j)|=Peδ(j)f(e) = Peδ+(j)f(e)holds
for any jC. Then from the feasibility of flow fwe know
that lPeδ+(j)f(e)u.
It is obvious that the cost of a feasible assignment and the
cost of its corresponding flow are exactly the same. Because
X
eE
c(e)f(e) = X
eE2
c(e)f(e)
=X
eE2:f(e)=1
||eoed||2
=X
iXX
j=σ(i)
||ij||2
=X
xX
||xσ(x)||2
=
k
X
i=1 X
xXi
||xσ(x)||2,
where the first equality is derived from the construction and
the last equality holds for any feasible partition of X, which
we assume without loss of generality is {Xi}1ik. Implies
the lemma.
Based on the above, we conclude that a MCF subroutine
embedded in the Random Sampling algorithm guarantees a
valid solution for the balanced k-clustering problem. The
pseudocode is provided as Algorithm 2.
DISCUSSION
We are incredibly well informed yet we know incredibly
little, and this is what is happening in the clustering tasks.
Algorithm 2: Random Sampling for balanced k-clustering
tasks
Input: Dataset X, integer k;
Output: k-clustering of X.
1Sample a subset Sby m(k)independent draws from
Xuniformly at random;
2for every k-clustering {Si}1ikof Sdo
3Compute the centroid set C={c(Si)}1ik;
4Obtain {Xi}1ikby the MCF subroutine;
5Compute the value
k
P
i=1 P
xXi
||xc(Xi)||2;
6return {Xi}1ikwith the minimum value.
Our work implies that we do not need so much information of
dataset when doing clustering. From the experiments, roughly
speaking, to obtain a competitive clustering result compared
with the k-means method and k-means++ method, we only
need about 7% information of the dataset. For the rest of
the 93% data, we immediately make decisions for them with
only O(k)additional computations. Note that the resources
consumed in the algorithm are dominated by the brute force
search for the k-clustering of the sampling set. If we have up
to 15% information of the dataset, then with high probability,
our algorithm outperforms both the k-means method and k-
means++ method in terms of the quality of clustering. The
above statements hold only when 1) The dataset is independent
and identically distributed; 2) The sampling set is picked
uniformly at random from the original dataset; 3) The most
important, the dataset is large enough (experimentally 500 data
points or above suffice). At a cost, the proposed algorithm
has a high complexity with respect to k, but fortunately not
sensitive to the size of the dataset or the size of the sampling
set.
We believe that the Random Sampling idea as well as
the framework of the analysis has the potential to deal with
incomplete dataset and online clustering tasks.
REFERENCES
[1] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network
flows - theory, algorithms and applications. Prentice Hall, 1993.
[2] David Arthur and Sergei Vassilvitskii. k-means++: The advantages
of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 1027–1035, 2007.
[3] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and
Ali Kemal Sinop. The hardness of approximation of euclidean k-means.
In International Symposium on Computational Geometry (SoCG), pages
754–767, 2015.
[4] Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause.
Approximate k-means++ in sublinear time. In AAAI Conference on
Artificial Intelligence (AAAI), pages 1459–1467, 2016.
[5] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and
Sergei Vassilvitskii. Scalable k-means++. In Very Large Data Bases
(VLDB), pages 622–633, 2012.
[6] Davin Choo, Christoph Grunau, Julian Portmann, and V´
aclav Rozhoˇ
n.
k-means++: few more steps yield constant approximation. arXiv preprint
arXiv:2002.07784, 2020.
[7] Vincent Cohen-Addad. Approximation schemes for capacitated clus-
tering in doubling metrics. In ACM-SIAM Symposium on Discrete
Algorithms (SODA), pages 2241–2259, 2020.
[8] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability
of capacitated clustering. In International Colloquium on Automata,
Languages, and Programming (ICALP), pages 1–14, 2019.
[9] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted
voronoi diagrams and randomization to variance-based k-clustering. In
International Symposium on Computational Geometry (SoCG), pages
332–339, 1994.
[10] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric
facility location and k-median problems using the primal-dual schema
and lagrangian relaxation. Journal of the ACM, 48(2):274–296, 2001.
[11] Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via
local search. In International Conference on Machine Learning (ICML),
pages 3662–3671, 2019.
[12] Zhihui Li, Feiping Nie, Xiaojun Chang, Zhigang Ma, and Yi Yang.
Balanced clustering via exclusive lasso: A pragmatic approach. In AAAI
Conference on Artificial Intelligence (AAAI), pages 3596–3603, 2018.
[13] Weibo Lin, Zhu He, and Mingyu Xiao. Balanced clustering: A uniform
model and fast algorithm. In International Joint Conference on Artificial
Intelligence (IJCAI), pages 2987–2993, 2019.
[14] Hanyang Liu, Junwei Han, Feiping Nie, and Xuelong Li. Balanced
clustering with least square regression. In AAAI Conference on Artificial
Intelligence (AAAI), pages 2231–2237, 2017.
[15] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on
information theory, 28(2):129–137, 1982.
[16] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang
Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu,
S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and
information systems, 14(1):1–37, 2008.
[17] Yicheng Xu, Rolf H M¨
ohring, Dachuan Xu, Yong Zhang, and Yifei Zou.
A constant fpt approximation algorithm for hard-capacitated k-means.
Optimization and Engineering, pages 1–14, 2020.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Hard-capacitated k-means (HCKM) is one of the fundamental problems remaining open in combinatorial optimization and engineering. In HCKM, one is required to partition a given n-point set into k disjoint clusters with known capacity so as to minimize the sum of within-cluster variances. It is known to be at least APX-hard, and most of the work on it has been done from a meta heuristic or bi-criteria approximation perspective. To the best our knowledge, no constant approximation algorithm or existence proof of such an algorithm is known. As our main contribution, we propose an FPT(k) approximation algorithm with constant performance guarantee for HCKM in this paper.
Conference Paper
Full-text available
Clustering is a fundamental research topic in data mining and machine learning. In addition, many specific applications demand that the clusters obtained be balanced. In this paper, we present a balanced clustering model that is to minimize the sum of squared distances to cluster centers, with uniform regularization functions to control the balance degree of the clustering results. To solve the model, we adopt the idea of the k-means method. We show that the k-means assignment step has an equivalent minimum cost flow formulation when the regularization functions are all convex. By using a novel and simple acceleration technique for the k-means and network simplex methods our model can be solved quite efficiently. Experimental results over benchmarks validate the advantage of our algorithm compared to the state-of-the-art balanced clustering algorithms. On most datasets, our algorithm runs more than 100 times faster than previous algorithms with a better solution.
Article
Full-text available
We present approximation algorithms for the metric uncapacitated facility location problem and the metric k-median problem achieving guarantees of 3 and 6 respectively. The distinguishing feature of our algorithms is their low running time: O(m log m) and O(m log m(L + log(n))) respectively, where n and m are the total number of vertices and edges in the underlying complete bipartite graph on cities and facilities. The main algorithmic ideas are a new extension of the primal-dual schema and the use of Lagrangian relaxation to derive approximation algorithms.
Article
Full-text available
This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development. Yes Yes
Conference Paper
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
The hardness of approximation of euclidean k-means
  • Pranjal Awasthi
  • Moses Charikar
  • Ravishankar Krishnaswamy
  • Ali Kemal Sinop
Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In International Symposium on Computational Geometry (SoCG), pages 754-767, 2015.
Approximate k-means++ in sublinear time
  • Olivier Bachem
  • Mario Lucic
  • Andreas Hamed Hassani
  • Krause
Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In AAAI Conference on Artificial Intelligence (AAAI), pages 1459-1467, 2016.