ArticlePDF Available

Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage

December 2014
IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(3)

December 2014
11(3)

DOI:10.1109/TCBB.2013.177

Source
PubMed

Authors:

Falk Hüffner

Technische Universität Berlin

Christian Komusiewicz

Philipps University of Marburg

A popular clustering algorithm for biological networks which was proposed by Hartuv and Shamir [IPL 2000] identifies nonoverlapping highly connected components. We extend the approach taken by this algorithm by introducing the combinatorial optimization problem Highly Connected Deletion, which asks for removing as few edges as possible from a graph such that the resulting graph consists of highly connected components. We show that Highly Connected Deletion is NP-hard and provide a fixed-parameter algorithm and a kernelization. We propose exact and heuristic solution strategies, based on polynomial-time data reduction rules and integer linear programming with column generation. The data reduction typically identifies 75% of the edges that are deleted for an optimal solution; the column generation method can then optimally solve protein interaction networks with up to 6,000 vertices and 13,500 edges in less than a day. Additionally, we present a new heuristic that finds more clusters than the method by Hartuv and Shamir.

Content uploaded by Christian Komusiewicz

Content may be subject to copyright.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. X, JANUARY 201X 1

Partitioning Biological Networks

into Highly Connected Clusters

with Maximum Edge Coverage

Falk Hüffner, Christian Komusiewicz, Adrian Liebtrau, and Rolf Niedermeier

Abstract

—A popular clustering algorithm for biological networks which

was proposed by Hartuv and Shamir [IPL 2000] identiﬁes nonoverlapping

highly connected components. We extend the approach taken by this

algorithm by introducing the combinatorial optimization problem HIGHLY

CONNECTED DELETION, which asks for removing as few edges as

possible from a graph such that the resulting graph consists of highly

connected components. We show that HIGHLY CONNECTED DELETION

is NP-hard and provide a ﬁxed-parameter algorithm and a kernelization.

We propose exact and heuristic solution strategies, based on polynomial-

time data reduction rules and integer linear programming with column

generation. The data reduction typically identiﬁes 75 % of the edges that

are deleted for an optimal solution; the column generation method can

then optimally solve protein interaction networks with up to 6 000 vertices

and 13 500 edges within ﬁve hours. Additionally, we present a new

heuristic that ﬁnds more clusters than the method by Hartuv and Shamir.

Index Terms

—Cluster analysis, PPI networks, ﬁxed-parameter tractabil-

ity, data reduction, integer linear programming, heuristics

1 INTRODUCTION

etwork clustering is a computational tool to ana-

lyze biological systems by identifying functional

subgroups within large biological networks generated

for example from protein interaction data. Herein, a

key idea is to identify densely connected subgraphs

(clusters) that have many interactions within themselves

and few with the rest of the graph [

–

]. Hartuv and

Shamir

[5]

proposed a clustering algorithm producing

A preliminary version appeared in the proceedings the 9th International

Bioinformatics Research and Applications Symposium (ISBRA 2013) held at

Charlotte, NC, USA (volume 7875 in Lecture Notes in Computer Science,

pages 99–111, Springer, 2013). The full version contains all missing proofs as

well as experiments based on an extended data set. We furthermore propose

and evaluate a new heuristic for HIG HLY CONNECTED DELETION.

•

F. Hüffner, C. Komusiewicz, and R. Niedermeier are with the Institut für

Softwaretechnik und Theoretische Informatik, TU Berlin, Germany.

E-mail: {falk.hueffner,christian.komusiewicz,rolf.niedermeier}@tu-berlin.de

•

A. Liebtrau was with the Institut für Informatik, Friedrich-Schiller-

Universität Jena, Germany.

Supported by DFG projects PABI (NI 369/7-2) and ALEPH (HU 2139/1-1)

and a post-doc fellowship of the region Pays de la Loire.

Manuscript received XX Sept. 201X; revised XX Sept. 201X; accepted X

Sept. 201X; published online XX Sept.. 201X.

For information on obtaining reprints of this article, please send e-mail to:

tcbb@computer.org, and reference IEEECS Log Number TCBB-201X-XX-

XXXX. Digital Object Identiﬁer no. 10.1109/TCBB.XX.X.

so-called highly connected clusters. Their method has

been successfully used to cluster cDNA ﬁngerprints [

to ﬁnd complexes in protein–protein interaction (PPI)

data [

], to group protein sequences hierarchically into

superfamily and family clusters [

], and to ﬁnd families

of regulatory RNA structures [

]. Hartuv and Shamir

[5]

formalized the connectivity demand for a cluster as

follows: the edge connectivity

λ(G)

of a graph

is the

minimum number of edges whose deletion results in

a disconnected graph, and a graph

with

vertices

is called highly connected if

λ(G)> n/2

. An equivalent

characterization is that a graph is highly connected if

each vertex has degree greater than

n/2

[

]. Thus, highly

connected graphs are very similar to 0.5-quasi-complete

graphs [

], that is, graphs where every vertex has

degree at least

(n−1)/2

. Further, being highly connected

also ensures that the diameter of a cluster is at most

two [5].

The algorithm by Hartuv and Shamir

[5]

partitions the

vertex set of the given graph such that each partition

set is highly connected, thus guaranteeing good intra-

cluster density (including maximum cluster diameter

two and the presence of more than half of all possible

edges). Moreover, the algorithm needs no prespeciﬁed

parameters (such as the number of clusters) and it

naturally extends to hierarchical clustering. Essentially,

Hartuv and Shamir’s algorithm iteratively deletes the

edges of a minimum cut in a connected component that

is not yet highly connected.1

While Hartuv and Shamir’s algorithm guarantees to

output a partitioning into highly connected subgraphs, it

iteratively uses a greedy step to ﬁnd small edge sets

to delete. Thus it does not guarantee to maximize the

overall number of edges within clusters or, equivalently,

to minimize inter-cluster connectivity. In other words, it

is not ensured that the partitioning comes along with a

minimum number of edge deletions making the resulting

graphs consist of highly connected components. This

is why we propose a formally deﬁned combinatorial

The CLICK algorithm [

] and the SIDES algorithm [

] follow

the same scheme, but use edge weights and different stopping criteria,

based on probabilistic models.

optimization problem that speciﬁes the goal to minimize the

number of edge deletions which is addressed implicitly

by Hartuv and Shamir’s algorithm.

HIGHLY CONNECTED DELETION

Instance: An undirected graph G= (V, E ).

Task:

Find a minimum subset of edges

E0⊆E

such

that in

G0= (V, E \E0)

all connected components

are highly connected.

Note that, by deﬁnition, isolated edges are not highly

connected. Hence, the smallest clusters are triangles; we

consider all singletons as unclustered.

The problem formulation resembles the CLUST ER DELE-

TION problem [

], which asks for a minimum number

of edge deletions to make each connected component a

clique; thus, CL USTER DELETION has a much stronger

demand on intra-cluster connectivity. Also related is the 2-

CLUB DELETION problem [

], which asks for a minimum

number of edge deletions to make each connected

component have a diameter of at most two. Since highly

connected clusters also have diameter at most two [

2-CLUB DELETION poses a weaker demand on intra-

cluster connectivity and density. Further related problems

are those which allow a different set of modiﬁcations.

If we allow insertion of edges instead of deletion, the

problem reduces to increasing the connectivity of a graph

to more than

n/2

by edge insertions. This problem can

be solved in polynomial time [

]. The vertex deletion

version of HIGHLY-CONNECTED DELETION is NP-hard by

a simple adaption of the NP-hardness proof of 2-CLU B

CLUSTE R VERT EX DELETION due to Liu et al.

[17]

: given

a VERTEX COVE R instance, create an equivalent HIGHLY

CONNECTED VERT EX DELETION instance by attaching to

each vertex a large clique. It is then easy to see that the

graph has a vertex cover of size at most

if and only if

at most

vertices can be deleted to leave a graph where

every component is highly connected.

It could be expected that the algorithm by Hartuv

and Shamir

[5]

yields a good approximation for the

optimization goal of HIGHLY CONNECTED DELETION.

However, we can observe that in the worst case, its

result has size

Ω(k2)

, where

k:= |E0|

is the size of

an optimal solution. For this, consider two cliques with

vertex sets

u1, . . . , un

and

v1, . . . , vn

, respectively, and

the additional edges

{ui, vi}

for

2≤i≤n

. Then these

additional edges form a solution set of size

n−1

; however,

Hartuv and Shamir’s algorithm will (with unlucky choice

of minimum cuts) transform one of the two cliques

into an independent set by repeatedly cutting off one

vertex, thereby deleting

n(n+ 1)/2−1

edges. This also

illustrates the tendency of the algorithm to cut off one-

vertex clusters, which Hartuv and Shamir counteract

with postprocessing [

]. This tendency might introduce

systematic bias [

]. Hence, exact algorithms for solving

HIGHLY CONNECTED DELETION are desirable.

1.1 Our contributions

We analyze the (parameterized) computational complex-

ity of HIGHLY CONNECTED DELETION and propose

several exact solution methods.

In particular, we show

that HIGHLY CONNECTED DELETION is NP-hard even

on 4-regular graphs and, provided the Exponential Time

Hypothesis (ETH) [

] is true, cannot be solved in subex-

ponential time. While biological networks are unlikely

to be 4-regular, this result directly implies hardness for

graphs with bounded degeneracy.

Biological networks

often have low degeneracy, but the NP-hardness means

that this fact cannot be directly exploited to obtain

efﬁcient algorithms for HI GHLY CONNECTED DELETION.

In addition, we provide polynomial-time executable data

reduction rules (on the theoretical side also yielding a

so-called problem kernel of polynomial size) and a ﬁxed-

parameter algorithm based on dynamic programming;

both these results exploit the parameter “number of edge

deletions”.

Geared more towards practical methods, we design

an additional polynomial-time executable data reduction

rule for HI GHLY CONNECTED DELETION preserving the

possibility to solve the problem exactly. Further, we

develop an ILP formulation (using column generation)

and show how, combined with the presented data

reduction rules, real-world instances with, e.g., 6 000

vertices and 13 500 edges can be solved within ﬁve

hours. From an algorithmic standpoint we thus make

progress towards exact algorithms for NP-hard clustering

problems on biological networks. The related application

area of comparative network analysis has seen systematic

algorithmic advances in recent years that have made it

possible to ﬁnd exact solutions for hard problems [

We believe that such advances should also be made

for clustering problems and present a ﬁrst step for one

concrete problem.

Finally, we present a simple heuristic

that can be employed when the instances are too large

for the exact approach.

On the biological and experimental side, we focus on

the analysis of the three species A. thaliana,C. elegans,

and M. musculus with moderate-size protein interaction

networks. We compare our approach with Hartuv and

Shamir’s, to which we will refer as min-cut method, and

Markov clustering (MCL). While our exact approach is

clearly the slowest when compared with the other two,

we demonstrate that it is a viable alternative in terms of

number and quality of the reported clusters.

More speciﬁcally, ﬁrst we observe that our data re-

Exact (in contrast to heuristic and approximate) algorithms may

be beneﬁcial for several reasons. The availability of optimal solutions

can be used to evaluate the quality and performance of heuristics, help

to separate possible model inadequacies from errors introduced by

heuristic solutions, and along with the clear combinatorial model, they

are easier to interpret; refer to Aloise et al.

[19]

for a deeper elaboration.

A graph has degeneracy

if every subgraph has at least one vertex

of degree at most d.

A related NP-hard clustering problem for which signiﬁcant

progress has been made towards efﬁcient exact algorithms is CLU STER

EDITING [22].

duction rules typically identify more than 75 % of the

edges to be deleted for an optimal solution for HI GHLY

CONNECTED DELETION. Combining our data reduction

rules with the min-cut method signiﬁcantly improves this

method’s running time and the quality of its reported

clusters. Still, with our data reduction and ILP approach

we typically ﬁnd more biologically relevant clusters than

the min-cut method extended with our data reduction

algorithm does. Further, we compared our results with

a state-of-the-art algorithm based on Markov clustering.

Again, while this algorithm shows much faster running

time and better coverage (higher number of vertices

(that is, proteins) assigned to clusters), our algorithm is

superior in terms of cluster quality. Our newly proposed

heuristic can also beat the min-cut method for the number

of clusters found, which we illustrate in particular for a

dense network.

1.2 Preliminaries

We consider only undirected and simple graphs

(V, E )

. We use

and

to denote the number of vertices

and edges in the input graph, respectively, and

for the

minimum size of an edge set whose deletion makes all

components highly connected. The order of a graph

the number of vertices in

. We use

G[S]

to denote the

subgraph induced by

S⊆V

. Let

N(v) := {u| {u, v} ∈ E}

denote the (open) neighborhood of

and

N[v] := N(v)∪{v}

Aminimum cut of a graph

is a smallest edge set

such that deleting

increases the number of connected

components of G.

We view HIGHLY CONNECTED DELETION as a parame-

terized problem [

–

], where the parameter is the num-

ber

of edge deletions. A parameterized problem with

input size

is called ﬁxed-parameter tractable (FPT) with

respect to a parameter

if it can be solved in

f(k)·sO(1)

time, where

is a computable function only depending

. A problem kernel for a parameterized problem is a

many-one polynomial-time self-reduction such that the

produced instances have size upper-bounded by some

function of the parameter [

]. Usually, a problem kernel

is achieved by applying polynomial-time executable data

reduction rules. Referring to decision problems, we call

a data reduction rule

correct if the new problem

instance

that results from applying

to the original

instance

is a yes-instance if and only if

is a yes-

instance.

The Exponential Time Hypothesis (ETH) [

] states that

3-SAT cannot be solved in subexponential time. Several

results on (relative) lower bounds for exponential-time

algorithms have been shown by exploiting the ETH; see

Lokshtanov et al. [27] for a recent survey.

2 COMPUTATIONAL COMPLEXITY

We present a proof that HI GHLY CONNECTED DELETION

is NP-hard even on 4-regular graphs and that it cannot

Type I

Type II

Fig. 1. The two different neighborhoods in a

-regular

neighborhood restricted graph. None of these graphs

contains a clique of order four.

be solved in subexponential time unless the exponential-

time hypothesis (ETH) [

] is false. Our reduction is from

the PARTITION INTO TRIANGLES problem.

As shown by van Rooij et al.

[28]

, the PART ITION

INTO TRIANGLES problem is NP-hard even when the

input graph

G= (V, E )

is 4-regular. Moreover, NP-

hardness persists even when for each vertex

v∈V

the

graph

G[N[v]]

is isomorphic to one of the two graphs

shown in Fig. 1 [

]; we refer to such graphs as

-regular

neighborhood-restricted graphs. The variant of PARTITION

INTO TRIANGLES that we use in the reduction thus is:

RESTRICTED PARTITION INTO TRIANGLES

(RPIT)

Instance:

An undirected

-regular

neighborhood-restricted graph G= (V, E).

Question:

Can

be partitioned into

|V|/3

sets

such that each set of the partition induces a

triangle, that is, a complete graph on three

vertices, in G?

Our reduction is similar to a simple reduction that was

used to show NP-hardness of CL USTER DELETION on

graphs of maximum degree four [

]. In this reduction,

the graph remains basically the same, and one just has

to ﬁnd the appropriate

. The main difference is that

for HIGHLY CONNECTED DELETION, we have to show

that there can be no clusters larger than triangles in

(in the case of CL USTER DELETION this is easier since

the cluster size is at most ﬁve in 4-regular graphs). In

the following, we present two observations and a data

reduction rule for RPIT and then use them to obtain a

reduction from RPIT to HIGHLY CONNECTED DELETION.

The overall aim is to show that we can assume for our

reduction that all clusters are triangles.

Lemma 1:

Let

G= (V, E )

be a 4-regular neighborhood-

restricted graph. Then

does not contain any highly

connected subgraph of order four, ﬁve, or at least eight.

Proof: Obviously,

does not contain highly con-

nected subgraphs of order at least eight, since

is 4-

regular and any highly connected graph of order at least

eight has minimum degree ﬁve. Furthermore, the only

highly connected graphs of order four are cliques on

four vertices. Since

has only the neighborhoods shown

in Fig. 1, it does not contain cliques of order four.

It remains to show that

does not contain highly

connected subgraphs of order ﬁve. The main observation

we use is that, in such a graph, every vertex has degree at

least three. Let

be a vertex in

. Since

G[N[v]]

contains

at least two vertices of degree two (see Fig. 1),

G[N[v]]

not highly connected. Hence, if

is contained in a highly

connected subgraph

of order ﬁve, then

contains

at least one vertex from

V\N[v]

and exactly one vertex

from

N[v]

is not in

. If

G[N[v]]

is of Type I (see Fig. 1),

then

is not contained in a highly connected subgraph

of order ﬁve: deleting one vertex from

G[N[v]]

produces

one vertex with degree one and this vertex cannot obtain

degree at least three by adding one vertex. Hence, assume

that

G[N[v]]

is of Type II. Clearly, every highly connected

subgraph of order ﬁve containing

also has to contain

since otherwise one creates again a degree-one vertex.

Note that

N[u] = N[v]

. Since

is 4-regular, there is no

vertex in

V\N[v]

that is adjacent to

. Consequently,

every vertex of

V\N[v]

has degree at most two in a

subgraph of

that has order ﬁve and contains

and

Hence, there is no highly connected subgraph of order

ﬁve that contains v.

We now present a data reduction rule that removes

small connected components from the RPIT instance.

Rule 1:

Let

G= (V, E )

be an instance of RPIT. If

contains a connected component

of order at most seven,

then check whether

can be partitioned into triangles. If

this is the case, then remove

from

, otherwise, answer

“no”.

The correctness of the reduction rule is obvious. Further-

more, it results in an instance with the following property.

Lemma 2:

Let

G= (V, E )

be an instance of RPIT that is

reduced with respect to Rule 1. Then,

does not contain

any highly connected subgraphs of order six or seven.

Proof: Assume that

contains a highly connected

subgraph

G0= (V0, E0)

on six vertices. Then, each vertex

has at least four neighbors in

. Consequently, no

vertex of

has in

any neighbors in

V\V0

. Hence,

is a connected component of

. This contradicts the

fact that

is reduced with respect to Rule 1. A similar

argument applies for highly connected subgraphs of order

seven.

Theorem 1:

HIGHLY CONNECTED DELETION on 4-

regular graphs is NP-hard and cannot be solved in

2o(k)·

nO(1)

2o(n)·nO(1)

, or

2o(m)·nO(1)

time unless the

exponential-time hypothesis (ETH) is false.

Proof: We reduce from RPIT, which is NP-hard

and cannot be solved in

2o(n)·nO(1)

time unless the

ETH is false [

]. Given an instance of RPIT, ﬁrst

apply Rule 1. Let

G= (V, E )

be the resulting instance.

We obtain an instance of HIGHLY CONNECTED DELETION

by setting k:= |V|.

The equivalence of the instances can be seen as

follows. If

has a partition into triangles, then each

of these triangles is a highly connected subgraph. The

number of triangles is

|V|/3

and the overall number of

edges contained in these triangles is

|V|

. Since

is 4-

regular,

|E|= 2|V|

. Hence,

can be transformed by at

most

edge deletions into a highly connected cluster

graph.

Conversely, if

can be transformed into a highly

connected cluster graph

by at most

|V|

edge deletions,

then

has at least

|V|

edges. By Lemmas 1 and 2, no

cluster in

has order at least four. Hence, all clusters

are triangles or singletons. Since

has

|V|

edges, all

clusters are triangles. Therefore,

can be partitioned into

triangles.

Clearly, the reduction implies NP-hardness of HIGHLY

CONNECTED DELETION on 4-regular graphs. The ETH-

based lower bounds follow from the fact that

|V|=k=

|E|/2.

3 PARAMETERIZED COMPLEXITY

In this section, we provide polynomial-time data reduc-

tion rules that reduce an instance of HIGHLY CONNECTED

DELETION to an equivalent one with at most

10k1.5

vertices. Thus, after reduction, the instance size depends

solely on

. The resulting instance is called a problem

kernel with respect to the parameter

and the data

reduction algorithm is called kernelization [

]. The size of

the resulting instance is used to measure the effectiveness

of the data reduction rules. Further, we present a ﬁxed-

parameter algorithm for HI GHLY CONNECTED DELETION

with running time

O(34k·k2+n2mk ·log n)

. These

results imply the ﬁxed-parameter tractability of HI GHLY

CONNECTED DELETION with respect to

and give hope

for ﬁnding optimal solutions for instances where

is not

too large.

3.1 Problem Kernel

The ﬁrst data reduction rule is obvious.

Rule 2:

Remove all connected components from

that

are highly connected.

The following lemma can be proved by a simple counting

argument.

Lemma 3:

Let

be a highly connected graph and let

u, v

be two vertices in

. If

and

are adjacent, then

they have at least one common neighbor; otherwise, they

have at least three common neighbors.

Proof: Let

nuv

be the number of common neighbors

and

the number of neighbors speciﬁc

and

, respectively (excluding

and

). Let

{u, v} ∈ E

and

otherwise. We have

nuv +nu+c > n/2

and

nuv +nv+c > n/2

, thus

2nuv +nu+nv+ 2c≥n+ 1

Since

n≥nuv +nu+nv+ 2

, we get

nuv + 2c−2≥1

, thus

nuv ≥3−2c.

A simple data reduction rule follows directly from

Lemma 3.

Rule 3:

If there are two vertices

and

with

{u, v} ∈ E

that have no common neighbors, then delete {u, v}and

decrease kby one.

Interestingly, Rules 2 and 3 yield a linear-time algo-

rithm for HIGHLY CONNECTED DELETION on graphs of

maximum degree three, which together with Theorem 1

shows a complexity dichotomy with respect to the

maximum degree.

Theorem 2:

HIGHLY CONNECTED DELETION can be

solved in linear time when the input graph has degree

at most three.

Proof: We ﬁrst apply Rule 3. This reduction rule can

be applied in one pass since an edge that is in a triangle is

never deleted by this rule. Consequently, the application

of this rule does not produce new vertices

and

which this rule applies. Hence, Rule 3 can be exhaustively

applied in

O(n+m)

time: for each edge in

we examine

the neighborhoods of its endpoints; since

has maximum

degree three this neighborhood has constant size. Next,

we apply Rule 2, which can also be performed in linear

time. After the application of this rule,

is reduced with

respect to both rules.

Consider a connected component in

. We show that

contains only four vertices. Let

{u, v}

be some edge

in this connected component. Since

is reduced with

respect to Rule 3, there is a vertex

that is a common

neighbor of

and

. Since

is reduced with respect

to Rule 2, one of these three vertices, say

has a further

neighbor

. Now,

has a common neighbor with

, say

The connected component does not contain any further

vertices: First,

and

can have no further neighbors

since

has maximum degree three. Second,

and

cannot be adjacent since then the connected component is

a clique of order four which contradicts that

is reduced

with respect to Rule 2. Finally, neither

nor

have a

further neighbor since this neighbor has to be adjacent

to either

, which already have degree three. Hence,

each remaining connected component can be solved in

constant time.

The next two data reduction rules are concerned with

ﬁnding vertex sets that have a small edge cut. For

S⊆V

we use

D(S) := {{u, v} ∈ E|u∈Sand v∈V\S}

denote the set of edges outgoing from

, that is, the edge

cut of S.

The idea behind the next reduction rule is to ﬁnd vertex

sets that cannot be separated by at most

edge deletions.

We call two vertices uand vinseparable if the minimum

edge cut between

and

is larger than

. Analogously, a

vertex set

is inseparable if all vertices in

are pairwise

inseparable.

Rule 4:

contains a maximal inseparable vertex

set

of size at least

, then do the following. If

G[S]

not highly connected, then there is no solution of size

at most

. Otherwise, remove

from

and set

k:=

k− |D(S)|.

Lemma 4:

Rule 4 preserves optimal solvability and can

be exhaustively applied in O(n2·mk log n)time.

Proof: We show that the rule produces equivalent

instances. First, assume that

(G, k)

is a yes-instance.

Clearly, an inseparable vertex set

has to be subset of a

cluster

in the solution graph. Now, since

|S| ≥ 2k

there

can be no vertex in

C\S

: Assume that

contains such

a vertex. Then by the maximality of

, the graph

G[C]

has an edge cut of size at most

. Since

|C|>2k

, this

means that

G[C]

is not highly connected. Hence,

is a

cluster in the solution and thus

G[S]

is highly connected.

The rule performs precisely the edge deletions needed to

cut

from

V\S

and reduces the parameter accordingly.

Hence, it produces a yes-instance.

Now, if the instance is a no-instance, then either the

rule returns “no” or performs some edge deletions and

reduces the parameter accordingly. This cannot transform

a no-instance into a yes-instance.

We now describe how to achieve an exhaustive applica-

tion of the rule in the described running time. First, build

a so-called Gomory–Hu tree in

O(n2·mlog n)

time [

This tree has

vertices and the set of weighted edges

represents all pairwise min-cuts. A maximal inseparable

vertex sets can be found by deleting all edges that have

weight at most

. Once this set has been identiﬁed,

the application of the reduction rule can be performed

O(m)

time. Since the reduction rule can be performed

at most

times (it answers either “no” or reduces

), the

overall running time follows.

Note that a highly connected graph of order at least

forms an inseparable vertex set. Hence, after exhaustive

application of Rule 4, every cluster has bounded size.

While Rule 4 identiﬁes clusters that are large with respect

, Rule 5 identiﬁes clusters that are large compared to

their neighborhood.

Rule 5: If Gcontains a vertex set Ssuch that

•|S| ≥ 4,

•G[S]is highly connected, and

•|D(S)| ≤ 0.3·p|S|,

then remove Sfrom Gand set k:= k− |D(S)|.

Lemma 5:

Rule 5 preserves optimal solvability and can

be exhaustively applied in O(n2·mk log n)time.

Proof: We show that there is an optimal solution in

which

is a cluster. To this end, suppose that there is an

optimal solution which produces some clusters

C1, . . . , Cq

that contain vertices from

and vertices from

V\S

. We

show how to transform this solution into one that has

as a cluster and needs at most as many edge deletions.

First, we bound the overall size of the

’s. Note that

deleting all edges between

and

S1≤i≤q(Ci\S)

cuts

each

. By the condition of the rule, such a cut has at

most

0.3p|S|

edges. Since each

G[Ci]

is highly connected,

this implies that P1≤i≤q|Ci|<0.6p|S|.

Now, transform the solution at hand into another

solution as follows. Make

a cluster, that is, undo all

edge deletions within

and delete all edges in

D(S)

and for each

, delete all edges in

G[Ci\S]

. This is

indeed a valid solution since

G[S]

is highly connected,

and all other vertices that are in “new” clusters are now

in singleton clusters.

We now compare the number of edge modiﬁcations for

both edge deletion sets and show that the new solution

needs less edge modiﬁcations. To this end, we consider

each vertex

u∈S

that is contained in some

. On

the one hand, since

G[S]

is highly connected, and since

there is at least some

v∈S

that is not contained in

any

we undo at at least

|S|/2

edge deletions between

vertices of

. On the other hand, an additional number

of up to

0.3p|S|+b0.6√|S|c

2

edge deletions may be

necessary to cut all the

’s from

and to delete all

edges in each

G[Ci\S]

. By the preconditions of the rule

we have

p|S|≤|S|/2

and thus the overall number of

saved edge modiﬁcations for uis at least

|S|/2−0.3p|S| − b0.6p|S|c

2

>|S|/2−0.6|S|/2−0.36|S|/2>0.

(1)

Hence, the number of undone edge modiﬁcations

is larger than the number of new edge modiﬁcations.

Consequently, Sis a cluster in every optimal solution.

The running time can be bounded analogously to

the running time of Rule 4. The only difference is that

after constructing the Gomory–Hu tree, one can ﬁnd a

vertex set that fulﬁlls the conditions of the rule by trying

all

O(m)

possibilities for “guessing”

|S|/2

. Assuming the

correct guess, deleting all edges with weight at most

|S|/2

in the tree produces one connected component that is

exactly S.

Theorem 3:

HIGHLY CONNECTED DELETION can be

reduced in

O(n2·mk log n)

time to an equivalent instance,

called problem kernel, with at most 10 ·k1.5vertices.

Proof: Let

I= (G, k)

be an instance that is reduced

with respect to Rules 2, 4 and 5. We show that every

yes-instance has at most

10 ·k1.5

vertices. Hence, we can

answer no for all larger instances.

Assume that

is a yes-instance and let

C1, . . . , Cq

denote the clusters of a solution. Since

is reduced

with respect to Rule 4, we have

|Ci| ≤ 2k

for each

Furthermore, for every

we have

D(Ci)≥0.3p|Ci|

since

is reduced with respect to Rules 2 and 5. In other

words, every cluster

“needs” at least

0.3p|Ci|

edge

deletions. Hence, the overall instance size is at most

max

(c1,...,cq)∈Nq

i=1

cis. t.

∀i∈ {1, . . . , q}:ci≤2k,

1≤i≤q

0.3·√ci≤2k.

A simple calculation shows that there is an assignment

to the

’s maximizing the sum such that at most one

is smaller than

. Hence, the sum is maximized when

a maximum number of

’s have value

. Each of the

corresponding clusters is incident with at least

0.3√2k

edge deletions. Hence, there are at most

2k/0.3√2k=

10√2k/3

such clusters. The overall instance size follows.

3.2 Fixed-Parameter Algorithm

We present a ﬁxed-parameter algorithm solving HIGHLY

CONNECTED DELETION in

34k·nO(1)

time. Fixed-

parameter algorithms have also been given for related

clustering problems; the best known ﬁxed-parameter

algorithm for CL USTER DELETION (after a long line

of improvements) runs in

1.415k·nO(1)

time [

], and

the best known ﬁxed-parameter algorithm for 2-CLUB

DELETION runs in 2.74k·nO(1) time [17].

The main idea of our algorithm is to branch until each

connected component has diameter at most two and solve

these instances by a dynamic programming algorithm.

The details are as follows.

Since each highly connected graph has diameter at

most two [

], we can perform the following branching

rule.

Branching Rule 1:

If a connected component in

has

diameter three or more, then ﬁnd two vertices

and

with distance three and pick an arbitrary shortest

path

P=uxyv

between

and

. Branch into the three

possibilities to destroy

by deleting either

{u, x}

{x, y}

or {y, v}. In each recursive branch, set k:= k−1.

The rule is obviously correct in the sense that at least one

of the three edges has to be destroyed. Now assume that

the branching rule does not apply anymore, that is, each

connected component has diameter two. If the graph

is also highly connected, then we are done. Otherwise,

we can apply Rule 4 to obtain an instance that is small

compared to k, as shown by the following lemma.

Lemma 6:

Let

I= (G, k)

be an instance of HI GHLY

CONNECTED DELETION such that

has diameter two

and

is reduced with respect to Rule 4. Then,

has at

most 4kvertices.

Proof: Consider a solution for

. Since

has diameter

two, there is at most one cluster in this solution that has

vertices that are not incident with an edge that is deleted

by the solution (call these vertices unaffected): if there are

two clusters

and

with unaffected vertices

and

then these vertices are within distance at least three (

not adjacent to a neighbor of

). Let

be the, possibly

empty, cluster that has unaffected vertices. Then, since

is reduced with respect to Rule 4,

has at most

vertices. Since all other vertices are affected, there are at

most

further vertices. The overall size of the instance

follows.

In the following lemma, we describe an algorithm that

solves HIGHLY CONNECTED DELETION for arbitrary (not

necessarily diameter-2) instances. The main trick in the

ﬁxed-parameter algorithm is that with the above lemma,

this running time becomes single-exponential for the

parameter k.

Lemma 7:

HIGHLY CONNECTED DELETION can be

solved in O(3n·m)time.

Proof: We describe a dynamic programming algo-

rithm. The idea of the algorithm is that if

can be

two-partitioned into

and

such that all clusters are

subsets of either

, then we can obtain the overall

solution by combining best solutions for the induced

subgraphs G[V1]and G[V2]. The details are as follows.

We build a dynamic programming table

with entries

of the type

T[V0]

V0⊆V

which store an optimal solution

for HIGHLY CONNECTED DELETION for

G[V0]

. The table

is initialized by setting

T[V0]=0

for each

V0⊆V

such

that

G[V0]

is highly connected. The remaining entries can

be computed by the recurrence

T[V0] = min

V1,V2,V1˙

∪V2=V0(T[V1] + T[V2]+

|{{u, v} ∈ E|u∈V1∧v∈V2}|).

(2)

The third summand is exactly the number of edges

needed to cut

from

. After all entries have

been computed,

T[V]

stores the number of edge deletions

needed for obtaining a highly connected cluster graph;

an actual clustering can be obtained by a traceback. The

correctness of the recurrence follows from the discussion

above.

The running time can be bounded as follows. For each

table entry, the initialization can be performed in

O(m)

time, leading to an overall time of

O(2n·m)

for this part

of the algorithm. In the second part of the algorithm, an

overall number of

O(3n)

recurrences have to be evaluated:

each partition of

into

and

uniquely deﬁnes a

three-partition of

into

V\V0

and

. Since the

number of edges needed for the third summand can be

counted in

O(m)

time, the overall running time follows.

Combining the above two lemmas with the branching

rule, we obtain our main result of this section.

Theorem 4:

HIGHLY CONNECTED DELETION can be

solved in O(34k·k2+n2mk ·log n)time.

Proof: The algorithm performs Rules 2 and 4 and

the branching rule as long as possible. By Lemma 6,

the remaining instances have at most

4k0

vertices for

some

k0≤k

. Using Lemma 7, these instances can be

solved in

O(34k0·k2)

time. The overall running time

follows from a simple worst-case analysis; we omit the

details.

The running time given by Theorem 4 is impractical.

Moreover, the algorithm relies partially on dynamic

programming which has the disadvantage that, compared

to branching algorithms, its average-case running time

often comes close to its worst-case running time. We

conjecture that a running time improvement using only

branching algorithms is possible.

4 PRACTICAL ALGORITHMS

The ﬁxed-parameter tractability results for HIGHLY CON-

NECTED DELETION (Section 3.2) are of purely theoretical

nature. Hence, we now follow an algorithmic approach

that consists of two main steps: First, apply a set of

data reduction rules that exploit the structure of biological

networks and yield a new instance that is signiﬁcantly

smaller than the original one. Second, formulate the

problem as an integer linear programming (ILP) and apply

it to the new, smaller instance.

The resulting ILP can

be solved using sophisticated ILP solvers which are very

efﬁcient.

ILP formulations have proved useful for attacking other hard

graph problems involving dense subgraphs such as computation of

edge-weighted quasi-bicliques [33] or CL UST ER EDITING [22].

4.1 Further Data Reduction

As we demonstrate in the computational experiments

presented in Section 5, Rule 3 tremendously simpliﬁes

many real-world input instances. In particular, as shown

by Theorem 2, it is useful to reduce vertices of small de-

gree, as found in protein interaction networks. However,

Rules 4 and 5 that produce a kernel have the downside

of requiring relatively large substructures unlikely to

exist in our inputs. In contrast, we noted that Rule 3

leaves behind many degree-2 and -3 vertices. Thus, we

additional devised the following rule.

We try to identify triangles

uvw

that form a highly

connected cluster in at least one optimal solution. For a

triangle edge

{x, y}

, let

Nxy := (N(x)∪N(y)) \ {u, v, w}

be the common neighbors of the edge outside the triangle.

Let the value of an edge

Ne6=∅

and

otherwise.

Let the value of a vertex

be the size of the largest

connected component in

G[N(x)\ {u, v, w}]

, or 0 if this

size is 1.

Rule 6:

Assume that for a triangle

uvw

the following

conditions hold:

•

There is no vertex connected to all three of

and w;

•the set Nuv ∪Nuw ∪Nvw does not contain an edge;

•

for any

{x, y, z}={u, v, w}

, the value of

{x, y}

plus

the value of zis at most three;

•

the sum of the values of

, and

is at most three.

Then isolate the triangle by deleting all edges incident

with u,v, and wexcept the triangle edges.

Proof (preservation of optimality): By case distinction:

if the triangle is not a solution cluster, then it must be

part of a larger cluster, or the vertices are divided into

two or three clusters. The conditions ensure that none of

these situations yield a better solution than isolating the

triangle.

4.2 Integer Linear Programming: Column Genera-

tion

We now consider integer linear programming (ILP) based

approaches. With these, we can utilize the decades of

engineering that went into commercial solvers like CPLEX

or Gurobi to be able to tackle large instances. Our main

approach is somewhat involved due to the use of column

generation. We additionally tried a more straightforward

approach based on a CLIQUE PARTITIONING formulation

and row generation [

]. However, it performed

poorly compared to the column generation, and we omit

a detailed comparison.

We describe an ILP formulation of HIGHLY CON-

NECTED DELETION, which in its basic scheme is similar to

those of Mehrotra and Trick

[35]

and Ji and Mitchell

[36]

for constrained CLIQUE PARTITIONING and that of Aloise

et al.

[19]

for modularity maximization; however, we

need a new approach for solving the column generation

subproblem. Let

be the set of all vertex sets that induce

a highly connected subgraph. We use binary variables

to indicate that the cluster

T∈ T

is part of the solution.

Then the model is

maximize X

T∈T

cTzT,(3)

s. t. X

{T∈T |u∈T}

zT= 1 ∀u∈V, (4)

zT∈ {0,1} ∀T∈ T ,(5)

where

is the number of edges in the subgraph induced

. The objective (3) maximizes the number of edges

within clusters, which is equivalent to minimizing the

number of inter-cluster edges (deletions). The constraints

of type (4) ensure that each vertex is contained in exactly

one cluster.

Due to the large number of variables, this model cannot

be solved directly except for tiny instances. Thus, the idea

is to only consider “relevant” variables. More precisely,

we start with an initial set of

variables that yields a

feasible solution (e. g., all singleton clusters). Then we

successively add variables (“columns”) that improve the

objective, until this is no longer possible. Due to the

structure of real-world instances, typically only a small

subset of possible variables needs to be added.

More precisely, we try to add variables that improve the

relaxed model where constraint (5) is replaced by

zT≥0

We thus ﬁrst compute an optimal solution to the relaxed

model. If we can ﬁnd an improving column, then we add

this column and compute a new solution for the relaxed

model. If we cannot ﬁnd an improving column, then

there is no cluster that can improve the current fractional

solution. If this solution is “by chance” integral, then

we have found an optimal solution also for the model

with integrality constraints. Otherwise, as suggested by

Mehrotra and Trick

[35]

, we use Ryan–Foster branching:

we ﬁnd two vertices such that there is both a fractional

cluster containing both vertices and a fractional cluster

containing exactly one of them, and branch into the two

cases that the two vertices are together in a cluster or

that they are in a different cluster.

The improvement of adding a column for cluster

the relaxed model is

minus the contribution of the

vertices in

to the objective function. This contribution

for some vertex

can be calculated as the value of

the dual variable

λu

for the corresponding constraint

of type (4) in the relaxation (see e. g. Aloise et al.

[19]

for

details). The values of the dual variables can be easily

calculated by a linear programming solver. Thus, we need

to ﬁnd a cluster

that maximizes

cT−Pu∈Tλu

. In other

words, we need to ﬁnd a highly connected subgraph that

maximizes the number of edges minus vertex weights.

For this, we again use an ILP formulation, using binary

edge variables

euv

and binary vertex variables

describe the cluster selected, and a positive integral

variable dto describe the cluster size:

maximize X

{u,v}∈E

euv −X

u∈t

λuvu,(6)

s. t. d=X

u∈V

vu,(7)

euv ≤vu, evu ≤vv∀{u, v} ∈ E, (8)

if vuthen X

v∈N(u)

euv > d/2∀u∈V, (9)

where the constraint (9) can be linearized using the big-

method (that is, by adding

M(1 −vu)

on the left-

hand side with a sufﬁciently large constant

); in our

implementation, we instead use indicator constraints as

supported by CPLEX.

To get a speedup, we can make use of the fact that it

is not necessary to ﬁnd a maximally improving column.

Therefore, we can solve the column generation problem

heuristically, and only solve it optimally using the ILP

when no improving solution was found. As heuristic,

we use a simple greedy method that starting from each

vertex repeatedly adds the vertex that maximizes the

value of the cluster, and records the best cluster that

was highly connected. If this fails, we try local search:

removing and adding up to two vertices, respectively, to

a known cluster. Further, we abort solving the column

generation ILP as soon as an improving solution is found.

Also, to ﬁnd a good set of disjoint clusters more quickly,

we initially scale the dual variables by a factor of 5 until

no improving clusters can be found with this factor.

4.3 Neighborhood Heuristic

One drawback of the method by Hartuv and Shamir

[5]

is that it uses a minimum cut routine, which has a

fairly high worst-case time complexity and is difﬁcult to

implement. We suggest an alternative simpler heuristic.

Recall that Rule 3 from Section 3.1 says that an edge

whose two endpoints have no common neighbor can be

deleted. The idea is then to greedily delete in a connected

component that is not yet highly connected the edge for

which the endpoints have the fewest common neighbors.

We additionally weigh this by the vertex degree, that is,

we delete the edge

{u, v}

with

x:= |N(u)∩N(v)|

that

minimizes

min{x/ deg(u), x/ deg(v)}.(10)

5 EXPERIMENTAL EVALUATION

We implemented the data reduction in OCaml and the

ILPs in C++ using the CPLEX 12.5.1 ILP solver. For

the minimum cut subroutine of the min-cut method, a

highly optimized implementation in C was used [

Our source code and sample instances are available at

http://fpt.akt.tu-berlin.de/hcd/. The test machine is a

4-core 3.6 GHz Intel Xeon E5-1620 (Sandy Bridge-E) with

10 MB L3 cache and 64 GB main memory, running under

Debian GNU/Linux 7.0. CPLEX is allowed to use up to

8 threads; we report wall-clock times.

We used protein interaction networks available at the

BioGRID repository [

] (release 3.2.101 from May 25th,

2013). The species for which we illustrate our results are

S. pombe,C. elegans,M. musculus, and A. thaliana. For

TABLE 1

Instance properties and data reduction results. Here,

the number of connected components, n0and m0are the

number of vertices and edges in the largest connected

component, respectively, ∆kis the number of edges

deleted during data reduction, ∆k[%]:= ∆k/k is the

relative number of edges deleted during data reduction,

K0is the number of connected components after data

reduction, and n00 and m00 are the number of vertices and

edges in the largest connected component after data

reduction, respectively.

n m K n0m0∆k∆k[%] K0n00 m00

SP phys. 1963 4772 33 1875 4709 2398 62.9 1249 611 2173

SP all 3735 51620 9 3716 51608 6446 ≥13.0 959 2765 45156

CE phys. 3176 5465 101 2926 5314 4593 88.6 2844 268 747

CE all 3866 7707 73 3686 7599 5665 77.7 3292 542 1991

MM phys. 7354 14509 142 6983 14274 10693 79.6 6107 1115 3604

MM all 7414 14687 146 7037 14449 10757 79.1 6133 1159 3733

AT phys. 5999 13571 118 5727 13407 9388 79.0 4730 1009 3658

AT all 6038 13680 124 5739 13490 9396 78.5 4744 1042 3771

each species, we extracted one network with physical

interactions only, and one with all interactions. Table 1

shows some basic properties of these networks.

5.1 Data Reduction and Running Time

Table 1 shows the effect of data reduction. Knowing

the optimal

(see Table 2) allows us to state that

typically 75 % of the edges that need to be deleted are

identiﬁed. Since connected components can be treated

separately, the most important time factor is the size of

the largest connected component (

m00

). Here, the number

of edges is reduced to typically 30 %. This demonstrates

the effectivity of the data reduction, which preserves exact

solvability, and suggests it should be applied regardless

of the actual solution method that follows.

Table 2 summarizes the clustering results and running

times. Doing data reduction before running the min-

cut method actually improves the running time, since

it reduces the number of costly min-cut calls. The

neighborhood method ﬁnds many more (mostly smaller)

clusters than the min-cut method, and overall deletes less

edges. However, the largest clusters it produces tend to

be smaller. The column generation method is able to solve

all but one test instance, although the hardest one takes

almost 5 hours. It is not able to solve the network of all

interactions of S. pombe within 32 hours; this is probably

because this is a denser network, making data reduction

less effective. Ryan–Foster branching is necessary for all

instances, but the search tree is typically small, the largest

having 37 nodes.

In Figure 2, we examine the trade-off between the solu-

tion size

and the running time for the min-cut method,

the neighborhood method, and the column generation

with a per-connected-component time limit. We see that

data reduction speeds up the ﬁrst two methods a lot, and

also improves the solution quality, in particular for the

min-cut heuristic. The column generation gets an almost

0510 15 20 25 30 35 40

time (min)

relative error for

(

)

Column generation

Min-Cut

Min-Cut + DR

Neighbor

Neighbor + DR

Fig. 2. Running time trade-off for the A. thaliana network

optimal result very quickly, and ﬁnds the optimal solution

after about 8 minutes (however, proving optimality takes

several hours). Data reduction for column generation

consistently improves running time, for example for CE-

p by a factor of 24.

5.2 Biological evaluation

For the biological evaluation, we studied the A. thaliana

network with all interactions in more detail. For the

computation of the enrichment of annotation terms, we

used the GOstats package [

] of Bioconductor with

A. thaliana annotation data from the TAIR database [

The computed

-values are corrected for multiple hypoth-

esis testing. We used a signiﬁcance threshold of

p≤0.01

Our ﬁndings are summarized in Figure 3. Solving HIGHLY

CONNECTED DELETION exactly produces more clusters

than using the min-cut algorithm with data reduction

which in turn produces more clusters than the min-cut

algorithm without data reduction. This behavior can be

observed for small and for large clusters.

To assess the biological relevance of these clusters, we

determined for each cluster whether the corresponding

protein set has a statistically signiﬁcant enrichment of

annotations describing processes in which the protein

take part. As shown in Fig. 3, for all methods a large

portion of clusters shows such an enrichment. The min-

cut algorithm with data reduction clearly outperforms the

min-cut algorithm without data reduction: it produces

more clusters without producing a larger fraction of

nonenriched clusters. For the neighborhood heuristic

and the exact algorithm the results are less clear: they

produce even more clusters, but a larger fraction is

nonenriched. This behavior is particularly pronounced

for small clusters of size at most three, but also for

some larger cluster sizes. A possible explanation could

be as follows. By minimizing the number of deletions,

some of the small reported clusters, for instance triangles,

TABLE 2

Results for the instances of Table 1. Here, kis the number of edges deleted, uis the number of unclustered vertices,

is the number of clusters,

and

are the number of vertices and edges in the largest cluster, respectively, and

the running time in seconds.

min-cut without DR min-cut with DR neighborhood with DR column generation with DR

k u K n m t k u K n m t k u K n m t k u K n m t

SP-p 4324 1839 17 17 96 16 4165 1699 62 17 96 2 3961 1548 101 15 71 2 3811 1495 108 17 96 102

SP-a 50343 3663 4 63 1268 526 50331 3651 8 63 1268 214 49514 3263 95 60 1175 3491 — — — — — —

CE-p 5437 3159 4 7 16 56 5268 3040 37 9 30 1 5215 2984 56 9 30 1 5184 2960 62 9 30 34

CE-a 7613 3849 1 17 94 93 7491 3758 27 17 94 3 7382 3664 55 15 78 4 7295 3625 63 19 113 149

MM-p 14413 7316 9 13 69 1198 14078 7072 80 13 50 15 13636 6718 185 13 69 15 13428 6600 209 13 67 2190

MM-a 14591 7376 9 13 69 1253 14265 7141 77 13 50 15 13791 6762 189 13 69 16 13591 6655 209 13 67 2458

AT-p 13009 5843 25 23 190 602 12497 5478 131 23 190 10 12119 5249 191 22 178 10 11885 5142 209 23 190 16721

AT-a 13121 5885 24 23 190 616 12613 5523 129 23 190 10 12222 5291 189 22 178 10 11972 5170 211 23 190 10536

100

120

140

160

number of clusters

4 5 678 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

cluster size

Min-Cut

Min-Cut + DR

Neighbor + DR

Column generation

Fig. 3. Clusters in the A. thaliana network with all interactions. The darker part of each bar shows the fraction of clusters

with signiﬁcant enrichment of biological process annotation terms.

contain some nodes of very low degree and one high

degree node that is “by chance” not included in any large

cluster. The resulting cluster is unlikely to have similar

GO annotations across all three proteins. We discuss a

possibility to counter this behavior in the conclusion.

Comparison with Markov Clustering

Next, we compare our clustering algorithm with a popu-

lar clustering algorithm for protein interaction networks.

As comparison, we choose the so-called Markov Cluster-

ing Algorithm (MCL), which was shown to outperform

several other clustering algorithms on protein interaction

networks [

]. For details concerning MCL refer to [

];

in the experiments, we used the MCL implementation

available at http://micans.org/mcl/ (version 12-135).

One parameter that can be set when using MCL is the

“inﬂation”

. We performed experiments with the default

value of

I= 2.0

and with

I= 3.0

. which produces a more

ﬁne-grained clustering (as does our algorithm). Unless

stated otherwise, we use MCL to refer to the algorithm

with default setting.

When comparing the two algorithms, our exact ap-

proach (in the following referred to as HCD) and the

MCL algorithm, there are some clear advantages of the

MCL algorithm: MCL ﬁnishes within less than a second,

MCL assigns almost all proteins to clusters, and MCL

produces more clusters than HCD. MCL also produces

larger clusters than HCD. For instance, it ﬁnds 38 clusters

of size more than 20, and the largest cluster has size

280. As shown in Figure 4, the number of produced

clusters is higher across all cluster sizes. The fraction of

clusters whose proteins share a signiﬁcantly enriched GO

annotation term, however, is for small and medium-size

clusters much lower in the clustering produced by MCL

than in the clustering produced by HCD. Hence, MCL

places many more vertices in enriched clusters than the

other methods, but the average cluster quality is much

lower (see Table 3).

In order to assess how informative the clusters pro-

vided by the algorithms are, we examined the nature of

the relevant enriched terms more closely. For instance,

the largest cluster produced by MCL has 280 proteins

and the annotation term with the lowest

-value was

‘transport’ shared by 30 % of the proteins, followed by

‘localization’. In contrast, the largest cluster produced by

HCD only has size 23. The term with the lowest

-value

is ‘negative regulation of cyclin-dependent protein kinase

activity’ shared by 21 of 23 proteins.

34 5 67

cluster size

100

150

200

250

300

number of clusters

mcl

Column generation

8 9 10 11 12 13 14 15 16 17 18 19

≥

cluster size

number of clusters

mcl

Column generation

Fig. 4. Clusters produced by the MCL algorithm and

column generation for A. thaliana. The darker part of

each bar shows the fraction of clusters with signiﬁcant

enrichment of biological process annotation terms.

#clust. clust. enriched [%] vert. enriched [%]

Min-Cut 24 66.7 2.0

Min-Cut + DR 129 56.6 5.5

Neighbor-Heur. 189 49.7 7.4

Column gen. 211 49.8 8.5

MCL I= 2 1045 29.9 48.1

MCL I= 3 1514 22.4 31.6

TABLE 3

Number and percentage of clusters with gene annotation

enrichment in the A. thaliana network

Semantic similarity

To provide a more systematic analysis of the similarity

of annotation terms for the clusters, we computed for

each protein pair in the same cluster a semantic similarity

score for the GO annotations using the score deﬁnition

34 5 67

cluster size

0.0

0.2

0.4

0.6

0.8

1.0

average semantic similarity

mcl

Column generation

(a) Small clusters for A. thaliana.

3 6 7 11 12 30 >30

cluster size

0.0

0.2

0.4

0.6

0.8

1.0

average semantic similarity

mcl

Column generation

(b) Cluster categories for A. thaliana.

Fig. 5. The average pairwise semantic similarity for protein

pairs in the same clusters grouped by cluster sizes; error

bars show the 95 % conﬁdence interval.

of Wang et al.

[43]

. The computed scores lie in

[0,1]

;

a higher score indicates higher similarity between the

two considered proteins. The average semantic similarity

score for a protein pair in the same cluster is

0.538

for

HCD,

0.258

for MCL with

I= 2.0

, and

0.261

for MCL

with

I= 3.0

. This purely numeric score, however, could

be skewed in favor of HCD: since MCL produces larger

clusters, it would be acceptable that the pairs show less

similarity because the obtained clustering could simply

be coarser. We therefore further examined the effect of the

cluster size on the average semantic score for protein pairs

in the same cluster. Our results are shown in Figure 5. For

small clusters, our clusters show better similarity values

for all cluster sizes except for size-ﬁve and size-seven

clusters, where the difference is less pronounced. Note

that the similarity value increases for size-six clusters and

then decreases again for size-seven clusters. We believe

that this behavior is due to the following fact: Size-ﬁve

clusters are the smallest clusters that can have a missing

edge; they can contain vertices with

3=5−2

neighbors in

the cluster. For size-six clusters, every protein can again

have at most one missing neighbor and thus has at least

4=6−2

neighbors in the cluster. Now, size-seven clusters

can contain proteins with two missing neighbors. This

might indicate that this “jump” in semantic similarity

is due to the rounding effect in the deﬁnition of highly

connected graphs.

Next, we grouped the reported clusters into four

categories: small (3–6 proteins), medium (7–11 proteins),

large (12–30 proteins), and very large (

>30

proteins) and

computed the average similarity scores for clusters in

these categories. While HCD did not ﬁnd any very large

clusters, the average similarity score is, for the three

other categories, signiﬁcantly higher in HCD than in

MCL. Our results also conﬁrm that the very large clusters

show lower pairwise similarity than the small clusters.

Summarizing, our results for the A. thaliana network

indicate that HCD outperforms MCL in terms of quality

of the reported clusters while MCL shows better coverage

and a better running time.

Biological interpretation of an example cluster

We further examined a cluster of size 15 that was

found by the column generation algorithm and the

neighborhood heuristic in the A. thaliana network with all

interactions. The corresponding proteins are AT2G04660,

AT2G18290, AT2G20000, AT2G39090, AT2G42260,

AT3G48150, AT3G57860, AT4G11920, AT4G21530,

AT4G22910, AT4G33270, AT5G05560, AT5G13840,

AT1G06590, and AT1G78770. The annotation term with

the lowest

-value for this cluster is ‘regulation of DNA

endoreduplication’ shared by 12 out of the 15 proteins.

The three proteins that are not annotated with this term

are AT2G42260, AT4G21530, and AT4G33270. Protein

AT2G42260 is annotated by ‘DNA endoreduplication’

and has unknown function, but mutants undergo further

rounds of endoreduplications, so a role in regulation

of DNA endoreduplication is not unlikely. Protein

AT4G21530 is part of the anaphase promoting complex

(APC) which plays a crucial role in cell cycle regulation.

Protein AT4G33270 is a signal transducing protein

interacting with subunits of the APC. So far, it is only

annotated by ‘signal transduction’; a more speciﬁc role

in regulation of DNA endoreduplication is suggested

by its many interactions with the other proteins of the

cluster.

Interestingly, this cluster is completely destroyed by the

min-cut method (with or without data reduction), that is,

all proteins are unclustered when using these approaches.

The MCL algorithm ﬁnds other clusters that overlap with

this cluster. The cluster with the largest overlap has an

enrichment of annotation terms; the term with the lowest

-value is again ‘regulation of DNA endoreduplication’,

but here only 9 of 19 proteins have this annotation.

34 5 678 9 10 11 12 13 14 15 16 17 18 19

≥

cluster size

100

120

140

number of clusters

Min-Cut + DR

Neighbor + DR

Column generation (12 h)

Fig. 6. Heuristic methods for the S. pombe network

5.3 Heuristics

We examine the heuristic methods for HIGHLY CON-

NECTED DELETION for the S. pombe network with all

interactions, which our exact approach cannot solve (see

Fig. 6). The min-cut method ﬁnds a dense core with

63 vertices and 1268 edges. The neighborhood heuristic

ﬁnds almost the same cluster, omitting 5 vertices and

adding 2 others. However, the min-cut method ﬁnds

additionally only 7 triangles, whereas the neighborhood

heuristic ﬁnds an additional 94 clusters, of which 31 are

not triangles, the largest being two clusters of size 15

each. The column generation method with time limit

ﬁnds even more clusters, but only after several hours

(note that the heuristic column generation is not well-

optimized for such large graphs, and this time could

easily be improved). In summary, the neighborhood

heuristic and column generation with time limit ﬁnd

many more potentially interesting clusters where the

min-cut algorithm fails.

6 CONCLUSION

Proposing the NP-hard problem HIGHLY CONNECTED

DELETION, we introduced a new view on partitioning

into highly connected components with a clearly deﬁned

and natural optimization goal. We developed effective

and efﬁcient data reduction rules which can be combined

with different approaches, including heuristic and ILP-

based ones. Hence, we suggest these data reduction rules

for wider use. Furthermore, we chartered the border

of practical feasibility for ﬁnding optimal solutions to

HIGHLY CONNECTED DELETION, showing that medium-

size instances can be solved within few hours. Finally, we

demonstrated the practical relevance of our approach for

clustering protein interaction networks, favorably compar-

ing with established clustering approaches without clear

optimization goal. We believe that our simple formal

model of partitioning into highly connected clusters

provides a viable alternative to existing approaches.

Compared to the min-cut method with data reduction, we

ﬁnd more clusters of slightly inferior quality; compared

with Markov clustering, we ﬁnd fewer clusters, but these

have higher quality.

For future work, it seems interesting to perform com-

parisons with further clustering algorithms, for example

the RN algorithm [

]. One drawback of the HIGHLY

CONNECTED DELETION clustering deﬁnition is that many

vertices remain unclustered. This could be counteracted

with postprocessing as suggested by Hartuv and Shamir

[5]

. Another drawback is that the biological quality of the

small clusters for the exact solutions is worse than for

the min-cut method. Thus, two possible extensions for

improving the overall cluster quality could be as follows:

First, one could demand that small clusters contain only

vertices of low degree. Second, one could demand for

instance for size-ﬁve clusters that they have to be cliques,

thus putting a stronger connectivity demand on these

small clusters. Finally, it seems useful to consider edge-

weighted HIGHLY CONNECTED DELETION, that is, to

maximize the sum of edge weights in the clustering. This

could be useful to model different degrees of reliability

in the data [

]. Our ILP can be adapted to solve this

problem as well.

ACKNOWLEDGMENT

We are indebted to Nadja Betzler and Johannes Uhlmann

for their early contributions in the theoretical part of

this research. We also thank Andrea Kappes (Karlsruhe

Institute of Technology) for pointing out a ﬂaw in a

previous version of the column generation algorithm.

REFERENCES

[1]

R. Albert, “Scale-free networks in cell biology,”

Journal of Cell Science, vol. 118, no. 21, pp. 4947–4957,

2007.

[2]

R. Sharan, I. Ulitsky, and R. Shamir, “Network-based

prediction of protein function,” Molecular Systems

Biology, vol. 3, p. 88, 2007.

[3]

V. Spirin and L. A. Mirny, “Protein complexes and

functional modules in molecular networks,” PNAS,

vol. 100, no. 21, pp. 12 123–12 128, 2003.

[4]

M. E. J. Newman, Networks: An Introduction. Oxford

University Press, 2010.

[5]

E. Hartuv and R. Shamir, “A clustering algorithm

based on graph connectivity,” Information Processing

Letters, vol. 76, no. 4–6, pp. 175–181, 2000.

[6]

E. Hartuv, A. O. Schmitt, J. Lange, S. Meier-Ewert,

H. Lehrach, and R. Shamir, “An algorithm for

clustering cDNA ﬁngerprints,” Genomics, vol. 66,

no. 3, pp. 249–256, 2000.

[7]

N. Pržulj, D. A. Wigle, and I. Jurisica, “Functional

topology in a network of protein interactions,” Bioin-

formatics, vol. 20, no. 3, pp. 340–348, 2004.

[8]

W. Hayes, K. Sun, and N. Pržulj, “Graphlet-based

measures are suitable for biological network com-

parison,” Bioinformatics, vol. 29, no. 4, pp. 483–491,

2013.

[9]

A. Krause, J. Stoye, and M. Vingron, “Large scale

hierarchical clustering of protein sequences,” BMC

Bioinformatics, vol. 6, p. 15, 2005.

[10]

B. J. Parker, I. Moltke, A. Roth, S. Washietl, J. Wen,

M. Kellis, R. Breaker, and J. S. Pedersen, “New

families of human regulatory RNA structures identi-

ﬁed by comparative analysis of vertebrate genomes,”

Genome Research, vol. 21, no. 11, pp. 1929–1943, 2011.

[11]

G. Chartrand, “A graph-theoretic approach to a

communications problem,” SIAM Journal on Applied

Mathematics, vol. 14, no. 4, pp. 778–781, 1966.

[12]

H. Matsuda, T. Ishihara, and A. Hashimoto, “Clas-

sifying molecular sequences using a linkage graph

with their pairwise similarities,” Theoretical Computer

Science, vol. 210, no. 2, pp. 305–325, 1999.

[13]

D. Jiang and J. Pei, “Mining frequent cross-graph

quasi-cliques,” ACM Transactions on Knowledge Dis-

covery from Data, vol. 2, no. 4, pp. 16:1–16:42, 2009.

[14]

I. Gat-Viks, R. Sharan, and R. Shamir, “Scoring

clustering solutions by their biological relevance,”

Bioinformatics, vol. 19, no. 18, pp. 2381–2389, 2003.

[15]

M. Koyutürk, W. Szpankowski, and A. Grama,

“Assessing signiﬁcance of connectivity and conser-

vation in protein interaction networks,” Journal of

Computational Biology, vol. 14, no. 6, pp. 747–764,

2007.

[16]

R. Shamir, R. Sharan, and D. Tsur, “Cluster graph

modiﬁcation problems,” Discrete Applied Mathematics,

vol. 144, no. 1–2, pp. 173–182, 2004.

[17]

H. Liu, P. Zhang, and D. Zhu, “On editing graphs

into 2-club clusters,” in Proc. FAW-AAIM ’12, ser.

LNCS, vol. 7285. Springer, 2012, pp. 235–246.

[18]

G.-R. Cai and Y.-G. Sun, “The minimum augmen-

tation of any graph to a

-edge-connected graph,”

Networks, vol. 19, no. 1, pp. 151–172, 1989.

[19]

D. Aloise, S. Caﬁeri, G. Caporossi, P. Hansen, S. Per-

ron, and L. Liberti, “Column generation algorithms

for exact modularity maximization in networks,”

Physical Review E, vol. 82:046112, no. 046112, 2010.

[20]

R. Impagliazzo, R. Paturi, and F. Zane, “Which

problems have strongly exponential complexity?”

Journal of Computer and System Sciences, vol. 63, no. 4,

pp. 512–530, 2001.

[21]

N. Atias and R. Sharan, “Comparative analysis of

protein networks: hard problems, practical solutions,”

Communications of the ACM, vol. 55, no. 5, pp. 88–97,

2012.

[22]

S. Böcker, S. Briesemeister, and G. W. Klau, “Exact

algorithms for cluster editing: Evaluation and ex-

periments,” Algorithmica, vol. 60, no. 2, pp. 316–334,

2011.

[23]

R. G. Downey and M. R. Fellows, Parameterized

Complexity, 1999.

[24]

J. Flum and M. Grohe, Parameterized Complexity

Theory, 2006.

[25]

R. Niedermeier, Invitation to Fixed-Parameter Algo-

rithms. Oxford University Press, 2006.

[26]

J. Guo and R. Niedermeier, “Invitation to data

reduction and problem kernelization,” ACM SIGACT

News, vol. 38, no. 1, pp. 31–45, 2007.

[27]

D. Lokshtanov, D. Marx, and S. Saurabh, “Lower

bounds based on the Exponential Time Hypothesis,”

Bulletin of the EATCS, vol. 105, pp. 41–71, 2011.

[28]

J. M. M. van Rooij, M. E. van Kooten Niekerk,

and H. L. Bodlaender, “Partition into triangles

on bounded degree graphs,” Theory of Computing

Systems, vol. 52, no. 4, pp. 687–718, 2013.

[29]

C. Komusiewicz and J. Uhlmann, “Cluster editing

with locally bounded modiﬁcations,” Discrete Applied

Mathematics, vol. 160, no. 15, pp. 2259–2270, 2012.

[30]

R. E. Gomory and T. C. Hu, “Multi-Terminal Net-

work Flows,” Journal of the Society for Industrial and

Applied Mathematics, vol. 9, no. 4, pp. 551–570, 1961.

[31]

V. King, S. Rao, and R. E. Tarjan, “A faster determin-

istic maximum ﬂow algorithm,” Journal of Algorithms,

vol. 17, no. 3, pp. 447–474, 1994.

[32]

S. Böcker and P. Damaschke, “Even faster parameter-

ized cluster deletion and cluster editing,” Information

Processing Letters, vol. 111, no. 14, pp. 717–721, 2011.

[33]

W.-C. Chang, S. Vakati, R. Krause, and O. Eulenstein,

“Exploring biological interaction networks with tai-

lored weighted quasi-bicliques,” BMC Bioinformatics,

vol. 13, no. S-10, p. S16, 2012.

[34]

M. Grötschel and Y. Wakabayashi, “A cutting plane

algorithm for a clustering problem,” Mathematical

Programming, vol. 45, no. 1–3, pp. 59–96, 1989.

[35]

A. Mehrotra and M. A. Trick, “Cliques and cluster-

ing: A combinatorial approach,” Operations Research

Letters, vol. 22, no. 1, pp. 1–12, 1998.

[36]

X. Ji and J. E. Mitchell, “Branch-and-price-and-cut

on the clique partitioning problem with minimum

clique size requirement,” Discrete Optimization, vol. 4,

no. 1, pp. 87–102, 2007.

[37]

C. Chekuri, A. V. Goldberg, D. R. Karger, M. S.

Levine, and C. Stein, “Experimental study of

minimum cut algorithms,” in Proc. 8th SODA.

ACM/SIAM, 1997, pp. 324–333.

[38]

C. Stark, B.-J. Breitkreutz, T. Reguly, L. Boucher,

A. Breitkreutz, and M. Tyers, “BioGRID: a general

repository for interaction datasets,” Nucleic Acids

Research, vol. 34, no. suppl. 1, pp. D535–D539, 2006.

[39]

S. Falcon and R. Gentleman, “Using GOstats to test

gene lists for go term association,” Bioinformatics,

vol. 23, no. 2, pp. 257–258, 2007.

[40]

T. Z. Berardini, S. Mundodi, R. Reiser, E. Huala,

M. Garcia-Hernandez et al., “Functional annotation

of the Arabidopsis genome using controlled vocab-

ularies,” Plant Physiology, vol. 135, no. 2, pp. 1–11,

2004.

[41]

S. Brohée and J. van Helden, “Evaluation of clus-

tering algorithms for protein-protein interaction

networks,” BMC Bioinformatics, vol. 7, no. 1, p. 488,

2006.

[42]

S. van Dongen, “Graph clustering by ﬂow simula-

tion,” Ph.D. dissertation, University of Utrecht, 2000.

[43]

J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-

F. Chen, “A new method to measure the semantic

similarity of GO terms,” Bioinformatics, vol. 23, no. 10,

pp. 1274–1281, 2007.

[44]

P. Ronhovde and Z. Nussinov, “Local resolution-

limit-free Potts model for community detection,”

Physical Review E, vol. 81, no. 4, p. 046114, 2010.

Falk Hüffner

studied computer science at the

Eberhard-Karls-Universität Tübingen and re-

ceived his PhD at Friedrich-Schiller-Universität

Jena. After a postdoctoral stay at Tel Aviv Uni-

versity, he now has a position at TU Berlin.

He is interested in the design, analysis, and

experimental evaluation of algorithms for hard

problems, in particular graph problems and dis-

crete optimization problems, from various areas

such as computational biology, VLSI design, and

operations research.

Christian Komusiewicz

studied bioinformatics

at Friedrich-Schiller-Universität Jena and re-

ceived his PhD from TU Berlin. Currently, he

is on a postdoctoral research stay at Université

de Nantes. His main research interest lies in

algorithmics for hard problems in bioinformatics.

Adrian Liebtrau

studied computer science

and mathematics at Friedrich-Schiller-Universität

Jena. He now works as an SAP consultant for

IBM.

Rolf Niedermeier

studied computer science at

TU München and received his PhD and habili-

tation from Eberhard-Karls-Universität Tübingen.

After chairing for six years the Theoretical Com-

puter Science/Computational Complexity group

at Friedrich-Schiller-Universität Jena, since 2010

he chairs the Algorithmics and Complexity Theory

group at TU Berlin, Faculty of Electrical Engineer-

ing and Computer Science. His research interests

include algorithms for NP-hard problems, ﬁnd-

ing applications in ﬁelds such as computational

molecular biology, computational social choice, and graph-based data

clustering.

Merged Path: Distributed Data Dissemination in Mobile Sinks Sensor Networks

Article

Jun 2024

This paper studies distributed data dissemination in multiple mobile sinks wireless sensor networks. Previous studies employed separated paths to disseminate data packets from a given source to a given set of mobile sinks independently, which exhausts the constrained resources of the network. In this paper, we explore how the merged paths mechanism could rationalize utilizing network resources. To do so, we propose a protocol named Merged Path, which is implemented in four steps in a distributed manner. First, the bifurcation points (i.e., where the path is branched into multiple sub-branches) are discovered. Second, we developed a Discrete Cumulative Clustering algorithm (DCC) to divide the sinks into disjoint clusters at each bifurcation point. Third, we propose a Diagonal Virtual Line (DVL) structure to delegate the communication between the high-tier and low-tier nodes. Last, on top of DVL and DCC, we propose an opportunistic metric that captures multiple network-layer attributes to disseminate the data packet to the sinks through multiple branches. The simulation results showed that about 50% of the network energy could be saved by merging the paths versus the separate paths, considering a battlefield application with 20 soldiers, each carrying a sink.

Graph Partitioning in Connected Components with Minimum Size Constraints via Mixed Integer Programming

Preprint

Full-text available

Feb 2022

In this work, a graph partitioning problem in a fixed number of connected components is considered. Given an undirected graph with costs on the edges, the problem consists on partitioning the set of nodes into a fixed number of subsets with minimum size, where each subset induces a connected subgraph with minimal edge cost. Mixed Integer Programming formulations together with a variety of valid inequalities are demonstrated and implemented in a Branch & Cut framework. A column generation approach is also proposed for this problem with additional cuts. Finally, the methods are tested for several simulated instances and computational results are discussed.

Supportness of the protein complex standards in PPI networks

Article

Full-text available

Oct 2021

A protein complex is a collection of two or more associated proteins that interact with each other in a stable long-term interaction. Protein complexes have essential roles in regulatory processes, cellular functions and signaling cascades. This paper examines how well-known collections of protein complexes are supported in protein–protein interaction (PPI) networks, i.e. whether they form connected subnetworks in a particular PPI network. For that purpose, we apply a variable neighbourhood search (VNS) metaheuristic algorithm for adding the minimum number of interactions in order to support protein complexes. Experimental results obtained on several PPI networks (BioGRID, WI-PHI and String) and four protein complex standards (MIPS, TAP06, SGD and CYC2008) show that considered networks do not include enough PPIs to support all complexes from complex standards. Deeper analysis indicates that there exists common PPIs which are probably missing in the considered networks. These findings can be useful for further biological interpretation and developing of PPI prediction models.

A survey of parameterized algorithms and the complexity of edge modification

Preprint

Jan 2020

The survey provides an overview of the developing area of parameterized algorithms for graph modification problems. We concentrate on edge modification problems, where the task is to change a small number of adjacencies in a graph in order to satisfy some required property.

On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records

Article

Dec 2023
INFORM SYST

A survey of parameterized algorithms and the complexity of edge modification

Article

May 2023

Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs

Conference Paper

Sep 2021

On the effectiveness of the incremental approach to minimal chordal edge modification

Article

Jul 2021
THEOR COMPUT SCI

Because edge modification problems are computationally difficult for most target graph classes, considerable attention has been devoted to inclusion-minimal edge modifications, which are usually polynomial-time computable and which can serve as an approximation of minimum cardinality edge modifications, albeit with no guarantee on the cardinality of the resulting modification set. Over the past fifteen years, the primary design approach used for inclusion-minimal edge modification algorithms is based on a specific incremental scheme. Unfortunately, nothing guarantees that the set E of edge modifications of a graph G that can be obtained in this specific way spans all the inclusion-minimal edge modifications of G. Here, we focus on edge modification problems into the class of chordal graphs and we show that for this the set E may not even contain any solution of minimum size and may not even contain a solution close to the minimum; in fact, we show that it may not contain a solution better than within an Ω(n) factor of the minimum. These results show strong limitations on the the use of the current favored algorithmic approach to inclusion-minimal edge modification in heuristics for computing a minimum cardinality edge modification. They suggest that further developments might be better using other approaches.

How well are known protein complexes supported in PPI networks?

Conference Paper

Aug 2020

Partitioning Weighted Metabolic Networks into Maximally Balanced Connected Partitions

Conference Paper

Mar 2020

On Editing Graphs into 2-Club Clusters

Conference Paper

Full-text available

May 2012

In this paper, we introduce and study three graph modification problems: 2-Club Cluster Vertex Deletion, 2-Club Cluster Edge Deletion, and 2-Club Cluster Editing. In 2-Club Cluster Vertex Deletion (2-Club Cluster Edge Deletion, and 2-Club Cluster Editing), one is given an undirected graph G and an integer k ≥ 0, and needs to decide whether it is possible to transform G into a 2-club cluster graph by deleting at most k vertices (by deleting at most k edges, and by deleting and adding totally at most k edges). Here, a 2-club cluster graph is a graph in which every connected component is of diameter 2. We first prove that all these three problems are NP-complete. Then, we present for 2-Club Cluster Vertex Deletion a fixed parameter algorithm with running time O ∗ (3.31k ), and for 2-Club Cluster Edge Deletion a fixed parameter algorithm with running time O ∗ (2.74k ).

Monographs in Computer Science

Book

Jan 1999

Scale- free networks in cell biology

Article

Oct 2005

Reka Albert

A cell's behavior is a consequence of the complex interactions between its numerous constituents, such as DNA, RNA, proteins and small molecules. Cells use signaling pathways and regulatory mechanisms to coordinate multiple processes, allowing them to respond to and adapt to an ever-changing environment. The large number of components, the degree of interconnectivity and the complex control of cellular networks are becoming evident in the integrated genomic and proteomic analyses that are emerging. It is increasingly recognized that the understanding of properties that arise from whole-cell function require integrated, theoretical descriptions of the relationships between different cellular components. Recent theoretical advances allow us to describe cellular network structure with graph concepts, and have revealed organizational features shared with numerous non-biological networks. How do we quantitatively describe a network of hundreds or thousands of interacting components? Does the observed topology of cellular networks give us clues about their evolution? How does cellular networks' organization influence their function and dynamical responses? This article will review the recent advances in addressing these questions.

Lower Bounds Based on the Exponential-Time Hypothesis

Chapter

Jul 2015

The Exponential Time Hypothesis (ETH) is a conjecture stating that, roughly speaking, n-variable 3-SAT cannot be solved in time 2o(n). In this chapter, we prove lower bounds based on ETH for the time needed to solve various problems. In many cases, these lower bounds match (up to small factors) the running time of the best known algorithms for the problem.

Functional annotation of the Arbidopsis genome using controlled vocabularies

Article

Jan 2004
PLANT PHYSIOL

Networks: An Introduction

Book

Jan 2010

M E J Newman

Parametrized complexity theory

Article

Lower bounds based on the exponential time hypothesis

Article

Jan 2011

In this article we survey algorithmic lower bound results that have been obtained in the field of exact exponential time algorithms and parameterized complexity under certain assumptions on the running time of algorithms solving CNF-SAT, namely exponential time hypothesis (ETH) and strong exponential time hypothesis (SETH).

Networks: An Introduction

Article

Jan 2010

Mark E. J. Newman

The scientific study of networks, including computer networks, social networks, and biological networks, has received an enormous amount of interest in the last few years. The rise of the Internet and the wide availability of inexpensive computers have made it possible to gather and analyze network data on a large scale, and the development of a variety of new theoretical tools has allowed us to extract new knowledge from many different kinds of networks. The study of networks is broadly interdisciplinary and important developments have occurred in many fields, including mathematics, physics, computer and information sciences, biology, and the social sciences. This book brings together the most important breakthroughs in each of these fields and presents them in a coherent fashion, highlighting the strong interconnections between work in different areas. Subjects covered include the measurement and structure of networks in many branches of science, methods for analyzing network data, including methods developed in physics, statistics, and sociology, the fundamentals of graph theory, computer algorithms, and spectral methods, mathematical models of networks, including random graph models and generative models, and theories of dynamical processes taking place on networks.

Invitation to Fixed-Parameter Algorithms

Book

Feb 2006

Rolf Niedermeier

Contents 1. Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Keep the Parameter Fixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Preliminaries and Agreements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Parameterized Complexity---a Brief Overview . . . . . . . . . . . . . . 6 1.3.1 Basic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Interpreting Fixed-Parameter Tractability . . . . . . . . . . . 9 1.4 Vertex Cover -- an Illustrative Example . . . . . . . . . . . . . . . . . 11 1.4.1 Parameterize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Specialize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.4 Count or Enumerate . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage

Abstract

Recommended publications

A Column Generation Algorithm For Boosting

KOMUNIKASI MATEMATIK DALAM PEMBELAJARAN PROGRAM LINIER BERKARAKTERISTIK KEWIRAUSAHAAN UNTUK MENUMBUH...

A Concept Based Indexing Approach for Document Clustering

A heuristic to generate rank-1 GMI cuts