Conference PaperPDF Available

3-HOP: A high-compression indexing scheme for reachability query

June 2009

June 2009

DOI:10.1145/1559845.1559930

Source
DBLP

Conference: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009

Authors:

Ruoming Jin

Kent State University

Yang Xiang

Beijing University of Chemical Technology

Ning Ruan

Google Inc.

David Fuhry

The Ohio State University

Reachability queries on large directed graphs have attracted much attention recently. The existing work either uses spanning structures, such as chains or trees, to compress the complete transitive closure, or utilizes the 2-hop strategy to describe the reachability. Almost all of these approaches work well for very sparse graphs. However, the challenging problem is that as the ratio of the number of edges to the number of vertices increases, the size of the compressed transitive closure grows very large. In this paper, we propose a new 3-hop indexing scheme for directed graphs with higher density. The basic idea of 3-hop indexing is to use chain structures in combination with hops to minimize the number of structures that must be indexed. Technically, our goal is to find a 3-hop scheme over dense DAGs (directed acyclic graphs) with minimum index size. We develop an efficient algorithm to discover a transitive closure contour, which yields near optimal index size. Empirical studies show that our 3-hop scheme has much smaller index size than state-of-the-art reachability query schemes such as 2-hop and path-tree when DAGs are not very sparse, while our query time is close to path-tree, which is considered to be one of the best reachability query schemes.

Pseudo-diagonal and Pseudo-upper triangular submatrix. All blank cells are 0-cells.

…

Edgelink between chain C1 and C3, and between chain C3 and C4. Dotted arrows are virtual edges (paths).

…

Generalized Join and Chain-Center Bipartite Graph

…

3-Hop Labeling of Transitive Closure Contour

…

Figures - uploaded by David Fuhry

Content may be subject to copyright.

Content uploaded by David Fuhry

Content may be subject to copyright.

3-HOP: A High-Compression Indexing Scheme for

Reachability Query

Ruoming Jin, Yang Xiang, Ning Ruan, and David Fuhry

Department of Computer Science, Kent State University

Kent, OH 44242, USA

{jin,yxiang,nruan,dfuhry}@cs.kent.edu

ABSTRACT

Reachability queries on large directed graphs have attracted much

attention recently. The existing work either uses spanning struc-

tures, such as chains or trees, to compress the complete transitive

closure, or utilizes the 2-hop strategy to describe the reachability.

Almost all of these approaches work well for very sparse graphs.

However, the challenging problem is that as the ratio of the number

of edges to the number of vertices increases, the size of the com-

pressed transitive closure grows very large. In this paper, we pro-

pose a new 3-hop indexing scheme for directed graphs with higher

density. The basic idea of 3-hop indexing is to use chain structures

in combination with hops to minimize the number of structures that

must be indexed. Technically, our goal is to ﬁnd a 3-hop scheme

over dense DAGs (directed acyclic graphs) with minimum index

size. We develop an efﬁcient algorithm to discover a transitive clo-

sure contour, which yields near optimal index size. Empirical stud-

ies show that our 3-hop scheme has much smaller index size than

state-of-the-art reachability query schemes such as 2-hop and path-

tree when DAGs are not very sparse, while our query time is close

to path-tree, which is considered to be one of the best reachability

query schemes.

Categories and Subject Descriptors

H.2.8 [Database management]: Database Applications—graph

indexing and querying

General Terms

Performance

Keywords

Graph indexing, Reachability queries, Transitive closure, 3-Hop,

2-Hop, Path-tree

1. INTRODUCTION

The rapid accumulation of very large graphs from a diversity

of disciplines, such as biological networks, social networks, on-

tologies, XML, and RDF databases, among others, calls for the

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.

graph database system. Important research issues, ranging theo-

retical foundations including algebra and query language [2], to

indices for various graph queries [20, 12] and more recently, graph

OLAP/summarization [17], have attracted much recent attention.

Among them, graph reachability query processing has evolved into

a core problem: given two vertices uand vin a directed graph, is

there a path from uto v(u→v)?

Graph reachability is one of the fundamental research questions

across several disciplines in computer science, such as software en-

gineering and distributed computing. In the database research com-

munity, the initial interest in reachability queries has been driven by

the need to handle recursive queries, with focus on efﬁcient and ef-

fective transitive closure compression. Recently, this problem has

captured the attention of database researchers again, due to the in-

creasing importance of XML data management, and fast growing

graph data, such as large scale social networks, WWW, and bio-

logical networks. For instance, in XML databases, the reachability

query is the basic building block for the typical path query for-

mat //P1//P2// ···//Pm, where “//” is the ancestor-descendant

search and Piis the tag. Reachability queries also have an impor-

tant role for managing/querying RDF and domain ontologies. In

bioinformatics, reachability queries can be used to answer basic

gene regulation questions in the regulatory network.

1.1 Prior Work

In order to tell whether a vertex ucan reach another vertex v

in a directed graph, many approaches have been developed over

the years. For a reachability query, we can effectively transform a

directed graph into a directed acyclic graph (DAG) by coalescing

strongly connected components into vertices and utilizing the DAG

to answer the reachability queries. Thus, throughout the paper, we

will only focus on DAG. Let G= (V, E)be the DAG for a reacha-

bility query. In Table 1.1, we summarize these approaches in terms

of their index size, construction time, and query processing time

based on worst-case analysis. Here, nis the number of vertices

(n=|V|) and mis the number of edges (m=|E|). Parameter

kis the width of the chain decomposition of DAG G[11], tis the

number of (non-tree) edges left after removing all the edges of a

spanning tree of G[19], and k′is the width of the path decomposi-

tion [12]. These three parameters k,tand k′, are method-speciﬁc

and will be explained in more detail when we discuss their corre-

sponding methods.

DFS/BFS and Transitive Closure Computation: We ﬁrst discuss

two classical approaches for reachability query, representing two

extremes with regard to index size and query time. DFS/BFS needs

to traverse the graph online and can take up to O(n+m)time to

answer a reachability query. This is too slow for large graphs. The

second approach precomputes the transitive closure of G, i.e., it

records the reachability between every pair of vertices in advance.

813

Index Size Construction Time Query Time

DFS/BFS O(n+m)−O(n+m)

Transitive Closure [16] O(n2)O(nm)O(1)

Opt. Chain Cover [11] O(nk)O(n3)O(log k)

Opt. Chain Cover [5] O(nk)O(n2+kn√k)O(log k)

Opt. Tree Cover [1] O(n2)O(nm)O(log n)

Dual Labeling [19] O(n+t2)O(n+m+t3)O(1)

Labeling+SSPI [4] O(n+m)O(n+m)O(m−n)

GRIPP [18] O(m+n)O(n+m)O(m−n)

Path-Tree [12] O(nk′)O(mk′)/O(mn)log2k′

2-Hop [9] ˜

O(nm1/2)O(n3|Tc|)˜

O(m1/2)

Table 1: Worst-Case Complexity

While this approach can answer reachability queries in constant

time, its storage cost O(n2)is prohibitive for large graphs.

Indeed, tackling the storage cost by effectively compressing the

transitive closure has been the major theme of index construction

for graph reachability processing. Typically, however, improved

compression comes at the cost of slower query answering time. To

ﬁnd the right balance between transitive closure compression and

reasonable query answering time is the driving force of ongoing

research into graph reachability indexing.

The existing research largely falls into two categories: the ﬁrst

category attempts to apply simple graph structures, such as chains

and trees, to compress the transitive closure of a DAG. The optimal

chain cover, tree cover and the recent path-tree cover all belong to

this category. The second category, referred to as 2-hop indexing,

tries to encode the reachability using a subset of vertices which

serve as intermediaries, i.e., each vertex records a list of interme-

diate vertices it can reach and a list of intermediate vertices which

can reach it. Then, 2-hop reachability means the starting vertex can

reach an intermediate vertex (the ﬁrst hop) and this intermediate

vertex can reach the end vertex (the second hop). In the following,

we go through these approaches in more detail.

Optimal Chain Cover: The basic idea of optimal chain cover is

to decompose a DAG into a minimal number of pair-wise disjoint

chains, and then assign each vertex in the graph a chain ID and

its sequence number in its chain. Given this, if a vertex can reach

another chain, it records only the smallest vertex it reaches in that

chain. In other words, each vertex in the compressed transitive

closure covers the remaining vertices (all the vertices with a higher

sequence number) in its respective chain. To determine if vertex u

reaches vertex v, we only need to check if ureaches any vertex (say,

v′) in v’s chain, and if yes, we check if the vertex v′has a smaller

sequence number than v. This strategy can compress the transitive

closure since we need to record at most one vertex in each chain for

a given vertex. If the minimal number of chains for a DAG (also

referred to as the width of the DAG) is k, then this approach has

O(nk)index size and O(log k)query time.

Jagadish [11] pioneered the application of chain decomposition

in the database research community to compress the transitive clo-

sure. He demonstrated that the problem of ﬁnding the minimal

number of chains from Gcan be transformed into a network ﬂow

problem, which can be solved in O(n3). He also proposed several

heuristic algorithms for chain decomposition in order to reduce the

computational cost and actual index size. Recently, Cheng [7] pro-

posed an O(n2+kn√k)time algorithm to decompose a DAG into

a minimal number of chains.

The worst case complexity of the chain cover approach is clearly

decided by the width of DAG. If the width is high, we tend to have

a lot of chains with only a small number of vertices, resulting in

a high index cost. Another way to look at the compression rate

is by observing that each vertex in compressed transitive closure

covers a partial chain (from the vertex itself to the last vertex in

the chain). Let R(u)be the transitive closure of u. Let RC(u)

be the number of vertices urecords for the chain decomposition.

Then, the compression ratio of the chain decomposition is deﬁned

as Pu∈V|R(u)|

Pu∈V|RC(u)|. Thus, we can see that the compression ratio is

exactly the average size of the partial chains each vertex in the com-

pressed transitive closure covers.

Optimal Tree Cover and Its Variants: The optimal tree cover uti-

lizes a (spanning) tree to compress the transitive closure [1]. Each

vertex in the tree is labeled by a pair of numbers, corresponding to

an interval: if a vertex is an ancestor of another vertex in the tree,

the interval labeling guarantees that the interval of the ﬁrst vertex

contains the interval of the second vertex. Note that if a vertex

reaches the root of a subtree in the original DAG, it will reach all

the vertices in the subtree. Thus, for each vertex in the DAG, we

can organize all the vertices in its transitive closure, i.e., all the ver-

tices it can reach, into pair-wise disjoint subtrees. To compress the

transitive closure, for each subtree, we only need to record its root

vertex. To answer the reachability query from vertex uto vertex v,

we check if the interval of vis contained by any interval associated

with those subtree roots we have recorded for u.

Agrawal et al. [1] formally introduced the tree cover and found

an optimized algorithm to discover a tree cover which can maxi-

mally compress the transitive closure. They also showed that the

tree cover approach can provide a better compression rate than the

optimal chain cover approach. The advantage of the tree cover ap-

proach over the chain cover approach comes from the fact that each

tree-cover vertex covers an entire subtree, while each chain-cover

vertex covers only a partial chain.

Several recent studies focus on the tree cover approach and try to

improve either its query processing time and/or provide a smaller

index size. Wang et al. [19] develop the Dual-Labeling approach

which tries to improve the query time and index size for very sparse

graphs, where the number of non-tree edges tis much smaller than

the number of vertices n(t << n). Their approach can reduce

the index size to O(n+t2)and achieve constant query answering

time. Unfortunately, many real world graphs would not satisfy the

condition required by this approach, and when t > n, this approach

will not help compress the index size.

Label+SSPI [4] and GRIPP [18] aim to minimize the index con-

struction time and index size. They achieve O(m+n)index con-

struction time and O(m+n)index size. However, this is at the

sacriﬁce of the query time, which will cost O(m−n). Both al-

gorithms start by extracting a tree cover and then deploy an online

search algorithm utilizing the tree structure to speed up the DFS

process.

Path-Tree Cover: The latest work to use a simple graph structure

to compress transitive closure is the path-tree cover approach, pro-

posed by Jin et al. [12], which generalizes the tree cover approach.

They observe that the covering capability of each vertex in the com-

pressed transitive closure is determined by the number of parents

and children each vertex has in the simple graph structure. For in-

stance, a chain vertex has one parent and one child while a tree

vertex has one parent and multiple children. The path-tree allows

two parents and multiple children. In path-tree cover, all vertices

in the original DAG are partitioned into pair-wise disjoint paths (k′

is the number of paths in the path-decomposition for a DAG G),

and then those paths serve as vertices in a tree structure. In other

words, the path-tree utilizes a tree-like structure, where each vertex

represents a path in the original DAG. Each vertex in the path-tree

needs only three numbers, two numbers for the interval label of the

tree-structure and one sequence number from a DFS traversal pro-

814

cedure, to answer the reachability query between any two vertices

in the path-tree in constant time. In [12], authors proposed two

path-tree schemes, PTree-1 and PTree-2. PTree-1 utilizes optimal

tree cover and thus has O(mn)construction time while PTree-2

has O(mk)construction time.

Given this, to compress the transitive closure, a vertex uonly

needs to record vertex v, such that 1) u→vand 2) there is no

vertex v′such that u→v′and v′can reach vin the path-tree.

Theoretically, they prove that path-tree cover can always perform

the compression of transitive closure better than or equal to the

optimal tree cover approaches and chain cover approaches. Note

that the enhanced power of the path-tree cover is a consequence of

the increased parent/child connectivity of path-tree vertices vs. tree

cover or chain cover vertices.

2-HOP Indexing: The 2-hop labeling method proposed by Cohen

et al. [9] is quite different from the aforementioned simple graph

covering approaches. It compresses the transitive closure using a

subset of intermediate vertices. Each vertex records a list of in-

termediate vertices it can reach and a list of intermediate vertices

which can reach it. The index size is the total number of interme-

diate vertices each vertex records. They propose an approximate

(greedy) algorithm based on set-covering which can produce a 2-

hop cover with size no larger than the minimum possible 2-hop

indexing by a logarithmic factor. The minimum 2-hop index size is

conjectured to be ˜

O(nm1/2).

The major problem of the 2-hop indexing approach is its high

construction cost. The greedy set-covering algorithm needs to iter-

atively ﬁnd a subset of vertices which utilizes a candidate vertex as

the intermediate hop. The subset of vertices are selected to mini-

mize the price measure, i.e., the cost of recording such an interme-

diate hop of these vertices with respect to the number of uncovered

reachable vertex pairs in this subset. Finding the subset of vertices

with minimal price can be transformed into the problem of ﬁnding

a densest subgraph in a bipartite graph. The approximate algorithm

to solve this subproblem is in the linear order with respect to the

number of edges in the bipartite graph. Besides, each vertex in the

DAG can serve as the intermediate hop which corresponds to a bi-

partite graph. Thus, for each iteration, it takes O(n3)to ﬁnd such a

desired subset of vertices. Considering the iteration needs to cover

the entire transitive closure Tc, we can see its construction time is

O(n3|Tc|).

Several approaches have been proposed to reduce its construc-

tion time. Schenkel et al. propose the HOPI algorithm, which ap-

plies a divide-and-conquer strategy to compute 2-hop labeling [15].

Recently, Cheng et al. propose several methods, such as a geometric-

based algorithm [6] and graph partition technique [7], to produce

a2-hop labeling. Though their algorithms signiﬁcantly speed up

the 2-hop construction time, they do not produce the approxima-

tion bound of the labeling size which is produced by Cohen et al.’s

approach.

1.2 Our Contribution

Almost all these approaches work reasonably well for very sparse

graphs (where the number of edges is very close to the number of

vertices). However, as the ratio of the number of edges to the num-

ber of vertices increases, the size of the compressed transitive clo-

sure of the simple graph covering approaches can grow very large.

In many real world graphs, such as citation networks, the semantic

web, and biological networks, the number of edges can be several

times the number of vertices. In general, the simple graph covering

approach works well only for those DAGs which have a structure

similar to the building-block chain, tree, or path-tree structures.

However, in many real world graphs, since edge density is much

higher than in simple graph structures, many edges will be left un-

covered. Vertices of uncovered edges likely need to be recorded as

ancillary data in the compressed transitive closure of the DAG, in-

creasing the index size. Thus, the size of the compressed transitive

closure can become very large as the density grows.

The original 2-hop [9] builds on top of the set-covering frame-

work and is theoretically appealing as it achieves a guaranteed ap-

proximation bound. However, to our knowledge, there is little the-

oretical comparison between the 2-hop approach and the simple

graph covering approaches in existing research. Most studies do

not even empirically compare the 2-hop approach and the simple

graph covering approaches. This may be due in part to the 2-hop

approach not scaling well to large graphs, even graphs with only

thousands of vertices. Speciﬁcally, since the original 2-hop needs

to compute the complete transitive closure, it becomes very expen-

sive as the edge density of the graph becomes larger. Though sev-

eral heuristic techniques [15, 6, 7] have been proposed to construct

2-hop faster, they do not guarantee any approximation bound as the

original 2-hop does. None of these methods have compared their

compression ratio directly with the optimal 2-hop approaches, even

on relatively small graphs.

To summarize, the major research challenge for existing graph

reachability indexing is how to signiﬁcantly compress the transitive

closure when the ratio between the number of edges and the number

of vertices increases. Driven by this need, we propose a new 3-hop

indexing scheme for directed graphs with higher density. The basic

idea in 3-hop indexing is to utilize a simple graph structure, rather

than a sole vertex, as an intermediate hop to describe the reachabil-

ity between source vertices and destination vertices. In this paper,

we focus on the chain structure. The new indexing scheme does

not need to compute the entire transitive closure. Instead, it only

needs to compute and record a number of so-called “contour” ver-

tex pairs, which can be orders of magnitude smaller than the size

of the transitive closure. Indeed, it is even much smaller than the

compressed transitive closure of the chain cover. The connectivity

of any pair of vertices in the DAG can be answered by those con-

tour vertex pairs. Further, we “factorize” these contour vertex pairs

by recording a list of “entry points” and “exit points” on some in-

termediate chains. We derive an efﬁcient algorithm to generate an

index which approximates the minimal 3-hop indexing by a loga-

rithmic factor. Theoretically, we show that 3-hop labeling always

has a better minimal compression ratio than 2-hop labeling, and its

construction time is much faster than that of 2-hop.

We perform a detailed experimental evaluation on both real and

synthetic datasets by comparing 3-hop labeling, 2-hop labeling and

the state-of-the-art path-tree covering approach. Empirical studies

show that our 3-hop scheme has a much smaller index size than

prior state-of-art reachability query schemes for dense DAGs when

the number of edges is not close to the number of vertices, i.e.,

|E| 6≈ |V|. The query processing time of 3-hop is close to path-

tree’s, which is considered to be one of the best reachability query

schemes.

2. BASIC IDEAS OF 3-HOP INDEXING

2.1 Basic 3-Hop

The 3-hop reachability indexing is analogous to the highway sys-

tem of the transportation network. To reach a destination from a

starting point, you simply need to get on an appropriate highway

and get off at the right exit to get to the destination. The high-

way system in the 3-hop labeling is simple graph structures, such

as chains or trees, as they can encode the reachability information

using a constant labeling size. In this paper, we focus on utilizing

815

chains, i.e., each chain serves as a different highway. Since each

chain has a direction, each vertex urecords a list of “entry points”

(the smallest vertices) it can reach on some chains. It also records a

list of “exit points” (the largest vertices) which can reach it in some

chains. Here, the order of vertices in the chain is with respect to

their topological order in that chain, i.e., a vertex with a smaller

number can reach a vertex with a larger number.

Given this, the three hops are 1) the ﬁrst hop from the starting

vertex to the entry point of some chain, 2) the second hop from the

entry point in the chain to the exit point of the chain, and ﬁnally

3) the third hop from the exit point of the chain to the destination

vertex. The goal of 3-hop indexing is to assign all vertices with

a minimal total number of entry and exit points so that they can

maximally compress the transitive closure.

Figure 1: A simple example for 3-hop and 2-hop

Figure 1(a) shows an example using the chain 5→6→7→8

as the intermediate hop (or highway). Thus, each vertex not on

the chain only needs to record its entry point and exit point in that

chain, listing them in the set oand set iassociated with each vertex,

respectively. To tell if vertex 2can reach 9, we compare 2’s entry

point with 9’s exit point. We conclude that 2can reach 9because

2’s entry point of 6precedes exit point 7, which then reaches vertex

9. In total, the simple 3-hop scheme records 8vertices to encode

the transitive closure by using a single chain.

Figure 1(b), shows the optimal 2-hop labeling where each vertex

records a list of intermediate vertices it reaches and a list of vertices

which reach it. Here, 2-hop needs to record a total of 16 vertices

to encode the transitive closure. However, readers should be ad-

vised that this is a very simple and incomplete example giving the

basic idea of 3-hop. Detailed deﬁnitions, algorithms and complete

running examples of 3-hop will be given from now on.

2.2 Chain Decomposition for 3-Hop

A simple technique which can signiﬁcantly boost the 3-hop com-

pression ratio is to apply a chain decomposition for the entire DAG

ﬁrst. For the 3-hop perspective, such a decomposition would as-

sociate each vertex itself with a highway since each vertex is par-

titioned to a chain. This suggests that many vertices in the same

chain may share the same entry points and exit points of some other

chains. Thus, we do not need to explicitly record those points for

each of these vertices in the same chain, and therefore can further

compress the transitive closure. To better understand the intuition

of boosting 3-hop with a chain decomposition, let us see the run-

ning example in Figure 2.

Figure 2 is a DAG with 4 chains as a result of chain decompo-

sition. In 3-hop, each chain serves as a highway and each vertex

also belongs to a highway. In Figure 3, we show the vertices using

Figure 2: A simple DAG with a chain decomposition. (Dotted

arrow from 13 →14 is not an edge in the original DAG, but an

inferred one using reachability).

chains C2and C3as intermediate hops (highways) to encode their

transitive closure. At the left of each chain, we draw those ver-

tices which record an entry point into that chain, and at the right of

each chain, we draw those vertices which record an exit point out

of the corresponding chain. To be more efﬁcient, we organize into

an “outgoing” segment those consecutive vertices (on one chain)

which share the same entry point, and correspondingly we orga-

nize into an “incoming” segment those consecutive vertices (on one

chain) which share the same exit point.

Speciﬁcally, we organize all the vertices on the left which share

the same entry point into an “outgoing” segment, and all the ver-

tices on the right which share the same exit point into an “incom-

ing” segment. Each segment corresponds to a list of consecutive

vertices in a chain. For instance, the vertices in the outgoing seg-

ment from 1to 3all record vertex 6in chain C2as the entry point,

and they are the ﬁrst three vertices in chain C1. The vertices in the

incoming segment from 17 to 20 all record vertex 11 in chain C3

as the exit point, and they are the last four vertices in chain C4.

Figure 3: Two examples of Reachability between segments

(through chains C2and C3)

Intuitively, we can apply 3-hop with chain decomposition to an-

swer a reachability query. For example, to answer whether vertex 6

can reach vertex 19, we ﬁnd that vertex 6 is in segment (6,7) which

can reach vertex 12 in C3, and vertex 19 is in segment (19,20)

which can be reached by vertex 14 in C3. Then we say 6can reach

816

19 because 6can reach 12,19 can be reached by 14, and 12 reaches

14 in the chain C3.

2.3 3-Hop Indexing and Our Approach

The major research problem we will study in this paper is as

follows: Given a chain decomposition {C1, C2,...,Ck}of a DAG

G, how can we utilize 3-hop strategy to maximally compress the

transitive closure and answer reachability queries efﬁciently? Our

approach addresses this problem in three steps:

1.(Section 3) Given a chain decomposition, we ﬁrst derive a con-

cise representation of the transitive closure, called the contour of

the transitive closure. This representation allows us to quickly iden-

tify those vertices which share the same entry point and vertices

which share the same exit point.

2.(Section 4) We show that a 3-hop strategy which maximally com-

presses the contour corresponds to a generalized “factorization” of

the contour. We develop an efﬁcient greedy algorithm to approxi-

mate the optimal results within a logarithmic factor.

3.(Section 5) We provide a query processing procedure utilizing

the index based on the 3-hop compression of the transitive clo-

sure contour. We also derive a theoretically faster query processing

scheme by transforming 3-hop contour into a 3-hop segment index-

ing.

3. TRANSITIVE CLOSURE CONTOUR

In this section, we will study a concise representation of the tran-

sitive closure matrix based on the chain decomposition of the DAG.

This representation will form the basis for efﬁcient construction of

the 3-HOP index. We will derive a fast algorithm to directly gener-

ate this concise representation.

3.1 Notation and Chain-Decomposition

Let G= (V, E )be a directed acyclic graph (DAG), where V=

{1,2,···, n}is the vertex set, and E⊆V×Vis the edge set.

We use (v, w)to denote the edge from vertex vto vertex w, and we

use (v0, v1,··· , vp)to denote a path from vertex v0to vertex vp,

where (vi, vi+1)is an edge (0≤i≤p−1). In a DAG, all paths

are simple paths, meaning each vertex in a path is distinct. We say

vertex vis reachable from vertex u(denoted as u→v) if there is

a path starting from uand ending at v.

Achain is the generalization of path, which is also a sequence of

vertices, (v0, v1,···, vp), where vi+1 is reachable from vi(vi→

vi+1,0≤i≤p−1). Clearly, any path in Gis also a chain. How-

ever, the reverse is not necessarily true (see chain C3in Figure 2).

Let C1and C2be two chains of G. We use C1∩C2to denote the

set of vertices appearing in both chains and use C1∪C2to denote

the set of vertices appearing in either of the chains.

DEFI NI TI O N 1. (Chain Decomposition) Achain decomposi-

tion of DAG G= (V, E)is a collection of pair-wise distinct chains,

C1, C2,···, Ck, such that C1∪C2∪· · · Ck=Vand Ci∩Cj=∅,

for any i6=j. The integer kis called the width of the decomposi-

tion.

Given the chain decomposition, we assign to each vertex va pair

of IDs, (cid,oid), where cid is the ID of the chain vertex vbelongs

to, and oid is v’s relative order on the chain. For any two vertices

uand vin the same chain, we have uviff u.oid ≤v.oid. If

u.oid < v.oid, we also say uis smaller than vand vice versa.

Several algorithms have been developed to partition a DAG into a

minimal number of chains to facilitate transitive closure computa-

tion [11, 5]. Our approach can utilize any of these approaches.

3.2 Transitive Closure between Two Chains

In this work, we will derive a more concise representation for

the transitive closure using the chain decomposition. We base this

representation on a key observation on how the transitive closure is

recorded in binary matrix format. Note that our approach does not

need to materialize this binary matrix representation of the transi-

tive closure.

Let Mbe the binary matrix representation of transitive closure

for G. Then, M[vi, vj] = 1 iff vi→vj.M[vi, vj] = 0 iff vi

cannot reach vj. We deﬁne an index (i, j)for Mto be a cell. If

M(i, j) = 0, we say (i, j )is a 0-cell; else (i, j)is a 1-cell. Also,

we order the vertices based on their chain ID and within each chain,

we sort the vertices according to their order ID (oid). Thus, the ver-

tices in the same chain are contiguous in a linearly increasing order.

We also introduce the sub-matrix for any two chains Ciand Cjas

MCi,Cj, which has the rows of Ciand columns of Cj. Clearly, the

complete transitive closure Mcan be written as the union of the

k×ksubmatrices, such as

M=2

MC1,C1MC1,C2·· · MC1,Ck

MC2,C1MC2,C2·· · MC2,Ck

.....

MCk,C1MCk,C2·· · MCk,Ck

(1)

Figure 4: Pseudo-diagonal and Pseudo-upper triangular sub-

matrix. All blank cells are 0-cells.

It is easy to see that any MCi,Ciis a special upper triangular ma-

trix, i.e., for any vavbwhere vaand vbare vertices of chain Ci,

M[va, vb]=1and for any va> vb,M[va, vb]=0. We refer to

it as an upper uni-triangular matrix. Note that the geometry of this

submatrix describes and is equivalent to the intra-chain reachability

property. Therefore, there is no need to materialize MCi,Ciupper

uni-triangular matrices. Next, what does a submatrix MCi,Cjlook

like when i6=j?

To describe the shape of the submatrices between any two paths,

we introduce the following notation. Given submatrix MCi,Cjwith

|Ci|rows and |Cj|columns, and two cells (x, y)and (x′, y′)where

xand x′are vertices of chain Ciand yand y′are vertices of chain

Cj, we say cell (x, y)dominates cell (x′, y′)in the matrix MCi,Cj

if xx′and yy′. In other words, a cell dominates all the cells

817

located in its upper-right quadrant. As a simple observation, in any

submatrix MCi,Cj, the collection of all the cells being dominated

by cell (x, y)form a rectangle which has (x, y)as its lower-left

corner and upper right cell of MCi,Cjas its upper-right corner.

DEFI NI TI O N 2. (Pseudo-Diagonal and Pseudo-Upper Trian-

gular matrix The pseudo-diagonal of a binary matrix (submatrix)

Msis a set of 1-cells, such as {(x1, y1),(x2, y2),···,(xl, yl)},

such that 1) all the 1-cells in Msare dominated by at least one

pseudo-diagonal cell, 2) none of the 0-cells in Msare dominated

by any pseudo-diagonal cell, and 3) no pseudo-diagonal cell domi-

nates another pseudo-diagonal cell. If a binary matrix (submatrix)

has a pseudo-diagonal, we refer to it as a pseudo-upper triangular

matrix (submatrix).

Clearly, not every binary matrix is a pseudo-upper triangular ma-

trix containing a pseudo-diagonal. We next provide the following

theorem to reveal the shape of a submatrix between two chains.

THE OR EM 1. Let MCi,Cjbe the binary submatrix of the tran-

sitive closure between two different chains, Ciand Cj.MCi,Cjis

a pseudo-upper triangular matrix.

Proof Sketch:Our proof is constructive. We will ﬁrst construct the

pseudo-diagonal explicitly. Then, we will show that the matrix is

indeed pseudo-upper triangular. Let the chain Cibe (v1, v2,·· · , vp).

Let f(vi)be the ﬁrst vertex in Cjvican reach. If vidoes not reach

any vertex in Cj, let f(vi)=+∞.

Then we construct the sequence as (f(v1), f (v2),···, f (vp)).

We can show f(vi)f(vi+1 )as follows: Because vireaches

vi+1,viwill reach f(vi+1 ). Thus, f(vi)should be no larger than

f(vi+1).This also suggests that f(vi)=+∞, if exists, can only

appear at the end of a sequence.

Given this, we observe the following property for pseudo-diagonal:

A 1-cell (vi, f (vi)) (1≤i≤p−1)is in the pseudo-diagonal if and

only if f(vi+1)≻f(vi)and f(vi)6= +∞.Besides, (vp, f (vp))

is in the pseudo-diagonal if and only if it is a 1-cell. Thus, we

can scan the sequence (f(v1), f (v2),···, f (vp)) once to create

the pseudo-diagonal.

Now, we only need to show that any cell which is dominated by

one of the cells in the pseudo-diagonal is a 1-cell and otherwise, a

0-cell. Let (a, b)be a cell in the matrix and assume it is dominated

by one of the cell in the pseudo-diagonal, (vi, f (vi)). Then, by

deﬁnition, avi, and bf(vi). In other words, aviin Ci

and f(vi)bin Cj. We also know vi→f(vi). Thus, we have

a→b, so (a, b)is a 1-cell.

Let (c, d)be a cell in the matrix and assume it is not dominated

by any of the cells in the pseudo-diagonal. Basically, we have d≺

f(c). Since f(c)is the smallest vertex in chain Cjccan reach,

then ccannot reach d, meaning (c, d)is a 0-cell. 2

In Figure 4, we can see each MCi,Cj,i6=j, is a pseudo-upper

triangular matrix. We highlight their pseudo-diagonal cells with a

circle.

CORO LLA RY 1. The transitive closure from any chain Cito

another chain Cj,»MCi,CiMCi,Cj

∅MCj,Cj–can be described as a

directed graph, with vertex set V=V(Ci)∪V(Cj)and edge

set E=E(Ci)∪E(Cj)∪{(vi, f (vi))|(vi, f(vi)) is a pseudo-

diagonal cell}, and no two edges cross, i.e., for any two pseudo-

diagonal cells, (vi, f (vi)) and (vj, f(vj), we have either

(vi.oid > vj.oid)∧(f(vi).oid > f(vj).oid), or

(vi.oid < vj.oid)∧(f(vi).oid < f(vj).oid).

Essentially, the edge links from Cito Cjdo not cross each other.

Figure 5 shows two examples: edge links between C1and C3, and

edge links between C3and C4. Moreover, we can see that for

chains Ciand Cj, the starting vertices of the pseudo-diagonal cells

naturally divide chain Ciinto several “outgoing” segments such

that all the vertices in a segment share the same “entry point” to

chain Cj. Similarly, the end vertices of these pseudo-diagonal cells

can divide chain Cjinto several “incoming” segments where all

the vertices in a segment share the same ”exit point” from chain

Ci. For instance, in Figure 5, for chain C3and C4, the pseudo-

diagonal cells, {(11,17),(13,18),(14,19)}divide chain C3into

four outgoing segments, (10,11),(12,13) and (14,14), and chain

C4into three incoming segments, (17,17),(18,18) and (19,20).

Now, we formally introduce the transitive closure contour.

Figure 5: Edgelink between chain C1and C3, and between

chain C3and C4. Dotted arrows are virtual edges (paths).

DEFI NI TI O N 3. (Transitive Closure Contour) Given DAG G

and its chain-decomposition, C1∪C2∪ · ·· Ck, the transitive clo-

sure contour, C on(G)is the set of all pseudo-diagonal cells for

each pseudo-upper triangular matrix, MCi,Cj, where i6=j.

Given a chain decomposition, we can see the transitive closure

contour can precisely describe the complete transitive closure. We

will utilize this concise representation of transitive closure to build

our 3-HOP indexing.

3.3 Computing Transitive Closure Contour

We now present an efﬁcient computation which can directly com-

pute the transitive closure contour without materializing the binary

matrix given a chain decomposition. The sketch of TransitiveClo-

sureContour is in Algorithm 1. We use a matrix Sto record the

entire transitive closure contour of DAG G,Con(G). Each ele-

ment Si,j records the pseudo-diagonal of MCi,Cjfor chain Ciand

Cj.

The computation follows the reverse topological order (Loop 3-

21), which broadcasts the reachability information from bottom to

top. Si,j is an ordered set of pseudo-diagonal cells (p, q)between

chain i and chain j (in ascending order of q.oid), and Si,j .head()

gets the ﬁrst (with smallest q.oid) pseudo-diagonal cell (p, q)in

Si,j For each vertex u, we use minoid[i]to record the smallest

vertex it can reach in chain Ci. At the beginning, we ﬁll minoid[i]

with the smallest vertex that its own chain Cu.cid can reach in

chain Ci. This is done in Line 4, and we can retrieve this cell by

818

Algorithm 1 TransitiveClosureContour(G, C1∪C2∪ · ·· ∪ Ck)

Parameter: C1∪C2∪ · ·· ∪ Ck: the Chain Decomposition

1: Perform the Topological Sort of G

2: For each i, j ,1≤i, j ≤k,Si,j ← ∅

3: for u=|V(G)|downto 1{following the reverse topological

order} do

4: For each i,1≤i≤k,minoid[i]←y, where y=q.oid

and (p, q)←Su.cid,i .head() {y=∞if Su.cid,i =∅}

5: for each v: the immediate successor of u{in topological

order} do

6: if v.oid < minoid[v.cid]∧v.cid 6=u.cid then

7: minoid[v.cid]←v.oid

8: for each i= 1 to kdo

9: Let y=q.oid:

(p, q)←argmin(p,q)∈Sv.cid,i (p.oid ≥v.oid)

10: if u.cid 6=i∧minoid[i]> y then

11: minoid[i]←y

12: end if

13: end for

14: end if

15: end for

16: for each i= 1 to kdo

17: if i6=u.cid ∧minoid[i]< y {y=q.oid and (p, q )←

Su.cid,i.head()}then

18: Su.cid,i ←Su.cid,i ∪ {(u, minoid[i])}

19: end if

20: end for

21: end for

Su.cid,i.head(). If Su.cid,i .head() is still empty, we ﬁll minoid[i]

with ∞.

After that, we visit each of vertex u’s immediate successors, v

(Line 5). Our visit follows their topological order, i.e., the smallest

vertex will be visited ﬁrst. Note that by following this order, when

we have more than one immediate successor of uin the same chain,

we only need to visit the smallest vertex among them (Line 6).

Given this, the major operation is to update the smallest vertices

which ucan reach using vertex von each chain, i.e., to update

each minoid[i]. Such an update comes from two sources: the ﬁrst

source is from vitself. If v.oid has a smaller sequence number

than the current minoid[v.cid], meaning the edge (u, v )allows

uto reach a smaller vertex on v’s chain; the second source is from

the smallest vertices on other chains which vcan reach. In the latter

case, for each chain Ci(Line 8), we need to get a pseudo-diagonal

cell (p, q)in SCv.cid ,i, where pand vare in the same chain and p

is the smallest vertex vdominates (Line 9). Thus, qis the smallest

vertex ucan reach via edge (u, v) directly. Given this, we will test

if qis smaller than the current smallest vertex ucan reach in chain

i, and replace it if it does (Line 10). Finally, after visiting all u’s

immediate successors, we will add cell (u, minoid[i]) to Su.cid,i

if it is a pseudo-diagonal cell (Line 17 and 18).

The correctness of Algorithm 1 can be observed by the fact that

we maintain a list of the smallest vertex of each chain vertex ucan

reach in minoid, and a cell (u, minoid[i]) is a pseudo-diagonal

cell iff minoid[i]is less than the smallest vertex in Su.cid,i (Corol-

lary 1). The time complexity of this algorithm is O(mk log n)in

the worst case. This is because the two biggest for loops, i.e. step

3to step 21, run mtimes, since DAG Ghas medges, and the loop

from step 8to 13 runs ktime and ﬁnally step 9takes O(log n)time

to do the binary search in the worst case.

4. 3-HOP LABELING FOR TRANSITIVE

CLOSURE CONTOUR

4.1 Problem Deﬁnition

Our goal in this section is to compress the transitive closure

contour, Con(G), using the 3-hop strategy. For any vertex pair

(u, v)∈Con(G), we say uis an out-anchor vertex for the con-

tour, and vis an in-anchor vertex. We will assign each out-anchor

vertex a list of intermediate “entry points” of some chains and as-

sign each in-anchor vertex a list of intermediate “exit points” of

some chains. To recover the reachability between an out-anchor u

and an in-anchor v, we will see if ucan reach vin three hops, i.e.,

the ﬁrst hop from uto an intermediate entry point, the second hop

to the intermediate exit point, and the third hop from the exit point

to v. Formally, we introduce the 3-hop reachability labeling for the

contour set Con(G)as follows.

DEFI NI TI O N 4. (3-HOP Reachability Labeling) Let Con(G)

be the transitive closure contour for Gwith respect to a chain-

decomposition. Let Vout and Vin be the sets of out-anchor vertices

and in-anchor vertices for Con(G), respectively. A 3-hop reach-

ability labeling assigns each out-anchor vertex uin Vout a label

Lout(u)(a set of intermediate entry points), and each in-anchor

vertex vin Vin a label Lin(v)(a set of intermediate exit points),

such that Lout(u), Lin (v)⊆V(G), and for every x∈Lout(u),

u→xand for every y∈Lin(v),y→v. Furthermore, we have

the following two conditions:

(1) (u, v)∈Con(G) =⇒ ∃x∈Lout (u),∃y∈Lin(v),

such that x, y ∈Ci,and xy

(2) for any x∈Lout(u), y ∈Lin (v), x, y ∈Ci,and xy

=⇒u→v

The size of the labeling is deﬁned to be

Cost(3hop) = X

u∈Vout |Lout(u)|+X

v∈Vin |Lin(v)|

To simplify our discussion, we assume u∈Lout(u)and v∈

Lin(v).

THE OR EM 2. Finding a minimum 3-hop reachability labeling

for a given contour set Con(G)of a DAG Gis an NP-hard prob-

lem.

Proof Sketch:We simply note that 3-hop labeling is a generaliza-

tion of 2-hop labeling. 2

To better understand this problem, we will describe it as a gener-

alized “factorization” problem and then transform it to the classical

set-cover problem. We start by partitioning each of the two anchor

sets, Vout and Vin, according to their intermediate chains:

out ={u|u∈Vout and Lout(u)∩Ci6=∅}

in ={v|v∈Vin and Lin(v)∩Ci6=∅}

Basically, Vi

out contains those out-anchor vertices which record

intermediate vertices (entry points) in chain Ci. Similarly, Vi

in con-

tains those in-anchor vertices which record intermediate vertices

(exit points) in chain Ci. Further, for each u∈Vi

out, we deﬁne

out(u)to be the vertex of Lout(u)∩Ci, and for each v∈Vi

we deﬁne Li

in(v)to be the vertex of Lin(v)∩Ci. By Corollary 1,

Lout(u)∩Ci(or Lin (v)∩Ci) contains at most one vertex. Given

819

Figure 6: Generalized Join and Chain-Center Bipartite Graph

this, we introduce the following generalized join operator (similar

to Cartesian product):

out ⊙Vi

in ={(u, v)|u∈Vi

out, v ∈Vi

in, Li

out(u)Li

in(v)}

In Figure 6(a), assume all the vertices on the left of chain C2

record their corresponding entry points into chain C2, and all the

vertices on the right record their exit points. For v= 12,18,19, as-

sume Lin(v)∩C26=∅. For u= 3,18,13,14,4, assume Lout (v)∩

C26=∅.Then, V2

out ={3,18,13,14,4}and V2

in ={12,18,19},

and V2

out ⊙V2

in contains all the vertex pairs (u, v), where uis on the

left and vis on the right, such that ucan reach vvia the edges in the

graphs, i.e., {(3,12),(3,18),(3,19),·· · ,(4,19)}. It also con-

tains all the edges in the graph, i.e., {(3,6),(7,12),···,(9,19)}.

We consider {V1

out,···Vk

out}⊗{V1

in,· · · Vk

in}=V1

out ⊙V1

in ∪

··· ∪Vk

out ⊙Vk

in to be generalized factorization. Hence, we deﬁne

the cost of factorization as follows:

Cost(f actorization) =

i=1 |Vi

out|+

i=1 |Vi

in|

Given this, we can rewrite our 3-hop reachability labeling prob-

lem as a generalized “factorization” problem: By assigning la-

bel Lout(u)for each vertex u∈Vout and Lin(v)for each ver-

tex v∈Vin, we want to ﬁnd a factorization {V1

out,···Vk

out} ⊗

{V1

in,···Vk

in}with minimum cost such that

Con(G)⊆V1

out ⊙V1

in ∪ · ·· ∪ Vk

out ⊙Vk

It is easy to see that the 3-hop reachability labeling problem is

equivalent to the generalized factorization of Con(G)where the 3-

hop indexing cost is equivalent to the corresponding factorization

cost:

Cost(3hop) = C ost(f actorization)

In the following subsections, we will derive efﬁcient algorithms

to produce minimized factorization and thus also the minimized 3-

hop labeling.

4.2 A Basic Approximation Algorithm for 3-

Hop Cover

In this subsection, we will transform the factorization problem

into a set-cover problem. For this purpose, we will ﬁrst introduce

the notation of the chain-center bipartite graph.

DEFI NI TI O N 5. (Chain-Center Bipartite Graph) Given a DAG

Gand a chain decomposition, C1∪C2∪ ·· · ∪ Ck, we construct

the chain-center bipartite graph for each chain as follows. Let

Bi= (Xi∪ Yi,Ei)be the chain-center bipartite graph:

Xi={u|∃a∈Ci,such that (u, a)∈Con(G)} ∪ {b|b∈

Ci,such that ∃v, (b, v)∈C on(G)}

Yi={v|∃b∈Ci,such that (b, v)∈Con(G)} ∪ {a|a∈

Ci,such that ∃u, (u, a)∈Con(G)}

Ei={(x, y)|x∈ Xiand y∈ Yiand (x, y)∈C on(G)}

Figure 6(b) is an example showing the bipartite graph for chain

C2. Now we can transform the factorization problem into the set-

cover problem as follows: Let the grounding set be Con(G). Let

the set of candidates be {ˆ

Bi|ˆ

Biis a subgraph of Biwhere 1≤i≤

k}. The weight of a candidate bipartite subgraph should reﬂect

the related index cost which is deﬁned as the number of vertices in

V(ˆ

Bi), i.e., weight(ˆ

Bi) = |V(ˆ

Bi)|. For example, in Figure 6, the

circled bipartite subgraph has weight 5.

Then we may apply the classical greedy algorithm [8] to ﬁnd the

minimal set cover as follows. Let Rbe the uncovered contour pairs

(initially, R=Con(G)). For each candidate set ˆ

Bi, where the

vertex sets X(ˆ

Bi)⊆ Xiand Y(ˆ

Bi)⊆ Yiand edge set E(ˆ

Bi)⊆

Ei, we deﬁne the compression ratio of selecting ˆ

Bias

ρ(ˆ

Bi) = |E(ˆ

Bi)∩R|

weight(ˆ

Bi)=|E(ˆ

Bi)∩R|

|X(ˆ

Bi)|+|Y(ˆ

Bi)|

At each iteration, the greedy algorithm selects the candidate set

with the highest compression ratio and puts it in the resulting set.

Then, the algorithm will update Rby removing the newly covered

contour pairs, R=R\E(ˆ

Bi). The procedure proceeds until all

contour pairs are covered (i.e. R=∅).

It has been proved that the approximation ratio of this algorithm

is ln(|Con(G)|)+1[8]. We now link this problem and its results

back to the aforementioned factorization problem. First, we note

that picking up a subgraph ˆ

Biin the set cover corresponds to adding

a generalized join between X(ˆ

Bi)and Y(ˆ

Bi), i.e., X(ˆ

Bi)⊙Y(ˆ

Bi).

This is because each non-Civertex vin ˆ

Bineeds to record in

Lout(v)an entry point to chain Ci, or record in Lin(v)an exit

point from chain Ci. It is easy to observe that non-Civertices ac-

count for at least half in ˆ

Bi. Given such a labeling, we can guaran-

tee to cover all the edges of E(ˆ

Bi), i.e., X(ˆ

Bi)⊙Y(ˆ

Bi)⊇E(ˆ

Bi).

Here, we may produce some edges which do not belong to the con-

tour, but this will not affect set cover results. Indeed, in the fac-

torization formulation, we may also produce extra edges which do

not belong to the contour. However, those edges all belong to the

complete transitive closure and thus will not affect the correctness

of our reachability indexing.

Second, we note that the optimal set cover results will choose

at most one subgraph from each chain-center bipartite graph, i.e.,

each vertex in each bipartite graph will be selected only once. In the

greedy algorithm, we may ﬁnd several subgraphs which all come

from the same bipartite graphs. In this case, we can simply combine

their label sets, and the weight of the resulting subgraphs will be no

higher than the sum of the weights of these individual subgraphs.

Thus, this optimal result of the set-cover problem can be rewritten

exactly as a factorization result with each chain having at most one

join centered on it, and our approximation bound is maintained.

However, the major issue here is that the number of candidate

subgraphs is exponential. A similar issue exists for 2-hop labeling.

As suggested in [9], we can deal with this problem by realizing that

ﬁnding ˆ

Biof the highest compression ratio is equivalent to ﬁnding

the densest subgraph of the bipartite graph B′

i= (Xi∪Yi,Ei\R).

820

Given this, the basic idea of 3-hop labeling algorithm is: for each

iteration, we ﬁrst ﬁnd the densest subgraph of each bipartite graph

B′

i, and then among them (ksubgraphs), we choose the densest

one and update the set Rof uncovered contour pairs. We repeat

this iteration until Ris empty.

Since ﬁnding the densest subgraph forms the core of our 3-hop

labeling algorithm, we formulate it precisely here:

DEFI NI TI O N 6. (Densest Subgraph Problem) Let G= (V, E )

be a graph (directed or undirected). For any subset Vs⊆V,

let G[Vs]=(Vs, Es)be the induced subgraph of G, i.e., Es=

E∩Vs×Vs. The densest subgraph problem is to ﬁnd a subset

Vs⊆V, such that the density of the induced subgraph, d=|Es|

|Vs|,

Gs= (Vs, Es), is maximized.

The fastest exact algorithm for the densest subgraph problem

runs in O(|V||E|log |V|2/|E|)[10]. In 2-hop labeling [9], the

author suggests using a linear 2-approximation algorithm for the

densest subgraph problem. Their algorithm is a simple variant of

[14]. It iteratively removes a vertex with the minimal degree from

the graph, and this gives Vsubgraphs. It returns a 2-approximate

densest subgraph and can run in linear time in the number of edges

in the graph.

In the next subsection, we will introduce a new approach to iden-

tify the densest subgraph, which will allow us to prune the search

space of these candidate subgraphs signiﬁcantly.

4.3 A Faster Algorithm for 3-HOP Labeling

To describe our new algorithm for densest subgraph discovery,

we introduce the rank subgraph.

DEFI NI TI O N 7. (Rank Subgraph) Let G= (V, E )be an undi-

rected graph. Given a positive integer d, we will remove all the ver-

tices which have degree less than dand their adjacent edges in G,

and then we repeat this procedure to the new graph. Let Gdbe the

resulting subgraph of Gwhere each vertex in Gdis adjacent to at

least dother vertices in Gd. If no vertices are left in the graph, we

refer to it as the empty graph, denoted as G∅. Given this, we con-

struct a subgraph sequence G⊇G1⊇G2·· · ⊇ Gl⊃Gl+1 =

G∅, where Gl6=G∅and contains at least l+ 1 vertices. We deﬁne

las the rank of the graph G, and Glas the rank subgraph of G.

Given this, we will use Glas the approximate densest subgraph.

LEM MA 1. Given G, let Gsbe the densest subgraph of G, with

density d(Gs), and let Glbe its rank subgraph with density d(Gl).

Then, the density of Glis no less than half of the density of Gs:

d(Gl)≥d(Gs)

Proof Sketch:We prove this by way of contradiction. Suppose

d(Gl)<d(Gs)

2, which suggests d(Gs)>2×d(Gl) = 2 |E(Gl)|

|V(Gl)|≥

2l|V(Gl)|/2

|V(Gl)|=l

Then, we claim that each vertex in Gsshould have degree more

than l, i.e., for any v∈V(Gs),degree(v)> l. If not, assume

v′∈V(Gs)has vertex degree dv′≤l. Then, we can simply

remove this vertex to increase the density of the subgraph:

|E(Gs)|−dv′

|V(Gs)|−1=d(Gs)|V(Gs)|−dv′

|V(Gs)|−1>d(Gs)|V(Gs)|−d(Gs)

|V(Gs)|−1=

d(Gs)

Since each vertex in Gshas degree more than l, we conclude that

Gs⊆Gl+1. However, Gl+1 =G∅, which contradicts that there is

aGswith density more than 2×d(Gl).2

Following this, we have the following interesting observation.

THE OR EM 3. Consider we have kbipartite graphs,

B1,B2,···,Bk. Let l1, l2,··· , lkbe their respective ranks. Let

S1,S2,···, Skbe their respective densest subgraphs, let Gl1(B1),

Gl2(B2),···, Glk(Bk)be their respective rank graphs, and let

lmax = max (l1, l2,···, lk). Assume we have several maximum

rank graphs with rank lmax. Then, we claim that any maximum

rank graph Gli(Bi)where li=lmax has a density no less than

half of the density of the maximal density subgraphs:

d(Gli(Bi)) ≥max1≤j≤kd(Sj)

Proof Sketch:The proof is similar to Lemma 1. We prove this

by way of contradiction. Suppose d(Gli(Bi)) <max1≤j≤kd(Sj)

Then we can derive

max

1≤j≤kd(Sj)>2d(Gli(Bi)) = 2 |E(Gli(Bi))|

|V(Gli(Bi))|

≥2li|V(Gli(Bi))|/2

|V(Gli(Bi))|=li=lmax

Suppose d(Sp) = max1≤j≤kd(Sj). Then with similar argu-

ment as in the proof of lemma 1 we claim all vertices in Sphave

degree more than lmax. Hence we conclude Sp⊆Glmax+1 (Bp).

However, Glmax+1 (Bp) = ∅according to the deﬁnition of rank

graph, a contradiction. 2

The key implication from Theorem 3 is that we can organize all

the bipartite graphs in a queue based on their ranks. If we know

lis the highest rank of all the bipartite graphs, then we can return

the ﬁrst rank subgraphs we ﬁnd from these bipartite graphs as the

2-approximation densest subgraph. We employ this technique in

the greedy algorithm by deriving an efﬁcient incremental search

procedure for the densest subgraph from those bipartite graphs at

every iteration.

Algorithm 2 3HOPContour(G,Con(G),C1∪ · · ·Ck)

1: Construct Bipartite Graphs B1,···,Bk;

2: For each Bi, construct vertex rank groups, compute the rank ri

of Biand the density diof the rank graph Gri(Bi);

3: Sort all Biinto queue Qaccording to descending order of ri.

4: R←Con(G);

5: Pop the ﬁrst element Bfrom the queue Q;

6: while R6=∅do

7: while B.r < B′.r {B′(B′←Q.pop()) is the next element

in Qafter popping the last bipartite graph} do

8: B′.r ←RankSubgraph(B′,R,B.r)

9: if B.r < B′.r then

10: insert Bback to Qin the sorting order;

11: B ← B′

12: else

13: insert B′back to Qin the sorting order;

14: end if

15: end while

16: R←R\E(Gr(B));

17: Update Lout and Lin for vertices in selected Gr(B).

18: B.r ←RankSubgraph(B,R,0);

19: end while

The sketch of our 3-hop labeling construction algorithm 3HOP-

Contour is in Algorithm 2. It starts with constructing kbipartite

graphs, each corresponding to a chain in 3-hop. Initially, we di-

rectly compute the rank of each bipartite graph and the density

of its corresponding rank subgraph (Line 2). We will then sort

all bipartite graphs based on their rank and put them in queue Q

821

(Line 3). Our goal is to cover the entire transitive closure contour

R=Con(G). The algorithm will iteratively pick the densest sub-

graphs and remove their edges until all the edges (vertex pairs) in

the transitive closure contour are covered (R=∅). During this

covering process, we can make the following observation for the

rank of each bipartite graph: for any bipartite graph, its rank will

not be able to increase during the covering process. This is be-

cause in our covering processing, an increasing number of edges in

the contour will be covered and similarly for the edge set of each

bipartite graph. Say at a certain iteration, we compute the rank for

a bipartite graph B, denoted as B.r. Then, if we try to reevaluate

its rank for the updated graph, where the edge set is E(B)∩R, we

know the updated rank cannot exceed its earlier rank B.r. Indeed,

we can apply B.r as an upper bound of B’s new rank.

To further speed up the rank subgraph searching procedure, we

organize the vertices of each bipartite graph into different rank

groups: For a given bipartite graph with a rank l, let Gdbe the re-

sulting subgraph of a given bipartite graph as we iteratively remove

all the vertices with degree less than d. Thus, we have a subgraph

sequence, G⊇G1⊇G2··· ⊇ Gl⊃Gl+1 =G∅. We assign

each vertex varank d, if v∈V(Gd)and v /∈V(Gd+1). Given

this, all the vertices with the same rank will be organized together

in each bipartite graph. We note that the rank of each vertex will

not be able to increase during the covering process as well. Thus,

using this organization, we can quickly prune the vertices with rank

lower than a given threshold. This will be applied to facilitate the

rank graph searching procedure.

The major iteration of our algorithm is in the loop in lines 6to 19.

In every iteration, we greedily select the densest subgraphs from

our kbipartite graphs. This is done using the queue in the while

loop from Lines 7to 15. We visit each bipartite graph according to

its order in the queue (Line 7). Let Bbe the bipartite graph which

has the highest rank among all the visited bipartite graphs for the

current iteration. Then, we always extract the ﬁrst bipartite graph

B′in the queue Qand compare its saved rank B′.r, which is the

upper bound of its real rank, with B’s real rank B.r.

If B′.r ≤ B.r, we know that the current rank is the highest one

all the bipartite graphs can have since all the remaining bipartite

graphs in the queue will not have a higher rank than B′.r. Thus, we

do the following: 1) we ﬁrst extract the highest ranked subgraph

Gr(B)and apply it to cover R(Line 16); 2) we update Lout and

Lin for vertices in Gr(B); 3) we recompute the rank of Bimme-

diately and use it as the ﬁrst candidate rank for the next iteration

(Line 18).

However, if this is not the case (B′.r > B.r), we need to check if

the true rank of B′.r is higher than B.r. Here we will apply the ver-

tex rank group organization to speed up the search procedure: since

we already have bipartite graph Bwith rank B.r, we are not inter-

ested in B′if it has equivalent or lower rank. Thus, we invoke the

RankSubgraph procedure with three parameters: B′is the targeted

bipartite graph, Ris the uncovered edges, and the last parameter

is the minimal rank in which we are interested. In this case, we

are only interested in ranks higher than B.r. This procedure will

apply Rto remove those edges not in Rand update the vertex rank

group. Again, it only updates those vertices with rank higher than

B.r. This is done in Line 8. For brevity, we omit the details of the

RankSubgraph procedure.

Putting all of these together, we can see Algorithm 2 creates Lout

and Lin for the out-anchor and in-anchor vertices of the transitive

closure contour Con(G). As an example, in Figure 6 one of the

densest bipartite subgraphs is the circled subgraph which could be

selected by Algorithm 2. If selected, Algorithm 2 will add 9 to

Lout(4),Lout (14), and Lin (19). A complete labeling sets Lin

and Lout from Algorithm 2 are shown in Figure 7, where we only

show Lin(u)(or Lout (u)) of a vertex uif it is not empty, and set i

for Lin and set ofor Lout.

Finally, we can claim the following optimalities of our 3HOP-

Contour algorithm. Due to space constraints, we omit the proofs.

THE OR EM 4. The 3HOPContour algorithm ﬁnds a 3-hop la-

beling for the transitive closure contour Con(G)whose size is

larger than the smallest such labeling by at most an O(ln |Con(G)|+

1) = O(log n)factor, where nis the number of vertices in G.

THE OR EM 5. For any DAG G, the minimum 3-hop labeling

cost (deﬁned previously as Cost(3hop)) for transitive closure con-

tour Con(G),Opt3−hop, is always no larger than the minimum

labeling cost of 2-hop, Opt2−hop . In addition, the upper bound of

3-hop labeling cost produced by 3HOPContour algorithm,

O((ln |Con(G)|+ 1)(Opt3−hop)), is always no larger than

O((ln |V|2+1)(Opt2−hop )), the upper bound of labeling cost pro-

duced by Cohen et al’s 2-hop algorithm [9].

Figure 7: 3-Hop Labeling of Transitive Closure Contour

5. REACHABILITY QUERY PROCESSING

USING 3-HOP INDEXING

In Section 4, we show how to construct the 3-hop labeling for

the transitive closure contour. As a result of Algorithm 2, we get

Lout(u)and Lin (v)for each out-anchor vertex uand each in-

anchor vertex v, respectively. In this section we will show how

to efﬁciently answer reachability queries using these labelings. We

describe two approaches: the ﬁrst approach directly applies the 3-

hop labeling of the contour to achieve a worst-case time complexity

O(|Con(G)|)while the second approach utilizes segments to re-

duce the query processing complexity.

5.1 3-HOP Contour Query Processing

Note that the 3-hop labeling of the transitive closure contour

Con(G)ensures that the reachability for any pair of vertices in

a DAG Gcan be inferred. This is because 3-hop labeling can cover

all the vertex pairs in Con(G), and Con(G)can cover all the other

vertex pairs in the transitive closure matrix.

Given this, to tell if vertex uin chain Cican reach vertex vin

chain Cj, we can ﬁrst recover the pseudo-diagonal of MCi,Cjusing

822

the 3-hop labeling and then test if (u, v)is dominated by any of the

pseudo-diagonal cells. However, we do not need to consider those

pseudo-diagonal cells or the closure vertex pairs whose out-anchor

vertex is smaller than uor whose in-anchor vertex is bigger than v.

We can integrate these steps together and have the following query

processing procedure:

Step 1: In chain Ci, (u∈Ci), we collect all the smallest vertices

on any other chain that ucan reach through the out-anchor vertex

u′,uu′: (Lx.cid

out (u′) = Lout(u′)∩Cx.cid )

X={x|x∈[

u′u

Lout(u′)AND xLx.cid

out (u′)for any u′u}

Step 2: In chain Cj, (v∈Cj), we collect all the largest vertices on

any other chain which can reach vthrough the in-anchor vertex v′,

v′v: (Ly.cid

in (v′) = Lin(v′)∩Cy.cid )

Y={y|y∈[

v′v

Lin(v′)AND Ly.cid

in (v′)yfor any v′v}

Step 3: We see if there is an x, y pair, x∈Xand y∈Y, such that

x.cid =y.cid and xy.

Using the highway analogy, we can see the ﬁrst step collects

those entry points ucan reach on the intermediate chains, the sec-

ond step collects those exit points which reach von the interme-

diate chains, and the third step checks to see if an entry point can

reach an exit point, i.e., if they are on the same chain and the en-

try point has smaller sequence number than the exit point. Note

that the worst-case query processing cost is O(|Con(G)|). This

can be observed by the fact that for any out-anchor vertex u, and

v∈Lout, we have (u, v)∈C on(G)(the same for any in-anchor

vertex). Thus, the ﬁrst two steps cost maximally O(|Con(G)|)

time and the third step costs O(k), where kis the number of chains

in the chain-decomposition.

For example, in Figure 7, to tell whether u= 2 can reach v=

20, we get set X={6,15}by checking Lout (3),Lout(4) and

Lout(5); and set Y={9,13}by checking Lin (19),Lin (18), and

Lin(17). Since 6(∈X)reaches 9(∈Y)in C2, we say ucan reach

5.2 3-HOP Segment Query Processing

In this subsection, we introduce an indexing method on top of the

3-hop contour labeling to reduce the query processing complexity.

The worst-case query processing time is O(log n+k), where nis

the number of vertices in G. We can see the major bottleneck in

the ﬁrst approach is its ﬁrst two steps. To speed up these steps, the

new approach will break each chain into segments. Speciﬁcally,

for each chain Ci, we will break it into outgoing segments and

incoming segments.

We construct the segments for chain Ciwith respect to another

chain Cjbased on the 3-hop contour labeling. Let Qout(i, j )be all

the out-anchor vertices of chain Ciwhich record an intermediate

entry point in chain Cj:

Qout(i, j ) = {x|x∈Ci, Lout(x)∩Cj6=∅}

Let Qin(i, j )be all the in-anchor vertices of chain Cjwhich record

an intermediate exit point in chain Ci:

Qin(i, j ) = {y|y∈Cj, Lin(y)∩Ci6=∅}

Then, we can order all the vertices x1,···, xlin Qout (i, j)such

that x1x2 ·· ·  xl,l=|Qout (i, j)|, and order all the

vertices y1,···, yl′in Qin (i, j)such that y1y2 ···  yl′,

l′=|Qin(i, j )|. Given this, we construct the outgoing segments

for Ci, denoted by their sequence number,

(1, x1.oid),(x1.oid + 1, x2.oid),···(xl−1.oid + 1, xl.oid)

and the incoming segments for Cj,

(y1.oid, y2.oid−1),(y2.oid, y3.oid−1),···(yl.oid, Cj.last().oid)

For example, in Figure 7, the outgoing segments constructed

from Qout(1,2) are (1,3) and (4,4) where L2

out(1,3) = 6 and

out(4,4) = 9. The incoming segments constructed from Qin(3,4)

are (17,17) and (18,20) where L3

in(17,17) = 11 and L3

in(18,20) =

13.

We say a vertex vis in a segment S= (x, y)(denoted as v∈S)

if x≤v.oid ≤y. We note that all the vertices in each outgoing

segment share the same entry point of chain Cjand all the ver-

tices in each incoming segment share the same exit point of chain

Ci. Thus, we assign each outgoing segment (or incoming segment)

with a unique vertex on chain Cj(or Ci) as its label.

In the 3-hop segment indexing, we construct these outgoing and

incoming segments of each chain with respect to every other chain.

Then for all the segments which share the same starting vertex and

ending vertex, we combine all their individual labels into Lout(S),

where Sis the combined segment. In addition, to facilitate query

processing, we construct an interval tree [3] for all the outgoing

segments in a single chain Ciand an interval tree for all the in-

coming segments in a single chain Cj. Given this, we can see that

the new query processing procedure for answering whether ucan

reach vis as follows:

Step 1: In chain Ci, (u∈Ci), we collect all the outgoing segments

which contain uand combine all their labels in X.

X={x|x∈[

u∈S

Lout(S)AND xLx.cid

out (S)for any S∋u}

Step 2: In chain Cj, (v∈Cj), we collect all the incoming seg-

ments which contain vand combine all their labels in Y.

Y={y|y∈[

v∈S

Lin(S)AND Ly.cid

in (S)yfor any S∋v}

Step 3: We see if there is an x, y pair, x∈Xand y∈Y, such that

x.cid =y.cid and xy.

The worst case query processing time is O(log n+k). Though

the number of segments could be maximally n2, the number of

segments covering uor vis actually no more than k.The interval

tree can return the segments which cover uin O(log n+k)time.

Finally, we note that the extra segments can contribute to an O(nk)

storage cost on top of the 3-hop contour labeling.

6. EXPERIMENTAL EVALUATION

In this section, we empirically compare the new 3-hop labeling

approach with the state-of-art simple graph covering approach, the

path-tree cover and the 2-hop labeling approach, on both synthetic

and real data. We also list query time of two classical approaches,

breadth-ﬁrst search and depth-ﬁrst search as benchmarks. We are

particularly interested in the following issues:

1. Index size: The major goal of this work is to derive an in-

dexing scheme for reachability query which can signiﬁcantly

compress the transitive closure when the ratio between the

number of edges and the number of vertices is relatively high.

Speciﬁcally, we would like to learn how much we can gain

by using 3-hop labeling compared with two best available in-

dexing approaches, path-tree and 2-hop. Since each vertex in

the path-tree is labeled by three numbers (two numbers are

823

tree intervals and one number is depth-ﬁrst order), and each

vertex in the 3-hop is labeled by two numbers (cid and oid),

we deﬁne the index size of the path-tree scheme for a graph

G= (V, E )to be the size of transitive closure plus 3∗ |V|,

and the index size of the 3HOP-Contour to be cost(3hop)

(deﬁned in subsection 4.1) plus 2∗|V|. The index size of the

3HOP-Segment is the size of all segments, i.e. two times the

number of segments, plus the cost of labeling. In this case

each segment has a label instead of each vertex. It is easy

to observe that the total labeling cost of 3HOP-Segment is

cost(3hop), the same as that of 3HOP-Contour.

2. Query processing time: As we mentioned before, there is a

trade-off between the compression rate of the transitive clo-

sure and the query answering time. In order to achieve a

high compression rate, the 3-hop indexing approach clearly

requires more runtime processing for answering reachability

queries than path-tree. However, the interesting question is

how fast 3-hop can answer queries and whether it is compa-

rable with path-tree and 2-hop.

3. Construction time: A major advantage of 3-hop compared

with 2-hop is that it does not require computing the full tran-

sitive closure and employs a new strategy to speedup the

densest subgraph identiﬁcation. How can these factors speedup

the labeling process of the 3-hop compared with the 2-hop

approach?

Given this, we have speciﬁcally compared these six algorithms

in the experimental evaluation: 1) the original 2-hop approach by

Cohen et al. [9], denoted as 2HOP; 2) the path-tree approach

(PTree-1) proposed by Jin et al. [12], denoted as Path-Tree; 3)

the 3-hop labeling approach with 3-hop contour query processing,

denoted as 3HOP-Contour; 4) the 3-hop labeling approach with

3-hop segment query processing, denoted as 3HOP-Segment; 5)

Breadth-ﬁrst Search; and 6) Depth-ﬁrst Search. We have imple-

mented all six algorithms. The Path-Tree is an improved second

version with respect to the ﬁrst version in [12]. Besides, since 3-

hop needs a chain decomposition, we implement a heuristic algo-

rithm developed by Jagadish, procedure-3 in [11]. All algorithms

are implemented using C++ based on the Standard Template Li-

brary (STL). We perform experiments on a Linux 2.6 machine with

2.0GHz CPU and 8.0GB RAM.

In the experiments, we collect all three measures: the index size,

the query time, and the indexing construction time, and each exper-

iment processes 100,000 randomly generated queries.

6.1 Synthetic Datasets

Here, we run two sets of experiments using the synthetic DAGs,

which are generated using a random directed acyclic graph genera-

tion algorithm described in [13].

In the ﬁrst experiment, we generate a set of DAGs with 2,000

vertices, and vary their average density from 2to 12. We compare

all six approaches, 3HOP-Segment, 3HOP-Contour, 2HOP, Path-

Tree, breadth-ﬁrst search, and depth-ﬁrst search in this experiment.

From ﬁgure 8, both 3HOP-Segment and 3HOP-Contour con-

sistently obtain a better index size compression rate than 2HOP

and Path-Tree on all synthetic datasets. Overall, the index size of

3HOP-Contour and 3HOP-Segment are on average about 2.7times

and 2.0times better than the Path-Tree approach, and about 1.5

times and 1.1times better than 2HOP. On the other hand, in Ta-

ble 2, we observe that path-tree has moderately faster query time

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

2 4 6 8 10 12

Index size

|E| / |V|

Index size of rand2k datasets

3HOP-Contour

3HOP-Segment

Path-Tree

2HOP

Figure 8: Index size of Synthetic Datasets (2K)

200000

400000

600000

800000

1e+06

1.2e+06

1.4e+06

1.6e+06

1.8e+06

5 10 15 20 25

Index size

|E| / |V|

Index size of rand10k datasets

3HOP-Contour

3HOP-Segment

Path-Tree

Figure 9: Index size of Synthetic Datasets (10K)

824

Dataset Query Time (in ms)

3HOP-Contour 3HOP-Segment Path-Tree 2HOP Breath-First Search Depth-First Search

rand2k_2 22.865 165.646 9.108 70.239 891.957 891.502

rand2k_4 49.354 566.175 26.051 297.801 2197.01 1796.84

rand2k_6 77.686 1092.2 33.785 514.546 4397.49 4358.84

rand2k_8 103.769 1422.82 31.626 589.059 6134.99 7553.84

rand2k_10 124.291 1661.82 28.322 574.64 7499.31 11305.3

rand2k_12 141.825 1748.04 28.411 722.005 8628.21 14917.4

Table 2: Query Time of Synthetic Datasets (2K)

Dataset DAG #V DAG #E Density

Arxiv 6000 66707 11.12

Citeseer 10720 44258 4.13

Go 6793 13361 1.97

Pubmed 9000 40028 4.45

Yago 6642 42392 6.38

Table 3: Real datasets

than 3HOP-Contour, as expected. However, 3HOP-Contour has

not only smaller index size than that of 3HOP, but shorter query

time as shown in Table 2. It is interesting to observe that 3HOP-

Contour is faster than 3HOP-Segment. We know that the query

time complexity of 3HOP-Segment is better than 3HOP-Contour.

However, in practice, we can see intuitively that more memory ac-

cess operations (e.g. searching interval trees and processing search

results) are needed in 3HOP-Segment, and interval trees are too

big to be loaded into system caches. Thus, it is reasonable that the

query time of 3HOP-Contour is better than that of 3HOP-Segment.

In terms of construction time, 3HOP-Contour and 3HOP-Segment

are several orders of magnitude faster than 2HOP. In this experi-

ment, 2HOP ﬁnishes index construction of a dataset between 7and

21 hours, while 3HOP-Segment and 3HOP-Contour take only 1

second to 71 seconds. To explain the phenomena, we notice that 3-

hop needs to take O((kn2)∗ |Con(G)|)construction time (Recall

we have kbipartite graphs corresponding to kchains. Each bipar-

tite graph starts as a complete bipartite graph with O(n2)edges.)

while 2-hop takes O(n3|Tc|).|Con(G)|is the number of contour

points and |Tc|is the size of transitive closure. Although in worst

much smaller. In addition, we have developed and implemented a

new technique (Theorem 3) which can speed up 3-hop labeling by

up to O(k).

In the second experiment, we generate random DAGs with 10,000

vertices, and vary their densities from 2to 25. Note that we do not

compare with 2HOP in the second experiment because 2HOP can-

not process such large scale datasets due to memory constraints.

Figure 9 shows the index size of the two 3-hop approaches and

the path-tree approach. Here, 3HOP-Contour and 3HOP-Segment

can achieve up to 6.0times and 5.3times smaller index sizes than

Path-Tree. On average, 3HOP-Contour and 3HOP-Segment have

3.9times and 3.1times smaller index sizes, respectively, than Path-

Tree. The query processing time and construction time are similar

to the ﬁrst experiments and we omit them here.

It is interesting to observe that there is a peak occurring at density

10 on the index size for all three algorithms. Since 3-hop labeling

relies on chain decomposition and path-tree labeling depends on

path decomposition, an increase in density potentially may result

in better chain or path decomposition (i.e. with fewer chains or

paths w.r.t. DAG). This can explain the peak phenomena.

6.2 Real Datasets

To evaluate our indexing scheme on real datasets, we have col-

lected ﬁve real datasets listed in Table 4. All graphs are extracted

from real-world large datasets with density being larger than or

close to 2. Among them, arXiv is extracted from a dataset of ci-

tations among scientiﬁc papers from the arxiv.org website 1. Sim-

ilarly, citeseer contains citations among scientiﬁc literature pub-

lications from the CiteSeer project 2, and pubmed was extracted

from an XML registry of open access medical publications from

the PubMed Central website 3. GO contains genetic terms and their

relationships from the Gene Ontology project 4. Yago describes the

structure of relationships among terms in the semantic knowledge

database from the YAGO project 5.

Table 4 shows the index size and query time of three methods,

the two 3-hop approaches and the path-tree approach. Again, in this

experiment, the 2HOP approach fails by running out of memory.

As shown in the table, the index sizes of 3HOP-Contour are re-

duced signiﬁcantly with respect to Path-Tree, and the index sizes

of 3HOP-Segment are smaller than Path-Tree in 3 out of 5 datasets.

On average, 3HOP-Contour and 3HOP-Segment obtain 1.7times

and 1.2times better compression rates than the Path-Tree approach.

As expected, we found that the query time of Path-Tree is better

than 3HOP.

The 3HOP-Contour has a similar construction time to 3HOP-

Segment therefore we only report 3HOP-contour construction time

here. It takes 8530,106,25,257, and 25 seconds to construct

indexing for dataset arXiv, citeseer, go, pubmed, and yago, respec-

tively. The Path-Tree is much faster and takes only 10,0.73,0.2,

0.77, and 0.55 seconds, respectively for these datasets. This is

expected since the 3-hop approach is more computationally expen-

sive. However, the new approach has an evidently higher compres-

sion rate and its query processing time is also comparable to the

path-tree approach.

7. CONCLUSION

In this work, we introduce a new 3-hop indexing scheme with

high compression rate targeting the directed graphs with higher

edge-vertex ratio. We not only show that our index size can achieve

a guaranteed approximation bound, but also demonstrate its ap-

plicability through extensive experimental evaluations on both real

and synthetic datasets. More importantly, we believe this method

potentially opens a new way to compress the transitive closure and

leads to new provocative questions. For instance, how can other

simple graph structures, such as trees, serve as the intermediate hop

(highway)? How can we derive the average complexity of these

1http://arxiv.org/

2http://citeseer.ist.psu.edu/oai.html

3http://www.pubmedcentral.nih.gov/

4http://www.geneontology.org/

5http://www.mpi-inf.mpg.de/ suchanek/downloads/yago/

825

Dataset Index Size Query Time (in ms)

3HOP-Contour 3HOP-Segment Path-Tree 3HOP-Contour 3HOP-Segment Path-Tree Breadth-First Search Depth-First Search

ArXiv 47472 64378 86855 125.382 1060.2 24.278 19029.2 129587

Citeseer 51035 72167 91820 87.763 523.488 23.32 4567.16 4781.19

Go 27764 41798 37729 53.354 250.261 10.39 2697.67 2780.23

Pubmed 54531 72215 107915 72.491 533.495 21.818 4083.08 4224.54

Yago 27038 39638 39181 44.495 229.416 12.256 2605.56 2622.23

Table 4: Comparison between 3HOP and Path-Tree

compression approaches, including the simple graph covering ap-

proaches, 2-hop, and 3-hop? We plan to investigate these problems

in the future.

8. REFERENCES

[1] R. Agrawal, A. Borgida, and H. V. Jagadish. Efﬁcient

management of transitive relationships in large data and

knowledge bases. In SIGMOD, pages 253–262, 1989.

[2] Renzo Angles and Claudio Gutierrez. Survey of graph

database models. ACM Comput. Surv., 40(1):1–39, 2008.

[3] M.de Berg, M.van Kreveld, M.Overmars, and

O.Schwarzkopf. Computational Geometry. Springer, 2000.

[4] Li Chen, Amarnath Gupta, and M. Erdem Kurul.

Stack-based algorithms for pattern matching on dags. In

VLDB ’05: Proceedings of the 31st international conference

on Very large data bases, pages 493–504, 2005.

[5] Yangjun Chen and Yibin Chen. An efﬁcient algorithm for

answering graph reachability queries. In ICDE, pages

893–902, 2008.

[6] Jiefeng Cheng, Jeffrey Xu Yu, Xuemin Lin, Haixun Wang,

and Philip S. Yu. Fast computation of reachability labeling

for large graphs. In EDBT, pages 961–979, 2006.

[7] Jiefeng Cheng, Jeffrey Xu Yu, Xuemin Lin, Haixun Wang,

and Philip S. Yu. Fast computing reachability labelings for

large graphs with high compression rate. In EDBT, pages

193–204, 2008.

[8] V. Chvátal. A greedy heuristic for the set-covering problem.

Math. Oper. Res, 4:233–235, 1979.

[9] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick.

Reachability and distance queries via 2-hop labels. In

Proceedings of the 13th annual ACM-SIAM Symposium on

Discrete algorithms, pages 937–946, 2002.

[10] G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast

parametric maximum ﬂow algorithm and applications. SIAM

J. Comput., 18(1):30–55, 1989.

[11] H. V. Jagadish. A compression technique to materialize

transitive closure. ACM Trans. Database Syst.,

15(4):558–598, 1990.

[12] Ruoming Jin, Yang Xiang, Ning Ruan, and Haixun Wang.

Efﬁciently answering reachability queries on very large

directed graphs. In SIGMOD Conference, pages 595–608,

2008.

[13] Richard Johnsonbaugh and Martin Kalin. A graph generation

software package. In SIGCSE ’91: Proceedings of the

twenty-second SIGCSE technical symposium on Computer

science education, pages 151–154, New York, NY, USA,

1991. ACM.

[14] Guy Kortsarz and David Peleg. Generating sparse

2-spanners. In SWAT ’92: Proceedings of the Third

Scandinavian Workshop on Algorithm Theory, pages 73–82,

1992.

[15] R. Schenkel, A. Theobald, and G. Weikum. HOPI: An

efﬁcient connection index for complex XML document

collections. In EDBT, 2004.

[16] K. Simon. An improved algorithm for transitive closure on

acyclic digraphs. Theor. Comput. Sci., 58(1-3):325–346,

1988.

[17] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel.

Efﬁcient aggregation for graph summarization. In SIGMOD

Conference, 2008.

[18] Silke Trißl and Ulf Leser. Fast and practical indexing and

querying of very large graphs. In SIGMOD ’07: Proceedings

of the 2007 ACM SIGMOD international conference on

Management of data, pages 845–856, 2007.

[19] Haixun Wang, Hao He, Jun Yang, Philip S. Yu, and

Jeffrey Xu Yu. Dual labeling: Answering graph reachability

queries in constant time. In ICDE ’06: Proceedings of the

22nd International Conference on Data Engineering

(ICDE’06), page 75, 2006.

[20] Xifeng Yan, Philip S. Yu, and Jiawei Han. Substructure

similarity search in graph databases. In SIGMOD

Conference, pages 766–777, 2005.

826

Fast Searching The Densest Subgraph And Decomposition With Local Optimality

Preprint

Jul 2023

Densest Subgraph Problem (DSP) is an important primitive problem with a wide range of applications, including fraud detection, community detection and DNA motif discovery. Edge-based density is one of the most common metrics in DSP. Although a maximum flow algorithm can exactly solve it in polynomial time, the increasing amount of data and the high complexity of algorithms motivate scientists to find approximation algorithms. Among these, its duality of linear programming derives several iterative algorithms including Greedy++, Frank-Wolfe and FISTA which redistribute edge weights to find the densest subgraph, however, these iterative algorithms vibrate around the optimal solution, which are not satisfactory for fast convergence. We propose our main algorithm Locally Optimal Weight Distribution (LOWD) to distribute the remaining edge weights in a locally optimal operation to converge to the optimal solution monotonically. Theoretically, we show that it will reach the optimal state of a specific linear programming which is called locally-dense decomposition. Besides, we show that it is not necessary to consider most of the edges in the original graph. Therefore, we develop a pruning algorithm using a modified Counting Sort to prune graphs by removing unnecessary edges and nodes, and then we can search the densest subgraph in a much smaller graph.

A Counting-based Approach for Efficient k-Clique Densest Subgraph Discovery

Article

May 2024

Densest subgraph discovery (DSD) is a fundamental topic in graph mining. It has been extensively studied in the literature and has found many real applications in a wide range of fields, such as biology, finance, and social networks. As a typical problem of DSD, the k-clique densest subgraph (CDS) problem aims to detect a subgraph from a graph, such that the ratio of the number of k-cliques over the number of its vertices is maximized. This problem has received plenty of attention in the literature, and is widely used in identifying larger ''near-cliques''. Existing CDS solutions, either k-core or convex programming based solutions, often need to enumerate almost all the k-cliques, which is very inefficient because real-world graphs usually have a vast number of k-cliques. To improve the efficiency, in this paper, we propose a novel framework based on the Frank-Wolfe algorithm, which only needs k-clique counting, rather than k-clique enumeration, where the former one is often much faster than the latter one. Based on the framework, we develop an efficient approximation algorithm, by employing the state-of-the-art k-clique counting algorithm and proposing some optimization techniques. We have performed extensive experimental evaluation on 14 real-world large graphs and the results demonstrate the high efficiency of our algorithms. Particularly, our algorithm is up to seven orders of magnitude faster than the state-of-the-art algorithm with the same accuracy guarantee.

On Querying Historical Connectivity in Temporal Graphs

Article

May 2024

We study the historical connectivity query in temporal graphs where edges continuously arrive. Given an arbitrary time window, and two query vertices, the problem aims to identify if two vertices are connected by a path in the snapshot of the window. The state-of-the-art method designs an index based on the two-hop cover, and updating the index is costly when new edges arrive. In this paper, we propose a new framework and design a novel forest-based index for historical connectivity queries. The index enables us to answer queries by searching if two vertices are connected in the forest. We update the index by modifying a forest structure. Our techniques also work for connectivity query processing in a sliding window of temporal graphs. Extensive experiments have been conducted to show the considerable advantages of our approach compared with the state-of-the-art methods in both historical connectivity queries and sliding-window connectivity queries.

Efficient algorithms for reachability and path queries on temporal bipartite graphs

Article

Full-text available

May 2024
VLDB J

Bipartite graphs are naturally used to model relationships between two types of entities, such as people-location, user-post, and investor-stock. When modeling real-world applications like disease outbreaks, edges are often enriched with temporal information, leading to temporal bipartite graphs. While reachability has been extensively studied on (temporal) unipartite graphs, it remains largely unexplored on temporal bipartite graphs. To fill this research gap, we study the reachability problem on temporal bipartite graphs in this paper. Specifically, a vertex u reaches a vertex w in a temporal bipartite graph G if u and w are connected through a series of consecutive wedges with time constraints. To efficiently answer if a vertex can reach the other vertex, we propose an index-based method by adapting the idea of 2-hop labeling. Effective optimization strategies and parallelization techniques are devised to accelerate the index construction process. To better support real-life scenarios, we further show how the index is leveraged to efficiently answer other types of queries, e.g., single-source reachability and earliest-arrival path queries. In addition, we propose an efficient method to handle incremental maintenance of the index structure. Extensive experiments on 16 real-world graphs demonstrate the effectiveness and efficiency of our proposed techniques.

A Survey on the Densest Subgraph Problem and Its Variants

Article

Mar 2024

The Densest Subgraph Problem requires to find, in a given graph, a subset of vertices whose induced subgraph maximizes a measure of density. The problem has received a great deal of attention in the algorithmic literature over the last five decades, with many variants proposed and many applications built on top of this basic definition. Recent years have witnessed a revival of research interest in this problem with several important contributions, including some groundbreaking results, published in 2022 and 2023. This survey provides a deep overview of the fundamental results and an exhaustive coverage of the many variants proposed in the literature, with a special attention to the most recent results. The survey also presents a comprehensive overview of applications and discusses some interesting open problems for this evergreen research topic.

FulBM: Fast Fully Batch Maintenance for Landmark-based 3-hop Cover Labeling

Article

Mar 2024

Landmark-based 3-hop cover labeling is a category of approaches for shortest distance/path queries on large-scale complex networks. It pre-computes an index offline to accelerate the online distance/path query. Most real-world graphs undergo rapid changes in topology, which makes index maintenance on dynamic graphs necessary. So far, the majority of index maintenance methods can handle only one edge update (either an addition or deletion) each time. To keep up with frequently changing graphs, we research the ful ly b atch m aintenance problem for the 3-hop cover labeling, and proposed the method called FulBM . FulBM is composed of two algorithms: InsBM and DelBM, which are designed to handle batch edge insertions and deletions respectively. This separation is motivated by the insight that batch maintenance for edge insertions are much more time-efficient, and the fact that most edge updates in the real world are incremental. Both InsBM and DelBM are equipped with well-designed pruning strategies to minimize the number of vertex accesses. We have conducted comprehensive experiments on both synthetic and real-world graphs to verify the efficiency of FulBM and its variants for weighted graphs. The results show that our methods achieve 5.5 × to 228 × speedup compared with the state-of-the-art method.

Reachability Queries on Dynamic Temporal Bipartite Graphs

Conference Paper

Dec 2023

BL: An Efficient Index for Reachability Queries on Large Graphs

Article

Jan 2023

Reachability query has important applications in many fields such as social networks, semantic web, and biological information networks. How to improve the query efficiency in directed acyclic graph (DAG) has always been the main problem of reachability query research. Existing methods either can't prune unreachable pairs enough or can't perform well on both index size and query time. In this paper, we propose BL (Bridging Label), a general index framework for reachability queries in large graphs. First, we summarize the relationships between BL and existing label indices. Second, we propose a kind of specific index, named minBL, which can avoid redundant labels. Moreover, we propose TFD-minBL and CTFD-minBL, which generate minBL under the TFD-based permutation single-pass and in incremental, respectively. Finally, we conduct a large number of extensive experiments on real datasets. The experimental results show that our methods are much faster and use less storage overhead than state-of-the-art methods. The source codes of BL can be downloaded from web site https://github.com/BioLab310/BL</uri

Weight Matters: An Empirical Investigation of Distance Oracles on Knowledge Graphs

Conference Paper

Oct 2023

Scalable Algorithms for Densest Subgraph Discovery

Conference Paper

Apr 2023

Fast and Practical Indexing and Querying of Very Large Graphs

Conference Paper

Full-text available

Jul 2007

Many applications work with graph-structured data. As graphs grow in size, indexing becomes essential to ensure sufficient query performance. We present the GRIPP index structure (GRaph Indexing based on Pre- and Postorder numbering) for answering reachability queries in graphs. GRIPP requires only linear time and space. Using GRIPP, we can answer reachability queries on graphs with 5 million nodes on average in less than 5 milliseconds, which is unrivaled by previous methods. We evaluate the performance and scalability of our approach on real and synthetic random and scale-free graphs and compare our approach to existing indexing schemes. GRIPP is implemented as stored procedure inside a relational database management system and can therefore very easily be integrated into existing graph-oriented applications.

Conference Paper

Full-text available

Jun 2005

Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, chem-informatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently.In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate non-consecutive sequences, trees, and other complicated structures as well.

Efficiently answering reachability queries on very large directed graphs

Conference Paper

Full-text available

Jun 2008

Efficiently processing queries against very large graphs is an important research topic largely driven by emerging real world applications, as diverse as XML databases, GIS, web mining, social network analysis, ontologies, and bioinformat- ics. In particular, graph reachability has attracted a lot of research attention as reachability queries are not only com- mon on graph databases, but they also serve as fundamental operations for many other graph queries. The main idea be- hind answering reachability queries in graphs is to build in- dices based on reachability labels. Essentially, each vertex in the graph is assigned with certain labels such that the reach- ability between any two vertices can be determined by their labels. Several approaches have been proposed for building these reachability labels; among them are interval labeling (tree cover) and 2-hop labeling. However, due to the large number of vertices in many real world graphs (some graphs can easily contain millions of vertices), the computational cost and (index) size of the labels using existing methods would prove too expensive to be practical. In this paper, we introduce a novel graph structure, referred to as path- tree, to help labeling very large graphs. The path-tree cover is a spanning subgraph of G in at ree shape. We demon- strate both analytically and empirically the effectiveness of our new approaches.

Efficient aggregation for graph summarization

Conference Paper

Full-text available

Jun 2008

Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are critical. However, existing graph summarization methods are mostly statistical (studying statistics such as degree distributions, hop-plots and clustering coefficients). These statistical methods are very useful, but the resolutions of the summaries are hard to control. In this paper, we introduce two database-style operations to summarize graphs. Like the OLAP-style aggregation methods that allow users to drill-down or roll-up to control the resolution of summarization, our methods provide an analogous functionality for large graph datasets. The first operation, called SNAP, produces a summary graph by grouping nodes based on user-selected node attributes and relationships. The second operation, called k-SNAP, further allows users to control the resolutions of summaries and provides the "drill-down" and "roll-up" abilities to navigate through summaries with different resolutions. We propose an efficient algorithm to evaluate the SNAP operation. In addition, we prove that the k-SNAP computation is NP-complete. We propose two heuristic methods to approximate the k-SNAP results. Through extensive experiments on a variety of real and synthetic datasets, we demonstrate the effectiveness and efficiency of the proposed methods.

Fast computing reachability labelings for large graphs with high compression rate

Conference Paper

Full-text available

Mar 2008

There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O(|V|·|E|1/2) space, and be used to answer reachability query efficiently. How- ever, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate. In this paper, we propose a hierarchi- cal partitioning approach to partition a large graph G into two subgraphs repeatedly in a top-down fashion. The unique feature of our approach is that we compute 2-hop cover while partitioning. In brief, in every iteration of top-down parti- tioning, we provide techniques to compute the 2-hop cover for connections between the two subgraphs first. A cover is computed to cut the graph into two subgraphs, which results in an overall cover with high compression for the entire graph G. Two approaches are proposed, namely a node-oriented approach and an edge-oriented approach. Our approach can efficiently compute 2-hop cover for a large graph with high compression rate. Our extensive experiment study shows that the 2-hop cover for a graph with 1,700,000 nodes and 169 billion connections can be obtained in less than 30 min- utes with a compression rate about 40,000 using a PC.

HOPI: An efficient connection index for complex XML document collections

Conference Paper

Jan 2004
Lect Notes Comput Sci

In this paper we present HOPI, a new connection index for XML documents based on the concept of the 2-hop cover of a directed graph introduced by Cohen et al. In contrast to most of the prior work on XML indexing we consider not only paths with child or parent relationships between the nodes, but also provide space- and time-efficient reachability tests along the ancestor, descendant, and link axes to support path expressions with wildcards in our XXL search engine. We improve the theoretical concept of a 2-hop cover by developing scalable methods for index creation on very large XML data collections with long paths and extensive cross-linkage. Our experiments show substantial savings in the query performance of the HOPI index over previously proposed index structures in combination with low space requirements.

Jagadish H V. Efficient management of transitive relationships in large data and knowledge bases. In Proc. ACM Symp. Management of Data

Article

Jan 1989

An improved algorithm for transitive closure on acyclic digraphs

Conference Paper

Klaus Simon

In [6] Geralcikova, Koubek describe an algorithm for finding the transitive closure of an acyclic digraph G with worst case runtime O(ne red), where n is the number of nodes and e red is the number of edges in the transitive reduction of G. We present an improvement on their algorithm which runs in worst case time O(ke red) and space O(nk), where k is the width of a chain decomposition. For the expected values in the G n,p model of a random acyclic digraph with 0 < p < 1 we have: E(k) = O(\fracln(p n)p) E(ered ) = O(min(n |lnp|,p n2 )) = O(n lnn) E(k ered ) = { *20cO(n2 )for\fracln2 nn \leqslant p < 1 O(n2 lnlnn)otherwise \begin{gathered}E(k) = O(\frac{{\ln (p \cdot n)}}{p}) \hfill \\E(e_{red} ) = O(\min (n \cdot |lnp|,p \cdot n^2 )) = O(n \cdot \ln n) \hfill \\E(k \cdot e_{red} ) = \left\{ {\begin{array}{*{20}c}{O(n^2 )for\frac{{ln^2 n}}{n} \leqslant p < 1} \\{O(n^2 \cdot \ln \ln n)otherwise} \\\end{array} } \right. \hfill \\\end{gathered}

An Improved Algorithm for Transitive Closure on Acyclic Digraphs.

Article

Jun 1988
THEOR COMPUT SCI

Klaus Simon

In [6] Goralćíková and Koubek describe an algorithm for finding the transitive closure of an acyclic digraph G with worst-case runtime O(n·ered), where n is the number of nodes and ered is the number of edges in the transitive reduction of G. We present an improvement on their algorithm which runs in worst-case time O(k·ered) and space O(n·k), where k is the width of a chain decomposition. For the expected values in the Gn, p model of a random acyclic digraph with 0<p<;1 we have , otherwise, where “log” means the natural logarithm.

A Graph Generation Software Package

Conference Paper

Mar 1991

We discuss a software package that generates graphs of specified sizes and properties. Among the types of graphs are. random graphs l random connected graphs l random directed acyclic graphs l random complete weighted graphs l random pairs of isomorphic regular graphs l random graphs with Hamiltonian cycles l random networks Graphs may be specified further with respect to one or more of these properties: o weighted or unweighed. directed or undirected l simple or nonsimple Such graphs are useful to faculty and students for testing and experimenting with many algorithms that appear in the computer science curriculum, such as algorithms to find components, to perform a topological sort, to solve the traveling salesperson problem, to find a minimal spanning tree, or to solve the maximal flow problem. Our software package, written in C, writes graphs to user-specified files. The package is available at no cost from the authors.

3-HOP: A high-compression indexing scheme for reachability query

Abstract and Figures

Recommended publications

Converting Suffix Trees into Factor/Suffix Oracles

Path-tree

Tree compression pushdown automaton

Graph Indexing for Efficient Evaluation of Label-constrained Reachability Queries