Conference PaperPDF Available

3-HOP: A high-compression indexing scheme for reachability query

Authors:

Abstract and Figures

Reachability queries on large directed graphs have attracted much attention recently. The existing work either uses spanning structures, such as chains or trees, to compress the complete transitive closure, or utilizes the 2-hop strategy to describe the reachability. Almost all of these approaches work well for very sparse graphs. However, the challenging problem is that as the ratio of the number of edges to the number of vertices increases, the size of the compressed transitive closure grows very large. In this paper, we propose a new 3-hop indexing scheme for directed graphs with higher density. The basic idea of 3-hop indexing is to use chain structures in combination with hops to minimize the number of structures that must be indexed. Technically, our goal is to find a 3-hop scheme over dense DAGs (directed acyclic graphs) with minimum index size. We develop an efficient algorithm to discover a transitive closure contour, which yields near optimal index size. Empirical studies show that our 3-hop scheme has much smaller index size than state-of-the-art reachability query schemes such as 2-hop and path-tree when DAGs are not very sparse, while our query time is close to path-tree, which is considered to be one of the best reachability query schemes.
Content may be subject to copyright.
3-HOP: A High-Compression Indexing Scheme for
Reachability Query
Ruoming Jin, Yang Xiang, Ning Ruan, and David Fuhry
Department of Computer Science, Kent State University
Kent, OH 44242, USA
{jin,yxiang,nruan,dfuhry}@cs.kent.edu
ABSTRACT
Reachability queries on large directed graphs have attracted much
attention recently. The existing work either uses spanning struc-
tures, such as chains or trees, to compress the complete transitive
closure, or utilizes the 2-hop strategy to describe the reachability.
Almost all of these approaches work well for very sparse graphs.
However, the challenging problem is that as the ratio of the number
of edges to the number of vertices increases, the size of the com-
pressed transitive closure grows very large. In this paper, we pro-
pose a new 3-hop indexing scheme for directed graphs with higher
density. The basic idea of 3-hop indexing is to use chain structures
in combination with hops to minimize the number of structures that
must be indexed. Technically, our goal is to find a 3-hop scheme
over dense DAGs (directed acyclic graphs) with minimum index
size. We develop an efficient algorithm to discover a transitive clo-
sure contour, which yields near optimal index size. Empirical stud-
ies show that our 3-hop scheme has much smaller index size than
state-of-the-art reachability query schemes such as 2-hop and path-
tree when DAGs are not very sparse, while our query time is close
to path-tree, which is considered to be one of the best reachability
query schemes.
Categories and Subject Descriptors
H.2.8 [Database management]: Database Applications—graph
indexing and querying
General Terms
Performance
Keywords
Graph indexing, Reachability queries, Transitive closure, 3-Hop,
2-Hop, Path-tree
1. INTRODUCTION
The rapid accumulation of very large graphs from a diversity
of disciplines, such as biological networks, social networks, on-
tologies, XML, and RDF databases, among others, calls for the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.
Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00.
graph database system. Important research issues, ranging theo-
retical foundations including algebra and query language [2], to
indices for various graph queries [20, 12] and more recently, graph
OLAP/summarization [17], have attracted much recent attention.
Among them, graph reachability query processing has evolved into
a core problem: given two vertices uand vin a directed graph, is
there a path from uto v(uv)?
Graph reachability is one of the fundamental research questions
across several disciplines in computer science, such as software en-
gineering and distributed computing. In the database research com-
munity, the initial interest in reachability queries has been driven by
the need to handle recursive queries, with focus on efficient and ef-
fective transitive closure compression. Recently, this problem has
captured the attention of database researchers again, due to the in-
creasing importance of XML data management, and fast growing
graph data, such as large scale social networks, WWW, and bio-
logical networks. For instance, in XML databases, the reachability
query is the basic building block for the typical path query for-
mat //P1//P2// ···//Pm, where “//” is the ancestor-descendant
search and Piis the tag. Reachability queries also have an impor-
tant role for managing/querying RDF and domain ontologies. In
bioinformatics, reachability queries can be used to answer basic
gene regulation questions in the regulatory network.
1.1 Prior Work
In order to tell whether a vertex ucan reach another vertex v
in a directed graph, many approaches have been developed over
the years. For a reachability query, we can effectively transform a
directed graph into a directed acyclic graph (DAG) by coalescing
strongly connected components into vertices and utilizing the DAG
to answer the reachability queries. Thus, throughout the paper, we
will only focus on DAG. Let G= (V, E)be the DAG for a reacha-
bility query. In Table 1.1, we summarize these approaches in terms
of their index size, construction time, and query processing time
based on worst-case analysis. Here, nis the number of vertices
(n=|V|) and mis the number of edges (m=|E|). Parameter
kis the width of the chain decomposition of DAG G[11], tis the
number of (non-tree) edges left after removing all the edges of a
spanning tree of G[19], and kis the width of the path decomposi-
tion [12]. These three parameters k,tand k, are method-specific
and will be explained in more detail when we discuss their corre-
sponding methods.
DFS/BFS and Transitive Closure Computation: We first discuss
two classical approaches for reachability query, representing two
extremes with regard to index size and query time. DFS/BFS needs
to traverse the graph online and can take up to O(n+m)time to
answer a reachability query. This is too slow for large graphs. The
second approach precomputes the transitive closure of G, i.e., it
records the reachability between every pair of vertices in advance.
813
Index Size Construction Time Query Time
DFS/BFS O(n+m)O(n+m)
Transitive Closure [16] O(n2)O(nm)O(1)
Opt. Chain Cover [11] O(nk)O(n3)O(log k)
Opt. Chain Cover [5] O(nk)O(n2+knk)O(log k)
Opt. Tree Cover [1] O(n2)O(nm)O(log n)
Dual Labeling [19] O(n+t2)O(n+m+t3)O(1)
Labeling+SSPI [4] O(n+m)O(n+m)O(mn)
GRIPP [18] O(m+n)O(n+m)O(mn)
Path-Tree [12] O(nk)O(mk)/O(mn)log2k
2-Hop [9] ˜
O(nm1/2)O(n3|Tc|)˜
O(m1/2)
Table 1: Worst-Case Complexity
While this approach can answer reachability queries in constant
time, its storage cost O(n2)is prohibitive for large graphs.
Indeed, tackling the storage cost by effectively compressing the
transitive closure has been the major theme of index construction
for graph reachability processing. Typically, however, improved
compression comes at the cost of slower query answering time. To
find the right balance between transitive closure compression and
reasonable query answering time is the driving force of ongoing
research into graph reachability indexing.
The existing research largely falls into two categories: the first
category attempts to apply simple graph structures, such as chains
and trees, to compress the transitive closure of a DAG. The optimal
chain cover, tree cover and the recent path-tree cover all belong to
this category. The second category, referred to as 2-hop indexing,
tries to encode the reachability using a subset of vertices which
serve as intermediaries, i.e., each vertex records a list of interme-
diate vertices it can reach and a list of intermediate vertices which
can reach it. Then, 2-hop reachability means the starting vertex can
reach an intermediate vertex (the first hop) and this intermediate
vertex can reach the end vertex (the second hop). In the following,
we go through these approaches in more detail.
Optimal Chain Cover: The basic idea of optimal chain cover is
to decompose a DAG into a minimal number of pair-wise disjoint
chains, and then assign each vertex in the graph a chain ID and
its sequence number in its chain. Given this, if a vertex can reach
another chain, it records only the smallest vertex it reaches in that
chain. In other words, each vertex in the compressed transitive
closure covers the remaining vertices (all the vertices with a higher
sequence number) in its respective chain. To determine if vertex u
reaches vertex v, we only need to check if ureaches any vertex (say,
v) in v’s chain, and if yes, we check if the vertex vhas a smaller
sequence number than v. This strategy can compress the transitive
closure since we need to record at most one vertex in each chain for
a given vertex. If the minimal number of chains for a DAG (also
referred to as the width of the DAG) is k, then this approach has
O(nk)index size and O(log k)query time.
Jagadish [11] pioneered the application of chain decomposition
in the database research community to compress the transitive clo-
sure. He demonstrated that the problem of finding the minimal
number of chains from Gcan be transformed into a network flow
problem, which can be solved in O(n3). He also proposed several
heuristic algorithms for chain decomposition in order to reduce the
computational cost and actual index size. Recently, Cheng [7] pro-
posed an O(n2+knk)time algorithm to decompose a DAG into
a minimal number of chains.
The worst case complexity of the chain cover approach is clearly
decided by the width of DAG. If the width is high, we tend to have
a lot of chains with only a small number of vertices, resulting in
a high index cost. Another way to look at the compression rate
is by observing that each vertex in compressed transitive closure
covers a partial chain (from the vertex itself to the last vertex in
the chain). Let R(u)be the transitive closure of u. Let RC(u)
be the number of vertices urecords for the chain decomposition.
Then, the compression ratio of the chain decomposition is defined
as PuV|R(u)|
PuV|RC(u)|. Thus, we can see that the compression ratio is
exactly the average size of the partial chains each vertex in the com-
pressed transitive closure covers.
Optimal Tree Cover and Its Variants: The optimal tree cover uti-
lizes a (spanning) tree to compress the transitive closure [1]. Each
vertex in the tree is labeled by a pair of numbers, corresponding to
an interval: if a vertex is an ancestor of another vertex in the tree,
the interval labeling guarantees that the interval of the first vertex
contains the interval of the second vertex. Note that if a vertex
reaches the root of a subtree in the original DAG, it will reach all
the vertices in the subtree. Thus, for each vertex in the DAG, we
can organize all the vertices in its transitive closure, i.e., all the ver-
tices it can reach, into pair-wise disjoint subtrees. To compress the
transitive closure, for each subtree, we only need to record its root
vertex. To answer the reachability query from vertex uto vertex v,
we check if the interval of vis contained by any interval associated
with those subtree roots we have recorded for u.
Agrawal et al. [1] formally introduced the tree cover and found
an optimized algorithm to discover a tree cover which can maxi-
mally compress the transitive closure. They also showed that the
tree cover approach can provide a better compression rate than the
optimal chain cover approach. The advantage of the tree cover ap-
proach over the chain cover approach comes from the fact that each
tree-cover vertex covers an entire subtree, while each chain-cover
vertex covers only a partial chain.
Several recent studies focus on the tree cover approach and try to
improve either its query processing time and/or provide a smaller
index size. Wang et al. [19] develop the Dual-Labeling approach
which tries to improve the query time and index size for very sparse
graphs, where the number of non-tree edges tis much smaller than
the number of vertices n(t << n). Their approach can reduce
the index size to O(n+t2)and achieve constant query answering
time. Unfortunately, many real world graphs would not satisfy the
condition required by this approach, and when t > n, this approach
will not help compress the index size.
Label+SSPI [4] and GRIPP [18] aim to minimize the index con-
struction time and index size. They achieve O(m+n)index con-
struction time and O(m+n)index size. However, this is at the
sacrifice of the query time, which will cost O(mn). Both al-
gorithms start by extracting a tree cover and then deploy an online
search algorithm utilizing the tree structure to speed up the DFS
process.
Path-Tree Cover: The latest work to use a simple graph structure
to compress transitive closure is the path-tree cover approach, pro-
posed by Jin et al. [12], which generalizes the tree cover approach.
They observe that the covering capability of each vertex in the com-
pressed transitive closure is determined by the number of parents
and children each vertex has in the simple graph structure. For in-
stance, a chain vertex has one parent and one child while a tree
vertex has one parent and multiple children. The path-tree allows
two parents and multiple children. In path-tree cover, all vertices
in the original DAG are partitioned into pair-wise disjoint paths (k
is the number of paths in the path-decomposition for a DAG G),
and then those paths serve as vertices in a tree structure. In other
words, the path-tree utilizes a tree-like structure, where each vertex
represents a path in the original DAG. Each vertex in the path-tree
needs only three numbers, two numbers for the interval label of the
tree-structure and one sequence number from a DFS traversal pro-
814
cedure, to answer the reachability query between any two vertices
in the path-tree in constant time. In [12], authors proposed two
path-tree schemes, PTree-1 and PTree-2. PTree-1 utilizes optimal
tree cover and thus has O(mn)construction time while PTree-2
has O(mk)construction time.
Given this, to compress the transitive closure, a vertex uonly
needs to record vertex v, such that 1) uvand 2) there is no
vertex vsuch that uvand vcan reach vin the path-tree.
Theoretically, they prove that path-tree cover can always perform
the compression of transitive closure better than or equal to the
optimal tree cover approaches and chain cover approaches. Note
that the enhanced power of the path-tree cover is a consequence of
the increased parent/child connectivity of path-tree vertices vs. tree
cover or chain cover vertices.
2-HOP Indexing: The 2-hop labeling method proposed by Cohen
et al. [9] is quite different from the aforementioned simple graph
covering approaches. It compresses the transitive closure using a
subset of intermediate vertices. Each vertex records a list of in-
termediate vertices it can reach and a list of intermediate vertices
which can reach it. The index size is the total number of interme-
diate vertices each vertex records. They propose an approximate
(greedy) algorithm based on set-covering which can produce a 2-
hop cover with size no larger than the minimum possible 2-hop
indexing by a logarithmic factor. The minimum 2-hop index size is
conjectured to be ˜
O(nm1/2).
The major problem of the 2-hop indexing approach is its high
construction cost. The greedy set-covering algorithm needs to iter-
atively find a subset of vertices which utilizes a candidate vertex as
the intermediate hop. The subset of vertices are selected to mini-
mize the price measure, i.e., the cost of recording such an interme-
diate hop of these vertices with respect to the number of uncovered
reachable vertex pairs in this subset. Finding the subset of vertices
with minimal price can be transformed into the problem of finding
a densest subgraph in a bipartite graph. The approximate algorithm
to solve this subproblem is in the linear order with respect to the
number of edges in the bipartite graph. Besides, each vertex in the
DAG can serve as the intermediate hop which corresponds to a bi-
partite graph. Thus, for each iteration, it takes O(n3)to find such a
desired subset of vertices. Considering the iteration needs to cover
the entire transitive closure Tc, we can see its construction time is
O(n3|Tc|).
Several approaches have been proposed to reduce its construc-
tion time. Schenkel et al. propose the HOPI algorithm, which ap-
plies a divide-and-conquer strategy to compute 2-hop labeling [15].
Recently, Cheng et al. propose several methods, such as a geometric-
based algorithm [6] and graph partition technique [7], to produce
a2-hop labeling. Though their algorithms significantly speed up
the 2-hop construction time, they do not produce the approxima-
tion bound of the labeling size which is produced by Cohen et al.’s
approach.
1.2 Our Contribution
Almost all these approaches work reasonably well for very sparse
graphs (where the number of edges is very close to the number of
vertices). However, as the ratio of the number of edges to the num-
ber of vertices increases, the size of the compressed transitive clo-
sure of the simple graph covering approaches can grow very large.
In many real world graphs, such as citation networks, the semantic
web, and biological networks, the number of edges can be several
times the number of vertices. In general, the simple graph covering
approach works well only for those DAGs which have a structure
similar to the building-block chain, tree, or path-tree structures.
However, in many real world graphs, since edge density is much
higher than in simple graph structures, many edges will be left un-
covered. Vertices of uncovered edges likely need to be recorded as
ancillary data in the compressed transitive closure of the DAG, in-
creasing the index size. Thus, the size of the compressed transitive
closure can become very large as the density grows.
The original 2-hop [9] builds on top of the set-covering frame-
work and is theoretically appealing as it achieves a guaranteed ap-
proximation bound. However, to our knowledge, there is little the-
oretical comparison between the 2-hop approach and the simple
graph covering approaches in existing research. Most studies do
not even empirically compare the 2-hop approach and the simple
graph covering approaches. This may be due in part to the 2-hop
approach not scaling well to large graphs, even graphs with only
thousands of vertices. Specifically, since the original 2-hop needs
to compute the complete transitive closure, it becomes very expen-
sive as the edge density of the graph becomes larger. Though sev-
eral heuristic techniques [15, 6, 7] have been proposed to construct
2-hop faster, they do not guarantee any approximation bound as the
original 2-hop does. None of these methods have compared their
compression ratio directly with the optimal 2-hop approaches, even
on relatively small graphs.
To summarize, the major research challenge for existing graph
reachability indexing is how to significantly compress the transitive
closure when the ratio between the number of edges and the number
of vertices increases. Driven by this need, we propose a new 3-hop
indexing scheme for directed graphs with higher density. The basic
idea in 3-hop indexing is to utilize a simple graph structure, rather
than a sole vertex, as an intermediate hop to describe the reachabil-
ity between source vertices and destination vertices. In this paper,
we focus on the chain structure. The new indexing scheme does
not need to compute the entire transitive closure. Instead, it only
needs to compute and record a number of so-called “contour” ver-
tex pairs, which can be orders of magnitude smaller than the size
of the transitive closure. Indeed, it is even much smaller than the
compressed transitive closure of the chain cover. The connectivity
of any pair of vertices in the DAG can be answered by those con-
tour vertex pairs. Further, we “factorize” these contour vertex pairs
by recording a list of “entry points” and “exit points” on some in-
termediate chains. We derive an efficient algorithm to generate an
index which approximates the minimal 3-hop indexing by a loga-
rithmic factor. Theoretically, we show that 3-hop labeling always
has a better minimal compression ratio than 2-hop labeling, and its
construction time is much faster than that of 2-hop.
We perform a detailed experimental evaluation on both real and
synthetic datasets by comparing 3-hop labeling, 2-hop labeling and
the state-of-the-art path-tree covering approach. Empirical studies
show that our 3-hop scheme has a much smaller index size than
prior state-of-art reachability query schemes for dense DAGs when
the number of edges is not close to the number of vertices, i.e.,
|E| 6≈ |V|. The query processing time of 3-hop is close to path-
tree’s, which is considered to be one of the best reachability query
schemes.
2. BASIC IDEAS OF 3-HOP INDEXING
2.1 Basic 3-Hop
The 3-hop reachability indexing is analogous to the highway sys-
tem of the transportation network. To reach a destination from a
starting point, you simply need to get on an appropriate highway
and get off at the right exit to get to the destination. The high-
way system in the 3-hop labeling is simple graph structures, such
as chains or trees, as they can encode the reachability information
using a constant labeling size. In this paper, we focus on utilizing
815
chains, i.e., each chain serves as a different highway. Since each
chain has a direction, each vertex urecords a list of “entry points”
(the smallest vertices) it can reach on some chains. It also records a
list of “exit points” (the largest vertices) which can reach it in some
chains. Here, the order of vertices in the chain is with respect to
their topological order in that chain, i.e., a vertex with a smaller
number can reach a vertex with a larger number.
Given this, the three hops are 1) the first hop from the starting
vertex to the entry point of some chain, 2) the second hop from the
entry point in the chain to the exit point of the chain, and finally
3) the third hop from the exit point of the chain to the destination
vertex. The goal of 3-hop indexing is to assign all vertices with
a minimal total number of entry and exit points so that they can
maximally compress the transitive closure.
Figure 1: A simple example for 3-hop and 2-hop
Figure 1(a) shows an example using the chain 5678
as the intermediate hop (or highway). Thus, each vertex not on
the chain only needs to record its entry point and exit point in that
chain, listing them in the set oand set iassociated with each vertex,
respectively. To tell if vertex 2can reach 9, we compare 2s entry
point with 9’s exit point. We conclude that 2can reach 9because
2’s entry point of 6precedes exit point 7, which then reaches vertex
9. In total, the simple 3-hop scheme records 8vertices to encode
the transitive closure by using a single chain.
Figure 1(b), shows the optimal 2-hop labeling where each vertex
records a list of intermediate vertices it reaches and a list of vertices
which reach it. Here, 2-hop needs to record a total of 16 vertices
to encode the transitive closure. However, readers should be ad-
vised that this is a very simple and incomplete example giving the
basic idea of 3-hop. Detailed definitions, algorithms and complete
running examples of 3-hop will be given from now on.
2.2 Chain Decomposition for 3-Hop
A simple technique which can significantly boost the 3-hop com-
pression ratio is to apply a chain decomposition for the entire DAG
first. For the 3-hop perspective, such a decomposition would as-
sociate each vertex itself with a highway since each vertex is par-
titioned to a chain. This suggests that many vertices in the same
chain may share the same entry points and exit points of some other
chains. Thus, we do not need to explicitly record those points for
each of these vertices in the same chain, and therefore can further
compress the transitive closure. To better understand the intuition
of boosting 3-hop with a chain decomposition, let us see the run-
ning example in Figure 2.
Figure 2 is a DAG with 4 chains as a result of chain decompo-
sition. In 3-hop, each chain serves as a highway and each vertex
also belongs to a highway. In Figure 3, we show the vertices using
Figure 2: A simple DAG with a chain decomposition. (Dotted
arrow from 13 14 is not an edge in the original DAG, but an
inferred one using reachability).
chains C2and C3as intermediate hops (highways) to encode their
transitive closure. At the left of each chain, we draw those ver-
tices which record an entry point into that chain, and at the right of
each chain, we draw those vertices which record an exit point out
of the corresponding chain. To be more efficient, we organize into
an “outgoing” segment those consecutive vertices (on one chain)
which share the same entry point, and correspondingly we orga-
nize into an “incoming” segment those consecutive vertices (on one
chain) which share the same exit point.
Specifically, we organize all the vertices on the left which share
the same entry point into an “outgoing” segment, and all the ver-
tices on the right which share the same exit point into an “incom-
ing” segment. Each segment corresponds to a list of consecutive
vertices in a chain. For instance, the vertices in the outgoing seg-
ment from 1to 3all record vertex 6in chain C2as the entry point,
and they are the first three vertices in chain C1. The vertices in the
incoming segment from 17 to 20 all record vertex 11 in chain C3
as the exit point, and they are the last four vertices in chain C4.
Figure 3: Two examples of Reachability between segments
(through chains C2and C3)
Intuitively, we can apply 3-hop with chain decomposition to an-
swer a reachability query. For example, to answer whether vertex 6
can reach vertex 19, we find that vertex 6 is in segment (6,7) which
can reach vertex 12 in C3, and vertex 19 is in segment (19,20)
which can be reached by vertex 14 in C3. Then we say 6can reach
816
19 because 6can reach 12,19 can be reached by 14, and 12 reaches
14 in the chain C3.
2.3 3-Hop Indexing and Our Approach
The major research problem we will study in this paper is as
follows: Given a chain decomposition {C1, C2,...,Ck}of a DAG
G, how can we utilize 3-hop strategy to maximally compress the
transitive closure and answer reachability queries efficiently? Our
approach addresses this problem in three steps:
1.(Section 3) Given a chain decomposition, we first derive a con-
cise representation of the transitive closure, called the contour of
the transitive closure. This representation allows us to quickly iden-
tify those vertices which share the same entry point and vertices
which share the same exit point.
2.(Section 4) We show that a 3-hop strategy which maximally com-
presses the contour corresponds to a generalized “factorization” of
the contour. We develop an efficient greedy algorithm to approxi-
mate the optimal results within a logarithmic factor.
3.(Section 5) We provide a query processing procedure utilizing
the index based on the 3-hop compression of the transitive clo-
sure contour. We also derive a theoretically faster query processing
scheme by transforming 3-hop contour into a 3-hop segment index-
ing.
3. TRANSITIVE CLOSURE CONTOUR
In this section, we will study a concise representation of the tran-
sitive closure matrix based on the chain decomposition of the DAG.
This representation will form the basis for efficient construction of
the 3-HOP index. We will derive a fast algorithm to directly gener-
ate this concise representation.
3.1 Notation and Chain-Decomposition
Let G= (V, E )be a directed acyclic graph (DAG), where V=
{1,2,···, n}is the vertex set, and EV×Vis the edge set.
We use (v, w)to denote the edge from vertex vto vertex w, and we
use (v0, v1,··· , vp)to denote a path from vertex v0to vertex vp,
where (vi, vi+1)is an edge (0ip1). In a DAG, all paths
are simple paths, meaning each vertex in a path is distinct. We say
vertex vis reachable from vertex u(denoted as uv) if there is
a path starting from uand ending at v.
Achain is the generalization of path, which is also a sequence of
vertices, (v0, v1,···, vp), where vi+1 is reachable from vi(vi
vi+1,0ip1). Clearly, any path in Gis also a chain. How-
ever, the reverse is not necessarily true (see chain C3in Figure 2).
Let C1and C2be two chains of G. We use C1C2to denote the
set of vertices appearing in both chains and use C1C2to denote
the set of vertices appearing in either of the chains.
DEFI NI TI O N 1. (Chain Decomposition) Achain decomposi-
tion of DAG G= (V, E)is a collection of pair-wise distinct chains,
C1, C2,···, Ck, such that C1C2∪· · · Ck=Vand CiCj=,
for any i6=j. The integer kis called the width of the decomposi-
tion.
Given the chain decomposition, we assign to each vertex va pair
of IDs, (cid,oid), where cid is the ID of the chain vertex vbelongs
to, and oid is v’s relative order on the chain. For any two vertices
uand vin the same chain, we have uviff u.oid v.oid. If
u.oid < v.oid, we also say uis smaller than vand vice versa.
Several algorithms have been developed to partition a DAG into a
minimal number of chains to facilitate transitive closure computa-
tion [11, 5]. Our approach can utilize any of these approaches.
3.2 Transitive Closure between Two Chains
In this work, we will derive a more concise representation for
the transitive closure using the chain decomposition. We base this
representation on a key observation on how the transitive closure is
recorded in binary matrix format. Note that our approach does not
need to materialize this binary matrix representation of the transi-
tive closure.
Let Mbe the binary matrix representation of transitive closure
for G. Then, M[vi, vj] = 1 iff vivj.M[vi, vj] = 0 iff vi
cannot reach vj. We define an index (i, j)for Mto be a cell. If
M(i, j) = 0, we say (i, j )is a 0-cell; else (i, j)is a 1-cell. Also,
we order the vertices based on their chain ID and within each chain,
we sort the vertices according to their order ID (oid). Thus, the ver-
tices in the same chain are contiguous in a linearly increasing order.
We also introduce the sub-matrix for any two chains Ciand Cjas
MCi,Cj, which has the rows of Ciand columns of Cj. Clearly, the
complete transitive closure Mcan be written as the union of the
k×ksubmatrices, such as
M=2
6
6
6
4
MC1,C1MC1,C2·· · MC1,Ck
MC2,C1MC2,C2·· · MC2,Ck
.
.
..
.
.....
.
.
MCk,C1MCk,C2·· · MCk,Ck
3
7
7
7
5
(1)
Figure 4: Pseudo-diagonal and Pseudo-upper triangular sub-
matrix. All blank cells are 0-cells.
It is easy to see that any MCi,Ciis a special upper triangular ma-
trix, i.e., for any vavbwhere vaand vbare vertices of chain Ci,
M[va, vb]=1and for any va> vb,M[va, vb]=0. We refer to
it as an upper uni-triangular matrix. Note that the geometry of this
submatrix describes and is equivalent to the intra-chain reachability
property. Therefore, there is no need to materialize MCi,Ciupper
uni-triangular matrices. Next, what does a submatrix MCi,Cjlook
like when i6=j?
To describe the shape of the submatrices between any two paths,
we introduce the following notation. Given submatrix MCi,Cjwith
|Ci|rows and |Cj|columns, and two cells (x, y)and (x, y)where
xand xare vertices of chain Ciand yand yare vertices of chain
Cj, we say cell (x, y)dominates cell (x, y)in the matrix MCi,Cj
if xxand yy. In other words, a cell dominates all the cells
817
located in its upper-right quadrant. As a simple observation, in any
submatrix MCi,Cj, the collection of all the cells being dominated
by cell (x, y)form a rectangle which has (x, y)as its lower-left
corner and upper right cell of MCi,Cjas its upper-right corner.
DEFI NI TI O N 2. (Pseudo-Diagonal and Pseudo-Upper Trian-
gular matrix The pseudo-diagonal of a binary matrix (submatrix)
Msis a set of 1-cells, such as {(x1, y1),(x2, y2),···,(xl, yl)},
such that 1) all the 1-cells in Msare dominated by at least one
pseudo-diagonal cell, 2) none of the 0-cells in Msare dominated
by any pseudo-diagonal cell, and 3) no pseudo-diagonal cell domi-
nates another pseudo-diagonal cell. If a binary matrix (submatrix)
has a pseudo-diagonal, we refer to it as a pseudo-upper triangular
matrix (submatrix).
Clearly, not every binary matrix is a pseudo-upper triangular ma-
trix containing a pseudo-diagonal. We next provide the following
theorem to reveal the shape of a submatrix between two chains.
THE OR EM 1. Let MCi,Cjbe the binary submatrix of the tran-
sitive closure between two different chains, Ciand Cj.MCi,Cjis
a pseudo-upper triangular matrix.
Proof Sketch:Our proof is constructive. We will first construct the
pseudo-diagonal explicitly. Then, we will show that the matrix is
indeed pseudo-upper triangular. Let the chain Cibe (v1, v2,·· · , vp).
Let f(vi)be the first vertex in Cjvican reach. If vidoes not reach
any vertex in Cj, let f(vi)=+.
Then we construct the sequence as (f(v1), f (v2),···, f (vp)).
We can show f(vi)f(vi+1 )as follows: Because vireaches
vi+1,viwill reach f(vi+1 ). Thus, f(vi)should be no larger than
f(vi+1).This also suggests that f(vi)=+, if exists, can only
appear at the end of a sequence.
Given this, we observe the following property for pseudo-diagonal:
A 1-cell (vi, f (vi)) (1ip1)is in the pseudo-diagonal if and
only if f(vi+1)f(vi)and f(vi)6= +.Besides, (vp, f (vp))
is in the pseudo-diagonal if and only if it is a 1-cell. Thus, we
can scan the sequence (f(v1), f (v2),···, f (vp)) once to create
the pseudo-diagonal.
Now, we only need to show that any cell which is dominated by
one of the cells in the pseudo-diagonal is a 1-cell and otherwise, a
0-cell. Let (a, b)be a cell in the matrix and assume it is dominated
by one of the cell in the pseudo-diagonal, (vi, f (vi)). Then, by
definition, avi, and bf(vi). In other words, aviin Ci
and f(vi)bin Cj. We also know vif(vi). Thus, we have
ab, so (a, b)is a 1-cell.
Let (c, d)be a cell in the matrix and assume it is not dominated
by any of the cells in the pseudo-diagonal. Basically, we have d
f(c). Since f(c)is the smallest vertex in chain Cjccan reach,
then ccannot reach d, meaning (c, d)is a 0-cell. 2
In Figure 4, we can see each MCi,Cj,i6=j, is a pseudo-upper
triangular matrix. We highlight their pseudo-diagonal cells with a
circle.
CORO LLA RY 1. The transitive closure from any chain Cito
another chain Cj,»MCi,CiMCi,Cj
MCj,Cjcan be described as a
directed graph, with vertex set V=V(Ci)V(Cj)and edge
set E=E(Ci)E(Cj)∪{(vi, f (vi))|(vi, f(vi)) is a pseudo-
diagonal cell}, and no two edges cross, i.e., for any two pseudo-
diagonal cells, (vi, f (vi)) and (vj, f(vj), we have either
(vi.oid > vj.oid)(f(vi).oid > f(vj).oid), or
(vi.oid < vj.oid)(f(vi).oid < f(vj).oid).
Essentially, the edge links from Cito Cjdo not cross each other.
Figure 5 shows two examples: edge links between C1and C3, and
edge links between C3and C4. Moreover, we can see that for
chains Ciand Cj, the starting vertices of the pseudo-diagonal cells
naturally divide chain Ciinto several “outgoing” segments such
that all the vertices in a segment share the same “entry point” to
chain Cj. Similarly, the end vertices of these pseudo-diagonal cells
can divide chain Cjinto several “incoming” segments where all
the vertices in a segment share the same ”exit point” from chain
Ci. For instance, in Figure 5, for chain C3and C4, the pseudo-
diagonal cells, {(11,17),(13,18),(14,19)}divide chain C3into
four outgoing segments, (10,11),(12,13) and (14,14), and chain
C4into three incoming segments, (17,17),(18,18) and (19,20).
Now, we formally introduce the transitive closure contour.
Figure 5: Edgelink between chain C1and C3, and between
chain C3and C4. Dotted arrows are virtual edges (paths).
DEFI NI TI O N 3. (Transitive Closure Contour) Given DAG G
and its chain-decomposition, C1C2 · ·· Ck, the transitive clo-
sure contour, C on(G)is the set of all pseudo-diagonal cells for
each pseudo-upper triangular matrix, MCi,Cj, where i6=j.
Given a chain decomposition, we can see the transitive closure
contour can precisely describe the complete transitive closure. We
will utilize this concise representation of transitive closure to build
our 3-HOP indexing.
3.3 Computing Transitive Closure Contour
We now present an efficient computation which can directly com-
pute the transitive closure contour without materializing the binary
matrix given a chain decomposition. The sketch of TransitiveClo-
sureContour is in Algorithm 1. We use a matrix Sto record the
entire transitive closure contour of DAG G,Con(G). Each ele-
ment Si,j records the pseudo-diagonal of MCi,Cjfor chain Ciand
Cj.
The computation follows the reverse topological order (Loop 3-
21), which broadcasts the reachability information from bottom to
top. Si,j is an ordered set of pseudo-diagonal cells (p, q)between
chain i and chain j (in ascending order of q.oid), and Si,j .head()
gets the first (with smallest q.oid) pseudo-diagonal cell (p, q)in
Si,j For each vertex u, we use minoid[i]to record the smallest
vertex it can reach in chain Ci. At the beginning, we fill minoid[i]
with the smallest vertex that its own chain Cu.cid can reach in
chain Ci. This is done in Line 4, and we can retrieve this cell by
818
Algorithm 1 TransitiveClosureContour(G, C1C2 · ·· Ck)
Parameter: C1C2 · ·· Ck: the Chain Decomposition
1: Perform the Topological Sort of G
2: For each i, j ,1i, j k,Si,j
3: for u=|V(G)|downto 1{following the reverse topological
order} do
4: For each i,1ik,minoid[i]y, where y=q.oid
and (p, q)Su.cid,i .head() {y=if Su.cid,i =}
5: for each v: the immediate successor of u{in topological
order} do
6: if v.oid < minoid[v.cid]v.cid 6=u.cid then
7: minoid[v.cid]v.oid
8: for each i= 1 to kdo
9: Let y=q.oid:
(p, q)argmin(p,q)Sv.cid,i (p.oid v.oid)
10: if u.cid 6=iminoid[i]> y then
11: minoid[i]y
12: end if
13: end for
14: end if
15: end for
16: for each i= 1 to kdo
17: if i6=u.cid minoid[i]< y {y=q.oid and (p, q )
Su.cid,i.head()}then
18: Su.cid,i Su.cid,i {(u, minoid[i])}
19: end if
20: end for
21: end for
Su.cid,i.head(). If Su.cid,i .head() is still empty, we fill minoid[i]
with .
After that, we visit each of vertex us immediate successors, v
(Line 5). Our visit follows their topological order, i.e., the smallest
vertex will be visited first. Note that by following this order, when
we have more than one immediate successor of uin the same chain,
we only need to visit the smallest vertex among them (Line 6).
Given this, the major operation is to update the smallest vertices
which ucan reach using vertex von each chain, i.e., to update
each minoid[i]. Such an update comes from two sources: the first
source is from vitself. If v.oid has a smaller sequence number
than the current minoid[v.cid], meaning the edge (u, v )allows
uto reach a smaller vertex on vs chain; the second source is from
the smallest vertices on other chains which vcan reach. In the latter
case, for each chain Ci(Line 8), we need to get a pseudo-diagonal
cell (p, q)in SCv.cid ,i, where pand vare in the same chain and p
is the smallest vertex vdominates (Line 9). Thus, qis the smallest
vertex ucan reach via edge (u, v) directly. Given this, we will test
if qis smaller than the current smallest vertex ucan reach in chain
i, and replace it if it does (Line 10). Finally, after visiting all u’s
immediate successors, we will add cell (u, minoid[i]) to Su.cid,i
if it is a pseudo-diagonal cell (Line 17 and 18).
The correctness of Algorithm 1 can be observed by the fact that
we maintain a list of the smallest vertex of each chain vertex ucan
reach in minoid, and a cell (u, minoid[i]) is a pseudo-diagonal
cell iff minoid[i]is less than the smallest vertex in Su.cid,i (Corol-
lary 1). The time complexity of this algorithm is O(mk log n)in
the worst case. This is because the two biggest for loops, i.e. step
3to step 21, run mtimes, since DAG Ghas medges, and the loop
from step 8to 13 runs ktime and finally step 9takes O(log n)time
to do the binary search in the worst case.
4. 3-HOP LABELING FOR TRANSITIVE
CLOSURE CONTOUR
4.1 Problem Definition
Our goal in this section is to compress the transitive closure
contour, Con(G), using the 3-hop strategy. For any vertex pair
(u, v)Con(G), we say uis an out-anchor vertex for the con-
tour, and vis an in-anchor vertex. We will assign each out-anchor
vertex a list of intermediate “entry points” of some chains and as-
sign each in-anchor vertex a list of intermediate “exit points” of
some chains. To recover the reachability between an out-anchor u
and an in-anchor v, we will see if ucan reach vin three hops, i.e.,
the first hop from uto an intermediate entry point, the second hop
to the intermediate exit point, and the third hop from the exit point
to v. Formally, we introduce the 3-hop reachability labeling for the
contour set Con(G)as follows.
DEFI NI TI O N 4. (3-HOP Reachability Labeling) Let Con(G)
be the transitive closure contour for Gwith respect to a chain-
decomposition. Let Vout and Vin be the sets of out-anchor vertices
and in-anchor vertices for Con(G), respectively. A 3-hop reach-
ability labeling assigns each out-anchor vertex uin Vout a label
Lout(u)(a set of intermediate entry points), and each in-anchor
vertex vin Vin a label Lin(v)(a set of intermediate exit points),
such that Lout(u), Lin (v)V(G), and for every xLout(u),
uxand for every yLin(v),yv. Furthermore, we have
the following two conditions:
(1) (u, v)Con(G) = xLout (u),yLin(v),
such that x, y Ci,and xy
(2) for any xLout(u), y Lin (v), x, y Ci,and xy
=uv
The size of the labeling is defined to be
Cost(3hop) = X
uVout |Lout(u)|+X
vVin |Lin(v)|
To simplify our discussion, we assume uLout(u)and v
Lin(v).
THE OR EM 2. Finding a minimum 3-hop reachability labeling
for a given contour set Con(G)of a DAG Gis an NP-hard prob-
lem.
Proof Sketch:We simply note that 3-hop labeling is a generaliza-
tion of 2-hop labeling. 2
To better understand this problem, we will describe it as a gener-
alized “factorization” problem and then transform it to the classical
set-cover problem. We start by partitioning each of the two anchor
sets, Vout and Vin, according to their intermediate chains:
Vi
out ={u|uVout and Lout(u)Ci6=∅}
Vi
in ={v|vVin and Lin(v)Ci6=∅}
Basically, Vi
out contains those out-anchor vertices which record
intermediate vertices (entry points) in chain Ci. Similarly, Vi
in con-
tains those in-anchor vertices which record intermediate vertices
(exit points) in chain Ci. Further, for each uVi
out, we define
Li
out(u)to be the vertex of Lout(u)Ci, and for each vVi
in
we define Li
in(v)to be the vertex of Lin(v)Ci. By Corollary 1,
Lout(u)Ci(or Lin (v)Ci) contains at most one vertex. Given
819
Figure 6: Generalized Join and Chain-Center Bipartite Graph
this, we introduce the following generalized join operator (similar
to Cartesian product):
Vi
out Vi
in ={(u, v)|uVi
out, v Vi
in, Li
out(u)Li
in(v)}
In Figure 6(a), assume all the vertices on the left of chain C2
record their corresponding entry points into chain C2, and all the
vertices on the right record their exit points. For v= 12,18,19, as-
sume Lin(v)C26=. For u= 3,18,13,14,4, assume Lout (v)
C26=.Then, V2
out ={3,18,13,14,4}and V2
in ={12,18,19},
and V2
out V2
in contains all the vertex pairs (u, v), where uis on the
left and vis on the right, such that ucan reach vvia the edges in the
graphs, i.e., {(3,12),(3,18),(3,19),·· · ,(4,19)}. It also con-
tains all the edges in the graph, i.e., {(3,6),(7,12),···,(9,19)}.
We consider {V1
out,···Vk
out}⊗{V1
in,· · · Vk
in}=V1
out V1
in
··· Vk
out Vk
in to be generalized factorization. Hence, we define
the cost of factorization as follows:
Cost(f actorization) =
k
X
i=1 |Vi
out|+
k
X
i=1 |Vi
in|
Given this, we can rewrite our 3-hop reachability labeling prob-
lem as a generalized “factorization” problem: By assigning la-
bel Lout(u)for each vertex uVout and Lin(v)for each ver-
tex vVin, we want to find a factorization {V1
out,···Vk
out}
{V1
in,···Vk
in}with minimum cost such that
Con(G)V1
out V1
in · ·· Vk
out Vk
in
It is easy to see that the 3-hop reachability labeling problem is
equivalent to the generalized factorization of Con(G)where the 3-
hop indexing cost is equivalent to the corresponding factorization
cost:
Cost(3hop) = C ost(f actorization)
In the following subsections, we will derive efficient algorithms
to produce minimized factorization and thus also the minimized 3-
hop labeling.
4.2 A Basic Approximation Algorithm for 3-
Hop Cover
In this subsection, we will transform the factorization problem
into a set-cover problem. For this purpose, we will first introduce
the notation of the chain-center bipartite graph.
DEFI NI TI O N 5. (Chain-Center Bipartite Graph) Given a DAG
Gand a chain decomposition, C1C2 ·· · Ck, we construct
the chain-center bipartite graph for each chain as follows. Let
Bi= (Xi Yi,Ei)be the chain-center bipartite graph:
Xi={u|∃aCi,such that (u, a)Con(G)} {b|b
Ci,such that v, (b, v)C on(G)}
Yi={v|∃bCi,such that (b, v)Con(G)} {a|a
Ci,such that u, (u, a)Con(G)}
Ei={(x, y)|x Xiand y Yiand (x, y)C on(G)}
Figure 6(b) is an example showing the bipartite graph for chain
C2. Now we can transform the factorization problem into the set-
cover problem as follows: Let the grounding set be Con(G). Let
the set of candidates be {ˆ
Bi|ˆ
Biis a subgraph of Biwhere 1i
k}. The weight of a candidate bipartite subgraph should reflect
the related index cost which is defined as the number of vertices in
V(ˆ
Bi), i.e., weight(ˆ
Bi) = |V(ˆ
Bi)|. For example, in Figure 6, the
circled bipartite subgraph has weight 5.
Then we may apply the classical greedy algorithm [8] to find the
minimal set cover as follows. Let Rbe the uncovered contour pairs
(initially, R=Con(G)). For each candidate set ˆ
Bi, where the
vertex sets X(ˆ
Bi) Xiand Y(ˆ
Bi) Yiand edge set E(ˆ
Bi)
Ei, we define the compression ratio of selecting ˆ
Bias
ρ(ˆ
Bi) = |E(ˆ
Bi)R|
weight(ˆ
Bi)=|E(ˆ
Bi)R|
|X(ˆ
Bi)|+|Y(ˆ
Bi)|
At each iteration, the greedy algorithm selects the candidate set
with the highest compression ratio and puts it in the resulting set.
Then, the algorithm will update Rby removing the newly covered
contour pairs, R=R\E(ˆ
Bi). The procedure proceeds until all
contour pairs are covered (i.e. R=).
It has been proved that the approximation ratio of this algorithm
is ln(|Con(G)|)+1[8]. We now link this problem and its results
back to the aforementioned factorization problem. First, we note
that picking up a subgraph ˆ
Biin the set cover corresponds to adding
a generalized join between X(ˆ
Bi)and Y(ˆ
Bi), i.e., X(ˆ
Bi)Y(ˆ
Bi).
This is because each non-Civertex vin ˆ
Bineeds to record in
Lout(v)an entry point to chain Ci, or record in Lin(v)an exit
point from chain Ci. It is easy to observe that non-Civertices ac-
count for at least half in ˆ
Bi. Given such a labeling, we can guaran-
tee to cover all the edges of E(ˆ
Bi), i.e., X(ˆ
Bi)Y(ˆ
Bi)E(ˆ
Bi).
Here, we may produce some edges which do not belong to the con-
tour, but this will not affect set cover results. Indeed, in the fac-
torization formulation, we may also produce extra edges which do
not belong to the contour. However, those edges all belong to the
complete transitive closure and thus will not affect the correctness
of our reachability indexing.
Second, we note that the optimal set cover results will choose
at most one subgraph from each chain-center bipartite graph, i.e.,
each vertex in each bipartite graph will be selected only once. In the
greedy algorithm, we may find several subgraphs which all come
from the same bipartite graphs. In this case, we can simply combine
their label sets, and the weight of the resulting subgraphs will be no
higher than the sum of the weights of these individual subgraphs.
Thus, this optimal result of the set-cover problem can be rewritten
exactly as a factorization result with each chain having at most one
join centered on it, and our approximation bound is maintained.
However, the major issue here is that the number of candidate
subgraphs is exponential. A similar issue exists for 2-hop labeling.
As suggested in [9], we can deal with this problem by realizing that
finding ˆ
Biof the highest compression ratio is equivalent to finding
the densest subgraph of the bipartite graph B
i= (XiYi,Ei\R).
820
Given this, the basic idea of 3-hop labeling algorithm is: for each
iteration, we first find the densest subgraph of each bipartite graph
B
i, and then among them (ksubgraphs), we choose the densest
one and update the set Rof uncovered contour pairs. We repeat
this iteration until Ris empty.
Since finding the densest subgraph forms the core of our 3-hop
labeling algorithm, we formulate it precisely here:
DEFI NI TI O N 6. (Densest Subgraph Problem) Let G= (V, E )
be a graph (directed or undirected). For any subset VsV,
let G[Vs]=(Vs, Es)be the induced subgraph of G, i.e., Es=
EVs×Vs. The densest subgraph problem is to find a subset
VsV, such that the density of the induced subgraph, d=|Es|
|Vs|,
Gs= (Vs, Es), is maximized.
The fastest exact algorithm for the densest subgraph problem
runs in O(|V||E|log |V|2/|E|)[10]. In 2-hop labeling [9], the
author suggests using a linear 2-approximation algorithm for the
densest subgraph problem. Their algorithm is a simple variant of
[14]. It iteratively removes a vertex with the minimal degree from
the graph, and this gives Vsubgraphs. It returns a 2-approximate
densest subgraph and can run in linear time in the number of edges
in the graph.
In the next subsection, we will introduce a new approach to iden-
tify the densest subgraph, which will allow us to prune the search
space of these candidate subgraphs significantly.
4.3 A Faster Algorithm for 3-HOP Labeling
To describe our new algorithm for densest subgraph discovery,
we introduce the rank subgraph.
DEFI NI TI O N 7. (Rank Subgraph) Let G= (V, E )be an undi-
rected graph. Given a positive integer d, we will remove all the ver-
tices which have degree less than dand their adjacent edges in G,
and then we repeat this procedure to the new graph. Let Gdbe the
resulting subgraph of Gwhere each vertex in Gdis adjacent to at
least dother vertices in Gd. If no vertices are left in the graph, we
refer to it as the empty graph, denoted as G. Given this, we con-
struct a subgraph sequence GG1G2·· · GlGl+1 =
G, where Gl6=Gand contains at least l+ 1 vertices. We define
las the rank of the graph G, and Glas the rank subgraph of G.
Given this, we will use Glas the approximate densest subgraph.
LEM MA 1. Given G, let Gsbe the densest subgraph of G, with
density d(Gs), and let Glbe its rank subgraph with density d(Gl).
Then, the density of Glis no less than half of the density of Gs:
d(Gl)d(Gs)
2
Proof Sketch:We prove this by way of contradiction. Suppose
d(Gl)<d(Gs)
2, which suggests d(Gs)>2×d(Gl) = 2 |E(Gl)|
|V(Gl)|
2l|V(Gl)|/2
|V(Gl)|=l
Then, we claim that each vertex in Gsshould have degree more
than l, i.e., for any vV(Gs),degree(v)> l. If not, assume
vV(Gs)has vertex degree dvl. Then, we can simply
remove this vertex to increase the density of the subgraph:
|E(Gs)|−dv
|V(Gs)|−1=d(Gs)|V(Gs)|−dv
|V(Gs)|−1>d(Gs)|V(Gs)|−d(Gs)
|V(Gs)|−1=
d(Gs)
Since each vertex in Gshas degree more than l, we conclude that
GsGl+1. However, Gl+1 =G, which contradicts that there is
aGswith density more than 2×d(Gl).2
Following this, we have the following interesting observation.
THE OR EM 3. Consider we have kbipartite graphs,
B1,B2,···,Bk. Let l1, l2,··· , lkbe their respective ranks. Let
S1,S2,···, Skbe their respective densest subgraphs, let Gl1(B1),
Gl2(B2),···, Glk(Bk)be their respective rank graphs, and let
lmax = max (l1, l2,···, lk). Assume we have several maximum
rank graphs with rank lmax. Then, we claim that any maximum
rank graph Gli(Bi)where li=lmax has a density no less than
half of the density of the maximal density subgraphs:
d(Gli(Bi)) max1jkd(Sj)
2
Proof Sketch:The proof is similar to Lemma 1. We prove this
by way of contradiction. Suppose d(Gli(Bi)) <max1jkd(Sj)
2.
Then we can derive
max
1jkd(Sj)>2d(Gli(Bi)) = 2 |E(Gli(Bi))|
|V(Gli(Bi))|
2li|V(Gli(Bi))|/2
|V(Gli(Bi))|=li=lmax
Suppose d(Sp) = max1jkd(Sj). Then with similar argu-
ment as in the proof of lemma 1 we claim all vertices in Sphave
degree more than lmax. Hence we conclude SpGlmax+1 (Bp).
However, Glmax+1 (Bp) = according to the definition of rank
graph, a contradiction. 2
The key implication from Theorem 3 is that we can organize all
the bipartite graphs in a queue based on their ranks. If we know
lis the highest rank of all the bipartite graphs, then we can return
the first rank subgraphs we find from these bipartite graphs as the
2-approximation densest subgraph. We employ this technique in
the greedy algorithm by deriving an efficient incremental search
procedure for the densest subgraph from those bipartite graphs at
every iteration.
Algorithm 2 3HOPContour(G,Con(G),C1 · · ·Ck)
1: Construct Bipartite Graphs B1,···,Bk;
2: For each Bi, construct vertex rank groups, compute the rank ri
of Biand the density diof the rank graph Gri(Bi);
3: Sort all Biinto queue Qaccording to descending order of ri.
4: RCon(G);
5: Pop the first element Bfrom the queue Q;
6: while R6=do
7: while B.r < B.r {B(BQ.pop()) is the next element
in Qafter popping the last bipartite graph} do
8: B.r RankSubgraph(B,R,B.r)
9: if B.r < B.r then
10: insert Bback to Qin the sorting order;
11: B B
12: else
13: insert Bback to Qin the sorting order;
14: end if
15: end while
16: RR\E(Gr(B));
17: Update Lout and Lin for vertices in selected Gr(B).
18: B.r RankSubgraph(B,R,0);
19: end while
The sketch of our 3-hop labeling construction algorithm 3HOP-
Contour is in Algorithm 2. It starts with constructing kbipartite
graphs, each corresponding to a chain in 3-hop. Initially, we di-
rectly compute the rank of each bipartite graph and the density
of its corresponding rank subgraph (Line 2). We will then sort
all bipartite graphs based on their rank and put them in queue Q
821
(Line 3). Our goal is to cover the entire transitive closure contour
R=Con(G). The algorithm will iteratively pick the densest sub-
graphs and remove their edges until all the edges (vertex pairs) in
the transitive closure contour are covered (R=). During this
covering process, we can make the following observation for the
rank of each bipartite graph: for any bipartite graph, its rank will
not be able to increase during the covering process. This is be-
cause in our covering processing, an increasing number of edges in
the contour will be covered and similarly for the edge set of each
bipartite graph. Say at a certain iteration, we compute the rank for
a bipartite graph B, denoted as B.r. Then, if we try to reevaluate
its rank for the updated graph, where the edge set is E(B)R, we
know the updated rank cannot exceed its earlier rank B.r. Indeed,
we can apply B.r as an upper bound of B’s new rank.
To further speed up the rank subgraph searching procedure, we
organize the vertices of each bipartite graph into different rank
groups: For a given bipartite graph with a rank l, let Gdbe the re-
sulting subgraph of a given bipartite graph as we iteratively remove
all the vertices with degree less than d. Thus, we have a subgraph
sequence, GG1G2··· GlGl+1 =G. We assign
each vertex varank d, if vV(Gd)and v /V(Gd+1). Given
this, all the vertices with the same rank will be organized together
in each bipartite graph. We note that the rank of each vertex will
not be able to increase during the covering process as well. Thus,
using this organization, we can quickly prune the vertices with rank
lower than a given threshold. This will be applied to facilitate the
rank graph searching procedure.
The major iteration of our algorithm is in the loop in lines 6to 19.
In every iteration, we greedily select the densest subgraphs from
our kbipartite graphs. This is done using the queue in the while
loop from Lines 7to 15. We visit each bipartite graph according to
its order in the queue (Line 7). Let Bbe the bipartite graph which
has the highest rank among all the visited bipartite graphs for the
current iteration. Then, we always extract the first bipartite graph
Bin the queue Qand compare its saved rank B.r, which is the
upper bound of its real rank, with Bs real rank B.r.
If B.r B.r, we know that the current rank is the highest one
all the bipartite graphs can have since all the remaining bipartite
graphs in the queue will not have a higher rank than B.r. Thus, we
do the following: 1) we first extract the highest ranked subgraph
Gr(B)and apply it to cover R(Line 16); 2) we update Lout and
Lin for vertices in Gr(B); 3) we recompute the rank of Bimme-
diately and use it as the first candidate rank for the next iteration
(Line 18).
However, if this is not the case (B.r > B.r), we need to check if
the true rank of B.r is higher than B.r. Here we will apply the ver-
tex rank group organization to speed up the search procedure: since
we already have bipartite graph Bwith rank B.r, we are not inter-
ested in Bif it has equivalent or lower rank. Thus, we invoke the
RankSubgraph procedure with three parameters: Bis the targeted
bipartite graph, Ris the uncovered edges, and the last parameter
is the minimal rank in which we are interested. In this case, we
are only interested in ranks higher than B.r. This procedure will
apply Rto remove those edges not in Rand update the vertex rank
group. Again, it only updates those vertices with rank higher than
B.r. This is done in Line 8. For brevity, we omit the details of the
RankSubgraph procedure.
Putting all of these together, we can see Algorithm 2 creates Lout
and Lin for the out-anchor and in-anchor vertices of the transitive
closure contour Con(G). As an example, in Figure 6 one of the
densest bipartite subgraphs is the circled subgraph which could be
selected by Algorithm 2. If selected, Algorithm 2 will add 9 to
Lout(4),Lout (14), and Lin (19). A complete labeling sets Lin
and Lout from Algorithm 2 are shown in Figure 7, where we only
show Lin(u)(or Lout (u)) of a vertex uif it is not empty, and set i
for Lin and set ofor Lout.
Finally, we can claim the following optimalities of our 3HOP-
Contour algorithm. Due to space constraints, we omit the proofs.
THE OR EM 4. The 3HOPContour algorithm finds a 3-hop la-
beling for the transitive closure contour Con(G)whose size is
larger than the smallest such labeling by at most an O(ln |Con(G)|+
1) = O(log n)factor, where nis the number of vertices in G.
THE OR EM 5. For any DAG G, the minimum 3-hop labeling
cost (defined previously as Cost(3hop)) for transitive closure con-
tour Con(G),Opt3hop, is always no larger than the minimum
labeling cost of 2-hop, Opt2hop . In addition, the upper bound of
3-hop labeling cost produced by 3HOPContour algorithm,
O((ln |Con(G)|+ 1)(Opt3hop)), is always no larger than
O((ln |V|2+1)(Opt2hop )), the upper bound of labeling cost pro-
duced by Cohen et al’s 2-hop algorithm [9].
Figure 7: 3-Hop Labeling of Transitive Closure Contour
5. REACHABILITY QUERY PROCESSING
USING 3-HOP INDEXING
In Section 4, we show how to construct the 3-hop labeling for
the transitive closure contour. As a result of Algorithm 2, we get
Lout(u)and Lin (v)for each out-anchor vertex uand each in-
anchor vertex v, respectively. In this section we will show how
to efficiently answer reachability queries using these labelings. We
describe two approaches: the first approach directly applies the 3-
hop labeling of the contour to achieve a worst-case time complexity
O(|Con(G)|)while the second approach utilizes segments to re-
duce the query processing complexity.
5.1 3-HOP Contour Query Processing
Note that the 3-hop labeling of the transitive closure contour
Con(G)ensures that the reachability for any pair of vertices in
a DAG Gcan be inferred. This is because 3-hop labeling can cover
all the vertex pairs in Con(G), and Con(G)can cover all the other
vertex pairs in the transitive closure matrix.
Given this, to tell if vertex uin chain Cican reach vertex vin
chain Cj, we can first recover the pseudo-diagonal of MCi,Cjusing
822
the 3-hop labeling and then test if (u, v)is dominated by any of the
pseudo-diagonal cells. However, we do not need to consider those
pseudo-diagonal cells or the closure vertex pairs whose out-anchor
vertex is smaller than uor whose in-anchor vertex is bigger than v.
We can integrate these steps together and have the following query
processing procedure:
Step 1: In chain Ci, (uCi), we collect all the smallest vertices
on any other chain that ucan reach through the out-anchor vertex
u,uu: (Lx.cid
out (u) = Lout(u)Cx.cid )
X={x|x[
uu
Lout(u)AND xLx.cid
out (u)for any uu}
Step 2: In chain Cj, (vCj), we collect all the largest vertices on
any other chain which can reach vthrough the in-anchor vertex v,
vv: (Ly.cid
in (v) = Lin(v)Cy.cid )
Y={y|y[
vv
Lin(v)AND Ly.cid
in (v)yfor any vv}
Step 3: We see if there is an x, y pair, xXand yY, such that
x.cid =y.cid and xy.
Using the highway analogy, we can see the first step collects
those entry points ucan reach on the intermediate chains, the sec-
ond step collects those exit points which reach von the interme-
diate chains, and the third step checks to see if an entry point can
reach an exit point, i.e., if they are on the same chain and the en-
try point has smaller sequence number than the exit point. Note
that the worst-case query processing cost is O(|Con(G)|). This
can be observed by the fact that for any out-anchor vertex u, and
vLout, we have (u, v)C on(G)(the same for any in-anchor
vertex). Thus, the first two steps cost maximally O(|Con(G)|)
time and the third step costs O(k), where kis the number of chains
in the chain-decomposition.
For example, in Figure 7, to tell whether u= 2 can reach v=
20, we get set X={6,15}by checking Lout (3),Lout(4) and
Lout(5); and set Y={9,13}by checking Lin (19),Lin (18), and
Lin(17). Since 6(X)reaches 9(Y)in C2, we say ucan reach
v.
5.2 3-HOP Segment Query Processing
In this subsection, we introduce an indexing method on top of the
3-hop contour labeling to reduce the query processing complexity.
The worst-case query processing time is O(log n+k), where nis
the number of vertices in G. We can see the major bottleneck in
the first approach is its first two steps. To speed up these steps, the
new approach will break each chain into segments. Specifically,
for each chain Ci, we will break it into outgoing segments and
incoming segments.
We construct the segments for chain Ciwith respect to another
chain Cjbased on the 3-hop contour labeling. Let Qout(i, j )be all
the out-anchor vertices of chain Ciwhich record an intermediate
entry point in chain Cj:
Qout(i, j ) = {x|xCi, Lout(x)Cj6=∅}
Let Qin(i, j )be all the in-anchor vertices of chain Cjwhich record
an intermediate exit point in chain Ci:
Qin(i, j ) = {y|yCj, Lin(y)Ci6=∅}
Then, we can order all the vertices x1,···, xlin Qout (i, j)such
that x1x2 ·· · xl,l=|Qout (i, j)|, and order all the
vertices y1,···, ylin Qin (i, j)such that y1y2 ··· yl,
l=|Qin(i, j )|. Given this, we construct the outgoing segments
for Ci, denoted by their sequence number,
(1, x1.oid),(x1.oid + 1, x2.oid),···(xl1.oid + 1, xl.oid)
and the incoming segments for Cj,
(y1.oid, y2.oid1),(y2.oid, y3.oid1),···(yl.oid, Cj.last().oid)
For example, in Figure 7, the outgoing segments constructed
from Qout(1,2) are (1,3) and (4,4) where L2
out(1,3) = 6 and
L2
out(4,4) = 9. The incoming segments constructed from Qin(3,4)
are (17,17) and (18,20) where L3
in(17,17) = 11 and L3
in(18,20) =
13.
We say a vertex vis in a segment S= (x, y)(denoted as vS)
if xv.oid y. We note that all the vertices in each outgoing
segment share the same entry point of chain Cjand all the ver-
tices in each incoming segment share the same exit point of chain
Ci. Thus, we assign each outgoing segment (or incoming segment)
with a unique vertex on chain Cj(or Ci) as its label.
In the 3-hop segment indexing, we construct these outgoing and
incoming segments of each chain with respect to every other chain.
Then for all the segments which share the same starting vertex and
ending vertex, we combine all their individual labels into Lout(S),
where Sis the combined segment. In addition, to facilitate query
processing, we construct an interval tree [3] for all the outgoing
segments in a single chain Ciand an interval tree for all the in-
coming segments in a single chain Cj. Given this, we can see that
the new query processing procedure for answering whether ucan
reach vis as follows:
Step 1: In chain Ci, (uCi), we collect all the outgoing segments
which contain uand combine all their labels in X.
X={x|x[
uS
Lout(S)AND xLx.cid
out (S)for any Su}
Step 2: In chain Cj, (vCj), we collect all the incoming seg-
ments which contain vand combine all their labels in Y.
Y={y|y[
vS
Lin(S)AND Ly.cid
in (S)yfor any Sv}
Step 3: We see if there is an x, y pair, xXand yY, such that
x.cid =y.cid and xy.
The worst case query processing time is O(log n+k). Though
the number of segments could be maximally n2, the number of
segments covering uor vis actually no more than k.The interval
tree can return the segments which cover uin O(log n+k)time.
Finally, we note that the extra segments can contribute to an O(nk)
storage cost on top of the 3-hop contour labeling.
6. EXPERIMENTAL EVALUATION
In this section, we empirically compare the new 3-hop labeling
approach with the state-of-art simple graph covering approach, the
path-tree cover and the 2-hop labeling approach, on both synthetic
and real data. We also list query time of two classical approaches,
breadth-first search and depth-first search as benchmarks. We are
particularly interested in the following issues:
1. Index size: The major goal of this work is to derive an in-
dexing scheme for reachability query which can significantly
compress the transitive closure when the ratio between the
number of edges and the number of vertices is relatively high.
Specifically, we would like to learn how much we can gain
by using 3-hop labeling compared with two best available in-
dexing approaches, path-tree and 2-hop. Since each vertex in
the path-tree is labeled by three numbers (two numbers are
823
tree intervals and one number is depth-first order), and each
vertex in the 3-hop is labeled by two numbers (cid and oid),
we define the index size of the path-tree scheme for a graph
G= (V, E )to be the size of transitive closure plus 3 |V|,
and the index size of the 3HOP-Contour to be cost(3hop)
(defined in subsection 4.1) plus 2|V|. The index size of the
3HOP-Segment is the size of all segments, i.e. two times the
number of segments, plus the cost of labeling. In this case
each segment has a label instead of each vertex. It is easy
to observe that the total labeling cost of 3HOP-Segment is
cost(3hop), the same as that of 3HOP-Contour.
2. Query processing time: As we mentioned before, there is a
trade-off between the compression rate of the transitive clo-
sure and the query answering time. In order to achieve a
high compression rate, the 3-hop indexing approach clearly
requires more runtime processing for answering reachability
queries than path-tree. However, the interesting question is
how fast 3-hop can answer queries and whether it is compa-
rable with path-tree and 2-hop.
3. Construction time: A major advantage of 3-hop compared
with 2-hop is that it does not require computing the full tran-
sitive closure and employs a new strategy to speedup the
densest subgraph identification. How can these factors speedup
the labeling process of the 3-hop compared with the 2-hop
approach?
Given this, we have specifically compared these six algorithms
in the experimental evaluation: 1) the original 2-hop approach by
Cohen et al. [9], denoted as 2HOP; 2) the path-tree approach
(PTree-1) proposed by Jin et al. [12], denoted as Path-Tree; 3)
the 3-hop labeling approach with 3-hop contour query processing,
denoted as 3HOP-Contour; 4) the 3-hop labeling approach with
3-hop segment query processing, denoted as 3HOP-Segment; 5)
Breadth-first Search; and 6) Depth-first Search. We have imple-
mented all six algorithms. The Path-Tree is an improved second
version with respect to the first version in [12]. Besides, since 3-
hop needs a chain decomposition, we implement a heuristic algo-
rithm developed by Jagadish, procedure-3 in [11]. All algorithms
are implemented using C++ based on the Standard Template Li-
brary (STL). We perform experiments on a Linux 2.6 machine with
2.0GHz CPU and 8.0GB RAM.
In the experiments, we collect all three measures: the index size,
the query time, and the indexing construction time, and each exper-
iment processes 100,000 randomly generated queries.
6.1 Synthetic Datasets
Here, we run two sets of experiments using the synthetic DAGs,
which are generated using a random directed acyclic graph genera-
tion algorithm described in [13].
In the first experiment, we generate a set of DAGs with 2,000
vertices, and vary their average density from 2to 12. We compare
all six approaches, 3HOP-Segment, 3HOP-Contour, 2HOP, Path-
Tree, breadth-first search, and depth-first search in this experiment.
From figure 8, both 3HOP-Segment and 3HOP-Contour con-
sistently obtain a better index size compression rate than 2HOP
and Path-Tree on all synthetic datasets. Overall, the index size of
3HOP-Contour and 3HOP-Segment are on average about 2.7times
and 2.0times better than the Path-Tree approach, and about 1.5
times and 1.1times better than 2HOP. On the other hand, in Ta-
ble 2, we observe that path-tree has moderately faster query time
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
2 4 6 8 10 12
Index size
|E| / |V|
Index size of rand2k datasets
3HOP-Contour
3HOP-Segment
Path-Tree
2HOP
Figure 8: Index size of Synthetic Datasets (2K)
0
200000
400000
600000
800000
1e+06
1.2e+06
1.4e+06
1.6e+06
1.8e+06
5 10 15 20 25
Index size
|E| / |V|
Index size of rand10k datasets
3HOP-Contour
3HOP-Segment
Path-Tree
Figure 9: Index size of Synthetic Datasets (10K)
824
Dataset Query Time (in ms)
3HOP-Contour 3HOP-Segment Path-Tree 2HOP Breath-First Search Depth-First Search
rand2k_2 22.865 165.646 9.108 70.239 891.957 891.502
rand2k_4 49.354 566.175 26.051 297.801 2197.01 1796.84
rand2k_6 77.686 1092.2 33.785 514.546 4397.49 4358.84
rand2k_8 103.769 1422.82 31.626 589.059 6134.99 7553.84
rand2k_10 124.291 1661.82 28.322 574.64 7499.31 11305.3
rand2k_12 141.825 1748.04 28.411 722.005 8628.21 14917.4
Table 2: Query Time of Synthetic Datasets (2K)
Dataset DAG #V DAG #E Density
Arxiv 6000 66707 11.12
Citeseer 10720 44258 4.13
Go 6793 13361 1.97
Pubmed 9000 40028 4.45
Yago 6642 42392 6.38
Table 3: Real datasets
than 3HOP-Contour, as expected. However, 3HOP-Contour has
not only smaller index size than that of 3HOP, but shorter query
time as shown in Table 2. It is interesting to observe that 3HOP-
Contour is faster than 3HOP-Segment. We know that the query
time complexity of 3HOP-Segment is better than 3HOP-Contour.
However, in practice, we can see intuitively that more memory ac-
cess operations (e.g. searching interval trees and processing search
results) are needed in 3HOP-Segment, and interval trees are too
big to be loaded into system caches. Thus, it is reasonable that the
query time of 3HOP-Contour is better than that of 3HOP-Segment.
In terms of construction time, 3HOP-Contour and 3HOP-Segment
are several orders of magnitude faster than 2HOP. In this experi-
ment, 2HOP finishes index construction of a dataset between 7and
21 hours, while 3HOP-Segment and 3HOP-Contour take only 1
second to 71 seconds. To explain the phenomena, we notice that 3-
hop needs to take O((kn2) |Con(G)|)construction time (Recall
we have kbipartite graphs corresponding to kchains. Each bipar-
tite graph starts as a complete bipartite graph with O(n2)edges.)
while 2-hop takes O(n3|Tc|).|Con(G)|is the number of contour
points and |Tc|is the size of transitive closure. Although in worst
case |Con(G)|could be equal to |Tc|, in practice, |C on(G)|is
much smaller. In addition, we have developed and implemented a
new technique (Theorem 3) which can speed up 3-hop labeling by
up to O(k).
In the second experiment, we generate random DAGs with 10,000
vertices, and vary their densities from 2to 25. Note that we do not
compare with 2HOP in the second experiment because 2HOP can-
not process such large scale datasets due to memory constraints.
Figure 9 shows the index size of the two 3-hop approaches and
the path-tree approach. Here, 3HOP-Contour and 3HOP-Segment
can achieve up to 6.0times and 5.3times smaller index sizes than
Path-Tree. On average, 3HOP-Contour and 3HOP-Segment have
3.9times and 3.1times smaller index sizes, respectively, than Path-
Tree. The query processing time and construction time are similar
to the first experiments and we omit them here.
It is interesting to observe that there is a peak occurring at density
10 on the index size for all three algorithms. Since 3-hop labeling
relies on chain decomposition and path-tree labeling depends on
path decomposition, an increase in density potentially may result
in better chain or path decomposition (i.e. with fewer chains or
paths w.r.t. DAG). This can explain the peak phenomena.
6.2 Real Datasets
To evaluate our indexing scheme on real datasets, we have col-
lected five real datasets listed in Table 4. All graphs are extracted
from real-world large datasets with density being larger than or
close to 2. Among them, arXiv is extracted from a dataset of ci-
tations among scientific papers from the arxiv.org website 1. Sim-
ilarly, citeseer contains citations among scientific literature pub-
lications from the CiteSeer project 2, and pubmed was extracted
from an XML registry of open access medical publications from
the PubMed Central website 3. GO contains genetic terms and their
relationships from the Gene Ontology project 4. Yago describes the
structure of relationships among terms in the semantic knowledge
database from the YAGO project 5.
Table 4 shows the index size and query time of three methods,
the two 3-hop approaches and the path-tree approach. Again, in this
experiment, the 2HOP approach fails by running out of memory.
As shown in the table, the index sizes of 3HOP-Contour are re-
duced significantly with respect to Path-Tree, and the index sizes
of 3HOP-Segment are smaller than Path-Tree in 3 out of 5 datasets.
On average, 3HOP-Contour and 3HOP-Segment obtain 1.7times
and 1.2times better compression rates than the Path-Tree approach.
As expected, we found that the query time of Path-Tree is better
than 3HOP.
The 3HOP-Contour has a similar construction time to 3HOP-
Segment therefore we only report 3HOP-contour construction time
here. It takes 8530,106,25,257, and 25 seconds to construct
indexing for dataset arXiv, citeseer, go, pubmed, and yago, respec-
tively. The Path-Tree is much faster and takes only 10,0.73,0.2,
0.77, and 0.55 seconds, respectively for these datasets. This is
expected since the 3-hop approach is more computationally expen-
sive. However, the new approach has an evidently higher compres-
sion rate and its query processing time is also comparable to the
path-tree approach.
7. CONCLUSION
In this work, we introduce a new 3-hop indexing scheme with
high compression rate targeting the directed graphs with higher
edge-vertex ratio. We not only show that our index size can achieve
a guaranteed approximation bound, but also demonstrate its ap-
plicability through extensive experimental evaluations on both real
and synthetic datasets. More importantly, we believe this method
potentially opens a new way to compress the transitive closure and
leads to new provocative questions. For instance, how can other
simple graph structures, such as trees, serve as the intermediate hop
(highway)? How can we derive the average complexity of these
1http://arxiv.org/
2http://citeseer.ist.psu.edu/oai.html
3http://www.pubmedcentral.nih.gov/
4http://www.geneontology.org/
5http://www.mpi-inf.mpg.de/ suchanek/downloads/yago/
825
Dataset Index Size Query Time (in ms)
3HOP-Contour 3HOP-Segment Path-Tree 3HOP-Contour 3HOP-Segment Path-Tree Breadth-First Search Depth-First Search
ArXiv 47472 64378 86855 125.382 1060.2 24.278 19029.2 129587
Citeseer 51035 72167 91820 87.763 523.488 23.32 4567.16 4781.19
Go 27764 41798 37729 53.354 250.261 10.39 2697.67 2780.23
Pubmed 54531 72215 107915 72.491 533.495 21.818 4083.08 4224.54
Yago 27038 39638 39181 44.495 229.416 12.256 2605.56 2622.23
Table 4: Comparison between 3HOP and Path-Tree
compression approaches, including the simple graph covering ap-
proaches, 2-hop, and 3-hop? We plan to investigate these problems
in the future.
8. REFERENCES
[1] R. Agrawal, A. Borgida, and H. V. Jagadish. Efficient
management of transitive relationships in large data and
knowledge bases. In SIGMOD, pages 253–262, 1989.
[2] Renzo Angles and Claudio Gutierrez. Survey of graph
database models. ACM Comput. Surv., 40(1):1–39, 2008.
[3] M.de Berg, M.van Kreveld, M.Overmars, and
O.Schwarzkopf. Computational Geometry. Springer, 2000.
[4] Li Chen, Amarnath Gupta, and M. Erdem Kurul.
Stack-based algorithms for pattern matching on dags. In
VLDB ’05: Proceedings of the 31st international conference
on Very large data bases, pages 493–504, 2005.
[5] Yangjun Chen and Yibin Chen. An efficient algorithm for
answering graph reachability queries. In ICDE, pages
893–902, 2008.
[6] Jiefeng Cheng, Jeffrey Xu Yu, Xuemin Lin, Haixun Wang,
and Philip S. Yu. Fast computation of reachability labeling
for large graphs. In EDBT, pages 961–979, 2006.
[7] Jiefeng Cheng, Jeffrey Xu Yu, Xuemin Lin, Haixun Wang,
and Philip S. Yu. Fast computing reachability labelings for
large graphs with high compression rate. In EDBT, pages
193–204, 2008.
[8] V. Chvátal. A greedy heuristic for the set-covering problem.
Math. Oper. Res, 4:233–235, 1979.
[9] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick.
Reachability and distance queries via 2-hop labels. In
Proceedings of the 13th annual ACM-SIAM Symposium on
Discrete algorithms, pages 937–946, 2002.
[10] G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast
parametric maximum flow algorithm and applications. SIAM
J. Comput., 18(1):30–55, 1989.
[11] H. V. Jagadish. A compression technique to materialize
transitive closure. ACM Trans. Database Syst.,
15(4):558–598, 1990.
[12] Ruoming Jin, Yang Xiang, Ning Ruan, and Haixun Wang.
Efficiently answering reachability queries on very large
directed graphs. In SIGMOD Conference, pages 595–608,
2008.
[13] Richard Johnsonbaugh and Martin Kalin. A graph generation
software package. In SIGCSE ’91: Proceedings of the
twenty-second SIGCSE technical symposium on Computer
science education, pages 151–154, New York, NY, USA,
1991. ACM.
[14] Guy Kortsarz and David Peleg. Generating sparse
2-spanners. In SWAT ’92: Proceedings of the Third
Scandinavian Workshop on Algorithm Theory, pages 73–82,
1992.
[15] R. Schenkel, A. Theobald, and G. Weikum. HOPI: An
efficient connection index for complex XML document
collections. In EDBT, 2004.
[16] K. Simon. An improved algorithm for transitive closure on
acyclic digraphs. Theor. Comput. Sci., 58(1-3):325–346,
1988.
[17] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel.
Efficient aggregation for graph summarization. In SIGMOD
Conference, 2008.
[18] Silke Trißl and Ulf Leser. Fast and practical indexing and
querying of very large graphs. In SIGMOD ’07: Proceedings
of the 2007 ACM SIGMOD international conference on
Management of data, pages 845–856, 2007.
[19] Haixun Wang, Hao He, Jun Yang, Philip S. Yu, and
Jeffrey Xu Yu. Dual labeling: Answering graph reachability
queries in constant time. In ICDE ’06: Proceedings of the
22nd International Conference on Data Engineering
(ICDE’06), page 75, 2006.
[20] Xifeng Yan, Philip S. Yu, and Jiawei Han. Substructure
similarity search in graph databases. In SIGMOD
Conference, pages 766–777, 2005.
826
... Finding subgraphs with the highest average degrees in large networks is an important primitive problem in data mining, and has been applied to different areas including social networks [1], biological analysis [2], traffic pattern mining [3], and graph database [4], [5]. The maximum-flow-based algorithm [6] can exactly solve DSP by utilizing the binary search in polynomial time, which is ill-suited for large graphs due to the prohibitive cost. ...
... Input: Undirected graph G; density metric ρ(·) Output: S * : the nodeset of the densest subgraph of G. 1 S, S * ← V 2 while S ̸ = ∅ do 3 ▷ find the vertex u * with the lowest degree in S 4 u * ← arg min u∈S d S (u)) 5 Remove u * and all its adjacent edges from G. We claim that the process before getting the δ-core is a monotonic increasing phase of density in Greedy. We can confirm two facts: ...
Preprint
Densest Subgraph Problem (DSP) is an important primitive problem with a wide range of applications, including fraud detection, community detection and DNA motif discovery. Edge-based density is one of the most common metrics in DSP. Although a maximum flow algorithm can exactly solve it in polynomial time, the increasing amount of data and the high complexity of algorithms motivate scientists to find approximation algorithms. Among these, its duality of linear programming derives several iterative algorithms including Greedy++, Frank-Wolfe and FISTA which redistribute edge weights to find the densest subgraph, however, these iterative algorithms vibrate around the optimal solution, which are not satisfactory for fast convergence. We propose our main algorithm Locally Optimal Weight Distribution (LOWD) to distribute the remaining edge weights in a locally optimal operation to converge to the optimal solution monotonically. Theoretically, we show that it will reach the optimal state of a specific linear programming which is called locally-dense decomposition. Besides, we show that it is not necessary to consider most of the edges in the original graph. Therefore, we develop a pruning algorithm using a modified Counting Sort to prune graphs by removing unnecessary edges and nodes, and then we can search the densest subgraph in a much smaller graph.
Article
Densest subgraph discovery (DSD) is a fundamental topic in graph mining. It has been extensively studied in the literature and has found many real applications in a wide range of fields, such as biology, finance, and social networks. As a typical problem of DSD, the k-clique densest subgraph (CDS) problem aims to detect a subgraph from a graph, such that the ratio of the number of k-cliques over the number of its vertices is maximized. This problem has received plenty of attention in the literature, and is widely used in identifying larger ''near-cliques''. Existing CDS solutions, either k-core or convex programming based solutions, often need to enumerate almost all the k-cliques, which is very inefficient because real-world graphs usually have a vast number of k-cliques. To improve the efficiency, in this paper, we propose a novel framework based on the Frank-Wolfe algorithm, which only needs k-clique counting, rather than k-clique enumeration, where the former one is often much faster than the latter one. Based on the framework, we develop an efficient approximation algorithm, by employing the state-of-the-art k-clique counting algorithm and proposing some optimization techniques. We have performed extensive experimental evaluation on 14 real-world large graphs and the results demonstrate the high efficiency of our algorithms. Particularly, our algorithm is up to seven orders of magnitude faster than the state-of-the-art algorithm with the same accuracy guarantee.
Article
We study the historical connectivity query in temporal graphs where edges continuously arrive. Given an arbitrary time window, and two query vertices, the problem aims to identify if two vertices are connected by a path in the snapshot of the window. The state-of-the-art method designs an index based on the two-hop cover, and updating the index is costly when new edges arrive. In this paper, we propose a new framework and design a novel forest-based index for historical connectivity queries. The index enables us to answer queries by searching if two vertices are connected in the forest. We update the index by modifying a forest structure. Our techniques also work for connectivity query processing in a sliding window of temporal graphs. Extensive experiments have been conducted to show the considerable advantages of our approach compared with the state-of-the-art methods in both historical connectivity queries and sliding-window connectivity queries.
Article
Full-text available
Bipartite graphs are naturally used to model relationships between two types of entities, such as people-location, user-post, and investor-stock. When modeling real-world applications like disease outbreaks, edges are often enriched with temporal information, leading to temporal bipartite graphs. While reachability has been extensively studied on (temporal) unipartite graphs, it remains largely unexplored on temporal bipartite graphs. To fill this research gap, we study the reachability problem on temporal bipartite graphs in this paper. Specifically, a vertex u reaches a vertex w in a temporal bipartite graph G if u and w are connected through a series of consecutive wedges with time constraints. To efficiently answer if a vertex can reach the other vertex, we propose an index-based method by adapting the idea of 2-hop labeling. Effective optimization strategies and parallelization techniques are devised to accelerate the index construction process. To better support real-life scenarios, we further show how the index is leveraged to efficiently answer other types of queries, e.g., single-source reachability and earliest-arrival path queries. In addition, we propose an efficient method to handle incremental maintenance of the index structure. Extensive experiments on 16 real-world graphs demonstrate the effectiveness and efficiency of our proposed techniques.
Article
The Densest Subgraph Problem requires to find, in a given graph, a subset of vertices whose induced subgraph maximizes a measure of density. The problem has received a great deal of attention in the algorithmic literature over the last five decades, with many variants proposed and many applications built on top of this basic definition. Recent years have witnessed a revival of research interest in this problem with several important contributions, including some groundbreaking results, published in 2022 and 2023. This survey provides a deep overview of the fundamental results and an exhaustive coverage of the many variants proposed in the literature, with a special attention to the most recent results. The survey also presents a comprehensive overview of applications and discusses some interesting open problems for this evergreen research topic.
Article
Landmark-based 3-hop cover labeling is a category of approaches for shortest distance/path queries on large-scale complex networks. It pre-computes an index offline to accelerate the online distance/path query. Most real-world graphs undergo rapid changes in topology, which makes index maintenance on dynamic graphs necessary. So far, the majority of index maintenance methods can handle only one edge update (either an addition or deletion) each time. To keep up with frequently changing graphs, we research the ful ly b atch m aintenance problem for the 3-hop cover labeling, and proposed the method called FulBM . FulBM is composed of two algorithms: InsBM and DelBM, which are designed to handle batch edge insertions and deletions respectively. This separation is motivated by the insight that batch maintenance for edge insertions are much more time-efficient, and the fact that most edge updates in the real world are incremental. Both InsBM and DelBM are equipped with well-designed pruning strategies to minimize the number of vertex accesses. We have conducted comprehensive experiments on both synthetic and real-world graphs to verify the efficiency of FulBM and its variants for weighted graphs. The results show that our methods achieve 5.5 × to 228 × speedup compared with the state-of-the-art method.
Article
Reachability query has important applications in many fields such as social networks, semantic web, and biological information networks. How to improve the query efficiency in directed acyclic graph (DAG) has always been the main problem of reachability query research. Existing methods either can't prune unreachable pairs enough or can't perform well on both index size and query time. In this paper, we propose BL (Bridging Label), a general index framework for reachability queries in large graphs. First, we summarize the relationships between BL and existing label indices. Second, we propose a kind of specific index, named minBL, which can avoid redundant labels. Moreover, we propose TFD-minBL and CTFD-minBL, which generate minBL under the TFD-based permutation single-pass and in incremental, respectively. Finally, we conduct a large number of extensive experiments on real datasets. The experimental results show that our methods are much faster and use less storage overhead than state-of-the-art methods. The source codes of BL can be downloaded from web site https://github.com/BioLab310/BL</uri
Conference Paper
Full-text available
Many applications work with graph-structured data. As graphs grow in size, indexing becomes essential to ensure sufficient query performance. We present the GRIPP index structure (GRaph Indexing based on Pre- and Postorder numbering) for answering reachability queries in graphs. GRIPP requires only linear time and space. Using GRIPP, we can answer reachability queries on graphs with 5 million nodes on average in less than 5 milliseconds, which is unrivaled by previous methods. We evaluate the performance and scalability of our approach on real and synthetic random and scale-free graphs and compare our approach to existing indexing schemes. GRIPP is implemented as stored procedure inside a relational database management system and can therefore very easily be integrated into existing graph-oriented applications.
Conference Paper
Full-text available
Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, chem-informatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently.In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate non-consecutive sequences, trees, and other complicated structures as well.
Conference Paper
Full-text available
Efficiently processing queries against very large graphs is an important research topic largely driven by emerging real world applications, as diverse as XML databases, GIS, web mining, social network analysis, ontologies, and bioinformat- ics. In particular, graph reachability has attracted a lot of research attention as reachability queries are not only com- mon on graph databases, but they also serve as fundamental operations for many other graph queries. The main idea be- hind answering reachability queries in graphs is to build in- dices based on reachability labels. Essentially, each vertex in the graph is assigned with certain labels such that the reach- ability between any two vertices can be determined by their labels. Several approaches have been proposed for building these reachability labels; among them are interval labeling (tree cover) and 2-hop labeling. However, due to the large number of vertices in many real world graphs (some graphs can easily contain millions of vertices), the computational cost and (index) size of the labels using existing methods would prove too expensive to be practical. In this paper, we introduce a novel graph structure, referred to as path- tree, to help labeling very large graphs. The path-tree cover is a spanning subgraph of G in at ree shape. We demon- strate both analytically and empirically the effectiveness of our new approaches.
Conference Paper
Full-text available
Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are critical. However, existing graph summarization methods are mostly statistical (studying statistics such as degree distributions, hop-plots and clustering coefficients). These statistical methods are very useful, but the resolutions of the summaries are hard to control. In this paper, we introduce two database-style operations to summarize graphs. Like the OLAP-style aggregation methods that allow users to drill-down or roll-up to control the resolution of summarization, our methods provide an analogous functionality for large graph datasets. The first operation, called SNAP, produces a summary graph by grouping nodes based on user-selected node attributes and relationships. The second operation, called k-SNAP, further allows users to control the resolutions of summaries and provides the "drill-down" and "roll-up" abilities to navigate through summaries with different resolutions. We propose an efficient algorithm to evaluate the SNAP operation. In addition, we prove that the k-SNAP computation is NP-complete. We propose two heuristic methods to approximate the k-SNAP results. Through extensive experiments on a variety of real and synthetic datasets, we demonstrate the effectiveness and efficiency of the proposed methods.
Conference Paper
Full-text available
There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O(|V|·|E|1/2) space, and be used to answer reachability query efficiently. How- ever, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate. In this paper, we propose a hierarchi- cal partitioning approach to partition a large graph G into two subgraphs repeatedly in a top-down fashion. The unique feature of our approach is that we compute 2-hop cover while partitioning. In brief, in every iteration of top-down parti- tioning, we provide techniques to compute the 2-hop cover for connections between the two subgraphs first. A cover is computed to cut the graph into two subgraphs, which results in an overall cover with high compression for the entire graph G. Two approaches are proposed, namely a node-oriented approach and an edge-oriented approach. Our approach can efficiently compute 2-hop cover for a large graph with high compression rate. Our extensive experiment study shows that the 2-hop cover for a graph with 1,700,000 nodes and 169 billion connections can be obtained in less than 30 min- utes with a compression rate about 40,000 using a PC.
Conference Paper
In this paper we present HOPI, a new connection index for XML documents based on the concept of the 2-hop cover of a directed graph introduced by Cohen et al. In contrast to most of the prior work on XML indexing we consider not only paths with child or parent relationships between the nodes, but also provide space- and time-efficient reachability tests along the ancestor, descendant, and link axes to support path expressions with wildcards in our XXL search engine. We improve the theoretical concept of a 2-hop cover by developing scalable methods for index creation on very large XML data collections with long paths and extensive cross-linkage. Our experiments show substantial savings in the query performance of the HOPI index over previously proposed index structures in combination with low space requirements.
Conference Paper
In [6] Geralcikova, Koubek describe an algorithm for finding the transitive closure of an acyclic digraph G with worst case runtime O(ne red), where n is the number of nodes and e red is the number of edges in the transitive reduction of G. We present an improvement on their algorithm which runs in worst case time O(ke red) and space O(nk), where k is the width of a chain decomposition. For the expected values in the G n,p model of a random acyclic digraph with 0 < p < 1 we have: E(k) = O(\fracln(p n)p) E(ered ) = O(min(n |lnp|,p n2 )) = O(n lnn) E(k ered ) = { *20cO(n2 )for\fracln2 nn \leqslant p < 1 O(n2 lnlnn)otherwise \begin{gathered}E(k) = O(\frac{{\ln (p \cdot n)}}{p}) \hfill \\E(e_{red} ) = O(\min (n \cdot |lnp|,p \cdot n^2 )) = O(n \cdot \ln n) \hfill \\E(k \cdot e_{red} ) = \left\{ {\begin{array}{*{20}c}{O(n^2 )for\frac{{ln^2 n}}{n} \leqslant p < 1} \\{O(n^2 \cdot \ln \ln n)otherwise} \\\end{array} } \right. \hfill \\\end{gathered}
Article
In [6] Goralćíková and Koubek describe an algorithm for finding the transitive closure of an acyclic digraph G with worst-case runtime O(n·ered), where n is the number of nodes and ered is the number of edges in the transitive reduction of G. We present an improvement on their algorithm which runs in worst-case time O(k·ered) and space O(n·k), where k is the width of a chain decomposition. For the expected values in the Gn, p model of a random acyclic digraph with 0<p<;1 we have , otherwise, where “log” means the natural logarithm.
Conference Paper
We discuss a software package that generates graphs of specified sizes and properties. Among the types of graphs are. random graphs l random connected graphs l random directed acyclic graphs l random complete weighted graphs l random pairs of isomorphic regular graphs l random graphs with Hamiltonian cycles l random networks Graphs may be specified further with respect to one or more of these properties: o weighted or unweighed. directed or undirected l simple or nonsimple Such graphs are useful to faculty and students for testing and experimenting with many algorithms that appear in the computer science curriculum, such as algorithms to find components, to perform a topological sort, to solve the traveling salesperson problem, to find a minimal spanning tree, or to solve the maximal flow problem. Our software package, written in C, writes graphs to user-specified files. The package is available at no cost from the authors.