ChapterPDF Available
Title: Benchmarking for Graph Clustering and Partitioning
Name: David A. Bader1, Andrea Kappes2, Henning Meyerhenke2,
Peter Sanders2, Christian Schulz2, Dorothea Wagner2
Affil./Addr. 1: School of Computational Science and Engineering
Georgia Institute of Technology
Atlanta, GA 30332, United States
E-mail: bader@cc.gatech.edu
Affil./Addr. 2: Institute of Theoretical Informatics
Karlsruhe Institute of Technology (KIT)
76128 Karlsruhe, Germany
E-mail: {meyerhenke, sanders,
christian.schulz, dorothea.wagner}@kit.edu
Benchmarking for Graph Clustering and
Partitioning
Synonyms
Test Instances,Graph Repository,Algorithm Evaluation
Glossary
Benchmarking: Performance evaluation for comparison to the state of the art.
Benchmark Suite: Set of instances used for benchmarking.
Definition
Benchmarking refers to a repeatable performance evaluation as a means to compare some-
body’s work to the state of the art in the respective field. As an example, benchmarking
can compare the computing performance of new and old hardware.
In the context of computing, many different benchmarks of various sorts have been
used. A prominent example is the Linpack benchmark of the TOP500 list of the fastest
computers in the world, which measures the performance of the hardware by solving a
dense linear algebra problem. Different categories of benchmarks include sequential vs.
parallel, microbenchmark vs. application, or fixed code vs. informal problem description.
See e. g. [45] for a more detailed treatment of hardware evaluation.
When it comes to benchmarking algorithms for network analysis, typical measures
of interest are solution quality and running time. The comparison process requires the
establishment of widely accepted benchmark instances on which the algorithms have
to compete. In the course of the 10th DIMACS Implementation Challenge on Graph
Partitioning and Graph Clustering [7], we have assembled a suite of graphs and graph
generators intended for comparing graph algorithms with each other. While our particular
focus has been on assembling instances for benchmarking graph partitioning and graph
clustering algorithms, we believe the suite to be useful for related fields as well. This
includes the broad field of network analysis (which includes graph clustering, also known
as community detection) and various combinatorial problems.
The purpose of DIMACS Implementation Challenges is to assess the practical
performance of algorithms, in particular in problem settings where worst case and proba-
bilistic analysis yield unrealistic results. Where analysis fails, experimentation can provide
insights into realistic algorithm performance. By evaluating different implementations on
2
the assembled benchmark suite, the challenges create a reproducible picture of the state
of the art in the area under consideration. This helps to foster an effective technology
transfer within the research areas of algorithms, data structures, and implementation
techniques as well as a transfer back to the original applications.
Introduction
Graph partitioning and graph clustering (or community detection) are ubiquitous sub-
tasks in many application areas. Generally speaking, both techniques aim at the identi-
fication of vertex subsets (clusters) with many internal and few external edges. In this
work we concentrate our description on aspects important to the field of network analysis,
in particular on community detection.
In its most general form, community detection does not require a fixed number k
of clusters nor constraints on the size of the clusters. Instead, a quality function which
measures both, the density inside clusters and the sparseness between them, is used.
A variety of such functions has been proposed, among which the measure modularity
has proven itself fairly reliable and largely in accordance with human intuition in the
literature:
Problem 1 (Modularity Maximization). Given an undirected, weighted graph G=
(V, E , ω) without parallel edges, find a a partition Cof Vwhich optimizes the modularity
objective function
Q(C) := X
C∈C
X
{u,v}∈E
u,vC
ω({u, v})
X
eE
ω(e) X
vC
s(v)!2
4 X
eE
ω(e)!2.
3
Here we assume e={u, v} ∈ Eis a multiset (i. e., a self-loop u=vis allowed)
and that the strength s(v) of a node vis the sum of the weights of its incident edges. Re-
cently some criticism towards modularity has emerged. Fortunato and Barth´elemy [18]
demonstrate that global modularity optimization cannot distinguish between a single
community and a group of smaller communities. Berry et al. [10] provide a weighting
mechanism and Renaud a coarse-graining parametrization [27] that both somewhat al-
leviate the resolution limit drawback. Yet, other problems remain [20; 29]. That is one
reason why the 10th DIMACS Implementation Challenge had a second graph clustering
category apart from modularity maximization. In this second category, algorithms are
compared with respect to four different objective functions, with the goal to explicitly
invite clustering algorithms that are not based on a specific objective function. These ob-
jective functions are based upon performance [42], intracluster density, and intercluster
conductance and expansion [22].
In contrast to graph clustering, the term graph partitioning usually implies that
the number of partitions is fixed and the task is to partition the vertex set into blocks
of (almost) equal size. Its main applications do not belong to network analysis; a very
popular one is the preprocessing of data for parallel computing. The objective functions
used for the partitioning sub-challenges are the number of edges between the blocks and
the maximum communication volume [12] of the partition.
Participants of the challenge were invited to submit solutions to the different
challenge categories on graph partitioning and graph clustering. This way different al-
gorithms and implementations were tested against the benchmark instances. Thereby
future researchers are enabled to identify techniques that are most effective for a respec-
4
tive partitioning or clustering problem—by using our benchmark set and by comparing
their results to the challenge results and to those in the literature published afterwards.
In this chapter we describe the benchmark suite and its assembly process. More-
over, we sketch some of the results obtained by the challenge participants using the
benchmark graphs.
Key Points
Collecting the instances for the benchmark suite was performed with two main aspects
in mind, diversity of source applications and diversity of instance sizes. Moreover, some
graphs have been frequently used in previous work, whereas others are new or fairly
recent. Some instances are based on real-world inputs, while others have been created
using a generator. The generated graphs also vary in how closely they resemble real-
world counterparts. All instances have been long-term archived with public access [7].
The solutions generated by the challenge participants using the benchmark suite
constitute a valuable picture of the state of the art in graph partitioning and graph
clustering. To better suit algorithms that do not explicitly optimize a traditional objec-
tive function (and to circumvent known flaws in these traditional objective functions),
additional criteria to assess the quality of the submitted clusterings were evaluated.
Moreover, a nondiscriminatory way to assign scores to solvers that takes both
running time and solution quality into account was used.
Historical Background
Previous DIMACS Implementation Challenges addressed a large variety of algorithmic
problems, several of them involving graphs and networks. Graph repositories similar to
5
our benchmark suite exist as well. However, they often lack the size, broadness of source
applications, and connection to a quality-driven competition.
An example repository widely used in combinatorial scientific computing is Chris
Walshaw’s graph partitioning archive [39]. It stores 34 graphs and the best known graph
partitions computed for these graphs. This archive has substantially simplified the im-
provement of graph partitioning algorithms over the last 15 years. Today, however, the
instances contained therein have to be deemed rather small and also somewhat limited
in terms of application areas. For example there are no social networks contained in this
archive.
The University of Florida Sparse Matrix Collection [15], maintained by Tim Davis,
is broader in terms of application areas and matrix sizes. Although social networks have
been included recently as well, most matrices stem from technical applications.
Graph collections focusing on scale-free graphs such as social networks do exist, e.
g. [5; 34]. Two more recent but also prominent examples are the Stanford Large Network
Dataset Collection SNAP [32] and the Koblenz Network Collection (KONECT) [26]. In
most cases these existing collections lack a significant comparison of how a larger number
of different algorithms perform on the data – at least regarding inexact solutions to
complex problems such as graph clustering and partitioning.
Proposed Solution and Methodology
With the 10th DIMACS Implementation Challenge and its graph collection we addressed
both of these issues. Our collection contains more than 100 graphs of various origins
and assembled in different categories. (In addition we link to the Walshaw archive and
the University of Florida Sparse Matrix Collection.) We took care that our collection
6
contains instances best suited for partitioning in technical applications as well as instances
particularly intended for clustering and related network analysis tasks. Additionally the
challenge results provide guidance as to how different algorithms perform on different
classes of graphs.
The driving considerations during the assembly of the graph collection were to
include a sufficiently large variety of application sources (thereby instance structures)
and graph sizes. In that line of reasoning we identified three higher-level classes from
which to select: purely random graphs, generated graphs with close resemblance to data
from real-world applications and actual real-world data. Our intent is to offer a good
diversity in order to provide a meaningful benchmark for network analysis and graph
partitioning algorithms.
With generators at hand, an experimenter can scale to (nearly) arbitrary graph
sizes, retaining the general structure of the graphs while increasing their sizes. This is for
example important when performing weak scaling studies for the experimental analysis of
parallel algorithms. It is worth mentioning that, since the instance sizes are only limited
by architectural constraints, generators provide a means to ”grow” instance sizes with
future architectural improvements. This way the current state of a collection does not
age as quickly as without generators. More details on the graphs generated to resemble
real-world inputs follow below.
Generated random graphs offer an additional benefit. They are usually easier to
analyze with theoretical methods than other graph types. The Erd˝os-R´enyi (ER) model,
for example, has experienced significant consideration in theoretical works. Many im-
portant properties of these graphs were proved in this long course of research, see e. g.
Bollob´as [11]. Due to the lack of resemblance of typical ER graphs to real-world inputs, an
7
active line of research is developing alternative models. The random graphs we included
and their generators (as well as some recent developments) are described later in this
section in more detail.
Finally, real-world graphs add the necessary confidence that an algorithm’s perfor-
mance in terms of running time and quality on the collection resembles its performance
on the represented real-world applications. The real-world graphs we included are also
described below.
In the remainder of this section, we first explain the preprocessing performed to
unify the instances in our collection. After that, the individual categories of the collection
are described.
Preprocessing
Graph partitioning and, with some exceptions, also graph clustering are usually applied
to undirected graphs. A common preprocessing step is therefore to symmetrize the graph
prior to partitioning, i. e. to make the graph undirected by including an undirected edge
between two vertices aand bif and only if there exists an edge from ato bor from bto a.
If both directions are present in the original graph, there are several possibilities to assign
a weight to the resulting undirected edge. We chose the following approach: if the input
graph is unweighted, an edge between two vertices is considered as the information that
they are related in some sense, independent of the strength of this connection. Hence, if
an edge in both directions exists, it is translated into an unweighted, undirected edge.
On the other hand, if the input graph is weighted, we add up the edge weights of both
directions, as the connection between the vertices is typically stronger if both directions
exist. Analogously, the weight of parallel edges is summed up only in case of weighted
8
networks, otherwise parallel edges are removed. Self-loops are always removed (with the
exception of some synthetic Kronecker graphs, for which versions with self-loops and
parallel edges and versions without exist). Only a handful of the real-world networks
included in the benchmark contain (few) parallel edges and self-loops, therefore, this
decision does not alter the structure of the networks considerably.
Furthermore, the graph format used in the benchmark is a slight extension of
the format that some well-established partitioners such as Metis [23] use. This format
supports only integer edge weights, but some of the real-world graphs use fractional
weights in very different orders of magnitude. It would have been possible to define an
extension of this format to allow for fractional weight. However, this might have prevented
some solvers to enter the challenge. Multiplying the edge weights by a suitable power of
10 to get integer weights would have been another approach. Yet, as the edge weights
are of very different ranges, each graph would have needed its own normalization and
without rounding, the resulting edge weights could be too large to fit in standard integer
types. Although we are aware that this causes a loss of information, we felt that the
neatest solution was to make the respective graphs unweighted. One of the benchmark
graphs (cond-mat-2005) originally contains edges with a weight inf. As their meaning
is not clear and none of the objective function is well-defined in case of infinite weights,
we discarded these edges, as well as all edges with an edge weight of 0.
Random Graphs
The Erd˝os-R´enyi random graph generator in the collection creates graphs according to
the well-known G(n, p) model (which is very similar to the original model proposed by
Erd˝os and R´enyi, but was actually devised by Gilbert [19]). The included graphs have
9
been generated with p=1.5 ln n
n, where the value of pis chosen with the intent to obtain
connected graphs with high probability. This class of graphs is well-studied in theory. It is
also known that typical Erd˝os-R´enyi graphs do not resemble real-world graphs. The class
was included nevertheless for its theoretical importance and due to the easy generation
of large graphs with high average degree.
The graphs in the category Kronecker are generated using the Graph500 bench-
mark [6]. This benchmark’s purpose is to measure the performance of computer systems
when processing graph-structured workloads. More specifically, our instances are derived
from an R-MAT generator which is part of the benchmark. R-MAT graphs [14] are gen-
erated by sampling from a perturbed Kronecker product. They are scale-free and reflect
many properties of real social networks. All files have been generated with the R-MAT
parameters A=0.57, B=0.19, C=0.19, and D=0.05 and edge factor 48, i. e. the number of
edges equals 48n, where nis the number of vertices. The original Kronecker files contain
self-loops and multiple edges. These properties are also present in real-world data sets.
However, as some tools cannot handle these“artifacts”, we present “cleansed” versions of
the data sets (yielding simple graphs) as well.
Delaunay and random geometric graphs are taken from KaPPa [21] (Karlsruhe
Parallel Partitioner). Here, rggX is a random geometric graph with 2Xnodes. Each node
represents a random point in the unit square and edges connect nodes whose Euclidean
distance is below 0.55pln n/n. This threshold is chosen in order to ensure that the graph
is almost connected. The graph DelaunayX is the Delaunay triangulation of 2Xrandom
points in the unit square.
Other noteworthy generative models not or hardly included in the benchmark set
are Barabasi-Albert [3], BTER [38], Chung-Lu [1], Dorogovtsev-Mendes [16], and random
10
hyperbolic graphs [25]. We mention them here to give a more comprehensive overview.
The LFR model [30] is even particularly designed for benchmarking graph clustering
algorithms. The implementation accepts, among others, vertex degrees and community
sizes as parameters and generates a clustered graph accordingly.
Most graphs from the above models can be generated with the network analy-
sis toolkit NetworKit 4.0 [40] – except for BTER and the two generators available in
KaPPa. BTER also has a public implementation [24]. For some models there exist fur-
ther improved implementations. For Barabasi-Albert, for example, Meyer and Penschuck
proposed an external-memory and GPGPU variant [33], which was later improved and
implemented in distributed memory by Sanders and Schulz [37]. Also the Chung-Lu model
has a parallel generator for distributed memory [2]. The random hyperbolic graph gener-
ator by von Looz et al. [43], in turn, is only shared-memory parallel but also capable of
generating billions of edges quickly.
Generated Graphs with Real-World Structure
Each graph in the star mixture section of the benchmark represents a star-like structure
of different graphs S0, . . . , St. Here the graphs S1, . . . , Stare weakly connected to the
center S0by random edges. The total number of random edges added between each Si
and S0is less than 3% out of the total number of edges in Si. The graphs are mixtures
of the following structures: social networks, finite-element graphs, VLSI chips, peer-to-
peer networks, and matrices from optimization solvers. These graphs were submitted by
Safro et al.and are included into the benchmark because they are potentially hard graphs
for graph partitioning. A similar motivation is pursued by Gutfraind et al.’s multiscale
generator Musketeer, which takes a real-world graph and obfuscates it by random changes.
11
Two classic random models of social networks are preferential attachment [8] and
small world [44]. In the context of graph clustering, planted partition or G(n, pin, pout )
graphs are frequently used to validate algorithms [28]. These networks do not exhibit com-
mon properties of real-world social networks like a power-law degree distribution (which
can be provided by the aforementioned LFR generator). However, their use is typically
motivated by the knowledge of a ground-truth clustering that is used in the generation
process and can be used to compare algorithms independent of specific objective func-
tions. We included one graph of each category in the benchmark set as we deemed it
interesting to see to what extent algorithmic behavior on these graphs coincides with the
behavior on real-world data. Since the creation of the original benchmark set, LFR has
become another established generator for such ground-truth data, at least for smaller
networks, and we recommend its use as well.
Although we do not store dynamic graphs, three so-called frames (static instances
within the same dynamic sequence) from three dynamic mesh sequences each are in-
cluded in our collection. These sequences resemble two-dimensional adaptive numerical
simulations. The generator is explained in some detail by Marquardt and Schamberger.
Computational task graphs model temporal dependencies between tasks to be
solved, here for applications working on data streams. The generated graphs can be
used for performance analysis of algorithms and the development of improved hardware
parameters. These graphs have been submitted by Ajwani et al..
Real-World Graphs
The benchmark includes a large number of real-world networks stemming from many
different applications. Since scientific computing is a major application area using graph
12
partitioning, we included graphs that have been used in numerical simulations of various
kinds.
The partitioning of road networks is an important technique when it comes to
preprocessing for shortest path algorithms [9]. The graphs that can be found in this
section are road networks from whole continents, e. g. Europe, as well as from whole
countries, e. g. Germany. These graphs were submitted by Kobitzsch and are based on
data from the Open Street Map project.
Parallel direct methods for solving linear systems yield another important appli-
cation of graph partitioning. We therefore included a subset of graph representations of
matrices from the University of Florida Sparse Matrix Collection [15].
In the context of graph clustering, the analysis of social networks is one of the most
important applications. The part of the benchmark suite especially addressed to clustering
algorithms reflects this by including a variety of real-world social networks. Most of these
are taken from the webpages of Newman [34] and Arenas [5] and have been previously
used to compare and evaluate clustering in the context of modularity-maximization.
A special subcategory of social networks are coauthorship networks. In a scien-
tific context, coauthorship networks link scientists that have coauthored at least one
publication. The DIMACS Benchmark includes coauthorship graphs from the field of as-
trophysics, condensed matter and high-energy theory, network science, and computer and
information science. Closely related to these are copaper and citation networks. Copaper
graphs are compiled analogously to coauthorship graphs by linking papers if they share
at least one author. In contrast to that, citation networks link papers with another if one
cites the other. The benchmark set contains graphs of both kind based on publications
in computer and information science.
13
Graph clustering has also been successfully applied to web graphs, where edges link
webpages based on hyperlinks. A subset of the web graphs we included are gathered by the
Laboratory of Web Algorithms in Milano by domain-wise crawls performed between 2000
and 2007. In the context of the challenge, these networks are particularly interesting due
to their size; in fact, the graph combining twelve monthly snapshots of the .uk domain
comprises over 3 billion edges, which makes it the largest network in the whole benchmark
set.
Apart from these, the part of the benchmark set explicitly addressed to clustering
contains a variety of (mainly) small networks from various application areas such as biol-
ogy and political science. All of these are well-known in the modularity-based clustering
community. For details on particular networks and references, we refer to the challenge
webpage [7].
The graphs in the redistricting category represent US states. They are used for
solving the redistricting problem, i. e., determining new electoral boundaries, for example
due to population changes. Each node represents a block from the 2010 census. Two nodes
share an edge if their blocks are adjacent.
Illustrative Example
As running times would have been prohibitive for the whole set of benchmark instances,
participants of the competition were only required to submit clusterings for a subset of
instances, the final challenge testbed. This subset was announced two weeks before the
deadline. To illustrate the performance of different algorithms on graphs from different
categories, Table 1 shows the best modularity values achieved by the submitted solvers
on the final challenge testbed. From the fifteen solvers in this category, two clearly lead
14
the field. CGGCi RG [35] iteratively combines several high-quality clusterings to find a
solution with higher quality. In contrast to that, VNS [4] uses the metaheuristic variable
neighborhood search, a variant of local search. With few exceptions, VNS achieves the best
results on networks with up to approximately 100,000 vertices, but is outperformed by
CGGi RG in larger networks. An interesting observation is that ParMod [13], a technique
based on recursive bipartitions, attains the best modularity values on two graphs. Neither
the size nor the density of these graphs is exceptional, but unlike the majority of graphs
used for this competition, they exhibit a mesh-like structure.
In addition to quality, running time is also an important aspect when choosing an
algorithm for a certain application. This is why the DIMACS Challenge included a second
subchallenge for each objective function, where both quality and speed contributed to the
final scores. More specifically, the scoring is based on the Pareto Count of a submitted
algorithm on an instance, i. e. the number of competing algorithms that are both faster
and achieve a higher quality. In this category, a relatively fast agglomerative solver named
RG [35] obtained the best scores. While the differences in running time might not seem very
important in the context of small instances, they were in fact huge on larger instances.
Considering for example the raw running times on the webgraph uk-2002,RG needs
approximately 13 minutes to compute a clustering, which is more than 600 times faster
than the running time of CGGCi RG, while the difference in modularity is less than 0.001.
This running time can be further improved by using parallel algorithms. For example, one
of the submissions is able to cluster this instance in only 30 seconds by using a GPU [17],
with a modularity that is still larger than 0.97.
Consequently, the question which algorithm is the “best” cannot always be an-
swered globally. Instead, the answer often depends on application-specific parameters like
15
Graph Modularity Solver
as-22july06 0.678267 CGGCi RG [35]
astro-ph 0.744621 VNS [4]
audikw1 0.917983 VNS [4]
belgium.osm 0.994940 CGGCi RG [35]
cage15 0.903173 CGGCi RG [35]
caidaRouterLevel 0.872042 CGGCi RG [35]
celegans metabolic 0.453248 VNS [4]
citationCiteseer 0.823930 CGGCi RG [35]
coAuthorsCiteseer 0.905297 CGGCi RG [35]
cond-mat-2005 0.746254 CGGCi RG [35]
coPapersDBLP 0.866794 CGGCi RG [35]
email 0.582829 VNS [4]
er-fact1.5-scale25 0.077934 comm-el [36]
eu-2005 0.941554 CGGCi RG [35]
G n pin pout 0.500098 CGGCi RG [35]
in-2004 0.980622 CGGCi RG [35]
kron g500-s-logn16 0.065056 VNS [4]
kron g500-s-logn20 0.050350 CGGCi RG [35]
ldoor 0.969370 ParMod [13]
luxembourg.osm 0.989621 VNS [4]
memplus 0.700473 CGGCi RG [35]
PGPgiantcompo 0.886564 CGGCi RG [35]
polblogs 0.427105 VNS [4]
power 0.940851 VNS [4]
preferentialAttachment 0.315994 VNS [4]
rgg n 2 17 s0 0.978324 VNS [4]
smallworld 0.793042 VNS [4]
uk-2002 0.990301 CGGCi RG [35]
uk-2007-05 0.480210 comm-el-xmt2 [36]
333SP 0.989095 ParMod [13]
Table 1. Best modularity scores achieved by challenge participants on the challenge testbed.
the size and structure of certain instances, as well as the available hardware and a custom
trade-off between quality and running time. Comparing the results of different algorithms
on various benchmark instances can assist the choice of an appropriate algorithm.
The results of the DIMACS Challenge have initiated new algorithmic work in
the area of parallel computing. We choose to highlight two parallel modularity-driven
community detection algorithms, PLM [41] and Nerstrand [31]. They currently offer the
best tradeoff between running time and solution quality on commodity hardware, with
16
Nerstrand being slightly faster. The largest graph in the benchmark set, the web graph
uk-2007-05 with 3.3 billion edges, can be clustered by these two in less than three minutes
on a shared-memory server with a modularity value above 0.99.
Future Directions
With the graph archive of the 10th DIMACS Implementation Challenge on Graph Par-
titioning and Graph Clustering we have introduced a comprehensive collection of graphs
that can be used for the assessment of graph partitioning and network analysis algorithms.
With the archive we hope to simplify the development of improved solution techniques
in these areas by allowing algorithm engineers to compare the performance of their im-
plementations to the state of the art.
A deliberate limitation is to not consider dynamic graphs, directed graphs, nor
hypergraphs. As instances of this type were not considered in the challenge, they were
not included in the collection, either. We do consider all these omitted graph types useful,
though, and they represent interesting applications.
Of particular interest for network analysis are directed and dynamic graphs. There
is no lack of data (albeit not all of them are publicly accessible). As an example, the
dynamic interaction of social network users over time constitutes a dynamic graph that
is of particular interest to social media enterprises and online marketers. When compiling
dynamic instances into a collection, it should be considered that dynamic graphs are more
difficult to assemble or generate in a consistent way—issues such as a suitable interval
length and space-saving storage formats arise.
We encourage interested colleagues to start a new collection using a similar
methodology, this time focusing on the types of graphs we omitted. Such an effort would
17
certainly be beneficial to the network analysis community.
Acknowledgements. The authors would like to thank all contributors to the 10th DIMACS Implemen-
tation Challenge graph collection. Tim Davis provided valuable guidelines for preprocessing the data.
Financial support by the sponsors DIMACS, the Command, Control, and Interoperability Center for
Advanced Data Analysis (CCICADA), Pacific Northwest National Laboratory, Sandia National Labora-
tories, Intel Corporation and Deutsche Forschungsgemeinschaft (DFG) is gratefully acknowledged.
Cross References
Clustering Algorithms, 00138
Communities Discovery and Analysis in Online and Offline Social Networks: Link-
Based Overlapping and Non-overlapping Communities, 0006
Community Detection, Current and Future Research Trends, 00027
Community Discovery and Analysis in Large-Scale Online/Offline Social Networks,
00215
Extracting and Inferring communities via link analysis, 00218
Extracting Social Networks from Data, 00011
Large Networks, Analysis of, 00031
Simulated Datasets, 00164
Social Network Datasets, 00122
Sources of network data, 00313
Synthetic Datasets, 00169
18
References
1. William Aiello, Fan Chung, and Linyuan Lu. A random graph model for power law graphs. Experi-
mental Mathematics, 10(1):53–66, 2001.
2. Maksudul Alam and Maleq Khan. Parallel algorithms for generating random networks with given
degree sequences. International Journal of Parallel Programming, pages 1–19. To appear.
3. R. Albert and A.L. Barab´asi. Statistical mechanics of complex networks. Reviews of modern physics,
74(1):47, 2002.
4. Daniel Aloise, Gilles Caporossi, Sylvain Perron, Pierre Hansen, Leo Liberti, and Manuel Ruiz. Modu-
larity maximization in networks by variable neighborhood search. In 10th DIMACS Impl. Challenge
Workshop. Georgia Inst. of Technology, Atlanta, GA, 2012.
5. Alex Arenas. Network data sets. http://deim.urv.cat/~aarenas/data/welcome.htm. [Online;
accessed 28-Septembre-2012].
6. D. A. Bader, Jonathan Berry, Simon Kahan, Richard Murphy, E. Jason Riedy, and Jeremiah Will-
cock. Graph 500 benchmark 1 (”search”), version 1.1. Technical report, Graph 500, 2010.
7. David Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner. 10th DIMACS implemen-
tation challenge. http://www.cc.gatech.edu/dimacs10/, 2012. [Online; accessed 17-April-2016].
8. Albert-L´aszl´o Barab´asi and R´eka Albert. Emergence of scaling in random networks. Science, 286:509–
512, 1999.
9. Reinhard Bauer, Daniel Delling, Peter Sanders, Dennis Schieferdecker, Dominik Schultes, and
Dorothea Wagner. Combining hierarchical and goal-directed speed-up techniques for Dijkstra’s al-
gorithm. ACM Journal of Experimental Algorithmics, 15, 2010.
10. J.W. Berry, B. Hendrickson, R.A. LaViolette, and C.A. Phillips. Tolerating the community detection
resolution limit with edge weighting. Phys. Rev. E, 83:056119, May 2011.
11. B. Bollob´as. Random Graphs. London: Academic Press, 1985.
12. ¨
Umit V. C¸ ataly¨urek and Cevdet Aykanat. Decomposing irregularly sparse matrices for parallel
matrix-vector multiplication. In Alfonso Ferreira, Jos´e Rolim, Yousef Saad, and Tao Yang, editors,
Parallel Algorithms for Irregularly Structured Problems, volume 1117 of Lecture Notes in Computer
Science, pages 75–86. Springer Berlin / Heidelberg, 1996. 10.1007/BFb0030098.
19
13. ¨
Umit V. C¸ ataly¨urek, Kamer Kaya, Johannes Langguth, and Bora U¸car. A divisive clustering tech-
nique for maximizing the modularity. In 10th DIMACS Impl. Challenge Workshop. Georgia Inst. of
Technology, Atlanta, GA, 2012.
14. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proc.
4th SIAM Intl. Conf. on Data Mining (SDM), Orlando, FL, April 2004. SIAM.
15. T. Davis. The University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/
research/sparse/matrices, 2016. [Online; accessed 17-April-2016].
16. Sergei N Dorogovtsev and Jos´e FF Mendes. Evolution of networks: From biological nets to the
Internet and WWW. Oxford University Press, 2003.
17. B. O. Fagginger Auer and R. H. Bisseling. Graph coarsening and clustering on the GPU. In 10th
DIMACS Impl. Challenge Workshop. Georgia Inst. of Technology, Atlanta, GA, 2012.
18. S. Fortunato and M. Barthelemy. Resolution limit in community detection. Proceedings of the
National Academy of Science, 104:36–41, January 2007.
19. Horst Gilbert. Random Graphs. The Annals of Mathematical Statistics, 30(4):1141–1144, 1959.
20. Benjamin H. Good, Yves-Alexandre de Montjoye, and Aaron Clauset. Performance of modularity
maximization in practical contexts. Phys. Rev. E, 81:046106, Apr 2010.
21. M. Holtgrewe, P. Sanders, and C. Schulz. Engineering a Scalable High Quality Graph Partitioner.
24th IEEE International Parallal and Distributed Processing Symposium, 2010.
22. Ravi Kannan, Santosh Vempala, and Adrian Vetta. On Clusterings: Good, Bad, Spectral. Journal
of the ACM, 51(3):497–515, May 2004.
23. G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular
Graphs. SIAM Journal on Scientific Computing, 20:359–392, 1999.
24. Tamara G Kolda, Ali Pinar, Todd Plantenga, and C Seshadhri. A scalable generative graph model
with community structure. arXiv preprint arXiv:1302.6636, 2013.
25. Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Mari´an Bogu˜a.
Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, Sep 2010.
26. erˆome Kunegis. KONECT: the koblenz network collection. In Leslie Carr, Alberto H. F. Laen-
der, Bernadette Farias L´oscio, Irwin King, Marcus Fontoura, Denny Vrandecic, Lora Aroyo, Jos´e
Palazzo M. de Oliveira, Fernanda Lima, and Erik Wilde, editors, 22nd International World Wide
20
Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, Companion Volume, pages
1343–1350. International World Wide Web Conferences Steering Committee / ACM, 2013.
27. Renaud Lambiotte. Multi-scale modularity in complex networks. In 8th International Symposium
on Modeling and Optimization in Mobile, Ad-Hoc and Wireless Networks (WiOpt 2010), May 31 -
June 4, 2010, University of Avignon, Avignon, France, pages 546–553. IEEE, 2010.
28. Andrea Lancichinetti and Santo Fortunato. Community detection algorithms: A comparative anal-
ysis. Physical Review E, 80(5), November 2009.
29. Andrea Lancichinetti and Santo Fortunato. Limits of modularity maximization in community de-
tection. Phys. Rev. E, 84:066122, Dec 2011.
30. Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing com-
munity detection algorithms. Physical review E, 78(4):046110, 2008.
31. Dominique LaSalle and George Karypis. Multi-threaded modularity based graph clustering using
the multilevel paradigm. J. Parallel Distrib. Comput., 76:66–80, 2015.
32. J. Leskovec. Stanford Network Analysis Package (SNAP). http://snap.stanford.edu/index.
html. [Online; accessed 17-April-2016].
33. Ulrich Meyer and Manuel Penschuck. Generating massive scale-free networks under resource con-
straints. In Michael T. Goodrich and Michael Mitzenmacher, editors, Proceedings of the Eighteenth
Workshop on Algorithm Engineering and Experiments, ALENEX 2016, Arlington, Virginia, USA,
January 10, 2016, pages 39–52. SIAM, 2016.
34. Marc Newman. Network data. http://www-personal.umich.edu/~mejn/netdata/. [Online; ac-
cessed 28-Septembre-2012].
35. Michael Ovelg¨onne and Andreas Geyer-Schulz. An ensemble learning strategy for graph clustering.
In 10th DIMACS Impl. Challenge Workshop. Georgia Inst. of Technology, Atlanta, GA, 2012.
36. E. Jason Riedy, Henning Meyerhenke, David Ediger, and David A. Bader. Parallel community detec-
tion for massive graphs. In 10th DIMACS Impl. Challenge Workshop. Georgia Inst. of Technology,
Atlanta, GA, 2012.
37. Peter Sanders and Christian Schulz. Scalable generation of scale-free graphs. Information Processing
Letters, 116(7):489 – 491, 2016.
21
38. C. Seshadhri, Tamara G. Kolda, and Ali Pinar. Community structure and scale-free collections of
Erd˝os-R´enyi graphs. Physical Review E, 85(5), May 2012.
39. A.J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation
approach to graph-partitioning. Journal of Global Optimization, 29(2):225–241, 2004.
40. Christian Staudt, Aleksejs Sazonovs, and Henning Meyerhenke. Networkit: A tool suite for large-scale
complex network analysis. CoRR, abs/1403.3005, November 2015.
41. Christian L. Staudt and Henning Meyerhenke. Engineering parallel algorithms for community de-
tection in massive networks. IEEE Trans. Parallel Distrib. Syst., 27(1):171–184, 2016.
42. Stijn M. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.
43. Moritz von Looz, Henning Meyerhenke, and Roman Prutkin. Generating random hyperbolic graphs
in subquadratic time. In Khaled M. Elbassioni and Kazuhisa Makino, editors, Algorithms and
Computation - 26th International Symposium, ISAAC 2015, Nagoya, Japan, December 9-11, 2015,
Proceedings, volume 9472 of Lecture Notes in Computer Science, pages 467–478. Springer, 2015.
44. Duncan J. Watts and Steven H. Strogatz. Collective Dynamics of “Small-World” Networks. Nature,
393:440–442, 1998.
45. Reinhold Weicker. Benchmarking. In Maria Calzarossa and Salvatore Tucci, editors, Performance
Evaluation of Complex Systems: Techniques and Tools, volume 2459 of Lecture Notes in Computer
Science, pages 231–242. Springer Berlin / Heidelberg, 2002.
22
... Experiments with Streaming Algorithms. In our experiments with streaming algorithms, we use graphs from various sources [43][44][45][46][47]. Most of the considered graphs were used for benchmark in previous works on graph partitioning. ...
... The graphs wiki-Talk and web-Google, as well as most networks of co-purchasing, roads, social, web, autonomous systems, citations, circuits, similarity, meshes, and miscellaneous are publicly available either in [43] or in [44]. We also use graphs such as eu-2005, in-2004, uk-2002, and uk-2007-05, which are available at the 10 th DIMACS Implementation Challenge website [45]. Finally, we include some artificial random graphs. ...
... Graphs derived from sparse matrices have been taken from the SuiteSparse Matrix Collection [50]. We also use graphs from the 10th DIMACS Implementation Challenge [45] website. Here, rggX and delX are defined as before. ...
Preprint
Full-text available
(Hyper)Graph decomposition is a family of problems that aim to break down large (hyper)graphs into smaller sub(hyper)graphs for easier analysis. The importance of this lies in its ability to enable efficient computation on large and complex (hyper)graphs, such as social networks, chemical compounds, and computer networks. This dissertation explores several types of (hyper)graph decomposition problems, including graph partitioning, hypergraph partitioning, local graph clustering, process mapping, and signed graph clustering. Our main focus is on streaming algorithms, local algorithms and multilevel algorithms. In terms of streaming algorithms, we make contributions with highly efficient and effective algorithms for (hyper)graph partitioning and process mapping. In terms of local algorithms, we propose sub-linear algorithms which are effective in detecting high-quality local communities around a given seed node in a graph based on the distribution of a given motif. In terms of multilevel algorithms, we engineer high-quality multilevel algorithms for process mapping and signed graph clustering. We provide a thorough discussion of each algorithm along with experimental results demonstrating their superiority over existing state-of-the-art techniques. The results show that the proposed algorithms achieve improved performance and better solutions in various metrics, making them highly promising for practical applications. Overall, this dissertation showcases the effectiveness of advanced combinatorial algorithmic techniques in solving challenging (hyper)graph decomposition problems.
... For the experimental evaluation of community detection algorithms, suitable input instances are required [7]. Ideally, instances from applications of community detection with known ground truth communities should be used for this. ...
Chapter
Full-text available
The abundance of massive network data in a plethora of applications makes scalable analysis algorithms and software tools necessary to generate knowledge from such data in reasonable time. Addressing scalability as well as other requirements such as good usability and a rich feature set, the open-source software NetworKit has established itself as a popular tool for large-scale network analysis. This chapter provides a brief overview of the contributions to NetworKit made by the SPP 1736. Algorithmic contributions in the areas of centrality computations, community detection, and sparsification are in the focus, but we also mention several other aspects – such as current software engineering principles of the project and ways to visualize network data within a NetworKit -based workflow.
... Instances. We use graphs from various sources [27,40,2] to test our algorithm. Most of the considered graphs were used for benchmark in previous works in the area. ...
... For the experimental evaluation of community detection algorithms, suitable input instances are required [7]. Ideally, instances from applications of community detection with known ground truth communities should be used for this. ...
Preprint
Full-text available
The abundance of massive network data in a plethora of applications makes scalable analysis algorithms and software tools necessary to generate knowledge from such data in reasonable time. Addressing scalability as well as other requirements such as good usability and a rich feature set, the open-source software NetworKit has established itself as a popular tool for large-scale network analysis. This chapter provides a brief overview of the contributions to NetworKit made by the DFG Priority Programme SPP 1736 Algorithms for Big Data. Algorithmic contributions in the areas of centrality computations, community detection, and sparsification are in the focus, but we also mention several other aspects -- such as current software engineering principles of the project and ways to visualize network data within a NetworKit-based workflow.
Article
One of the most fundamental problems in computer science is the reachability problem : Given a directed graph and two vertices s and t , can s reach t via a path? We revisit existing techniques and combine them with new approaches to support a large portion of reachability queries in constant time using a linear-sized reachability index . Our new algorithm O’Reach can be easily combined with previously developed solutions for the problem or run standalone . In a detailed experimental study, we compare a variety of algorithms with respect to their index-building and query times as well as their memory footprint on a diverse set of instances. Our experiments indicate that the query performance often depends strongly not only on the type of graph, but also on the result, i.e., reachable or unreachable . Furthermore, we show that previous algorithms are significantly sped up when combined with our new approach in almost all scenarios. Surprisingly, due to cache effects, a higher investment in space doesn’t necessarily pay off: Reachability queries can often be answered even faster than single memory accesses in a precomputed full reachability matrix.
Article
Partitioning graphs into blocks of roughly equal size is a widely used tool when processing large graphs. Currently there is a gap observed in the space of available partitioning algorithms. On the one hand, there are streaming algorithms that have been adopted to partition massive graph data on small machines. In the streaming model, vertices arrive one at a time including their neighborhood and then have to be assigned directly to a block. These algorithms can partition huge graphs quickly with little memory, but they produce partitions with low solution quality. On the other hand, there are offline (shared-memory) multilevel algorithms that produce partitions with high quality but also need a machine with enough memory to partition huge networks. In this work, we make a first step to close this gap by presenting an algorithm that computes significantly improved partitions of huge graphs using a single machine with little memory in streaming setting. First, we adopt the buffered streaming model which is a more reasonable approach in practice. In this model, a processing element can store a buffer of nodes alongside with their edges before making assignment decisions. When our algorithm receives a batch of nodes, we build a model graph that represents the nodes of the batch and the already present partition structure. This model enables us to apply multilevel algorithms and in turn, on cheap machines, compute much higher quality solutions of huge graphs than previously possible. To partition the model graph, we develop a multilevel algorithm that optimizes an objective function that has previously shown to be effective for the streaming setting. Surprisingly, this also removes the dependency on the number of blocks k from the running time compared to the previous state-of-the-art. Overall, our algorithm computes, on average, \(75.9\% \) better solutions than \(\text{{\sf Fennel}} \) [35] using a very small buffer size. In addition, for large values of k our algorithm becomes faster than \(\text{{\sf Fennel}} \) .
Article
To handle task execution, modern supercomputers employ thousands (or millions) of processors. In such supercomputers, task scheduling has a meaningful impression on system performance. To improve efficiency, task scheduling algorithms aim to decrease the volume of communication and the number of message exchanges. These efforts, however, result in other bottlenecks, such as high-link congestion. In addition, the heterogeneity of processors and networks is another major challenge for schedulers. This paper presents a new algorithm for scheduling called Heterogeneity-Aware Task Scheduling (HATS). The proposed algorithm adopts an updated multi-level hyper-graph partitioning approach. It describes a new method of aggregation in the coarsening step that helps to accurately coarsen the hyper-graph of the task model. The Raccoon Optimization algorithm is then used in the initial partitioning phase, and in the un-coarsening phase, a novel refinement procedure optimises the initial partitions. The experiments on this approach showed that, compared to the other well-known algorithms, the proposed method offers better schedules with lower communication volume and imbalance ratio in a shorter time.
Conference Paper
Full-text available
Complex networks have become increasingly popular for modeling various real-world phenomena. Realistic generative network models are important in this context as they simplify complex network research regarding data sharing, reproducibility, and scalability studies. Random hyperbolic graphs are a very promising family of geometric graphs with unit-disk neighborhood in the hyperbolic plane. Previous work provided empirical and theoretical evidence that this generative graph model creates networks with many realistic features. In this work we provide the first generation algorithm for random hyperbolic graphs with subquadratic running time. We prove a time complexity of \(O((n^{3/2}+m) \log n)\) with high probability for the generation process. This running time is confirmed by experimental data with our implementation. The acceleration stems primarily from the reduction of pairwise distance computations through a polar quadtree, which we adapt to hyperbolic space for this purpose and which can be of independent interest. In practice we improve the running time of a previous implementation (which allows more general neighborhoods than the unit disk) by at least two orders of magnitude this way. Networks with billions of edges can now be generated in a few minutes.
Article
Full-text available
We introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets containing up to billions of connections. We describe the methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture. Our goal for the software is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is implemented as a hybrid combining the kernels written in C++ with a Python front end, enabling integration into the Python ecosystem of tested tools for data analysis and scientific computing. The package provides a wide range of functionality (including common and novel analytics algorithms and graph generators) and does so via a convenient interface. In an experimental comparison with related software, NetworKit shows the best performance on a range of typical analysis tasks.
Conference Paper
Random graphs as mathematical models of massive scale-free networks have recently become very popular. While a number of interesting properties of them have been proven, huge instances of such networks actually need to be generated for experimental evaluation and to provide artificial data sets. In this paper, we consider generation methods for random graph models based on linear preferential attachment under limited computational resources and investigate our techniques using the well-known Barabási-Albert (BA) graph model. We present the first two I/O-efficient BA generators, MP-BA and TFP-BA, for the external-memory (EM) model and then extend MP-BA to massive parallelism based on but not limited to GPGPU. Our simple and easily generalizable sequential TFP-BA outperforms a highly tuned implementation of the sequential linear-time BB-BA algorithm by Batagelj and Brandes by several orders of magnitude once the graph size exceeds the available RAM by only 2%. An implementation of MP-BA targeting heterogeneous systems with CPUs and GPUs is 17.6 times faster than BB-BA for instances fitting in main memory and scales well in the EM setting. Both schemes support a number of features in more general preferential attachment models, e.g., seed graphs exceeding main memory, vertices with random initial degrees, the uniform sampling of vertices, directed graphs and edges between two randomly chosen vertices. Compared with previous studies on computer clusters, MP-BA yields competitive results and already poses a viable alternative using only a single machine.
Article
We explain how massive instances of scale-free graphs following the Barabasi–Albert preferential attachment model can be generated very quickly in an embarrassingly parallel way. This makes this popular model available for studying big data graph problems. As a demonstration, we generated a Petaedge graph ( edges) in less than an hour.
Article
Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mech-anisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Article
We present a new graph clustering algorithm aimed at obtaining clusterings of high modularity. The algorithm pursues a divisive clustering approach and uses established graph partitioning algorithms and techniques to compute recursive bipartitions of the input as well as to refine clusters. Experimental evaluation shows that the modularity scores obtained compare favorably to many previous approaches. In the majority of test cases, the algorithm outperformed the best known alternatives. In particular, among 13 problem instances common in the literature, the proposed algorithm improves the best known modularity in 9 cases.
Article
Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained shared-memory parallel graph coarsening algorithm and use this to implement a parallel agglomerative clustering heuristic on both the CPU and the GPU. This heuristic is able to generate clusterings in very little time: a modularity 0.996 clustering is obtained from a street network graph with 14 million vertices and 17 million edges in 4.6 seconds on the GPU.
Article
This paper is on a graph clustering scheme inspired by en- semble learning. In short, the idea of ensemble learning is to learn several weak classifiers and use these weak classifiers to determine a strong clas- sifier. In this contribution, we use the generic procedure of ensemble learning and determine several weak graph clusterings (with respect to the objective function). From the partition given by the maximal over- lap of these clusterings (the cluster cores), we continue the search for a strong clustering. We demonstrate the performance of this scheme by using it to maximize the modularity of a graph clustering. We show, that the quality of the initial weak clusterings is of minor importance for the quality of the final result of the scheme if we iterate the process of restarting from maximal overlaps. In addition to the empirical evaluation of the clustering scheme, we will links its search behavior to global analysis. With help of Morse theory and a discussion of the path space of the search heuristics we explain the superior search performance of this clusteing scheme.