Conference PaperPDF Available

Efficiently Monitoring Bandwidth and Latency in IPNetworks

Authors:

Abstract

Effective monitoring of network utilization and performance indicators is a key enabling technology for proactive and reactive resource management, flexible accounting, and intelligent planning in next-generation IP networks. In this paper, we address the challenging problem of efficiently monitoring bandwidth utilization and path latencies in an IP data network. Unlike earlier approaches, our measurement architecture assumes a single point-of-control in the network (corresponding to the network operations center) that is responsible for gathering bandwidth and latency information using widely-deployed management tools, like SNMP, RMON/NetFlow, and explicitly-routed IP probe packets. Our goal is to identify effective techniques for monitoring (a) bandwidth usage for a given set of links or packet flows, and (b) path latencies for a given set of paths, while minimizing the overhead imposed by the management tools on the underlying production network. We demonstrate that minimizing overheads under our measurement model gives rise to new combinatorial optimization problems, most of which prove to be NP-hard. We also propose novel approximation algorithms for these optimization problems and prove guaranteed upper bounds on their worst-case performance. Our simulation results validate our approach, demonstrating the effectiveness of our novel monitoring algorithms over a wide range of network topologies
Efficiently Monitoring Bandwidth and Latency
in IP Networks
Yuri Breitbart, Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi, Avi Silberschatz
AbstractEffective monitoring of network utilization and performance
indicators is a key enabling technology for proactive and reactive re-
source management, flexible accounting, and intelligent planning in next-
generation IP networks. In this paper, we address the challenging problem
of efficiently monitoring bandwidth utilization and path latencies in an IP
data network. Unlike earlier approaches, our measurement architecture
assumes a single point-of-control in the network (corresponding to the Net-
work Operations Center) that is responsible for gathering bandwidth and
latency information using widely-deployed management tools, like SNMP,
RMON/NetFlow, and explicitly-routed IP probe packets. Our goal is to
identify effective techniques for monitoring (a) bandwidthusage for a given
set of links or packet flows, and (b) path latencies for a given set of paths,
while minimizing the overhead imposed by the management tools on the un-
derlying production network. We demonstrate that minimizing overheads
under our measurement model gives rise to new combinatorial optimiza-
tion problems, most of which prove to be

-hard. We also propose novel
approximation algorithms for these optimization problems and prove guar-
anteed upper bounds on their worst-case performance. Our simulation re-
sults validate our approach, demonstrating the effectiveness of our novel
monitoring algorithms over a wide range of network topologies.
I. INTRODUCTION
HE explosive growth in Internet and intranet deployment
for a constantly growing variety of applications has created
a massive increase in demand for bandwidth, performance, pre-
dictable Quality of Service (QoS), and differentiated network
services. Simultaneously, the need has emerged for measure-
ment technology that will support this growth by providing IP
network managers with effective tools for monitoring network
utilization and performance. Bandwidth and latency are clearly
the two key performance parameters and utilization indicators
for any modern IP network. Knowledge of the up-to-date band-
width utilizations and path latencies is critical for numerous im-
portant network management tasks, including application and
user profiling, proactive and reactive resource managementand
traffic engineering, as well as providingand verifying QoS guar-
antees for end-user applications.
Indeed, these observations have led to a recent flurry of both
research and industrial activity in the area of developing novel
tools and infrastructures for measuring network bandwidth and
latency parameters. Examples include SNMP and RMON mea-
surement probes [1], Cisco’s NetFlow tools [2], the IDMaps [3],
[4] and Network Distance Maps [5] efforts for measuring end-
to-end network latencies, the pathchar tool for estimating In-
ternet link characteristics [6], [7], and packet-pair algorithms
for measuring link bandwidth [8], [9]. A crucial requirement
for such monitoring tools is that they be deployed in an intelli-
gent manner in order to avoid placing undue strain on the shared
resources of the production network.
As an example, Cisco’s NetFlow measurement tool al-
lows NetFlow-enabled routers to collect detailed traffic data
on packet flows between source-destination node pairs [2].
NetFlow-enabled routers can generate large volumes of export
Information Sciences Research Center, Bell Laboratories, 600 Mountain Av-
enue, Murray Hill, NJ 07974.
yuri, cychan, minos, rastogi,
avi
@bell-labs.com
data due to the size and distributed nature of large data net-
works, the granularity of the recorded flow data, and the rapid
data traffic growth. The key mechanism for enhancing NetFlow
data volume manageability is the careful planning of NetFlow
deployment. Cisco suggests that NetFlow be deployed incre-
mentally (i.e., interface by interface) and strategically (i.e., on
carefully-chosen routers), instead of being widely deployed on
every router in the network [2]. Cisco domain experts can work
with customers to determine such “key” routers and interfaces
for NetFlow deployment based on the customers’ traffic flow
patterns and network topology and architecture [2]. Similar ob-
servations hold for the deployment of SNMP agents [1], since
processing SNMP queries can adversely impact router perfor-
mance and SNMP data transfers can result in significant vol-
umes of additional network traffic. In particular, as modern
Network Management Systems (NMS) shift their focus toward
service- and application-level management, the network moni-
toring process requires more data to be collected and at much
higher frequencies. In such scenarios, the SNMP-polling fre-
quency needs to be high enough not to miss relevant changes or
degradations in application behavior or service availability [10].
(In fact, even for failure monitoring, Stallings [1] suggests that
short polling intervals are often required in order for the NMS
to be responsive to problems in the network.) When such high
SNMP-polling frequencies are prescribed, the overhead that a
polled SNMP agent imposes on the underlying router can be
significant and can adversely impact the router’s throughput.
Further, the problem is only exacerbated for mid- to low-end
routers (e.g., that implement large parts of their routing function-
ality in software). As an example, our experiments with a Cisco
4000-series router on our local network showed the throughput
of the router to drop as much as

during a polling cycle
(where repeated getnext queries are issued to gather link uti-
lization data). Obviously, polling such a router at reasonably
high frequencies can severely impact its performance. Also,
note that the network bandwidth consumed by such frequent
SNMP polling for detailed router/application/service monitor-
ing can be significant, primarily due to the large number of
polling messages that need to traverse the network from/to the
NMS to/from the polled routers. In fact, this is the main motiva-
tion behind work on distributed polling engines (e.g., [11]) and
more recent proposals on “batching” SNMP-polling messages
[10] and more effective SNMP-polling primitives [12].
As another motivating example, the IDMaps [3], [4] and Net-
work Distance Maps [5] efforts aim to produce“latency maps”
of the Internet by introducing measurement servers/tracers that
continuously probe each other to determine their distance. In
order to make their approach scale in terms of both the stor-
age requirement and the extra probingload imposed on the net-
work, both approaches suggest techniques for pruning the dis-
tance map based on heuristic observations [3], graph-theoretic
ideas like
-spanners [4], or hierarchical clustering of the mea-
surement servers [5]. Minimizing monitoring overheads is also
critical in order to avoid “Heisenberg effects”, in which the ad-
ditional traffic imposed by the network monitors perturbs the
network’s performancemetrics and biases the resulting analysis
in unpredictable ways [13].
In this paper, we address the challenging problem of effi-
ciently monitoring bandwidth utilization and path latencies in
an IP data network. Earlier proposals for measuring network
utilization characteristics typically assume that the measurement
instrumentation can be either (a) intelligently distributed at dif-
ferent points in the underlying network [3], [4], [5] or (b) placed
at the endpoints (source and/or destination node) of the end-to-
end path whose characteristics are of interest [6], [7], [8], [9].
In contrast, the monitoring algorithms proposed in this paper
assume a much more common and, we believe, realistic mea-
surement model in which a single, predefined point in the net-
work (corresponding to the Network Operations Center (NOC))
is responsible for actively gathering bandwidth and latency in-
formation from network elements. Thus, rather than requiring
the distribution of specialized instrumentation software and/or
hardware (which can be cumbersome and expensive to deploy
and manage) inside the production network, our algorithms en-
able a network administrator to efficiently monitor utilization
statistics from a single point-of-control. More specifically, we
propose effective,low-overhead strategies for collecting the fol-
lowing utilization statistics as a function of time:
1. Bandwidth usage for a given (sub)set of (a) links, and (b)
aggregate packet flows between ingress-egress routers in the
network. Link-bandwidth utilization information is obviously
critical for a number of network management tasks, such as
identifying and relieving congestion points in the network.
Flow-bandwidth usage, on the other hand, provides bandwidth-
utilization data at a finer granularity which can be invaluable,
e.g., for usage-based customer billing and Service Level Agree-
ment (SLA) verification.
2. Path latencies for a given (sub)set of (possibly overlapping)
source-destination paths in the network. Once again, knowl-
edge of the delays that packets experience along certain routes
is important, e.g., in determining effective communication paths
for applications with low-latency QoS requirements or dynami-
cally routing the clients of a replicated service to their “closest”
replica [3].
Our statistics collection methodology is based on exploiting
existing, widely-deployed software tools for managing IP net-
works, like SNMP and RMON/NetFlow agents [1], [2] and
explicitly-routed IP probe packets [14]. The target applica-
tion domain for our monitoring strategies is large ISP networks,
comprising hundreds of routers and several thousand network
links. Such large ISP installations are typically characterized by
high resource-utilization levels, which means that scalable mon-
itoring strategies that minimize the impact of collecting utiliza-
tion information on the underlying network are of the essence.
This is especially true since this information needs to be col-
lected periodically (e.g., every fifteen minutes) in order to con-
tinuously monitor the state and evolution of the network. The
main contributions of our work can be summarizedas follows.
Novel Algorithms for Efficiently Monitoring Link and
Flow Bandwidth Utilization. We demonstrate that the prob-
lem of collecting link-bandwidth utilization information from
an underlying network while minimizing the required number
of SNMP probes gives rise to a novel,

-hard generalization
of the traditional Vertex Cover (VC) problem [15], termed Weak
VC. Abstractly, Weak VC is a VC problem enriched with a lin-
ear system of equations for edge variables representing addi-
tional “knowledge” that can be used to reduce the size of the
cover. We propose a new, polynomial-time heuristic algorithm
for Weak VC that is provably near-optimal (with a logarithmic
worst-case performance bound). Furthermore, we show that our
heuristic is in fact very general and can be adapted to guarantee
logarithmic approximation bounds for other

-hard problems
that arise in efficient bandwidth monitoring,including the prob-
lem of minimizing the RMON/NetFlow overhead for collecting
flow-bandwidth usage information from the network.
Novel Algorithms for Efficiently Monitoring Path Laten-
cies. We develop flexible techniques that are based on transmit-
ting explicitly-routed IP probe packets from the NOC to accu-
rately measure the latency of an arbitrary set of network paths.
By allowing IP probes to be shared among the various paths, our
probing techniques enable efficient measurement of their laten-
cies. We prove that the problem of computing the (optimal) set
of probes for measuring the latency of a set of paths that im-
poses minimum load on network links is

-hard. Fortunately,
we are able to demonstrate that our optimal probe computation
problem can be mapped to the well-known Facility Location
Problem (FLP), which allows us to use the polynomial-time ap-
proximation algorithm of Hochbaum [16] to obtain a provably
near-optimal set of IP probes.
Simulation Results Validating our Monitoring Strategies.
In order to gauge the effectiveness of our monitoring algorithms,
we have conducted a series of simulation experiments on a broad
range of network graphs generated using the Waxman topol-
ogy model [17]. For link-bandwidth measurements, we find
that, compared to a naive approach based on simple VC, our
Weak VC-based heuristic results in reductions as high as 57%
in the number of SNMP-agent activations. Our experiences with
latency measurements are similar, showing that, compared to
naive probing strategies, our FLP-based heuristic returns sets of
probes that, in several cases, traverse20% fewer network links.
The remainder of this paper is organized as follows. Sec-
tion II introduces our system model and the notational conven-
tions used in the paper. The two optimization problems that we
address in this paper are presented in Section III and Section IV,
respectively, for the the link/flow bandwidth measurement prob-
lem and the path latency measurement problem. In Section V,
we present simulation results to validate our proposed approach.
Finally, we present our conclusionsin Section VI. Due to space
constraints, proofs of theoretical results can be found in the full
version of this paper [18].
II. SYSTEM MODEL AND NOTATION
Our abstract model of a data network is an undirected graph

, where
 
denotes the set of net-
work nodes (i.e., routers) and
!"#$#%$#%&'
represents
the set of edges (i.e., physical links) connecting the routers. We
let
(
*) +)
and
,
-) .)
denote the number of
’s nodes and
edges, respectively, We also use
/
#0123
to denote the degree
(i.e., total number of incident edges) of node
546
. The loca-
tion of the Network Operations Center (NOC) is denoted by the
“special” node
7
where, without loss of generality, we assume
that
7.8
45
. Further, for a node
:94;
, we denote the shortest
path (in terms of the number of links) from
7
to
9
by
9
. Also,
for paths
9

,
)
9)
is the number of links in
9
and
9


is the
path resulting from the concatenation of
9
and

. Finally, given
an edge
#%9
in
,

#9
stands for the bandwidth utilization at
the corresponding link of the network. Table I summarizes the
notation used throughout the paper with a brief description of
its semantics. We provide detailed definitions of some of these
parameters in the text. Additional notation will be introduced
when necessary.
Symbol Semantics

Network graph
 
,
 
Number of nodes/edges in

!
#"#"$"
Generic nodes/routers in the network graph
%&
%
$"#"#"
Generic edges/links in the network graph
'(
')
*"*"#"
Generic paths in the network graph
+
Shortest path from
,
to

'(
Number of links in path
'-
'(./')
Concatenation of paths
'(
and
')
0
&132
14
*"*"#"
Set of traffic flows in the network
065
Set of traffic flows routed through router
in
067
Set of traffic flows routed through link
%
in
89:
%
Bandwidth utilization for network link
%
89:
1
Bandwidth utilization for traffic flow
1
TABLE I
NOTATION.
For our bandwidth-monitoringschemes that make use of flow
information, we assume that all data trafficin the monitored net-
work is distributed among a set of packet flows
;
; that is, every
data packet routed in
belongs to some flow
<
94
;
. Each
such flow
<
9
is essentially a directed path from a source/ingress
router to a destination/egress router in
. Note that, for a given
pair of ingress-egress nodes, there may be multiple packet flows
between them. Intuitively, each flow represents the aggregate
traffic involving a set of source-destination IP address pairs.
Edge ingress/egress routers typically serve a wide range of IP
addresses and traffic between different source/destination ad-
dresses may be split at the network’s edge routers along mul-
tiple flows, e.g., for traffic engineering or accounting purposes.
We let
;=
(
;?>
) denote the set of packet flows routed through
router
(resp., link
#
) in
. We also use

<
9
to denote the
bandwidth usage of flow
<
94
;
in
.
III. MONITORING LINK AND FLOW BANDWIDTH
An IP router is typically managed by activating an SNMP
agent on that router. Over time, the SNMP agent collects vari-
ous operational statistics for the router which are stored in the
router’s SNMP Management Information Base (MIB) and can
be dispatched (on demand or periodically) to the NOC site [1].
SNMP MIBs store detailed information on the total number of
bytes received or transmitted on each interface of an SNMP-
enabled network element. Thus, periodically querying router
SNMP MIBs provides us with a straightforward method of
observing link-bandwidth usage in the underlying IP network.
More specifically, assume that, using SNMP, we periodically
download the total number of bytes received (
A@&>*B=
) and bytes
transmitted (
DCE@&F
HG
) on a given router interface every
units of
time. The average bandwidth usage for the incoming (outgoing)
link attached to that interface over the measurement interval
is then
@>#B=JI

(resp.,
CE@F
-G
I
). A naive, “brute-force” solution
to our link-bandwidth monitoring problem would therefore con-
sist of (1) activating an SNMP agent on every network router
in
, and (2) periodically downloading the number of bytes
observed on each interface to the NOC by issuing appropriate
SNMP queries to all routers.
There are two serious problems with such a “brute-force” ap-
proach. First, running an SNMP agent and answering periodic
SNMP queries from the NOC typically has a significant associ-
ated overhead that can adversely impact the performance char-
acteristics of a router. Using such a naive bandwidth-monitoring
strategy means that the routing performance of every router in
the network is affected. Second, periodically downloading
SNMP link-traffic data from every router can result in a sub-
stantial increase in the observed volume of network traffic. We
are therefore interested in finding link-bandwidth monitoring
schemes that minimize the SNMP overhead on the underlying
IP network. More formally, our problem can be stated as fol-
lows.
Problem Statement [Low-Overhead Link-Bandwidth Mon-
itoring]: Given a network

, determine a mini-
mum subset of nodes
KML
such that enabling and monitor-
ing SNMP agents on these nodes is sufficient to infer the link-
bandwidth usage for everylink of
1.
For flow-bandwidth monitoring, RMON [1] or NetFlow
agents [2] can be enabled on routers to measure the num-
ber of data packets shipped through any of the router’s inter-
faces between specific pairs of source-destination IP addresses.
Like SNMP, however, deploying and periodically querying
RMON/NetFlow agents comes at a cost which can substantially
impact the performance of the router and the observed volume
of network traffic. In fact, both these problems are exacerbated
for RMON and NetFlow compared to simple SNMP, since the
measurements are collected and stored at a much finer granular-
ity resulting in much larger volumes of management data. Thus,
monitoring bandwidth usage at the level of packet flows gives
rise to similar overhead-minimization problems.
In this section, we propose novel formulations and algorith-
mic solutions to the problem of low-overhead bandwidth moni-
toring for network links and packet flows.
A. A Vertex Cover Formulation
A simple examination of the naive method of activating
SNMP agents on every network router reveals that it is really an
overkill. Abstractly, to monitor all links in
, what is needed is
to select a subset of SNMP-enabled routers such that every link
in
is “covered”; that is, there is an SNMP agent running on at
least one of the link’s two endpoints. This is an instance of the
well-known Vertex Cover (VC) problem over the network graph
[15]. Figure 1(a) depicts an example network graph and the
nodes corresponding to a minimum VC of size 4. Even though
VC is known to be

-hard, it is possible to approximate the
optimal VC within a factor of 2 using an
N
,
algorithm based
on determining a maximal matching of
[19].
B. Exploiting Knowledge of Router Mechanics and Traffic
Flows: The Weak Vertex Cover Formulation
Using a VC of the network graph
to determine the set of
nodes on which to run SNMP agents can obviously result in a
2
Without loss of generality, we assume that all links of
are to be monitored.
If only the edges in
POQR
are of interest, then
is understood to be the
network subgraph spanned by
O
.
v5
v6
v7
v2v3
v4
v1
v5
v6
v7
v2v3
v4
v1
(b)(a)
Fig. 1. (a) Network graph
and a minimum VC
2
4


. (b) A
directed version of
and a minimum Weak VC
2
4
.
substantial reduction in the number of activated SNMP agents to
monitor link-bandwidth usage in
. Nevertheless, it is possible
to do even better by exploiting knowledgeof the traffic flows in
the network and the mechanics of packet forwarding. To sim-
plify the exposition, we start by describing our novel problem
formulation and algorithmic solutions assuming a directed net-
work graph model
, with the direction of each link capturing
the flow of data packets into or out of each router node. We
then demonstrate how our results can be extended to the more
realistic scenario of an undirected graph model.
B.1 The Weak Vertex Cover Problem for Directed Graphs
Consider a router
in the directed network graph
and let
=
(
N?=
) denote the set of incoming (resp., outgoing) edges incident
to
in
. The key observation here is that, each such router
satisfies a flow-conservation law that, simply put, states that
the sum of the traffic flowing into
is approximately the same
as the sum of the traffic flowing out of
. More formally, the
flow-conservation law for a non-leaf node 2
can be stated as
the following equation:
>


# 9
>


#

(1)
Note that, in practice, the above flow-conservation equation
holds only approximately,since there can be (a) traffic directed
to/from the router (e.g., OSPF protocol exchanges, manage-
ment traffic, and ARP queries), (b) multicast traffic that is repli-
cated along many output interfaces, and (c) delays and dropped
packets in the router (under certain extreme congestion condi-
tions). We believe, however, that these are infrequent condi-
tions for routers in an ISP network that comprise only a very
small proportion of the overall observed data traffic. Therefore,
given a sufficiently large monitoring period, we expect the flow-
conservation equation at each router to be approximately cor-
rect. Several measurements over backbone routers in Lucent’s
network have corroborated our expectations showing that flow
conservation holds with a relative error that is consistently be-
low 0.05% [18].
The importance of the flow-conservation law for network
monitoring lies in the observation that we no longer need to
ensure that all edges of a router are “covered” by an SNMP
agent: if a router has
links incident on it and the bandwidth
utilization of
of the links is known, then the bandwidth
4
For simplicity, we assume that all links crossing the ISP network boundary
are terminated by distinct leaf nodes in
.
utilization of the remaining link can be derived from the flow-
conservation equation for that router. This observation leads to
a novel vertex-covering formulation, termed Weak Vertex Cover.
Definition III.1: [Weak Vertex Cover] Given a directed net-
work graph
, we define a set
K
of nodes to be a Weak Vertex
Cover of G if after initially marking each node in
K
as covered,
it is possible to mark every node in
by iteratively performing
the following two steps: (1) Mark every edge in
that is in-
cident on a covered node as covered; and, (2) For any non-leaf
node
in
, if
/
#0123
of the edges incident to
are marked
covered, then mark vertex
as covered.
Based on the law of flow conservation, it is obvious that ac-
tivating SNMP agents on a Weak VC for
is sufficient to de-
rive the bandwidth usage on every link in the network graph.
Thus, given flow conservation at each router of
, our efficient
link-bandwidth monitoring problem becomes equivalent to de-
termining a minimum Weak VC for
. For example, Figure 1(b)
depicts a directed network
and the two nodes corresponding
to a minimum Weak VC of
.
Note that every VC of a graph
is trivially also a Weak VC of
, but not necessarily a minimal one. (Figure 1 again provides a
good example.) In fact, there are graphs for which the size of an
optimal VC is arbitrarily larger than that of a Weak VC for the
same graph (consider, for example, a long directed chain). Thus,
exploiting the flow-conservation law can substantially improve
the SNMP-monitoring overhead over a simple VC approach.
To the best of our knowledge, our Weak VC formulation rep-
resents a novel optimization problem that has not been studied
in earlier research on combinatorial or graph algorithms. Unfor-
tunately, as the following theorem shows, it is highly unlikely
that we can find a minimum Weak VC in an efficient manner.
Theorem III.1:Given a directed graph
, discovering a mini-
mum Weak VC is

-hard.
A Near-Optimal Heuristic for Weak VC. An alternative way
of viewing our Weak VC formulation is as follows. The law
of flow conservation for every (non-leaf) router in
provides
us with additional knowledge for the link-bandwidthunknowns
(

# 9
) in the form of a linear system of equations that we can
exploit to determine the values for all

# 9
’s. The problem
then is to determine a minimum subset of nodes
K L
such
that, when the

# 9
’s incident to all nodesin
K
are determined,
the linear system of flow-conservation equations can be solved
for all the remaining

#9
’s. We now present a provably near-
optimal greedy heuristic for Weak VC that is motivated from the
above-stated formulation.
Let


denote the
(

,
linear system of flow-
conservation equations correspondingto the (non-leaf) nodes in
(Equation (1)). (Without loss of generality, we assume that
(
is the number of non-leaf nodes in
.) Note that, initially,
(i.e., the zero
(
-vector), but this is not necessarily the case
in the later stages of our algorithm where some unknowns may
have been specified by the routers selected in the cover. Also,
let

(

denote the rank of matrix
, i.e., the number of
linearly independent flow-conservation equations in the system.
Note that, if

(

,
then the linear system of flow-
conservation equations can be directly solved to determine the
values for all unknown link bandwidths

# 9
, which obvi-
ously means that no nodes need to be selected for monitoring.
Otherwise, the minimum required number of link-bandwidth
variables that need to be specified in order to make the flow-
conservation system solvable is exactly
,

(

. Se-
lecting a node
to run an SNMP agent, means that all link-
bandwidth variables attached to
become known and the flow-
conservation equation for
becomes redundant. Thus, the orig-
inal
(
,
system of flow-conservation equations is reduced to
a
(
,
/
#0 $
system, where
/
#0
is the degree
of node
in
.
Consider step
of the node-selection process (i.e., after en-
abling SNMP at
selected nodes of
) and let
denote
the network graph after the selected
nodes (and incident
edges) are removed from
. Also, let

denote the
(
,
system of flow-conservationequations for
. Finally,
,

(

denotes the minimum number of link vari-
ables that need to be (directly) specified so that the remainder
network graph
is fully covered (i.e., the flow-conservation
system for
becomes solvable). Our greedy algorithm for
Weak VC, termed GREEDYRANK, selects at each step the node
that results in the maximum possible reduction in the minimum
number of link variables required to make the flow-conservation
equations solvable. That is, we select the node that maximizes
the difference

. More formally, if we let

denote the
(
,
/
#0123
matrix resulting from the
deletion of the flow-conservation equation for
and all its at-
tached link variables from
, then the GREEDYRANK strategy
can be stated as depicted in Figure 2.
Algorithm GREEDYRANK
'
Input:
 $
is the directed network graph comprising
(
nodes and
,
links.
Output:
K L
a Weak VC of
.
1) Let

denote the
(
+,
linear system of flow-
conservation equations for
;
2) Set
K

,
,
(
(
,
,
,
,
,

;
3) while
,

(


do
4) Set
:&
F

nil and
,

#
/


%(
;
/* find node
that maximizes the reduction in the
required number of variables */
5) for each node
in
do
6)
23
,
/
#0

(

 
7) if
23

,

#
/


%(
then
8) Set
&
F


and
,

#
/


%(
;
9) Set
K
K

 &
F



 &
F

,


&
F

,
(

(
,
,

,
/
#012&
F

, and

;
Fig. 2. Finding a Near-Optimal Weak Vertex Cover.
The following theorem bounds the worst-case behavior of our
GREEDYRANK algorithm.
Theorem III.2:Algorithm GREEDYRANK returns a solution to
the Weak VC problem that is guaranteed to be within a factor of

,

(


of the optimal solution, where
is the
coefficient matrix of the
(
flow-conservation equations in
.
Time Complexity. GREEDYRANK requires repeated matrix-
rank computations in order to determine the “locally-optimal”
node to place in the cover at each step. However, as we show
in the full version of this paper [18],the specific form of the co-
efficient matrix
allows us to reduce matrix-rank computation
to a simple search for (undirected) connected components in
,
which can be performed in
N
"!$#
1
(
,
%
time. Consequently,
the worst-case running time of our GREEDYRANK algorithm
can be shown to be only
N
(
%!&#
(
,
%
.
B.2 Exploiting Knowledge of Network Flows
So far, our Weak VC formulation makes use of the law of
flow conservation on each router but it does not exploit knowl-
edge of traffic flows in the network when trying to estimate
link-bandwidth usage. This flow information essentially con-
sists of the paths along which packets get routed in the network
and can be computed using routing protocol control information
(e.g., the link-state database in OSPF or label switched paths in
MPLS). In this section, we demonstrate how knowledge of the
traffic flows in our directed network graph
can be exploited
in conjunction with flow conservation to further reduce the re-
quired SNMP overhead for monitoringlink bandwidth.
Consider a router
in
and let
=
,
;=
denote the (sub)sets
of links incident on
and packet flows routed through
, re-
spectively. We can always uniquely partition
=
into a maximal
collection of
subsets
=
$
('
=
)
=
such that each
flow
<
4
;=
only involves (a pair of) links in a single subset
9
=
, for some
. We say that such a partitioning of the links in
satisfies the non-overlapping flow property. An example of
this flow-based link partitioning (with
23
) for a node
is
depicted in Figure 3. (In the worst case
23
, i.e., a node’s
links cannot be partitioned into non-overlappingflows and, thus,
they all belong to the same partition.) The key observation here
is that the law of flow conservation in fact holds for each indi-
vidual link partition
9
=
,

; thus, node
can be
marked as coveredas long as
)
9
=
)
links in
9
=
are covered for
each

23
. This essentially means that we can infer
the link-bandwidth utilization on each link incident to
based
on knowing the bandwidth usage on only
)
=
)
links of
(
23
). (Note, of course, that these
)
=
)
23
links have to
satisfy the condition outlined above; that is, only one link may
be left unspecified in each partition
9
=
,

23
.) As an
example, knowing the bandwidth usage on the three outgoing
links in Figure 3 is sufficient to infer the bandwidth load on the
two incoming links. This leads us to a generalized formulation
of our Weak VC problem.
F1
F3
2
F
E
1
v
v
2
E
v
Fig. 3. Partitioning
’s edges into
*
,+
partitions (
2
5
and
4
5
) satisfying
the non-overlapping flow property.
Definition III.2: [Partitioned Weak Vertex Cover] Given a
directed network graph
and a partitioning of the links in
that satisfies the non-overlapping flow property, we define a set
K
of nodes to be a Partitioned Weak VC of G if after initially
marking each node in
K
as covered, it is possible to mark every
node in
by iteratively performing the following three steps:
(1) Mark every link in
that is incident on a covered node as
covered; (2) For any node
in
, if in any link partition
9
=
(

23
) there are
)
9
=
)
links marked covered then
mark the remaining link in
9
=
as covered; and, (3) For any node
in
, if in every link partition
9
=
(

) there are
at least
)
9
=
)
links marked covered then mark vertex
as
covered.
Generating the maximal link partitioning of
=
(for each
in
) that satisfies the non-overlapping flow property is fairly
straightforward. The idea is to start by placing each link in
its own partition and iteratively “merge” partitions that share
flows [18]. Based on the law of flow conservation for each link
partition, it is easy to see that, activating SNMP agents on a Par-
titioned Weak VC for
is sufficient to derivethe bandwidth us-
age on every link in the network graph. Thus, given traffic-flow
information in the network, our low-overhead link-bandwidth
monitoring problem becomes equivalent to determining a mini-
mum Partitioned Weak VC for
. This problem is clearly

-
hard (a generalization of Weak VC); however, a version of our
GREEDYRANK algorithm can be used to give a fast approximate
solution with a guaranteed logarithmic worst-case performance
bound. (Note that, for Partitioned Weak VC, removing a node
from the linear system of flow-conservation equations means
that all
equations corresponding to
are removed.) Due to
lack of space, the full details can be found in [18].
Theorem III.3:For the Partitioned Weak VC problem, algo-
rithm GREEDYRANK runs in time
N
(
"!$#
(
,

and
returns a solution that is guaranteed to be within a factor of

,

(


of the optimal solution, where
is the
coefficient matrix of the
=
flow-conservation equations
in
.
B.3 Extension to Undirected Network Graphs
Our discussion of link-bandwidth monitoring has so far fo-
cused on the case of a directed network graph model
, where
packet traffic on each physical link is uni-directional and known
beforehand. In general, physical network links are bi-directional
with data packets flowing both to and from routers on the same
link. We now briefly describe how our results and algorithmic
techniques extend to this more general and realistic scenario of
an undirected network graph model.
The basic idea is to “expand” the network graph
into a di-
rected graph

by modeling each bi-directional physical link
in
as two directed edges (in opposing directions) in

, thus
capturing both directions of packetflow for each link. Of course,
one (or both) of the directed edges for a link can be left out of
the model if we know that it is not being used for actual traffic in
the network, e.g., based on the knowledge of traffic flows. The
flow-conservation law (Equation (1)) then holds for each router
in
and its directed in- and out-links created in this manner.
Thus, our solutions for Weak VC and Partitioned Weak VC can
be directly applied on this “expanded” directed network graph
. More details can be found in [18].
C. Monitoring Flow-Bandwidth Utilization
Consider the (undirected) network graph
with a set of
packet flows
;
<
<

routed through its nodes. En-
abling RMON or NetFlow on a router
of
enables the band-
width utilization for the set
; =
of all flows routed through
to
be measured, i.e.,
directly covers all the flows in
; =
. It is
straightforward to see that the problem of determining a mini-
mum subset of RMON/NetFlow-enabled routers such that every
flow in
;
is directly covered is essentially an instance of the

-hard Set Cover problem [15]. Thus, a greedy Set-Cover
heuristic can be used to return an approximate solution that is
guaranteed to be within

)
;
)
of the optimal in
N
(
)
;
)
time [18].
Exploiting Link Bandwidth Information for Covering
Flows. The performance overheadimposed by tools like RMON
or NetFlow on routers and network traffic is typically signifi-
cantly higher than that of SNMP, mainly due to their much finer
granularity of data collection. The problem, of course, is that
SNMP agents cannot collect and provide traffic data at the re-
quired granularity of packet flows only aggregate information
on link-bandwidth utilization can be obtained throughSNMP.
The crucial observation here is that knowledge of aggregate
link bandwidths (obtained via SNMP) provides us with a system
of per-link linear equations on the unknownflow-bandwidth uti-
lizations that can be exploited to significantly reduce the num-
ber of RMON/NetFlow probes required for monitoring flow-
bandwidth usage. (Each such equation basically states that the
aggregate link bandwidth is equal to the sum of the bandwidth
utilizations of the flows traversing that link [18].) The resulting
problem is similar to Weak VC, except that we are interested in
a minimum subset
K
of routers such that determining the band-
width utilization of flows passing through routers in
K
renders
the system of per-link equations solvable. The following the-
orem establishes the intractability of this optimization problem
and the near-optimality of our general GREEDYRANK strategy.
Due to lack of space, the full details can be found in [18].
Theorem III.1: Given knowledge of the aggregate link-
bandwidth utilizations in
, the problem of determining a min-
imum subset
K
of routers such that enabling RMON/NetFlow
on every
+4
K
allows the determination of the flow-bandwidth
utilizations for every flow in
;
is

-hard. Further, an ap-
propriately modified version of our GREEDYRANK strategy re-
turns a solution to this problem that is guaranteed to be within
a factor of
)
;
)

(

$
of the optimal, where
is
the
,
)
;
)
coefficient matrix of the per-link flow-bandwidth
equations in
. This approximation factor is the best possible
(assuming
8

).
IV. MONITORING NETWORK LATENCY
We next turn our attention to the problem of measuring round-
trip latencies for a set of network paths
in the (undirected) net-
work graph
, where each path is a sequence of adjacent links
in
. Such network latency measurements are crucial for pro-
viding QoS guarantees to end applications (e.g., voice over IP),
traffic engineering, ensuring SLA compliance, fault and conges-
tion detection, performance debugging, network operations, and
dynamic replica selection on the Web.
Most previous proposals for measuring round-trip times of
network paths rely on probes, which are simply ICMP echo re-
quest packets. Existing systems typically belong to one of two
categories. The first category includes systems like WIPM [20],
AMP [21] and IDMaps [3] that deploy special measurement
servers at strategic locations in the network. Round-trip times
between each pair of servers are measured using probes, and
these times are subsequently used to approximate the latency
of arbitrary network paths. The measurement server approach,
while popular, suffers from the followingtwo drawbacks. First,
the cost of deploying and managinghundreds of geographically
distributed servers can be significant due to the required hard-
ware and software infrastructure as well as the human resources.
Second, the accuracy of the latency measurements is highly
dependent on the number of measurement servers. With few
servers, it is possible for significant errors to be introduced when
round-trip times between the servers are used to approximate
arbitrary path latencies. The second category of tools for mea-
suring path latencies include pathchar [6] and skitter [22].
Both tools measure the round-trip times for paths originating at
a small set of sources (between one and ten) by sending probes
with increasing TTL values from each source to a large set of
destinations. A shortcoming of these tools is that they can only
measure latencies of a limited set of paths that begin at one of
the sources from which ICMP probes are sent.
In this section, we present our probing-based technique that
alleviates the drawbacks of previous methods. In our approach,
path latencies are measured by transmitting probes from a single
point-of-control (i.e., the NOC). Consequently, since our tech-
nique does not require special instrumentation to be installed in
the network, it is cost-effective and easy to deploy. Further, un-
like existing approaches, our method allows for latencies of an
arbitrary set of network paths
to be measured exactly, and is
thus both accurate and flexible. Our schemes achieve this by
exploiting the ability within IP to explicitly route packets using
either source routing or encapsulation of “IP in IP”. We demon-
strate that, for measuring the latency of a given set of paths
, there exist a wide range of probing strategies that impose
varying amounts of load on the network infrastructure. While
the problem of selecting the optimal set of probes that mini-
mizes the network bandwidth consumed is

-hard, we show
that this problem can be mapped to the well-known Facility Lo-
cation Problem (FLP) for which efficient approximation algo-
rithms with guaranteed performance ratios have been proposed
in the literature [16], [23].
A. Overview and Problem Formulation
In our approach for measuring latency, explicitly routed
probes are transmitted along paths originating at the NOC (i.e.,
node
7
). (We discuss source routing and IP encapsulation, the
two mechanisms within IP for controlling the path traversed by
a packet in more detail in Section IV-D.) The round-trip latency
of a single path
9
is measured by sending the following two
probe packets:
1. The first probe packet is sent from
:7
to one of the end nodes,
say
9
, along the shortest path
9
between them in
. The probe
then returns to
7
along the reverse of
9
.
2. The second probe packet is sent from
7
to the other end node
of
9
(via
:9
) along the path
9
9
. The probe then returns to
7
along the reverse of
9
9
.
The round-trip latency of path
9
is computed as the difference
of the round-trip times of the two probes traversing the paths
9
and
9
9
.
In the remainder of this paper, we will represent each probe
by the forward path traversed by it from
:7
since the return path
is symmetric to the forward path (in the reverse direction). Note
that the first node of each probe is always
:7
. Also, in this paper,
we will not consider complex probing techniques for measur-
ing latency in which probes follow arbitrary paths which cannot
be decomposed into symmetric forward and reverse path seg-
v0
v1
v2
v3
50
60
80
p3 (100)p1 (20) p2 (40)
v0
v1
v2
v3
(a) (b)
Fig. 4. Sets of Probes for Measuring Latency of Paths
ments3.
In order to measure the latency of a set of paths
, we need
to employ a set of probes
K
such that for each path
94
,
K
contains a pair of probes
9
and
9
P
9
. We refer to
9
as the
base probe for
9
and to
9

9
as the measurement probe for
9
.
Further, we refer to probes in
K
that are not measurement probes
for any path in
as anchor probes and to the last node visited
by an anchor probe as its anchor node.
There are a number of differentsets of probes that can be used
to measure the latency of
. One obvious choice for
K
is the set
that contains, for each path
94
, the following two probes:
9
and
9
9
(assuming that
9
is the end node of
9
that is closest
to
7
). However, as the following example illustrates, there are a
number of other possibilities for set
K
, several of which contain
a smaller number of probes and traverse fewer links.
Example IV.1: Consider the network graph shown in Fig-
ure 4(a) containing the set of paths

, where
)
)

,
)
)

,
)

)
,
)
)

,
)
)

,
and
)

)

. The obvious set of probes
K
for measuring
the latency of
is illustrated in Figure 4(a), where the mea-
surement of each path is optimized independently by sending
the base probe to the end node closest to
7
. This results in a
distinct pair of probes,
9
and
9
9
, for each path
9
, which
requires a total of
probes and
traversed links. Figure 4(b)
illustrates a different set of probes
K
for measuring
that is
optimal with respect to both the number of probes as well as
the number of traversed links. In
K
, paths
and

share the
same base probe
, and the measurement probe for
(i.e.,
) also serves as the base probe for
. (Note that
is
the only anchor probe in
K
, whereas
K
contains three distinct
anchor probes (

).) This sharing of probes among paths
reduces the number of probes from
to
. Although both paths
and
are measured with longer measurement probes in
K
(
and
) than in
K
, this overhead is offset by the
savings due to the sharing of probes in
K
, thereby resulting in
an overall reduction from
to

traversed links.
Ideally, we would prefer a set
K
of probes that traverses as
few links as possible to measure
. This is because the total
number of links traversed by the probes in
K
is a good mea-
sure of the additional load that the probes impose on network
links. Minimizing this additional network traffic due to probes
is extremely important, since we need to monitor path latencies
continuously, causing probes to be transmitted frequently (e.g.,
every fifteen minutes). Thus, our efficient latency-monitoring
One such complex probing technique for measuring the latency of a path
'(
would be to send a pair of probes the first probe makes a round-trip from
,
to an internal node, say

, in
'-
; the second probe travels from
,
to one of the
end nodes of
'(
via

, and then to the other end node of
'-
, and finally back to
,
via
.
problem can be formally stated as follows.
Problem Statement [Low-Overhead Path-Latency Monitor-
ing]: Given a set of paths
, compute a set of probes
K
such that
(1)
K
measures the latency of
; that is, for every path
9 4
,
K
contains a pair of probes
9
and
9
9
, and (2)
K
is opti-
mal; that is, the total number of links traversedby probes in
K
is
minimum.
In the following subsection, we address the aboveproblem of
computing the optimal set of probes for measuring the latency of
paths in
. We assume that for any pair of paths
9
and

in
,
9
is not a prefix (or suffix) of
. The reason for this assump-
tion will become clear in the next subsection. Note, however,
that this assumption is not restrictive since, if
9
is a prefix of
9

, then
can be split into two non-overlappingpath
segments
9
and

, and its latency can be computed as the sum
of the latencies of
9
and

.
B. Computing an Optimal Set of Probes
As illustrated in Example IV.1, a naive approach that adds to
K
the optimal pair of base and measurement probes for eachpath
in
considered independently may not result in the optimal set
of probes. This is because (1) measurement probes for multiple
paths can share a common base probe, and (2) the measurement
probe for one path can serve as a base probe for an adjoining
path. Thus, more sophisticated algorithms are needed for com-
puting an optimal solution. Unfortunately, as the following the-
orem states, the problem of computing the optimal set of probes
is

-hard even if every path in
is restricted to be a single
link.
Theorem IV.1:Given a graph
and a set of paths

, the
problem of computing the optimal set of probes to measure the
latency of
is

-hard.
In the following, we map the problem of computing the opti-
mal set of probes to the Facility Location Problem (FLP). Since
efficient polynomial-time algorithms for approximatingthe FLP
exist in the literature, these can then be utilized to compute a
near-optimal set of probes.
Before we present our FLP reduction, we develop some addi-
tional notation. For a path
9
, we denote by
9
and
9
the
first and last node of
9
, respectively. Further,

5

$


denotes the undirected, distance-weighted graph induced by
;
thus,


9
9 )
9 4
is the set of all the end nodes
of the paths in
, and

6
9
9')
9 4
is the
set of edges between the end nodes such that corresponding to
every path in
, there is an edge in
connecting its two end
nodes. (Note that
is actually a multigraph, since there may
be multiple edges between a pair of nodes in
.) Each edge in
is labeled with its corresponding path, say
9
, in
, and has
an associated weight equal to
)
9)
. For a pair of nodes
9
and
in
, we denote by
9

the shortest path (with respect to the
sum of edge weights) from
:9
to
in

. Essentially,
9

is the
path from
:9
to
in the shortest path tree rooted at
9
in

.
Note that, since every edge in

corresponds to some path in
,
9

can be viewed as a concatenation of pathsin
. Finally,
we use
)
9

)
to denote the sum of edge weights for
9

in

.
We are now in a position to characterize the composition of
sets of probes. A set
K
of probes for measuring the latency of
paths in
consists of the following two disjoint subsets:
1. A set of anchor probes
K F
(corresponding to anchor nodes
FL
), and
Algorithm OPTIMALPROBES

F
Input:

is a network graph.
= A set of paths whose latency is to be measured.
F
= A set of anchor nodes.
Output:
K
&
= A set of measurement probes that is optimal with
respect to
F
and
.
1)
K
&:
;
2)for each
94
do
3) Let
45
F
and
4
9
9
such that
)


)
is
minimum (in case of a tie between nodes
and
in
F
,
the node with the smaller index is chosen);
4)
K
&
K
&


9
;
Fig. 5. Finding an Optimal Set of Probes for Anchor Nodes

.
2. A set of measurement probes
K
&
for measuring the latency
of paths in
.
Since the shortest possible anchor probe in
KF
corresponding to
each anchor node
:9 4
F
is
9
,
KF
9 ):9 4
F
. In
K
&
,
there is a separate measurement probe for each path in
(since
for any pair of paths
9

4
,
9
cannot be a prefix of
).
Further, every measurement probe in
K
&
is a concatenation of
a single anchor probe and one or more paths in
. The shortest
possible measurement probe in
K
&
to measure the latency of
path
9
has length equal to

=


=



'
)
'
)

)


9)
.
Here, the minimization over
)

)
essentially captures the
shortest possible path from
7
to one of the end nodes
of
9
that begins with an anchor probe
followed by paths in
.
Thus, for a given set of anchor nodes, it is possible to compute
the optimal set of measurement probes
K
&
for measuring
.
Algorithm OPTIMALPROBES in Figure 5 computes this optimal
set
K
&
for a given
F
and
. The computed set
K
&
is optimal
since, for each path
94
, OPTIMALPROBES adds to
K
&
the
measurement probe containing the smallest number of links. In
addition, we can prove that the set
K F

K
&
measures the latency
of all paths in
. For this, we need to show that, for every path
9
, the base probe

is either in
K6F
or is added to
K
&
. If

, then the base probe
is in
KF
. Otherwise, if

is
the final path in
, then
is the measurement probe for

with the smallest number of links and is thus added to
K
&
.
Thus, the set of probes
KF
K
&
measures the latency of
.
Theorem IV.2:Given a set of anchor nodes
F
and paths
,
Algorithm OPTIMALPROBES computes an optimal set of mea-
surement probes
K
&
such that
K6F
K
&
measures the latency of
.From Theorem IV.2, it follows that for a given set of anchor
nodes
F
, the set
K
K6F
K
&
is the optimal set of probes
for measuring
among sets for which
K F
is the set of anchor
probes. Further, the number of links traversed by
K
is:
=

)

)
!
"
=

=

'
)
'
)

)


9)
(2)
Thus, if
F
is a set of anchor nodes for an optimal set of probes,
then
K6F

K
&
(where
K
&
is computed by OPTIMALPROBES)
is an optimal set of probes for measuring
; that is,
K F
K
&
minimizes the value of Equation(2). As a result, we have trans-
formed the problem of computing the optimal set of probes to
that of computing a set of anchor nodes
F
that minimizes Equa-
tion (2). Once
F
is known, algorithm OPTIMALPROBES can be
used to compute the measurement probes
K
&
such that
K6F
K
&
is the optimal set of probes.
The above minimization problem maps naturally to the Facil-
ity Location Problem (FLP) [16], [23]. The FLP is formulated
as follows: Let
be a set of clients and
be a set of facili-
ties such that each facility “serves” every client. There is a cost

of “choosing” a facility
4
and a cost
/


of serv-
ing client
4
by facility
4
. The problem definition
asks to choose a subset of facilities
< L

such that the sum
of costs of the chosen facilities plus the sum of costs of serving
every client by its closest chosen facility is minimized; that is,




9

"

/



The problem of computing the set of anchor nodes
F
that
minimizes Equation (2) can be mapped to FLP as follows: Let
be the set of paths
and
be the set of candidate anchor nodes
. The cost of choosing a facility
,

, is
)

)
, the length
of the shortest path from
:7
to
. The cost of serving client
9
from facility
,
/


, is

=

'
)
'
)

)

)
)

)1
which is the sum of the lengths of

and the shortest path from
to one of the end nodes of
9
in
. Thus, the set
<
computed
for the FLP corresponds to our desired optimal set
F
of anchor
nodes.
The FLP is

-hard; however, it can be reduced to an in-
stance of the Set Cover problem and then approximated within a
factor of
N

)
)
with a running-time complexity of
N
)
)
)
)
[16]. Thus, we can compute a provably near-optimal set
K
of probes for measuring paths in
by first using Hochbaum’s
FLP heuristic [16] to find a near-optimal set of anchor nodes
F
(in
N
$)
)
)
)
time), and then running OPTIMALPROBES
to find the optimal set of measurement probes
K
&
for
F
. The
set of probes
K
KF
K
&
is then guaranteed to be within
N

)
)
of the optimal solution for measuring
[18].
C. Minimizing the Number of Probes
Suppose that instead of minimizing the number of traversed
links, we are interested in computing the set
K
with the min-
imum number of probes. In this case, it is possible to com-
pute the optimal set of probes by invokingalgorithm OPTIMAL-
PROBES with the set of paths
whose latency is to be measured
and the set
F
that contains one (arbitrary) node from each con-
nected component in
. The final set
K
KF
K
&
contains
one anchor probe per connected component in

and one mea-
surement probe per path, which is optimal with respect to the
number of probes.
D. Implementation Issues
Our approach for measuring latency is highly dependent on
being able to explicitly route probe packets along specific paths.
Loose source routing and encapsulation of IP in IP are two
mechanisms for controlling routes followed by packets. We pre-
fer encapsulation over loose source routing due to the following
reasons [24]. First, Internet routers exhibit performance prob-
lems when forwarding packets that contain IP options, includ-
ing the IP source routing option. Second, the source routing
option is frequently disabled on Internet routers due to security
problems associated with its use. Finally, IP allows for at most
40 bytes of options, which restricts the number of IP addresses
through which a packet can be routed using source routing to be
no more than 10.
While encapsulation addresses some of the problems with
source routing, unwrapping the header in encapsulated packets
still incurs overhead at routers and encapsulated packets are typ-
ically larger than source routed packets. Both processing over-
head and packet sizes can be reduced significantly by using as
few headers as possible in each probe packet. We can achieve
this by splitting the path for a probe packet into maximal disjoint
path segments, such that each path segment is consistent with
the route computed by the underlying routing protocol (e.g.,
OSPF). Then the probe packet can be routed along the path by
using one header per path segment that contains the IP address
of the endpoint of the segment that is not shared with the pre-
vious segment. Note that the final measured round-trip times
must be adjusted to account for the overhead of processing the
encapsulated packets at intermediate routers.
V. SIMULATION RESULTS
In this section, we present simulation results comparing the
performance of the various algorithms that we have devel-
oped for both the link-bandwidth and path-latency measurement
problems. The main objective of the simulation results is to
demonstrate that our proposed algorithmic solutions are not only
theoretically sound with good guaranteed worst-case bounds but
they also give significant benefits over naive solutions in prac-
tice (i.e., on the average) for a wide variety of realistic network
topologies. The simulations are based on network topologies
generated using the Waxman Model [17], which is a popular
topology model for networking research (e.g., [4]). Different
network topologies are generated by varying three parameters:
(1)
(
, the number of nodes in the network graph; (2)
, a pa-
rameter that controls the density of short edges in the network;
and (3)
, a parameter that controls the average node degree.
A. Bandwidth Measurement
For the link-bandwidth measurement problem, we compare
the performance of three algorithms: the maximal matching
heuristic for simple VC (Sec. III-A), and two algorithms based
on our Weak VC formulation a variant of the maximal match-
ing heuristic and our GREEDYRANK algorithm (Sec. III-B.1).
Our maximal matching variant for Weak VC basically ensures
that all transitively-specified edges (based on flow conservation)
are eliminated from
whenever a new edge enters the match-
ing. The comparison is in terms of the number of nodes that
need to run SNMP in order to measure the bandwidth of each
link in the generated network graphs4. We denote the number
of SNMP activations for these algorithms by
=DB
&
FC/B

,

=DB
&
FC/B

,
and

=DB
@&F
, respectively.
Table II presents one set of simulation results; we have ob-
tained similar results for other parameter settings. The first col-
umn in the table represents the average degree of the nodes in
the generated network graph (which increases with larger val-
ues of
). Our results indicate that GREEDYRANK is the clear
winner, reducing the number of SNMP activations by as much
as 67% over the naive,“brute-force” approach, and as much as
35% over its closest matching-basedcompetitor.
For the bandwidth measurement simulations, each undirected graph gener-
ated by the Waxmanmodel is converted into a directed graph by randomly fixing
the direction of each of its edges.
Avg. Degree
5




5




5



5


4.4 387 255 165 0.33
8.6 441 372 254 0.51
12.6 453 408 307 0.61
16.9 466 431 334 0.67
TABLE II
COMPARIS ON OF LINK-BA NDWIDTH MEA SU REME NT ALGO RITHM S,

,

"
,
+

"
+
*"*"*"#

"

.
B. Latency Measurement
For the latency measurement simulations, we compare the
performance of two algorithms: the naive approach, where
the optimal probes are computed independently for each path
(Sec. IV-A), and our FLP-based approach (Sec. IV-B). We com-
pare the performance of these algorithms in terms of both the
total number of links traversed by the probe packets, denoted by
, as well as the number of probe packets transmitted, denoted
by
, where
4
(

#
.
For each network graph generated using the Waxman model,
a random set of

paths (each with between
and
links) are
considered. We vary the “topology”of the set of generated paths
using a parameter
(

, which represents the number of end nodes
that serve as starting points for the 20 generated paths. Thus, a
smaller value of
(

means that more paths are terminated by the
same end node. The node representing the NOC is a randomly
selected node that is not incident on any of the paths.
Table II presents one set of simulation results; we have ob-
served similar trends for other parameter settings. The results
indicate that our FLP-based heuristic is more effective than the
naive approach in terms of both the total number of links tra-
versed as well as the total number of probe packets transmitted.
O

5$7



5$7
5&7


5$7
2 684 542 0.79 37 22 0.59
4 672 560 0.83 37 24 0.65
8 678 594 0.88 38 28 0.74
16 680 628 0.92 39 32 0.82
TABLE III
COMPARIS ON OF LATEN CY ME ASUR EMEN T ALGOR ITHMS ,
"!#
,
$
"
+
,
+
$
"
+
.
VI. CONCLUSIONS
In this paper, we have addressed the problem of efficiently
monitoring bandwidth utilization and path latencies in IP net-
works. Unlike earlier approaches, our measurement architecture
assumes a single point-of-control in the network (correspond-
ing to the NOC) that is responsible for gathering bandwidth and
latency information using widely-deployed management tools,
like SNMP, RMON/NetFlow, and explicitly-routed IP probes.
We have demonstrated that our measurement model gives rise
to new optimization problems, most of which prove to be

-
hard. We have also developed novel approximation algorithms
for these optimization problems and proved guaranteed upper
bounds on their worst-case performance. Finally, we have ver-
ified the effectiveness of our monitoring algorithms through a
preliminary simulation evaluation.
Although this paper has focused on a single point-of-control
measurement architecture, our approach is also readily appli-
cable to a distributed-monitoring setting, where a number of
NOCs/“monitoring boxes” have been distributed over a large
network area with each NOC responsible for monitoring a
smaller region of the network. Our algorithms can then be used
to minimize the monitoring overheadwithin each individual re-
gion. The problem of optimal distribution and placement of
NOCs across a large network can be formulated as a variant
of the well-known “k-center problem” (with an appropriately-
defined distance function) [4].
Acknowledgement: We would like to thank Amit Kumar for
suggesting the
N
"!$#
1
,
(
%
rank-computation algorithm for
GREEDYRANK.
REFERENCES
[1] W. Stallings, “SNMP,SNMPv2, SNMPv3, and RMON 1 and 2”, Addison-
Wesley Longman, Inc., 1999, (Third Edition).
[2] “NetFlow Services and Applications,” Cisco Systems White Paper, 1999.
[3] P. Francis, S. Jamin, V. Paxson, L. Zhang, D. F. Gryniewicz, and Y. Jin,
“An Architecture for a Global Internet Host Distance Estimation Service,”
in Proc. of IEEE INFOCOM’99, March 1999.
[4] S. Jamin, C. Jin, Y. Jin, Y. Raz, Y. Shavitt, and L. Zhang, “On the Place-
ment of Internet Instrumentation,” in Proc. of IEEE INFOCOM’2000,
March 2000.
[5] W. Theilmann and K. Rothermel, “Dynamic Distance Maps of the Inter-
net,” in Proc. of IEEE INFOCOM’2000, March 2000.
[6] V. Jacobsen, “pathchar A Tool to Infer Characteristics of Internet Paths,”
April 1997, ftp://ftp.ee.lbl.gov/pathchar.
[7] A.B. Downey, “Using pathchar to Estimate Internet Link Characteristics,”
in Proc. of ACM SIGCOMM’99, August 1999.
[8] J.-C. Bolot, “End-to-End Packet Delay and Loss Behavior in the Internet,”
in Proc. of ACM SIGCOMM’93, September 1993.
[9] K. Lai and M. Baker, “Measuring Bandwidth,” in Proc. of IEEE INFO-
COM’99, March 1999.
[10] M. Cheikhrouhou, J. Labetoulle, “An Efficient Polling Layer for SNMP,”
Proc. 2000 IEEE/IFIP Network Operations & Management Symposium,
April 2000.
[11] Y. Yemini, G. Goldszmidt, S. Yemini, “Network Management by Dele-
gation,” Proc. Intl Symposium on Integrated Network Management, April
1991.
[12] D. Breitgand, D. Raz, Y. Shavitt, “SNMP GetPrev: An Efficient Way to
Access Data in Large MIB Tables, Bell Labs Tech. Memorandum, August
2000.
[13] V. Paxson, “Towards a Framework for Defining Internet Performance Met-
rics,” in Proceedings of INET’96, 1996.
[14] S. Keshav, “An Engineering Approach to Computer Networking”,
Addison-Wesley Professional Computing Series, 1997.
[15] M.R. Garey and D.S. Johnson, “Computers and Intractability: A Guide to
the Theory of NP-Completeness”, W.H. Freeman, 1979.
[16] D.S. Hochbaum, “Heuristics for the Fixed Cost Median Problem,” Math-
ematical Programming, vol.22, pp. 148–162, 1982.
[17] B.M. Waxman, “Routing of Multipoint Connections,” IEEE Jrnl. on Se-
lected Areas in Communications, vol. 6, no. 9, pp. 1617–1622, December
1988.
[18] Y. Breitbart, C.-Y. Chan, M. Garofalakis, R. Rastogi, and A. Silberschatz,
“Efficiently Monitoring Bandwidth and Latency in IP Networks, Bell
Labs Tech. Memorandum, July 2000.
[19] V. V. Vazirani, “Approximation Algorithms”, Springer-Verlag, 2000, (To
appear).
[20] R. Caceres, N.G. Duffield, A. Feldmann, J. Friedmann, A. Greenberg,
R. Greer, T. Johnson, C. Kalmanek, B. Krishnamurthy, D. Lavelle, P.P.
Mishra, K.K. Ramakrishnan, J. Rexford, F. True, and J.E.van der Merwe,
“Measurement and Analysis of IP Network Usage and Behaviour, IEEE
Communications Magazine, pp. 144–151, May 2000.
[21] T. McGregor, H.-W. Braun, and J. Brown, “The NLANR Network Anal-
ysis Infrastructure,” IEEE Communications Magazine, pp. 122–128, May
2000.
[22] Cooperative Association for Internet Data Analysis (CAIDA),
http://www.caida.org/.
[23] M. Charikar and S. Guha, “Improved Combinatorial Algorithms for the
Facility Location and k-Median Problems, in Proc. of IEEE FOCS’99,
October 1999.
[24] C. Perkins, “IP encapsulation within IP,” Internet RFC-2003 (available
from http://www.ietf.org/rfc/), May 1990.
... Given a graph, the task of Min-VCP is to obtain a vertex set of minimum size, such that every edge in the graph is connected to at least one vertex in the set. Despite the simplicity of its expression, this problem is related to many real-world problems including monitoring Internet traffic [1], preventing denial-of-service attacks [2], creating immunization strategies in networks [3], solving the placement problem [4], and dealing with data aggregation [5]. In addition, Min-VCP is considered a representative computationally difficult problem that belongs to the class NP-hard. ...
... ERM is a random graph model constructed by connecting every pair of vertices (i, j) ∈ V × V independently with probability p ∈ [0, 1]. The degree distribution of ERM asymptotically converges to the Poisson distribution P(d ) = e −c c k /k!, where c := N p is the mean degree and p is set such that c is O (1). Therefore, ERM is also called a Poissondistributed graph. ...
... By controlling π and , one can implement various community structures in SBM. As a particular case, we consider the SBM with π z = 1/K and z, (1), namely, every vertex is assigned to each community uniformly. In this case, the mean degrees of intracommunity connections and intercommunity connections converge to c in and c out , respectively, as N tends to infinity. ...
Article
The minimum vertex cover (Min-VC) problem is a well-known NP-hard problem. Earlier studies illustrate that the problem defined over the Erdös-Rényi random graph with a mean degree c exhibits computational difficulty in searching the Min-VC set above a critical point c=e=2.718.... Here, we address how this difficulty is influenced by the mesoscopic structures of graphs. For this, we evaluate the critical condition of difficulty for the stochastic block model. We perform a detailed examination of the specific cases of two equal-size communities characterized by in and out degrees, which are denoted by cin and cout, respectively. Our analysis based on the cavity method indicates that the solution search once becomes difficult when cin+cout exceeds e from below, but becomes easy again when cout is sufficiently larger than cin in the region cout>e. Experiments based on various search algorithms support the theoretical prediction.
...  Applications based on the client-server model do not require workstations to store information or to ensure space on the hard disk for their storage. Studies such as (Nieh et al, 2012) (Li et al, 2005) (Breitbart et al, 2001) reveal that such applications will probably be used more significantly, but in a more sophisticated form. ...
Article
Full-text available
This paper demonstrates that the focus within which networks performance is evaluated is changing. We review the literature on network performance evaluation during two decades; 1998-2018 with a focus latency, throughput and jitter. Also, we analyze the literature from different perspectives such as the network performance indicators, simulation tools and models, measurement tools and models among others. We then provided a case for operational analysis for network performance evaluation.
... To compute the latency estimates from their measured timestamps they use an estimation distribution, an approach applied on a larger scale in Section 7. The active approach they use still produces strain on a large network and does not scale to the distributed approach required for measuring a broadcast. Breitbart et al. [8] and many others use a similar approach. ...
Article
Full-text available
The cryptocurrency system Bitcoin uses a peer-to-peer network to distribute new transactions to all participants. For risk estimation and usability aspects of Bitcoin applications, it is necessary to know the time required to disseminate a transaction within the network. Unfortunately, this time is not immediately obvious and hard to acquire. Measuring the dissemination latency requires many connections into the Bitcoin network, wasting network resources. Some third parties operate that way and publish large scale measurements. Relying on these measurements introduces a dependency and requires additional trust. This work describes how to unobtrusively acquire reliable estimates of the dissemination latencies for transactions without involving a third party. The dissemination latency is modelled with a lognormal distribution, and we estimate their parameters using a Bayesian model that can be updated dynamically. Our approach provides reliable estimates even when using only eight connections, the minimum connection number used by the default Bitcoin client. We provide an implementation of our approach as well as datasets for modelling and evaluation. Our approach, while slightly underestimating the latency distribution, is largely congruent with observed dissemination latencies.
... Given a graph, the task of Min-VCP is to obtain a vertex set of minimum size, such that every edge in the graph is connected to at least one vertex in the set. Despite the simplicity of its expression, this problem is related to many real-world problems including monitoring internet traffic [1], preventing denial-of-service attacks [2], creating immunization strategies in networks [3], solving the placement problem [4], and dealing with data aggregation [5].In addition, Min-VCP is considered a representative computationally difficult problem that belongs to the class NP-hard. This means that, in the worst case, all known algorithms require an exponentially long time with respect to the number of vertices not only to find a solution but also to verify its validity. ...
Preprint
The minimum vertex cover (Min-VC) problem is a well-known NP-hard problem. Earlier studies illustrate that the problem defined over the Erd\"{o}s-R\'{e}nyi random graph with a mean degree $c$ exhibits computational difficulty in searching the Min-VC set above a critical point $c = e = 2.718 \ldots$. Here, we address how this difficulty is influenced by the mesoscopic structures of graphs. For this, we evaluate the critical condition of difficulty for the stochastic block model. We perform a detailed examination of the specific cases of two equal-size communities characterized by in- and out- degrees, which are denoted by $c_{\rm in}$ and $c_{\rm out}$, respectively. Our analysis based on the cavity method indicates that the solution search becomes difficult when $c_{\rm in }+c_{\rm out} > e$, but becomes easy again when $c_{\text{out}}$ is sufficiently larger than $c_{\mathrm{in}}$ in the region $c_{\rm out}>e$. Experiments based on various search algorithms support the theoretical prediction.
... Actual implementations in Fog landscapes can exploit data from monitoring tools (e.g.,[51],[52]) to get updated information on the state of the infrastructure I.Wiley STM / Editor Buyya, Srirama: Fog and Edge Computing: Principles and Paradigms, Chapter 9 / Predictive Analysis to Support Fog Application Deployment ...
Chapter
In this chapter, we first recapitulate on the challenges related to segmenting application functionalities all through the Cloud-to-Things continuum. A detailed life-like example is then used to further motivate the readers. We then describe the model and algorithms that constitute our prototype tool, FogTorchΠ, which permits (1) to find candidate deployments that meet (functional and non-functional) requirements of an application to a given infrastructure, and (2) to perform what-if analysis based on predicted metrics. Applicability of FogTorchΠ to the given motivating example is shown, comparing candidate deployments with respect to their QoS-assurance, Fog resource consumption and cost. Later, we describe the state-of-the-art related to our work. Particularly, a VR game application example is used to compare the results obtained by FogTorchΠ with those obtained by one of the most promising tools available for simulating Fog scenarios (iFogSim). Finally, after highlighting future research directions in this area, we draw some concluding remarks.
Article
Full-text available
Identification of Node failure detection and a localization is a very important challenge in a network community to get a quick recovery and avoid useless traffic in network. But it is very difficult to check the failure nodes or locations because of the large number of Screw ups in dense network. As finding the main source for failure of network is always challenging our proposed work will achieve that, it identifies the node failure by using probing measurement of binary state to end to end paths. Apart from identifying the network failure, it also quantifies the total failure nodes and the ip address or vicinity of failure nodes, Identification of node failure is done by monitoring nodes which are deployed in the netwok. Our Proposed word is divided majorly in two phases one is identifying the node failures by using Probing Packets and other is finding of the failure and its recovery.
Conference Paper
Software-defined networking (SDN) is a cornerstone of next-generation networks and has already led to numerous advantages for data-center networks and wide-area networks, for instance in terms of reduced management complexity and more fine-grained traffic engineering. However, the design and implementation of SDN within wireless sensor networks (WSN) have received far less attention. Unfortunately, because of the multi-hop type of communication in WSN, a direct reuse of the wired SDN architecture could lead to excessive communication overhead. In this paper, we propose a cluster-based flow management approach that makes a trade-off between the granularity of monitoring by an SDN controller and the communication overhead of flow management. A network is partitioned into clusters with a minimum number of border nodes. Instead of having to handle the individual flows of all nodes, the SDN controller only manages incoming and outgoing traffic flows of clusters through border nodes. Our proof-of-concept implementations in software and hardware show that, when compared with benchmark solutions, our approach is significantly more efficient with respect to the number of nodes that must be managed and the number of control messages exchanged.
Article
This paper proposes a novel latency monitoring method for software-defined networks (SDNs) called LLDP-looping, which uses LLDP packets injected repeatedly in the control plane to determine latency between switches. It provides accurate and continuous latency monitoring without involving any dedicated network infrastructure, while avoiding potential measurement failures that can occur in the existing method of timestamping data packets as probe packets, and overcoming the major factors that decrease the measurement accuracy in many existing methods for monitoring SDN latency. We formulate an optimization problem to enable LLDP-looping to minimize its workload on both control and data planes, and propose a novel greedy algorithm to solve this problem efficiently. Evaluations over the tree-based network topologies demonstrate that LLDP-looping can effectively minimize its overhead, and provide measurement accuracy higher than 90% against the round trip time measured by Ping over an SDN with link latency as small as 0.05 ms. The advantages of LLDP-looping can be realized with minimal modifications to SDN switches and this technique can be generalized to other networking scenarios.
Article
Abstract The Internet' s tremendous,growth represents a triumph of standardization, since it is only through standardization that so many,different networks using so many,different designs can smoothly exchange,data. The standardization of Internet measurement, however, has not matched the explosive growth of the network,as a whole. Even such basic notions as how,to measure,the throughput or delay along an Internet path lack a standardized framework. Instead it has become,increasingly difficult to diagnose problems or determine whether one is receiving promised performance. In this paper we outline how,a measurement,framework might be developed to support Internet diagnosis and performance,evaluation. We propose terminology to use in defining standards, including the key notions of metric as the fundamental property we wish to measure, methodology as a way to attempt to measure the property, and measurement as the result of a specific application of a methodology. We develop a basic contrast between analytically-specified metrics, which emphasize viewing network properties in analytic terms, and empirically-specified metrics, which correspond to properties that are generally too complex,to discuss analytically but still very important for practical measurement. Each has its place in the framework. We further discuss the notion of composition(how a property we wish to measure might be fruitfully viewed in terms of a collection of simpler, underlying properties), the crucial issues of measurement errors and uncertainties, and the pros
Article
This document specifies a method by which an IP datagram may be encapsulated (carried as payload) within an IP datagram. Encapsulation is suggested as a means to alter the normal IP routing for datagrams, by delivering them to an intermediate destination that would otherwise not be selected by the (network part of the) IP Destination Address field in the original IP header. Encapsulation may serve a variety of purposes, such as delivery of a datagram to a mobile node using Mobile IP.
Article
We describe in this paper polynomial heuristics for three important hard problems—the discrete fixed cost median problem (the plant location problem), the continuous fixed cost median problem in a Euclidean space, and the network fixed cost median problem with convex costs. The heuristics for all the three problems guarantee error ratios no worse than the logarithm of the number of customer points. The derivation of the heuristics is based on the presentation of all types of median problems discussed as a set covering problem.