Conference PaperPDF Available

Efficiently Monitoring Bandwidth and Latency in IPNetworks

February 2001
Proceedings - IEEE INFOCOM 2:933 - 942 vol.2

February 2001
2:933 - 942 vol.2

DOI:10.1109/INFCOM.2001.916285

Source
IEEE Xplore

Conference: INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE
Volume: 2

Authors:

Minos Garofalakis

Athena-Research and Innovation Center in Information, Communication and Knowledge Technologies

Show all 5 authorsHide

Effective monitoring of network utilization and performance indicators is a key enabling technology for proactive and reactive resource management, flexible accounting, and intelligent planning in next-generation IP networks. In this paper, we address the challenging problem of efficiently monitoring bandwidth utilization and path latencies in an IP data network. Unlike earlier approaches, our measurement architecture assumes a single point-of-control in the network (corresponding to the network operations center) that is responsible for gathering bandwidth and latency information using widely-deployed management tools, like SNMP, RMON/NetFlow, and explicitly-routed IP probe packets. Our goal is to identify effective techniques for monitoring (a) bandwidth usage for a given set of links or packet flows, and (b) path latencies for a given set of paths, while minimizing the overhead imposed by the management tools on the underlying production network. We demonstrate that minimizing overheads under our measurement model gives rise to new combinatorial optimization problems, most of which prove to be NP-hard. We also propose novel approximation algorithms for these optimization problems and prove guaranteed upper bounds on their worst-case performance. Our simulation results validate our approach, demonstrating the effectiveness of our novel monitoring algorithms over a wide range of network topologies

Content uploaded by Abraham Silberschatz

Content may be subject to copyright.

Efﬁciently Monitoring Bandwidth and Latency

in IP Networks

Yuri Breitbart, Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi, Avi Silberschatz

Abstract—Effective monitoring of network utilization and performance

indicators is a key enabling technology for proactive and reactive re-

source management, ﬂexible accounting, and intelligent planning in next-

generation IP networks. In this paper, we address the challenging problem

of efﬁciently monitoring bandwidth utilization and path latencies in an IP

data network. Unlike earlier approaches, our measurement architecture

assumes a single point-of-control in the network (corresponding to the Net-

work Operations Center) that is responsible for gathering bandwidth and

latency information using widely-deployed management tools, like SNMP,

RMON/NetFlow, and explicitly-routed IP probe packets. Our goal is to

identify effective techniques for monitoring (a) bandwidthusage for a given

set of links or packet ﬂows, and (b) path latencies for a given set of paths,

while minimizing the overhead imposed by the management tools on the un-

derlying production network. We demonstrate that minimizing overheads

under our measurement model gives rise to new combinatorial optimiza-

tion problems, most of which prove to be



-hard. We also propose novel

approximation algorithms for these optimization problems and prove guar-

anteed upper bounds on their worst-case performance. Our simulation re-

sults validate our approach, demonstrating the effectiveness of our novel

monitoring algorithms over a wide range of network topologies.

I. INTRODUCTION



HE explosive growth in Internet and intranet deployment

for a constantly growing variety of applications has created

a massive increase in demand for bandwidth, performance, pre-

dictable Quality of Service (QoS), and differentiated network

services. Simultaneously, the need has emerged for measure-

ment technology that will support this growth by providing IP

network managers with effective tools for monitoring network

utilization and performance. Bandwidth and latency are clearly

the two key performance parameters and utilization indicators

for any modern IP network. Knowledge of the up-to-date band-

width utilizations and path latencies is critical for numerous im-

portant network management tasks, including application and

user proﬁling, proactive and reactive resource managementand

trafﬁc engineering, as well as providingand verifying QoS guar-

antees for end-user applications.

Indeed, these observations have led to a recent ﬂurry of both

research and industrial activity in the area of developing novel

tools and infrastructures for measuring network bandwidth and

latency parameters. Examples include SNMP and RMON mea-

surement probes [1], Cisco’s NetFlow tools [2], the IDMaps [3],

[4] and Network Distance Maps [5] efforts for measuring end-

to-end network latencies, the pathchar tool for estimating In-

ternet link characteristics [6], [7], and packet-pair algorithms

for measuring link bandwidth [8], [9]. A crucial requirement

for such monitoring tools is that they be deployed in an intelli-

gent manner in order to avoid placing undue strain on the shared

resources of the production network.

As an example, Cisco’s NetFlow measurement tool al-

lows NetFlow-enabled routers to collect detailed trafﬁc data

on packet ﬂows between source-destination node pairs [2].

NetFlow-enabled routers can generate large volumes of export

Information Sciences Research Center, Bell Laboratories, 600 Mountain Av-

enue, Murray Hill, NJ 07974.



yuri, cychan, minos, rastogi,

avi



@bell-labs.com

data due to the size and distributed nature of large data net-

works, the granularity of the recorded ﬂow data, and the rapid

data trafﬁc growth. The key mechanism for enhancing NetFlow

data volume manageability is the careful planning of NetFlow

deployment. Cisco suggests that NetFlow be deployed incre-

mentally (i.e., interface by interface) and strategically (i.e., on

carefully-chosen routers), instead of being widely deployed on

every router in the network [2]. Cisco domain experts can work

with customers to determine such “key” routers and interfaces

for NetFlow deployment based on the customers’ trafﬁc ﬂow

patterns and network topology and architecture [2]. Similar ob-

servations hold for the deployment of SNMP agents [1], since

processing SNMP queries can adversely impact router perfor-

mance and SNMP data transfers can result in signiﬁcant vol-

umes of additional network trafﬁc. In particular, as modern

Network Management Systems (NMS) shift their focus toward

service- and application-level management, the network moni-

toring process requires more data to be collected and at much

higher frequencies. In such scenarios, the SNMP-polling fre-

quency needs to be high enough not to miss relevant changes or

degradations in application behavior or service availability [10].

(In fact, even for failure monitoring, Stallings [1] suggests that

short polling intervals are often required in order for the NMS

to be responsive to problems in the network.) When such high

SNMP-polling frequencies are prescribed, the overhead that a

polled SNMP agent imposes on the underlying router can be

signiﬁcant and can adversely impact the router’s throughput.

Further, the problem is only exacerbated for mid- to low-end

routers (e.g., that implement large parts of their routing function-

ality in software). As an example, our experiments with a Cisco

4000-series router on our local network showed the throughput

of the router to drop as much as



during a polling cycle

(where repeated getnext queries are issued to gather link uti-

lization data). Obviously, polling such a router at reasonably

high frequencies can severely impact its performance. Also,

note that the network bandwidth consumed by such frequent

SNMP polling for detailed router/application/service monitor-

ing can be signiﬁcant, primarily due to the large number of

polling messages that need to traverse the network from/to the

NMS to/from the polled routers. In fact, this is the main motiva-

tion behind work on distributed polling engines (e.g., [11]) and

more recent proposals on “batching” SNMP-polling messages

[10] and more effective SNMP-polling primitives [12].

As another motivating example, the IDMaps [3], [4] and Net-

work Distance Maps [5] efforts aim to produce“latency maps”

of the Internet by introducing measurement servers/tracers that

continuously probe each other to determine their distance. In

order to make their approach scale in terms of both the stor-

age requirement and the extra probingload imposed on the net-

work, both approaches suggest techniques for pruning the dis-

tance map based on heuristic observations [3], graph-theoretic

ideas like



-spanners [4], or hierarchical clustering of the mea-

surement servers [5]. Minimizing monitoring overheads is also

critical in order to avoid “Heisenberg effects”, in which the ad-

ditional trafﬁc imposed by the network monitors perturbs the

network’s performancemetrics and biases the resulting analysis

in unpredictable ways [13].

In this paper, we address the challenging problem of efﬁ-

ciently monitoring bandwidth utilization and path latencies in

an IP data network. Earlier proposals for measuring network

utilization characteristics typically assume that the measurement

instrumentation can be either (a) intelligently distributed at dif-

ferent points in the underlying network [3], [4], [5] or (b) placed

at the endpoints (source and/or destination node) of the end-to-

end path whose characteristics are of interest [6], [7], [8], [9].

In contrast, the monitoring algorithms proposed in this paper

assume a much more common and, we believe, realistic mea-

surement model in which a single, predeﬁned point in the net-

work (corresponding to the Network Operations Center (NOC))

is responsible for actively gathering bandwidth and latency in-

formation from network elements. Thus, rather than requiring

the distribution of specialized instrumentation software and/or

hardware (which can be cumbersome and expensive to deploy

and manage) inside the production network, our algorithms en-

able a network administrator to efﬁciently monitor utilization

statistics from a single point-of-control. More speciﬁcally, we

propose effective,low-overhead strategies for collecting the fol-

lowing utilization statistics as a function of time:

1. Bandwidth usage for a given (sub)set of (a) links, and (b)

aggregate packet ﬂows between ingress-egress routers in the

network. Link-bandwidth utilization information is obviously

critical for a number of network management tasks, such as

identifying and relieving congestion points in the network.

Flow-bandwidth usage, on the other hand, provides bandwidth-

utilization data at a ﬁner granularity which can be invaluable,

e.g., for usage-based customer billing and Service Level Agree-

ment (SLA) veriﬁcation.

2. Path latencies for a given (sub)set of (possibly overlapping)

source-destination paths in the network. Once again, knowl-

edge of the delays that packets experience along certain routes

is important, e.g., in determining effective communication paths

for applications with low-latency QoS requirements or dynami-

cally routing the clients of a replicated service to their “closest”

replica [3].

Our statistics collection methodology is based on exploiting

existing, widely-deployed software tools for managing IP net-

works, like SNMP and RMON/NetFlow agents [1], [2] and

explicitly-routed IP probe packets [14]. The target applica-

tion domain for our monitoring strategies is large ISP networks,

comprising hundreds of routers and several thousand network

links. Such large ISP installations are typically characterized by

high resource-utilization levels, which means that scalable mon-

itoring strategies that minimize the impact of collecting utiliza-

tion information on the underlying network are of the essence.

This is especially true since this information needs to be col-

lected periodically (e.g., every ﬁfteen minutes) in order to con-

tinuously monitor the state and evolution of the network. The

main contributions of our work can be summarizedas follows.



Novel Algorithms for Efﬁciently Monitoring Link and

Flow Bandwidth Utilization. We demonstrate that the prob-

lem of collecting link-bandwidth utilization information from

an underlying network while minimizing the required number

of SNMP probes gives rise to a novel,



-hard generalization

of the traditional Vertex Cover (VC) problem [15], termed Weak

VC. Abstractly, Weak VC is a VC problem enriched with a lin-

ear system of equations for edge variables representing addi-

tional “knowledge” that can be used to reduce the size of the

cover. We propose a new, polynomial-time heuristic algorithm

for Weak VC that is provably near-optimal (with a logarithmic

worst-case performance bound). Furthermore, we show that our

heuristic is in fact very general and can be adapted to guarantee

logarithmic approximation bounds for other



-hard problems

that arise in efﬁcient bandwidth monitoring,including the prob-

lem of minimizing the RMON/NetFlow overhead for collecting

ﬂow-bandwidth usage information from the network.



Novel Algorithms for Efﬁciently Monitoring Path Laten-

cies. We develop ﬂexible techniques that are based on transmit-

ting explicitly-routed IP probe packets from the NOC to accu-

rately measure the latency of an arbitrary set of network paths.

By allowing IP probes to be shared among the various paths, our

probing techniques enable efﬁcient measurement of their laten-

cies. We prove that the problem of computing the (optimal) set

of probes for measuring the latency of a set of paths that im-

poses minimum load on network links is



-hard. Fortunately,

we are able to demonstrate that our optimal probe computation

problem can be mapped to the well-known Facility Location

Problem (FLP), which allows us to use the polynomial-time ap-

proximation algorithm of Hochbaum [16] to obtain a provably

near-optimal set of IP probes.



Simulation Results Validating our Monitoring Strategies.

In order to gauge the effectiveness of our monitoring algorithms,

we have conducted a series of simulation experiments on a broad

range of network graphs generated using the Waxman topol-

ogy model [17]. For link-bandwidth measurements, we ﬁnd

that, compared to a naive approach based on simple VC, our

Weak VC-based heuristic results in reductions as high as 57%

in the number of SNMP-agent activations. Our experiences with

latency measurements are similar, showing that, compared to

naive probing strategies, our FLP-based heuristic returns sets of

probes that, in several cases, traverse20% fewer network links.

The remainder of this paper is organized as follows. Sec-

tion II introduces our system model and the notational conven-

tions used in the paper. The two optimization problems that we

address in this paper are presented in Section III and Section IV,

respectively, for the the link/ﬂow bandwidth measurement prob-

lem and the path latency measurement problem. In Section V,

we present simulation results to validate our proposed approach.

Finally, we present our conclusionsin Section VI. Due to space

constraints, proofs of theoretical results can be found in the full

version of this paper [18].

II. SYSTEM MODEL AND NOTATION

Our abstract model of a data network is an undirected graph



, where

 

denotes the set of net-

work nodes (i.e., routers) and

!"#$#%$#%&'

represents

the set of edges (i.e., physical links) connecting the routers. We

let

(

*) +)

and

-) .)

denote the number of



’s nodes and

edges, respectively, We also use

#0123

to denote the degree

(i.e., total number of incident edges) of node

546

. The loca-

tion of the Network Operations Center (NOC) is denoted by the

“special” node

7

where, without loss of generality, we assume

that

7.8

45

. Further, for a node

:94;

, we denote the shortest

path (in terms of the number of links) from

7

9



. Also,

for paths



9



)



is the number of links in



and







is the

path resulting from the concatenation of



and



. Finally, given

an edge

#%9





 #9

stands for the bandwidth utilization at

the corresponding link of the network. Table I summarizes the

notation used throughout the paper with a brief description of

its semantics. We provide detailed deﬁnitions of some of these

parameters in the text. Additional notation will be introduced

when necessary.

Symbol Semantics



Network graph



 



 

Number of nodes/edges in







!

#"#"$"

Generic nodes/routers in the network graph



%&



%

$"#"#"

Generic edges/links in the network graph



'(



*"*"#"

Generic paths in the network graph





Shortest path from

,





'(



Number of links in path

'-

'(./')

Concatenation of paths

'(

and

0



&132



14

*"*"#"



Set of trafﬁc ﬂows in the network

065

Set of trafﬁc ﬂows routed through router





067

Set of trafﬁc ﬂows routed through link



89:



Bandwidth utilization for network link

89:



Bandwidth utilization for trafﬁc ﬂow

TABLE I

NOTATION.

For our bandwidth-monitoringschemes that make use of ﬂow

information, we assume that all data trafﬁcin the monitored net-

work is distributed among a set of packet ﬂows

;

; that is, every

data packet routed in



belongs to some ﬂow

;

. Each

such ﬂow

is essentially a directed path from a source/ingress

router to a destination/egress router in



. Note that, for a given

pair of ingress-egress nodes, there may be multiple packet ﬂows

between them. Intuitively, each ﬂow represents the aggregate

trafﬁc involving a set of source-destination IP address pairs.

Edge ingress/egress routers typically serve a wide range of IP

addresses and trafﬁc between different source/destination ad-

dresses may be split at the network’s edge routers along mul-

tiple ﬂows, e.g., for trafﬁc engineering or accounting purposes.

We let

;=

(

;?>

) denote the set of packet ﬂows routed through

router



(resp., link

) in



. We also use





9

to denote the

bandwidth usage of ﬂow

;



III. MONITORING LINK AND FLOW BANDWIDTH

An IP router is typically managed by activating an SNMP

agent on that router. Over time, the SNMP agent collects vari-

ous operational statistics for the router which are stored in the

router’s SNMP Management Information Base (MIB) and can

be dispatched (on demand or periodically) to the NOC site [1].

SNMP MIBs store detailed information on the total number of

bytes received or transmitted on each interface of an SNMP-

enabled network element. Thus, periodically querying router

SNMP MIBs provides us with a straightforward method of

observing link-bandwidth usage in the underlying IP network.

More speciﬁcally, assume that, using SNMP, we periodically

download the total number of bytes received (

A@&>*B=

) and bytes

transmitted (

DCE@&F



) on a given router interface every



units of

time. The average bandwidth usage for the incoming (outgoing)

link attached to that interface over the measurement interval



is then

@>#B=JI



(resp.,

CE@F



-G



). A naive, “brute-force” solution

to our link-bandwidth monitoring problem would therefore con-

sist of (1) activating an SNMP agent on every network router



, and (2) periodically downloading the number of bytes

observed on each interface to the NOC by issuing appropriate

SNMP queries to all routers.

There are two serious problems with such a “brute-force” ap-

proach. First, running an SNMP agent and answering periodic

SNMP queries from the NOC typically has a signiﬁcant associ-

ated overhead that can adversely impact the performance char-

acteristics of a router. Using such a naive bandwidth-monitoring

strategy means that the routing performance of every router in

the network is affected. Second, periodically downloading

SNMP link-trafﬁc data from every router can result in a sub-

stantial increase in the observed volume of network trafﬁc. We

are therefore interested in ﬁnding link-bandwidth monitoring

schemes that minimize the SNMP overhead on the underlying

IP network. More formally, our problem can be stated as fol-

lows.

Problem Statement [Low-Overhead Link-Bandwidth Mon-

itoring]: Given a network

  

, determine a mini-

mum subset of nodes

KML



such that enabling and monitor-

ing SNMP agents on these nodes is sufﬁcient to infer the link-

bandwidth usage for everylink of



For ﬂow-bandwidth monitoring, RMON [1] or NetFlow

agents [2] can be enabled on routers to measure the num-

ber of data packets shipped through any of the router’s inter-

faces between speciﬁc pairs of source-destination IP addresses.

Like SNMP, however, deploying and periodically querying

RMON/NetFlow agents comes at a cost which can substantially

impact the performance of the router and the observed volume

of network trafﬁc. In fact, both these problems are exacerbated

for RMON and NetFlow compared to simple SNMP, since the

measurements are collected and stored at a much ﬁner granular-

ity resulting in much larger volumes of management data. Thus,

monitoring bandwidth usage at the level of packet ﬂows gives

rise to similar overhead-minimization problems.

In this section, we propose novel formulations and algorith-

mic solutions to the problem of low-overhead bandwidth moni-

toring for network links and packet ﬂows.

A. A Vertex Cover Formulation

A simple examination of the naive method of activating

SNMP agents on every network router reveals that it is really an

overkill. Abstractly, to monitor all links in



, what is needed is

to select a subset of SNMP-enabled routers such that every link



is “covered”; that is, there is an SNMP agent running on at

least one of the link’s two endpoints. This is an instance of the

well-known Vertex Cover (VC) problem over the network graph



[15]. Figure 1(a) depicts an example network graph and the

nodes corresponding to a minimum VC of size 4. Even though

VC is known to be



-hard, it is possible to approximate the

optimal VC within a factor of 2 using an





algorithm based

on determining a maximal matching of



[19].

B. Exploiting Knowledge of Router Mechanics and Trafﬁc

Flows: The Weak Vertex Cover Formulation

Using a VC of the network graph



to determine the set of

nodes on which to run SNMP agents can obviously result in a

Without loss of generality, we assume that all links of



are to be monitored.

If only the edges in

POQR

are of interest, then



is understood to be the

network subgraph spanned by



v2v3

(b)(a)

Fig. 1. (a) Network graph



and a minimum VC



























. (b) A

directed version of



and a minimum Weak VC















substantial reduction in the number of activated SNMP agents to

monitor link-bandwidth usage in



. Nevertheless, it is possible

to do even better by exploiting knowledgeof the trafﬁc ﬂows in

the network and the mechanics of packet forwarding. To sim-

plify the exposition, we start by describing our novel problem

formulation and algorithmic solutions assuming a directed net-

work graph model



, with the direction of each link capturing

the ﬂow of data packets into or out of each router node. We

then demonstrate how our results can be extended to the more

realistic scenario of an undirected graph model.

B.1 The Weak Vertex Cover Problem for Directed Graphs

Consider a router



in the directed network graph



and let



(

N?=

) denote the set of incoming (resp., outgoing) edges incident





. The key observation here is that, each such router



satisﬁes a ﬂow-conservation law that, simply put, states that

the sum of the trafﬁc ﬂowing into



is approximately the same

as the sum of the trafﬁc ﬂowing out of



. More formally, the

ﬂow-conservation law for a non-leaf node 2



can be stated as

the following equation:







 # 9









 #











(1)

Note that, in practice, the above ﬂow-conservation equation

holds only approximately,since there can be (a) trafﬁc directed

to/from the router (e.g., OSPF protocol exchanges, manage-

ment trafﬁc, and ARP queries), (b) multicast trafﬁc that is repli-

cated along many output interfaces, and (c) delays and dropped

packets in the router (under certain extreme congestion condi-

tions). We believe, however, that these are infrequent condi-

tions for routers in an ISP network that comprise only a very

small proportion of the overall observed data trafﬁc. Therefore,

given a sufﬁciently large monitoring period, we expect the ﬂow-

conservation equation at each router to be approximately cor-

rect. Several measurements over backbone routers in Lucent’s

network have corroborated our expectations showing that ﬂow

conservation holds with a relative error that is consistently be-

low 0.05% [18].

The importance of the ﬂow-conservation law for network

monitoring lies in the observation that we no longer need to

ensure that all edges of a router are “covered” by an SNMP

agent: if a router has



links incident on it and the bandwidth

utilization of



 

of the links is known, then the bandwidth

For simplicity, we assume that all links crossing the ISP network boundary

are terminated by distinct leaf nodes in



utilization of the remaining link can be derived from the ﬂow-

conservation equation for that router. This observation leads to

a novel vertex-covering formulation, termed Weak Vertex Cover.

Deﬁnition III.1: [Weak Vertex Cover] Given a directed net-

work graph



, we deﬁne a set

of nodes to be a Weak Vertex

Cover of G if after initially marking each node in

as covered,

it is possible to mark every node in



by iteratively performing

the following two steps: (1) Mark every edge in



that is in-

cident on a covered node as covered; and, (2) For any non-leaf

node





, if

#0123

 

of the edges incident to



are marked

covered, then mark vertex



as covered.

Based on the law of ﬂow conservation, it is obvious that ac-

tivating SNMP agents on a Weak VC for



is sufﬁcient to de-

rive the bandwidth usage on every link in the network graph.

Thus, given ﬂow conservation at each router of



, our efﬁcient

link-bandwidth monitoring problem becomes equivalent to de-

termining a minimum Weak VC for



. For example, Figure 1(b)

depicts a directed network



and the two nodes corresponding

to a minimum Weak VC of



Note that every VC of a graph



is trivially also a Weak VC of



, but not necessarily a minimal one. (Figure 1 again provides a

good example.) In fact, there are graphs for which the size of an

optimal VC is arbitrarily larger than that of a Weak VC for the

same graph (consider, for example, a long directed chain). Thus,

exploiting the ﬂow-conservation law can substantially improve

the SNMP-monitoring overhead over a simple VC approach.

To the best of our knowledge, our Weak VC formulation rep-

resents a novel optimization problem that has not been studied

in earlier research on combinatorial or graph algorithms. Unfor-

tunately, as the following theorem shows, it is highly unlikely

that we can ﬁnd a minimum Weak VC in an efﬁcient manner.

Theorem III.1:Given a directed graph



, discovering a mini-

mum Weak VC is



-hard.

A Near-Optimal Heuristic for Weak VC. An alternative way

of viewing our Weak VC formulation is as follows. The law

of ﬂow conservation for every (non-leaf) router in



provides

us with additional knowledge for the link-bandwidthunknowns

(



 # 9

) in the form of a linear system of equations that we can

exploit to determine the values for all



 # 9

’s. The problem

then is to determine a minimum subset of nodes

K L



such

that, when the



 # 9

’s incident to all nodesin

are determined,

the linear system of ﬂow-conservation equations can be solved

for all the remaining



 #9 

’s. We now present a provably near-

optimal greedy heuristic for Weak VC that is motivated from the

above-stated formulation.

Let















denote the

(



linear system of ﬂow-

conservation equations correspondingto the (non-leaf) nodes in



(Equation (1)). (Without loss of generality, we assume that

(

is the number of non-leaf nodes in



.) Note that, initially,











(i.e., the zero

(

-vector), but this is not necessarily the case

in the later stages of our algorithm where some unknowns may

have been speciﬁed by the routers selected in the cover. Also,

let



(









denote the rank of matrix



, i.e., the number of

linearly independent ﬂow-conservation equations in the system.

Note that, if



(











then the linear system of ﬂow-

conservation equations can be directly solved to determine the

values for all unknown link bandwidths



 # 9

, which obvi-

ously means that no nodes need to be selected for monitoring.

Otherwise, the minimum required number of link-bandwidth

variables that need to be speciﬁed in order to make the ﬂow-

conservation system solvable is exactly





(









. Se-

lecting a node



to run an SNMP agent, means that all link-

bandwidth variables attached to



become known and the ﬂow-

conservation equation for



becomes redundant. Thus, the orig-

inal

(



system of ﬂow-conservation equations is reduced to



(

 









#0   $

system, where

#0   

is the degree

of node





Consider step



of the node-selection process (i.e., after en-

abling SNMP at



 

selected nodes of



) and let





denote

the network graph after the selected



 

nodes (and incident

edges) are removed from



. Also, let





















denote the

(







system of ﬂow-conservationequations for





. Finally,













(











denotes the minimum number of link vari-

ables that need to be (directly) speciﬁed so that the remainder

network graph





is fully covered (i.e., the ﬂow-conservation

system for





becomes solvable). Our greedy algorithm for

Weak VC, termed GREEDYRANK, selects at each step the node



that results in the maximum possible reduction in the minimum

number of link variables required to make the ﬂow-conservation

equations solvable. That is, we select the node that maximizes

the difference













. More formally, if we let







 

denote the



(



 











#0123

matrix resulting from the

deletion of the ﬂow-conservation equation for



and all its at-

tached link variables from





, then the GREEDYRANK strategy

can be stated as depicted in Figure 2.

Algorithm GREEDYRANK

 '

Input:

  $ 

is the directed network graph comprising

(

nodes and

links.

Output:

K L



a Weak VC of



1) Let















denote the

(



linear system of ﬂow-

conservation equations for



;

2) Set









(

 

(

 



 



  

;

3) while













(















4) Set

:&





nil and















;

/* ﬁnd node



that maximizes the reduction in the

required number of variables */

5) for each node











23 













#0   





(











 

7) if





23













then

8) Set

&





and

















  

;

9) Set





 &

















 &









 





&





(



 

(



 



 





#012&





, and











;

Fig. 2. Finding a Near-Optimal Weak Vertex Cover.

The following theorem bounds the worst-case behavior of our

GREEDYRANK algorithm.

Theorem III.2:Algorithm GREEDYRANK returns a solution to

the Weak VC problem that is guaranteed to be within a factor of











(









of the optimal solution, where



is the

coefﬁcient matrix of the

(

ﬂow-conservation equations in



Time Complexity. GREEDYRANK requires repeated matrix-

rank computations in order to determine the “locally-optimal”

node to place in the cover at each step. However, as we show

in the full version of this paper [18],the speciﬁc form of the co-

efﬁcient matrix



allows us to reduce matrix-rank computation

to a simple search for (undirected) connected components in



which can be performed in



"!$#

1

(



%

time. Consequently,

the worst-case running time of our GREEDYRANK algorithm

can be shown to be only



(





%!&#



(



%

B.2 Exploiting Knowledge of Network Flows

So far, our Weak VC formulation makes use of the law of

ﬂow conservation on each router but it does not exploit knowl-

edge of trafﬁc ﬂows in the network when trying to estimate

link-bandwidth usage. This ﬂow information essentially con-

sists of the paths along which packets get routed in the network

and can be computed using routing protocol control information

(e.g., the link-state database in OSPF or label switched paths in

MPLS). In this section, we demonstrate how knowledge of the

trafﬁc ﬂows in our directed network graph



can be exploited

in conjunction with ﬂow conservation to further reduce the re-

quired SNMP overhead for monitoringlink bandwidth.

Consider a router





and let



;=

denote the (sub)sets

of links incident on



and packet ﬂows routed through



, re-

spectively. We can always uniquely partition



into a maximal

collection of



  





subsets





$

('

)

such that each

ﬂow

;=

only involves (a pair of) links in a single subset



, for some



. We say that such a partitioning of the links in



satisﬁes the non-overlapping ﬂow property. An example of

this ﬂow-based link partitioning (with



23 



) for a node



depicted in Figure 3. (In the worst case



23 



, i.e., a node’s

links cannot be partitioned into non-overlappingﬂows and, thus,

they all belong to the same partition.) The key observation here

is that the law of ﬂow conservation in fact holds for each indi-

vidual link partition













  

; thus, node



can be

marked as coveredas long as

) 

)

 

links in



are covered for

each







 



23

. This essentially means that we can infer

the link-bandwidth utilization on each link incident to



based

on knowing the bandwidth usage on only

) 

)





  

links of



(



23





). (Note, of course, that these

) 

)





23

links have to

satisfy the condition outlined above; that is, only one link may

be left unspeciﬁed in each partition













23

.) As an

example, knowing the bandwidth usage on the three outgoing

links in Figure 3 is sufﬁcient to infer the bandwidth load on the

two incoming links. This leads us to a generalized formulation

of our Weak VC problem.

Fig. 3. Partitioning



’s edges into





 

partitions (



and



) satisfying

the non-overlapping ﬂow property.

Deﬁnition III.2: [Partitioned Weak Vertex Cover] Given a

directed network graph



and a partitioning of the links in



that satisﬁes the non-overlapping ﬂow property, we deﬁne a set

of nodes to be a Partitioned Weak VC of G if after initially

marking each node in

as covered, it is possible to mark every

node in



by iteratively performing the following three steps:

(1) Mark every link in



that is incident on a covered node as

covered; (2) For any node





, if in any link partition



(







 



23

) there are

) 

)

 

links marked covered then

mark the remaining link in



as covered; and, (3) For any node





, if in every link partition



(







 



  

) there are

at least

) 

)

 

links marked covered then mark vertex



covered.

Generating the maximal link partitioning of



(for each





) that satisﬁes the non-overlapping ﬂow property is fairly

straightforward. The idea is to start by placing each link in

its own partition and iteratively “merge” partitions that share

ﬂows [18]. Based on the law of ﬂow conservation for each link

partition, it is easy to see that, activating SNMP agents on a Par-

titioned Weak VC for



is sufﬁcient to derivethe bandwidth us-

age on every link in the network graph. Thus, given trafﬁc-ﬂow

information in the network, our low-overhead link-bandwidth

monitoring problem becomes equivalent to determining a mini-

mum Partitioned Weak VC for



. This problem is clearly



hard (a generalization of Weak VC); however, a version of our

GREEDYRANK algorithm can be used to give a fast approximate

solution with a guaranteed logarithmic worst-case performance

bound. (Note that, for Partitioned Weak VC, removing a node



from the linear system of ﬂow-conservation equations means

that all



  

equations corresponding to



are removed.) Due to

lack of space, the full details can be found in [18].

Theorem III.3:For the Partitioned Weak VC problem, algo-

rithm GREEDYRANK runs in time



(





"!$#



(





and

returns a solution that is guaranteed to be within a factor of











(









of the optimal solution, where



is the

coefﬁcient matrix of the





  

ﬂow-conservation equations



B.3 Extension to Undirected Network Graphs

Our discussion of link-bandwidth monitoring has so far fo-

cused on the case of a directed network graph model



, where

packet trafﬁc on each physical link is uni-directional and known

beforehand. In general, physical network links are bi-directional

with data packets ﬂowing both to and from routers on the same

link. We now brieﬂy describe how our results and algorithmic

techniques extend to this more general and realistic scenario of

an undirected network graph model.

The basic idea is to “expand” the network graph



into a di-

rected graph





by modeling each bi-directional physical link



as two directed edges (in opposing directions) in





, thus

capturing both directions of packetﬂow for each link. Of course,

one (or both) of the directed edges for a link can be left out of

the model if we know that it is not being used for actual trafﬁc in

the network, e.g., based on the knowledge of trafﬁc ﬂows. The

ﬂow-conservation law (Equation (1)) then holds for each router





and its directed in- and out-links created in this manner.

Thus, our solutions for Weak VC and Partitioned Weak VC can

be directly applied on this “expanded” directed network graph





. More details can be found in [18].

C. Monitoring Flow-Bandwidth Utilization

Consider the (undirected) network graph



with a set of

packet ﬂows

;

 



 

routed through its nodes. En-

abling RMON or NetFlow on a router





enables the band-

width utilization for the set

; =

of all ﬂows routed through



be measured, i.e.,



directly covers all the ﬂows in

; =

. It is

straightforward to see that the problem of determining a mini-

mum subset of RMON/NetFlow-enabled routers such that every

ﬂow in

;

is directly covered is essentially an instance of the



-hard Set Cover problem [15]. Thus, a greedy Set-Cover

heuristic can be used to return an approximate solution that is

guaranteed to be within



 

)

;

)

of the optimal in



(



)

;

) 

time [18].

Exploiting Link Bandwidth Information for Covering

Flows. The performance overheadimposed by tools like RMON

or NetFlow on routers and network trafﬁc is typically signiﬁ-

cantly higher than that of SNMP, mainly due to their much ﬁner

granularity of data collection. The problem, of course, is that

SNMP agents cannot collect and provide trafﬁc data at the re-

quired granularity of packet ﬂows – only aggregate information

on link-bandwidth utilization can be obtained throughSNMP.

The crucial observation here is that knowledge of aggregate

link bandwidths (obtained via SNMP) provides us with a system

of per-link linear equations on the unknownﬂow-bandwidth uti-

lizations that can be exploited to signiﬁcantly reduce the num-

ber of RMON/NetFlow probes required for monitoring ﬂow-

bandwidth usage. (Each such equation basically states that the

aggregate link bandwidth is equal to the sum of the bandwidth

utilizations of the ﬂows traversing that link [18].) The resulting

problem is similar to Weak VC, except that we are interested in

a minimum subset

of routers such that determining the band-

width utilization of ﬂows passing through routers in

renders

the system of per-link equations solvable. The following the-

orem establishes the intractability of this optimization problem

and the near-optimality of our general GREEDYRANK strategy.

Due to lack of space, the full details can be found in [18].

Theorem III.1: Given knowledge of the aggregate link-

bandwidth utilizations in



, the problem of determining a min-

imum subset

of routers such that enabling RMON/NetFlow

on every

+4

allows the determination of the ﬂow-bandwidth

utilizations for every ﬂow in

;



-hard. Further, an ap-

propriately modiﬁed version of our GREEDYRANK strategy re-

turns a solution to this problem that is guaranteed to be within

a factor of



  

)

;

)





(







$

of the optimal, where



the



)

;

)

coefﬁcient matrix of the per-link ﬂow-bandwidth

equations in



. This approximation factor is the best possible

(assuming







IV. MONITORING NETWORK LATENCY

We next turn our attention to the problem of measuring round-

trip latencies for a set of network paths



in the (undirected) net-

work graph



, where each path is a sequence of adjacent links



. Such network latency measurements are crucial for pro-

viding QoS guarantees to end applications (e.g., voice over IP),

trafﬁc engineering, ensuring SLA compliance, fault and conges-

tion detection, performance debugging, network operations, and

dynamic replica selection on the Web.

Most previous proposals for measuring round-trip times of

network paths rely on probes, which are simply ICMP echo re-

quest packets. Existing systems typically belong to one of two

categories. The ﬁrst category includes systems like WIPM [20],

AMP [21] and IDMaps [3] that deploy special measurement

servers at strategic locations in the network. Round-trip times

between each pair of servers are measured using probes, and

these times are subsequently used to approximate the latency

of arbitrary network paths. The measurement server approach,

while popular, suffers from the followingtwo drawbacks. First,

the cost of deploying and managinghundreds of geographically

distributed servers can be signiﬁcant due to the required hard-

ware and software infrastructure as well as the human resources.

Second, the accuracy of the latency measurements is highly

dependent on the number of measurement servers. With few

servers, it is possible for signiﬁcant errors to be introduced when

round-trip times between the servers are used to approximate

arbitrary path latencies. The second category of tools for mea-

suring path latencies include pathchar [6] and skitter [22].

Both tools measure the round-trip times for paths originating at

a small set of sources (between one and ten) by sending probes

with increasing TTL values from each source to a large set of

destinations. A shortcoming of these tools is that they can only

measure latencies of a limited set of paths that begin at one of

the sources from which ICMP probes are sent.

In this section, we present our probing-based technique that

alleviates the drawbacks of previous methods. In our approach,

path latencies are measured by transmitting probes from a single

point-of-control (i.e., the NOC). Consequently, since our tech-

nique does not require special instrumentation to be installed in

the network, it is cost-effective and easy to deploy. Further, un-

like existing approaches, our method allows for latencies of an

arbitrary set of network paths



to be measured exactly, and is

thus both accurate and ﬂexible. Our schemes achieve this by

exploiting the ability within IP to explicitly route packets using

either source routing or encapsulation of “IP in IP”. We demon-

strate that, for measuring the latency of a given set of paths



, there exist a wide range of probing strategies that impose

varying amounts of load on the network infrastructure. While

the problem of selecting the optimal set of probes that mini-

mizes the network bandwidth consumed is



-hard, we show

that this problem can be mapped to the well-known Facility Lo-

cation Problem (FLP) for which efﬁcient approximation algo-

rithms with guaranteed performance ratios have been proposed

in the literature [16], [23].

A. Overview and Problem Formulation

In our approach for measuring latency, explicitly routed

probes are transmitted along paths originating at the NOC (i.e.,

node

7

). (We discuss source routing and IP encapsulation, the

two mechanisms within IP for controlling the path traversed by

a packet in more detail in Section IV-D.) The round-trip latency

of a single path



is measured by sending the following two

probe packets:

1. The ﬁrst probe packet is sent from

:7

to one of the end nodes,

say

9

, along the shortest path



between them in



. The probe

then returns to

7

along the reverse of



2. The second probe packet is sent from

7

to the other end node







(via

:9

) along the path







. The probe then returns to

7

along the reverse of







The round-trip latency of path



is computed as the difference

of the round-trip times of the two probes traversing the paths



and







In the remainder of this paper, we will represent each probe

by the forward path traversed by it from

:7

since the return path

is symmetric to the forward path (in the reverse direction). Note

that the ﬁrst node of each probe is always

:7

. Also, in this paper,

we will not consider complex probing techniques for measur-

ing latency in which probes follow arbitrary paths which cannot

be decomposed into symmetric forward and reverse path seg-

p3 (100)p1 (20) p2 (40)

(a) (b)

Fig. 4. Sets of Probes for Measuring Latency of Paths

ments3.

In order to measure the latency of a set of paths



, we need

to employ a set of probes

such that for each path





contains a pair of probes



and



P



. We refer to



as the

base probe for



and to







as the measurement probe for



Further, we refer to probes in

that are not measurement probes

for any path in



as anchor probes and to the last node visited

by an anchor probe as its anchor node.

There are a number of differentsets of probes that can be used

to measure the latency of



. One obvious choice for

is the set

that contains, for each path





, the following two probes:



and







(assuming that

9

is the end node of



that is closest

7

). However, as the following example illustrates, there are a

number of other possibilities for set

, several of which contain

a smaller number of probes and traverse fewer links.

Example IV.1: Consider the network graph shown in Fig-

ure 4(a) containing the set of paths



 















, where

)



) 



)



) 





)





) 

  

)



) 





)



) 





and

)





) 



. The obvious set of probes



for measuring

the latency of



is illustrated in Figure 4(a), where the mea-

surement of each path is optimized independently by sending

the base probe to the end node closest to

7

. This results in a

distinct pair of probes,



and







, for each path



, which

requires a total of



probes and







traversed links. Figure 4(b)

illustrates a different set of probes



for measuring



that is

optimal with respect to both the number of probes as well as

the number of traversed links. In



, paths





and





share the

same base probe





, and the measurement probe for





(i.e.,











) also serves as the base probe for





. (Note that





the only anchor probe in



, whereas



contains three distinct

anchor probes (













).) This sharing of probes among paths

reduces the number of probes from





. Although both paths





and



are measured with longer measurement probes in



(

















and









) than in



, this overhead is offset by the

savings due to the sharing of probes in



, thereby resulting in

an overall reduction from











traversed links.

Ideally, we would prefer a set

of probes that traverses as

few links as possible to measure



. This is because the total

number of links traversed by the probes in

is a good mea-

sure of the additional load that the probes impose on network

links. Minimizing this additional network trafﬁc due to probes

is extremely important, since we need to monitor path latencies

continuously, causing probes to be transmitted frequently (e.g.,

every ﬁfteen minutes). Thus, our efﬁcient latency-monitoring



One such complex probing technique for measuring the latency of a path

'(

would be to send a pair of probes – the ﬁrst probe makes a round-trip from

,

to an internal node, say





, in

'-

; the second probe travels from

,

to one of the

end nodes of

'(

via





, and then to the other end node of

'-

, and ﬁnally back to

,

via





problem can be formally stated as follows.

Problem Statement [Low-Overhead Path-Latency Monitor-

ing]: Given a set of paths



, compute a set of probes

such that

(1)

measures the latency of



; that is, for every path



9 4



contains a pair of probes



and







, and (2)

is opti-

mal; that is, the total number of links traversedby probes in

minimum.

In the following subsection, we address the aboveproblem of

computing the optimal set of probes for measuring the latency of

paths in



. We assume that for any pair of paths



and







is not a preﬁx (or sufﬁx) of

 

. The reason for this assump-

tion will become clear in the next subsection. Note, however,

that this assumption is not restrictive since, if



is a preﬁx of













, then



can be split into two non-overlappingpath

segments



and





, and its latency can be computed as the sum

of the latencies of



and





B. Computing an Optimal Set of Probes

As illustrated in Example IV.1, a naive approach that adds to

the optimal pair of base and measurement probes for eachpath



considered independently may not result in the optimal set

of probes. This is because (1) measurement probes for multiple

paths can share a common base probe, and (2) the measurement

probe for one path can serve as a base probe for an adjoining

path. Thus, more sophisticated algorithms are needed for com-

puting an optimal solution. Unfortunately, as the following the-

orem states, the problem of computing the optimal set of probes



-hard even if every path in



is restricted to be a single

link.

Theorem IV.1:Given a graph



and a set of paths







, the

problem of computing the optimal set of probes to measure the

latency of





-hard.

In the following, we map the problem of computing the opti-

mal set of probes to the Facility Location Problem (FLP). Since

efﬁcient polynomial-time algorithms for approximatingthe FLP

exist in the literature, these can then be utilized to compute a

near-optimal set of probes.

Before we present our FLP reduction, we develop some addi-

tional notation. For a path



, we denote by







9

and







9

the

ﬁrst and last node of



, respectively. Further,





5  



$





denotes the undirected, distance-weighted graph induced by



;

thus,













9  







9  )



9 4





is the set of all the end nodes

of the paths in



, and





6  







9  







9')



9 4





is the

set of edges between the end nodes such that corresponding to

every path in



, there is an edge in





connecting its two end

nodes. (Note that





is actually a multigraph, since there may

be multiple edges between a pair of nodes in





.) Each edge in





is labeled with its corresponding path, say



, in



, and has

an associated weight equal to

)



. For a pair of nodes

9

and









, we denote by







the shortest path (with respect to the

sum of edge weights) from

:9









. Essentially,







is the

path from

:9





in the shortest path tree rooted at

9





Note that, since every edge in





corresponds to some path in









can be viewed as a concatenation of pathsin



. Finally,

we use

)







)

to denote the sum of edge weights for











We are now in a position to characterize the composition of

sets of probes. A set

of probes for measuring the latency of

paths in



consists of the following two disjoint subsets:

1. A set of anchor probes

K F

(corresponding to anchor nodes



), and

Algorithm OPTIMALPROBES

  







Input:



is a network graph.



= A set of paths whose latency is to be measured.



= A set of anchor nodes.

Output:

= A set of measurement probes that is optimal with

respect to



and



&:

;

2)for each





3) Let





45

and





4 







9 







9 

such that

)













)

minimum (in case of a tie between nodes





and







the node with the smaller index is chosen);

&





















9

;

Fig. 5. Finding an Optimal Set of Probes for Anchor Nodes





2. A set of measurement probes

for measuring the latency

of paths in



Since the shortest possible anchor probe in

corresponding to

each anchor node

:9 4 



 



9 ):9 4 



. In

there is a separate measurement probe for each path in



(since

for any pair of paths



9







cannot be a preﬁx of

 

Further, every measurement probe in

is a concatenation of

a single anchor probe and one or more paths in



. The shortest

possible measurement probe in

to measure the latency of

path



has length equal to





















)







)



)

















Here, the minimization over

)

 











)

essentially captures the

shortest possible path from

7

to one of the end nodes





that begins with an anchor probe



followed by paths in



Thus, for a given set of anchor nodes, it is possible to compute

the optimal set of measurement probes

for measuring



Algorithm OPTIMALPROBES in Figure 5 computes this optimal

set

for a given



and



. The computed set

is optimal

since, for each path





, OPTIMALPROBES adds to

the

measurement probe containing the smallest number of links. In

addition, we can prove that the set

K F



measures the latency

of all paths in



. For this, we need to show that, for every path



, the base probe

 











is either in

K6F

or is added to

. If













, then the base probe



is in

. Otherwise, if





the ﬁnal path in









, then













is the measurement probe for





with the smallest number of links and is thus added to

Thus, the set of probes



K

measures the latency of



Theorem IV.2:Given a set of anchor nodes



and paths



Algorithm OPTIMALPROBES computes an optimal set of mea-

surement probes

such that

K6F



measures the latency of



.From Theorem IV.2, it follows that for a given set of anchor

nodes



, the set



K6F



is the optimal set of probes

for measuring



among sets for which

K F

is the set of anchor

probes. Further, the number of links traversed by

is:





 

)



)











"





 











)







)



)

















9) 

(2)

Thus, if



is a set of anchor nodes for an optimal set of probes,

then

K6F



(where

is computed by OPTIMALPROBES)

is an optimal set of probes for measuring



; that is,

K F



K

minimizes the value of Equation(2). As a result, we have trans-

formed the problem of computing the optimal set of probes to

that of computing a set of anchor nodes



that minimizes Equa-

tion (2). Once



is known, algorithm OPTIMALPROBES can be

used to compute the measurement probes

such that

K6F



is the optimal set of probes.

The above minimization problem maps naturally to the Facil-

ity Location Problem (FLP) [16], [23]. The FLP is formulated

as follows: Let



be a set of clients and



be a set of facili-

ties such that each facility “serves” every client. There is a cost









of “choosing” a facility





and a cost











of serv-

ing client





by facility



4



. The problem deﬁnition

asks to choose a subset of facilities

< L



such that the sum

of costs of the chosen facilities plus the sum of costs of serving

every client by its closest chosen facility is minimized; that is,





































"

















 

The problem of computing the set of anchor nodes



that

minimizes Equation (2) can be mapped to FLP as follows: Let



be the set of paths



and



be the set of candidate anchor nodes





. The cost of choosing a facility











, is

)



)

, the length

of the shortest path from

:7





. The cost of serving client



from facility















, is















)







)



 )



)



)









)1

which is the sum of the lengths of



and the shortest path from





to one of the end nodes of







. Thus, the set

computed

for the FLP corresponds to our desired optimal set



of anchor

nodes.

The FLP is



-hard; however, it can be reduced to an in-

stance of the Set Cover problem and then approximated within a

factor of







)



) 

with a running-time complexity of

)



)





)



) 

[16]. Thus, we can compute a provably near-optimal set

of probes for measuring paths in



by ﬁrst using Hochbaum’s

FLP heuristic [16] to ﬁnd a near-optimal set of anchor nodes



(in

$)



)





) 



) 

time), and then running OPTIMALPROBES

to ﬁnd the optimal set of measurement probes

for



. The

set of probes





is then guaranteed to be within







)



) 

of the optimal solution for measuring



[18].

C. Minimizing the Number of Probes

Suppose that instead of minimizing the number of traversed

links, we are interested in computing the set

with the min-

imum number of probes. In this case, it is possible to com-

pute the optimal set of probes by invokingalgorithm OPTIMAL-

PROBES with the set of paths



whose latency is to be measured

and the set



that contains one (arbitrary) node from each con-

nected component in





. The ﬁnal set





K

contains

one anchor probe per connected component in





and one mea-

surement probe per path, which is optimal with respect to the

number of probes.

D. Implementation Issues

Our approach for measuring latency is highly dependent on

being able to explicitly route probe packets along speciﬁc paths.

Loose source routing and encapsulation of IP in IP are two

mechanisms for controlling routes followed by packets. We pre-

fer encapsulation over loose source routing due to the following

reasons [24]. First, Internet routers exhibit performance prob-

lems when forwarding packets that contain IP options, includ-

ing the IP source routing option. Second, the source routing

option is frequently disabled on Internet routers due to security

problems associated with its use. Finally, IP allows for at most

40 bytes of options, which restricts the number of IP addresses

through which a packet can be routed using source routing to be

no more than 10.

While encapsulation addresses some of the problems with

source routing, unwrapping the header in encapsulated packets

still incurs overhead at routers and encapsulated packets are typ-

ically larger than source routed packets. Both processing over-

head and packet sizes can be reduced signiﬁcantly by using as

few headers as possible in each probe packet. We can achieve

this by splitting the path for a probe packet into maximal disjoint

path segments, such that each path segment is consistent with

the route computed by the underlying routing protocol (e.g.,

OSPF). Then the probe packet can be routed along the path by

using one header per path segment that contains the IP address

of the endpoint of the segment that is not shared with the pre-

vious segment. Note that the ﬁnal measured round-trip times

must be adjusted to account for the overhead of processing the

encapsulated packets at intermediate routers.

V. SIMULATION RESULTS

In this section, we present simulation results comparing the

performance of the various algorithms that we have devel-

oped for both the link-bandwidth and path-latency measurement

problems. The main objective of the simulation results is to

demonstrate that our proposed algorithmic solutions are not only

theoretically sound with good guaranteed worst-case bounds but

they also give signiﬁcant beneﬁts over naive solutions in prac-

tice (i.e., on the average) for a wide variety of realistic network

topologies. The simulations are based on network topologies

generated using the Waxman Model [17], which is a popular

topology model for networking research (e.g., [4]). Different

network topologies are generated by varying three parameters:

(1)

(

, the number of nodes in the network graph; (2)



, a pa-

rameter that controls the density of short edges in the network;

and (3)



, a parameter that controls the average node degree.

A. Bandwidth Measurement

For the link-bandwidth measurement problem, we compare

the performance of three algorithms: the maximal matching

heuristic for simple VC (Sec. III-A), and two algorithms based

on our Weak VC formulation – a variant of the maximal match-

ing heuristic and our GREEDYRANK algorithm (Sec. III-B.1).

Our maximal matching variant for Weak VC basically ensures

that all transitively-speciﬁed edges (based on ﬂow conservation)

are eliminated from



whenever a new edge enters the match-

ing. The comparison is in terms of the number of nodes that

need to run SNMP in order to measure the bandwidth of each

link in the generated network graphs4. We denote the number

of SNMP activations for these algorithms by



=DB

FC/B





=DB

FC/B



and



=DB

@&F





, respectively.

Table II presents one set of simulation results; we have ob-

tained similar results for other parameter settings. The ﬁrst col-

umn in the table represents the average degree of the nodes in

the generated network graph (which increases with larger val-

ues of



). Our results indicate that GREEDYRANK is the clear

winner, reducing the number of SNMP activations by as much

as 67% over the naive,“brute-force” approach, and as much as

35% over its closest matching-basedcompetitor.



For the bandwidth measurement simulations, each undirected graph gener-

ated by the Waxmanmodel is converted into a directed graph by randomly ﬁxing

the direction of each of its edges.

Avg. Degree



















































4.4 387 255 165 0.33

8.6 441 372 254 0.51

12.6 453 408 307 0.61

16.9 466 431 334 0.67

TABLE II

COMPARIS ON OF LINK-BA NDWIDTH MEA SU REME NT ALGO RITHM S,





















+

*"*"*"#







B. Latency Measurement

For the latency measurement simulations, we compare the

performance of two algorithms: the naive approach, where

the optimal probes are computed independently for each path

(Sec. IV-A), and our FLP-based approach (Sec. IV-B). We com-

pare the performance of these algorithms in terms of both the

total number of links traversed by the probe packets, denoted by





, as well as the number of probe packets transmitted, denoted





, where



4 

(





#









For each network graph generated using the Waxman model,

a random set of



paths (each with between



and

 

links) are

considered. We vary the “topology”of the set of generated paths

using a parameter

(



, which represents the number of end nodes

that serve as starting points for the 20 generated paths. Thus, a

smaller value of

(



means that more paths are terminated by the

same end node. The node representing the NOC is a randomly

selected node that is not incident on any of the paths.

Table II presents one set of simulation results; we have ob-

served similar trends for other parameter settings. The results

indicate that our FLP-based heuristic is more effective than the

naive approach in terms of both the total number of links tra-

versed as well as the total number of probe packets transmitted.









5$7









5$7









5&7

















5$7

2 684 542 0.79 37 22 0.59

4 672 560 0.83 37 24 0.65

8 678 594 0.88 38 28 0.74

16 680 628 0.92 39 32 0.82

TABLE III

COMPARIS ON OF LATEN CY ME ASUR EMEN T ALGOR ITHMS ,





"!#





$



$



+

VI. CONCLUSIONS

In this paper, we have addressed the problem of efﬁciently

monitoring bandwidth utilization and path latencies in IP net-

works. Unlike earlier approaches, our measurement architecture

assumes a single point-of-control in the network (correspond-

ing to the NOC) that is responsible for gathering bandwidth and

latency information using widely-deployed management tools,

like SNMP, RMON/NetFlow, and explicitly-routed IP probes.

We have demonstrated that our measurement model gives rise

to new optimization problems, most of which prove to be



hard. We have also developed novel approximation algorithms

for these optimization problems and proved guaranteed upper

bounds on their worst-case performance. Finally, we have ver-

iﬁed the effectiveness of our monitoring algorithms through a

preliminary simulation evaluation.

Although this paper has focused on a single point-of-control

measurement architecture, our approach is also readily appli-

cable to a distributed-monitoring setting, where a number of

NOCs/“monitoring boxes” have been distributed over a large

network area with each NOC responsible for monitoring a

smaller region of the network. Our algorithms can then be used

to minimize the monitoring overheadwithin each individual re-

gion. The problem of optimal distribution and placement of

NOCs across a large network can be formulated as a variant

of the well-known “k-center problem” (with an appropriately-

deﬁned distance function) [4].

Acknowledgement: We would like to thank Amit Kumar for

suggesting the



"!$#

1



(

%

rank-computation algorithm for

GREEDYRANK.

REFERENCES

[1] W. Stallings, “SNMP,SNMPv2, SNMPv3, and RMON 1 and 2”, Addison-

Wesley Longman, Inc., 1999, (Third Edition).

[2] “NetFlow Services and Applications,” Cisco Systems White Paper, 1999.

[3] P. Francis, S. Jamin, V. Paxson, L. Zhang, D. F. Gryniewicz, and Y. Jin,

“An Architecture for a Global Internet Host Distance Estimation Service,”

in Proc. of IEEE INFOCOM’99, March 1999.

[4] S. Jamin, C. Jin, Y. Jin, Y. Raz, Y. Shavitt, and L. Zhang, “On the Place-

ment of Internet Instrumentation,” in Proc. of IEEE INFOCOM’2000,

March 2000.

[5] W. Theilmann and K. Rothermel, “Dynamic Distance Maps of the Inter-

net,” in Proc. of IEEE INFOCOM’2000, March 2000.

[6] V. Jacobsen, “pathchar – A Tool to Infer Characteristics of Internet Paths,”

April 1997, ftp://ftp.ee.lbl.gov/pathchar.

[7] A.B. Downey, “Using pathchar to Estimate Internet Link Characteristics,”

in Proc. of ACM SIGCOMM’99, August 1999.

[8] J.-C. Bolot, “End-to-End Packet Delay and Loss Behavior in the Internet,”

in Proc. of ACM SIGCOMM’93, September 1993.

[9] K. Lai and M. Baker, “Measuring Bandwidth,” in Proc. of IEEE INFO-

COM’99, March 1999.

[10] M. Cheikhrouhou, J. Labetoulle, “An Efﬁcient Polling Layer for SNMP,”

Proc. 2000 IEEE/IFIP Network Operations & Management Symposium,

April 2000.

[11] Y. Yemini, G. Goldszmidt, S. Yemini, “Network Management by Dele-

gation,” Proc. Intl Symposium on Integrated Network Management, April

1991.

[12] D. Breitgand, D. Raz, Y. Shavitt, “SNMP GetPrev: An Efﬁcient Way to

Access Data in Large MIB Tables,” Bell Labs Tech. Memorandum, August

2000.

[13] V. Paxson, “Towards a Framework for Deﬁning Internet Performance Met-

rics,” in Proceedings of INET’96, 1996.

[14] S. Keshav, “An Engineering Approach to Computer Networking”,

Addison-Wesley Professional Computing Series, 1997.

[15] M.R. Garey and D.S. Johnson, “Computers and Intractability: A Guide to

the Theory of NP-Completeness”, W.H. Freeman, 1979.

[16] D.S. Hochbaum, “Heuristics for the Fixed Cost Median Problem,” Math-

ematical Programming, vol.22, pp. 148–162, 1982.

[17] B.M. Waxman, “Routing of Multipoint Connections,” IEEE Jrnl. on Se-

lected Areas in Communications, vol. 6, no. 9, pp. 1617–1622, December

1988.

[18] Y. Breitbart, C.-Y. Chan, M. Garofalakis, R. Rastogi, and A. Silberschatz,

“Efﬁciently Monitoring Bandwidth and Latency in IP Networks,” Bell

Labs Tech. Memorandum, July 2000.

[19] V. V. Vazirani, “Approximation Algorithms”, Springer-Verlag, 2000, (To

appear).

[20] R. Caceres, N.G. Dufﬁeld, A. Feldmann, J. Friedmann, A. Greenberg,

R. Greer, T. Johnson, C. Kalmanek, B. Krishnamurthy, D. Lavelle, P.P.

Mishra, K.K. Ramakrishnan, J. Rexford, F. True, and J.E.van der Merwe,

“Measurement and Analysis of IP Network Usage and Behaviour,” IEEE

Communications Magazine, pp. 144–151, May 2000.

[21] T. McGregor, H.-W. Braun, and J. Brown, “The NLANR Network Anal-

ysis Infrastructure,” IEEE Communications Magazine, pp. 122–128, May

2000.

[22] Cooperative Association for Internet Data Analysis (CAIDA),

http://www.caida.org/.

[23] M. Charikar and S. Guha, “Improved Combinatorial Algorithms for the

Facility Location and k-Median Problems,” in Proc. of IEEE FOCS’99,

October 1999.

[24] C. Perkins, “IP encapsulation within IP,” Internet RFC-2003 (available

from http://www.ietf.org/rfc/), May 1990.

Statistical mechanics of the minimum vertex cover problem in stochastic block models

Article

Dec 2019
PRE

The minimum vertex cover (Min-VC) problem is a well-known NP-hard problem. Earlier studies illustrate that the problem defined over the Erdös-Rényi random graph with a mean degree c exhibits computational difficulty in searching the Min-VC set above a critical point c=e=2.718.... Here, we address how this difficulty is influenced by the mesoscopic structures of graphs. For this, we evaluate the critical condition of difficulty for the stochastic block model. We perform a detailed examination of the specific cases of two equal-size communities characterized by in and out degrees, which are denoted by cin and cout, respectively. Our analysis based on the cavity method indicates that the solution search once becomes difficult when cin+cout exceeds e from below, but becomes easy again when cout is sufficiently larger than cin in the region cout>e. Experiments based on various search algorithms support the theoretical prediction.

A Review of Literature on Networks Performance Evaluation. Computing

Article

Full-text available

Jan 2018

This paper demonstrates that the focus within which networks performance is evaluated is changing. We review the literature on network performance evaluation during two decades; 1998-2018 with a focus latency, throughput and jitter. Also, we analyze the literature from different perspectives such as the network performance indicators, simulation tools and models, measurement tools and models among others. We then provided a case for operational analysis for network performance evaluation.

Unobtrusive monitoring: Statistical dissemination latency estimation in Bitcoin’s peer-to-peer network

Article

Full-text available

Dec 2020
PLOS ONE

The cryptocurrency system Bitcoin uses a peer-to-peer network to distribute new transactions to all participants. For risk estimation and usability aspects of Bitcoin applications, it is necessary to know the time required to disseminate a transaction within the network. Unfortunately, this time is not immediately obvious and hard to acquire. Measuring the dissemination latency requires many connections into the Bitcoin network, wasting network resources. Some third parties operate that way and publish large scale measurements. Relying on these measurements introduces a dependency and requires additional trust. This work describes how to unobtrusively acquire reliable estimates of the dissemination latencies for transactions without involving a third party. The dissemination latency is modelled with a lognormal distribution, and we estimate their parameters using a Bayesian model that can be updated dynamically. Our approach provides reliable estimates even when using only eight connections, the minimum connection number used by the default Bitcoin client. We provide an implementation of our approach as well as datasets for modelling and evaluation. Our approach, while slightly underestimating the latency distribution, is largely congruent with observed dissemination latencies.

Statistical mechanics of the minimum vertex cover problem in stochastic block models

Preprint

Aug 2019

The minimum vertex cover (Min-VC) problem is a well-known NP-hard problem. Earlier studies illustrate that the problem defined over the Erd\"{o}s-R\'{e}nyi random graph with a mean degree $c$ exhibits computational difficulty in searching the Min-VC set above a critical point $c = e = 2.718 \ldots$. Here, we address how this difficulty is influenced by the mesoscopic structures of graphs. For this, we evaluate the critical condition of difficulty for the stochastic block model. We perform a detailed examination of the specific cases of two equal-size communities characterized by in- and out- degrees, which are denoted by $c_{\rm in}$ and $c_{\rm out}$, respectively. Our analysis based on the cavity method indicates that the solution search becomes difficult when $c_{\rm in }+c_{\rm out} > e$, but becomes easy again when $c_{\text{out}}$ is sufficiently larger than $c_{\mathrm{in}}$ in the region $c_{\rm out}>e$. Experiments based on various search algorithms support the theoretical prediction.

Predictive Analysis to Support Fog Application Deployment

Chapter

Jan 2018

In this chapter, we first recapitulate on the challenges related to segmenting application functionalities all through the Cloud-to-Things continuum. A detailed life-like example is then used to further motivate the readers. We then describe the model and algorithms that constitute our prototype tool, FogTorchΠ, which permits (1) to find candidate deployments that meet (functional and non-functional) requirements of an application to a given infrastructure, and (2) to perform what-if analysis based on predicted metrics. Applicability of FogTorchΠ to the given motivating example is shown, comparing candidate deployments with respect to their QoS-assurance, Fog resource consumption and cost. Later, we describe the state-of-the-art related to our work. Particularly, a VR game application example is used to compare the results obtained by FogTorchΠ with those obtained by one of the most promising tools available for simulating Fog scenarios (iFogSim). Finally, after highlighting future research directions in this area, we draw some concluding remarks.

Efficient identification of node failure and recovery through end to end Probing techniques

Article

Full-text available

Oct 2021

Identification of Node failure detection and a localization is a very important challenge in a network community to get a quick recovery and avoid useless traffic in network. But it is very difficult to check the failure nodes or locations because of the large number of Screw ups in dense network. As finding the main source for failure of network is always challenging our proposed work will achieve that, it identifies the node failure by using probing measurement of binary state to end to end paths. Apart from identifying the network failure, it also quantifies the total failure nodes and the ip address or vicinity of failure nodes, Identification of node failure is done by monitoring nodes which are deployed in the netwok. Our Proposed word is divided majorly in two phases one is identifying the node failures by using Probing Packets and other is finding of the failure and its recovery.

CluFlow: Cluster-based Flow Management in Software-Defined Wireless Sensor Networks

Conference Paper

Apr 2019

Software-defined networking (SDN) is a cornerstone of next-generation networks and has already led to numerous advantages for data-center networks and wide-area networks, for instance in terms of reduced management complexity and more fine-grained traffic engineering. However, the design and implementation of SDN within wireless sensor networks (WSN) have received far less attention. Unfortunately, because of the multi-hop type of communication in WSN, a direct reuse of the wired SDN architecture could lead to excessive communication overhead. In this paper, we propose a cluster-based flow management approach that makes a trade-off between the granularity of monitoring by an SDN controller and the communication overhead of flow management. A network is partitioned into clusters with a minimum number of border nodes. Instead of having to handle the individual flows of all nodes, the SDN controller only manages incoming and outgoing traffic flows of clusters through border nodes. Our proof-of-concept implementations in software and hardware show that, when compared with benchmark solutions, our approach is significantly more efficient with respect to the number of nodes that must be managed and the number of control messages exchanged.

HCMonitor: An Accurate Measurement System for High Concurrent Network Services

Conference Paper

Aug 2019

An Efficient and Accurate Link Latency Monitoring Method for Low-Latency Software-Defined Networks

Article

Jul 2018

This paper proposes a novel latency monitoring method for software-defined networks (SDNs) called LLDP-looping, which uses LLDP packets injected repeatedly in the control plane to determine latency between switches. It provides accurate and continuous latency monitoring without involving any dedicated network infrastructure, while avoiding potential measurement failures that can occur in the existing method of timestamping data packets as probe packets, and overcoming the major factors that decrease the measurement accuracy in many existing methods for monitoring SDN latency. We formulate an optimization problem to enable LLDP-looping to minimize its workload on both control and data planes, and propose a novel greedy algorithm to solve this problem efficiently. Evaluations over the tree-based network topologies demonstrate that LLDP-looping can effectively minimize its overhead, and provide measurement accuracy higher than 90% against the round trip time measured by Ping over an SDN with link latency as small as 0.05 ms. The advantages of LLDP-looping can be realized with minimal modifications to SDN switches and this technique can be generalized to other networking scenarios.

Performance Monitoring Nodes Deployment Strategies for Power Wireless Private Networks Based on Improved Mixed Greedy Algorithm

Conference Paper

May 2018

Routing of multipoint connections

Article

Jan 1988

B.M. Waxman

Improved combinatorial algorithms for the facility location and k-median problems

Article

Jan 1999

SNMP, SNMPv2, SNMPv3, and RMON 1 and 2

Article

Jan 1999

William Stallings

T owards a Framework for Defining Internet Performance Metrics

Article

Jan 1996

V. Paxson

Abstract The Internet' s tremendous,growth represents a triumph of standardization, since it is only through standardization that so many,different networks using so many,different designs can smoothly exchange,data. The standardization of Internet measurement, however, has not matched the explosive growth of the network,as a whole. Even such basic notions as how,to measure,the throughput or delay along an Internet path lack a standardized framework. Instead it has become,increasingly difficult to diagnose problems or determine whether one is receiving promised performance. In this paper we outline how,a measurement,framework might be developed to support Internet diagnosis and performance,evaluation. We propose terminology to use in defining standards, including the key notions of metric as the fundamental property we wish to measure, methodology as a way to attempt to measure the property, and measurement as the result of a specific application of a methodology. We develop a basic contrast between analytically-specified metrics, which emphasize viewing network properties in analytic terms, and empirically-specified metrics, which correspond to properties that are generally too complex,to discuss analytically but still very important for practical measurement. Each has its place in the framework. We further discuss the notion of composition(how a property we wish to measure might be fruitfully viewed in terms of a collection of simpler, underlying properties), the crucial issues of measurement errors and uncertainties, and the pros

An Engineering Approach to Computer Networking

Article

Jan 1997

Keshav Srinivasan

Snmpv3 and rmon 1 and 2

Article

W. Stallings

A tool to infer characteristics of internet paths

Article

Jan 1997

V. Jacobson

IP encapsulation within IP

Article

Jan 1996

Charles Edward Perkins

This document specifies a method by which an IP datagram may be encapsulated (carried as payload) within an IP datagram. Encapsulation is suggested as a means to alter the normal IP routing for datagrams, by delivering them to an intermediate destination that would otherwise not be selected by the (network part of the) IP Destination Address field in the original IP header. Encapsulation may serve a variety of purposes, such as delivery of a datagram to a mobile node using Mobile IP.

Computers And Intractability: A Guide to the Theory of NP-Completeness

Chapter

Jan 1979

Heuristics for the fixed cost median problem

Article

Dec 1982

Dorit S. Hochbaum

We describe in this paper polynomial heuristics for three important hard problems—the discrete fixed cost median problem (the plant location problem), the continuous fixed cost median problem in a Euclidean space, and the network fixed cost median problem with convex costs. The heuristics for all the three problems guarantee error ratios no worse than the logarithm of the number of customer points. The derivation of the heuristics is based on the presentation of all types of median problems discussed as a set covering problem.

Efficiently Monitoring Bandwidth and Latency in IPNetworks

Abstract

Recommended publications

Management of high speed networks with the simple network management protocol (SNMP)

Efficiently Monitoring Bandwidth and Latency in IP Networks

Topology Discovery in heterogeneous IP networks

Topology Discovery in Heterogeneous IP Networks: The NetInventory System

Physical topology discovery for large multisubnet networks