ArticlePDF Available

LatLong: Diagnosing Wide-Area Latency Changes for CDNs

September 2012
IEEE Transactions on Network and Service Management 9(3):333-345

September 2012
9(3):333-345

DOI:10.1109/TNSM.2012.070412.110180

Authors:

Jennifer Rexford

Princeton University

Show all 5 authorsHide

Minimizing user-perceived latency is crucial for Content Distribution Networks (CDNs) hosting interactive services. Latency may increase for many reasons, such as interdomain routing changes and the CDN's own load-balancing policies. CDNs need greater visibility into the causes of latency increases, so they can adapt by directing traffic to different servers or paths. In this paper, we propose a tool for CDNs to diagnose large latency increases, based on passive measurements of performance, traffic, and routing. Separating the many causes from the effects is challenging. We propose a decision tree for classifying latency changes, and determine how to distinguish traffic shifts from increases in latency for existing servers, routers, and paths. Another challenge is that network operators group related clients to reduce measurement and control overhead, but the clients in a region may use multiple servers and paths during a measurement interval. We propose metrics that quantify the latency contributions across sets of servers and routers. Based on the design, we implement the LatLong tool for diagnosing large latency increases for CDN. We use LatLong to analyze a month of data from Google's CDN, and find that nearly 1% of the daily latency changes increase delay by more than 100 msec. Note that the latency increase of 100 msec is significant, since these are daily averages over groups of clients, and we only focus on latency-sensitive traffic for our study. More than 40% of these increases coincide with interdomain routing changes, and more than one-third involve a shift in traffic to different servers. This is the first work to diagnose latency problems in a large, operational CDN from purely passive measurements. Through case studies of individual events, we identify research challenges for managing wide-area latency for CDNs.

CDN architecture and measurements.

…

LatLong system design: classification of large latency changes. TABLE II SUMMARY OF KEY NOTATION

…

Distribution of ΔLatM ap and ΔF EDist across all client regions for one day in June 2010.

…

Distribution of ΔLatencyM ap and ΔF EDistribution for events ΔF E ≥ 0.4.

…

Figures - uploaded by Jennifer Rexford

Content may be subject to copyright.

Content uploaded by Jennifer Rexford

Content may be subject to copyright.

IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION 1

LatLong: Diagnosing Wide-Area

Latency Changes for CDNs

Yaping Zhu, Benjamin Helsley, Jennifer Rexford, Aspi Siganporia, and Sridhar Srinivasan

Abstract—Minimizing user-perceived latency is crucial for

Content Distribution Networks (CDNs) hosting interactive ser-

vices. Latency may increase for many reasons, such as interdo-

main routing changes and the CDN’s own load-balancing policies.

CDNs need greater visibility into the causes of latency increases,

so they can adapt by directing trafﬁc to different servers or

paths. In this paper, we propose a tool for CDNs to diagnose large

latency increases, based on passive measurements of performance,

trafﬁc, and routing. Separating the many causes from the effects

is challenging. We propose a decision tree for classifying latency

changes, and determine how to distinguish trafﬁc shifts from

increases in latency for existing servers, routers, and paths.

Another challenge is that network operators group related clients

to reduce measurement and control overhead, but the clients

in a region may use multiple servers and paths during a

measurement interval. We propose metrics that quantify the

latency contributions across sets of servers and routers. Based

on the design, we implement the LatLong tool for diagnosing

large latency increases for CDN. We use LatLong to analyze a

month of data from Google’s CDN, and ﬁnd that nearly 1%

of the daily latency changes increase delay by more than 100

msec. Note that the latency increase of 100 msec is signiﬁcant,

since these are daily averages over groups of clients, and we only

focus on latency-sensitive trafﬁc for our study. More than 40% of

these increases coincide with interdomain routing changes, and

more than one-third involve a shift in trafﬁc to different servers.

This is the ﬁrst work to diagnose latency problems in a large,

operational CDN from purely passive measurements. Through

case studies of individual events, we identify research challenges

for managing wide-area latency for CDNs.

Index Terms—Network diagnosis, latency increases, content

distribution networks (CDNs).

I. INTRODUCTION

CONTENT Distribution Networks (CDNs) offer users

access to a wide variety of services, running on geo-

graphically distributed servers. Many web services are delay-

sensitive interactive applications (e.g., search, games, and

collaborative editing). CDN administrators go to great lengths

to minimize user-perceived latency, by overprovisioning server

resources, directing clients to nearby servers, and shifting

trafﬁc away from overloaded servers. Yet, CDNs are quite

vulnerable to increases in the wide-area latency between their

servers and the clients, due to interdomain routing changes or

congestion in other domains. The CDN administrators need to

detect and diagnose these large increases in round-trip time,

Manuscript received August 26, 2011; revised November 30, 2011 and

March 20, 2012; accepted May 3, 2012. The associate editor coordinating the

review of this manuscript and approving it for publication was C. Wang.

Y. Zhu and J. Rexford are with Princeton University (e-mail: {yapingz,

jrex}@cs.princeton.edu).

B. Helsley, A. Siganporia, and S. Srinivasan are with Google Inc. (e-mail:

{bhelsley, aspi, sridhars}@google.com).

Digital Object Identiﬁer 10.1109/TNSM.2012.12.110180

Fig. 1. CDN architecture and measurements.

and adapt to alleviate the problem (e.g., by directing clients

to a different front-end server or adjusting routing policies to

select a different path).

To detect and diagnose latency problems, CDNs could

deploy a large-scale active-monitoring infrastructure to collect

performance measurements from synthetic clients all over the

world. Instead, this paper explores how CDNs can diagnose

latency problems based on measurements they can readily and

efﬁciently collect—passive measurements of performance [1],

trafﬁc [2], and routing from their own networks. Our goal is

to design the system to maximize the information the CDN

can glean from these sources of data. By joining data collected

from different locations, the CDN can determine where a client

request enters the CDN’s network, which front-end server

handles the request, and what egress router and interdomain

path carry the response trafﬁc, as shown in Figure 1. Using

this data, we analyze changes in wide-area latency between the

clients and the front-end servers; the rest of the user-perceived

latency, between the front and back-end servers, is already

under the CDN’s direct control.

Finding the root cause of latency increases is difﬁcult. Many

factors can contribute to higher delays, including internal

factors like how the CDN selects servers for the clients,

and external factors such as interdomain routing changes.

Moreover, separating cause from effect is a major challenge.

For example, directing a client to a different front-end server

naturally changes where trafﬁc enters and leaves the network,

but the routing system is not to blame for any resulting

increase in latency. After detecting large increases in latency,

our classiﬁcation must ﬁrst determine whether client requests

shifted to different front-end servers, or the latency to reach the

existing servers increased. Only then can we analyze why these

changes happened. For example, the front-end server may

change because the CDN determined that the client is closer to

1932-4537/12/$31.00 c

2012 IEEE

2 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION

a different server, or because a load-balancing policy needed

to shift clients away from an overloaded server. Similarly,

if the round-trip time to a speciﬁc server increases, routing

changes along the forward or reverse path (or both!) could be

responsible.

The scale of large CDNs also introduces challenges. To

measure and control communication with hundreds of millions

of users, CDNs typically group clients by preﬁx or geographic

region. For example, a CDN may collect round-trip times

and trafﬁc volumes by IP preﬁx, or direct clients to front-end

servers by region. During any measurement interval, a group

of clients may send requests to multiple front-end servers, and

thetrafﬁcmaytraversemultiple ingress and egress routers.

Thus, in order to analyze the latency increases for groups

of requests, we need to deﬁne the metrics to distinguish the

changes from an individual router or server.

In designing our tool LatLong for classifying large latency

increases, we make the following contributions:

Decision tree for separating cause from effect:Akey

contribution of this paper is that we determine the causal

relationship among the various factors which lead to latency

increases. We propose a decision tree for separating the causes

of latency changes from their effects, and identify the data

sets needed for each step in the analysis. We analyze the

measurement data to identify suitable thresholds to identify

large latency changes and to distinguish one possible cause

from another.

Metrics to analyze over sets of servers and routers: Our

tool LatLong can analyze latency increases and trafﬁc shifts

over sets of servers and routers. For all potential causes of

the latency increase in the decision tree, we propose metrics

to quantify the contribution of the latency increases. For

each potential cause, we deﬁne the metric to quantify the

contribution of latency increases by a single router or server,

as well as a way to summarize the contributions across all

routers and servers.

Deployment of LatLong in Google’s CDN: We apply

our tool to one month of trafﬁc, performance, and routing

data from Google’s CDN, and focus our studies on the large

latency increases which last long and affect a large number of

clients. Note that our tool could be applied to study the latency

increases at any granularity. We focus on the large increases,

because these are the events which causes serious performance

degradation for the clients. We determine 100 msec as the

threshold of large latency increase, because it is signiﬁcant

given that this number is the daily average aggregated from

group of clients. We also focus on the latency-sensitive trafﬁc

for interactive applications for our study, which does not

include video trafﬁc (e.g., YouTube). We identiﬁed that nearly

1% of the daily latency changes increase delay by more than

100 msec. Our results show that 73.9% of these large increases

in latency were explained (at least in part) by a large increase

in latency to reach an existing front-end server, with 42.2%

coincided with a change in the ingress router or egress router

(or both!); around 34.7% of the large increases of latency

involved a signiﬁcant shift of client trafﬁc to different front-

end servers, often due to load-balancing decisions or changes

in the CDN’s own view of the closest server.

Case studies to highlight challenges in CDN manage-

ment: We present several events in greater detail to highlight

the challenges of measuring and managing wide-area perfor-

mance. These case studies illustrate the difﬁculty of building

an accurate latency-map to direct clients to nearby servers, the

extra latency client experience when ﬂash crowds force some

requests to distant front-end servers, and the risks of relying

on AS path length as an indicator of performance. Although

many of these problems are known already, our case studies

highlight that these issues arise in practice and are responsible

for very large increases in latency affecting real users.

Our tool is complementary to the recent work on Why-

High [3]. WhyHigh uses active measurements, combined

with routing and trafﬁc data, to study persistent performance

problems where some clients in a geographic region have

much higher latency than others. In contrast, we use passive

measurements to analyze large latency changes affecting entire

groups of clients. The dynamics of latency increases caused

by changes in server selection and inter-domain routing are

not studied in the work of WhyHigh.

The rest of the paper is organized as follows. Section II

provides an overview of the architecture of the Google CDN,

and the datasets we gathered. Section III describes our design

of LatLong using decision-tree based classiﬁcation. Section IV

presents a high-level characterization of the latency changes

in the Google’s CDN, and identiﬁes the large latency events

we study. Next, we present the results of classiﬁcation using

LatLong in Section V, followed by several case studies in

Section VI. Then, we discuss the future research directions in

Section VII, and present related work in Section VIII. Finally,

we conclude the paper in Section IX.

II. GOOGLE’SCDN AND MEASUREMENT DATA

In this section, we ﬁrst provide a high-level overview of the

network architecture of Google’s CDN. Then, we describe the

measurement dataset we gathered as the input of our tool.

A. Google’s CDN Architecture

The infrastructure of Google’s CDN consists of many

servers in the data centers spread across the globe. The

client requests are ﬁrst served at a front-end (FE) server,

which provides caching, content assembly, pipelining, request

redirection, and proxy functions for the client requests. To

have greater control over network performance, CDN admin-

istrators typically place front-end servers in managed hosting

locations, or ISP points of presence, in geographic regions

nearby the clients. The client requests are terminated at the

FEs, and (when necessary) served at the backend servers

which implement the corresponding application logic. Inside

the CDN’s internal network, servers are connected by the

routers, and IP packets enter and leave the network at edge

routers that connect to neighboring ISPs.

Figure 1 presents a simpliﬁed view of the path of a client

request. A client request is directed to an FE, based on

proximity and server capacity. Each IP packet enters the CDN

network at an ingress router and travels to the chosen FE.

After receiving responses from the back-end servers, the FE

directs response trafﬁc to the client. These packets leave the

CDN at an egress router and follow an AS path through one

ZHU et al.: LATLONG: DIAGNOSING WIDE-AREA LATENCY CHANGES FOR CDNS 3

TAB L E I

MEASUREMENTS OF WIDE-AREA P ERFORMANCE,TR AFFIC ,AND ROUTING

Data Set Collection Point Logged Information

Performance front ends (FEs) (client /24 preﬁx, country, RPD, average RTT)

Trafﬁc ingress routers (client IP address, FE IP address, bytes-in)

egress routers (FE IP address, client IP address, bytes-out)

Routing egress routers (client IP preﬁx, AS path)

Joint data (client IP preﬁx, FE, RPD, RTT, {ingress, bytes-in},{egress, AS path, bytes-out})

or more Autonomous Systems (ASes) en route to the client.

The user-perceived latency is affected by several factors:

the location of the servers, the path from the client to the

ingress router, and the path from the egress router back to the

client. From the CDN’s perspective, the visible factors are: the

ingress router, the selection of the servers, the egress router,

and the AS path.

Like many CDNs, Google uses DNS to direct clients to

front-end servers, based ﬁrst on a latency map (preferring

the FE with the smallest network latency) and second on

aload-balancing policy (that selects another nearby FE if

the closest FE is overloaded) [4]. To periodically construct

the latency map, the CDN collects round-trip statistics by

passively monitoring TCP transfers to a subset of the IP

preﬁxes. In responding to a DNS request, the CDN identiﬁes

the IP preﬁx associated with the DNS resolver and returns the

IP address of the selected FE, under the assumption that end

users are relatively close to their local DNS servers. Changes

in the latency map can lead to shifts in trafﬁc to different

FEs. The latency between the front-end and back-end servers

is a known and predictable quantity, and so our study focuses

on the network latency—speciﬁcally, the round-trip time—

between the FEs and the clients.

B. Passive Measurements of the CDN

The measurement data sets, which are routinely collected

at the servers and routers, are summarized in Table II.

The three main datasets—performance, trafﬁc, and routing

measurements—are collected by different systems. The mea-

surement data gathered is composed of latency sensitive trafﬁc

for the interactive applications. We do not include the video

trafﬁc(e.g., YouTube) for our study, because that is latency

insensitive.

Client performance (at the FEs): The FEs monitor the

round-trip time (RTT) for a subset of the TCP connections

by measuring the time between sending the SYN-ACK and

receiving an ACK from the client. In cases when the SYN-

ACK or ACK packet is lost, this SYN-ACK RTT value

would be invalid. In these cases, the RTT for data transfers

in the same TCP connection would be used instead. These

measurements capture the propagation and queuing delays

along both the forward and reverse paths to the clients. Each

FE also counts the number of requests, producing a daily

summary of the round-trip time (RTT) and the requests per

day (RPD) for each /24 IP preﬁx. Each /24 preﬁx is associated

with a speciﬁc country, using an IP-geo database. We use it to

group preﬁxes in nearby geographical regions for our study.

Netﬂow trafﬁc (at edge routers): The edge routers collect

trafﬁc measurements using Netﬂow [2]. The client is the

source address for incoming trafﬁc and the destination address

for outgoing trafﬁc; similarly, the FE is the destination for

incoming trafﬁc, and the source for outgoing trafﬁc. Netﬂow

performs packet sampling, so the trafﬁc volumes are estimates

after correcting for the sampling rate. This leads to records that

summarize trafﬁc in each ﬁfteen-minute interval, indicating the

client IP address, front-end server address, and trafﬁc volume.

Trafﬁc for a single client address may be associated with

multiple routers or FEs during the interval. The Netﬂow data

we use chooses bytes as the metric, because it represents the

trafﬁc volume we care about. However, our techniques could

also be applied to analyze the number of ﬂows. We do not see

any signiﬁcant distinction between using the bytes versus the

ﬂows as the metric.

BGP routing (at egress routers): The edge routers also

collect BGP routing updates that indicate the sequence of

Autonomous Systems (ASes) along the path to each client IP

preﬁx. (Because BGP routing is destination based, the routers

cannot collect similar information about the forward path from

clients to the FEs.) A dump of the BGP routing table every

ﬁfteen minutes, aligned with the measurement interval for the

Netﬂow data, indicates the AS-PATH of the BGP route used

to reach each IP preﬁx from each egress router.

Joint data set: The joint data set used in our analysis

combines the performance, trafﬁc, and routing data, using the

client IP preﬁx and FE IP address as keys in the join process.

First, the trafﬁc and routing data at the egress routers are

joined by matching the client IP address from the Netﬂow

data with the longest-matching preﬁx in the routing data.

Second, the combined trafﬁc and routing data are aggregated

into summaries and joined with the performance data, by

matching the /24 preﬁx in the performance data with the

longest-matching preﬁx from the routing data. The resulting

joint data set captures the trafﬁc, routing, and performance for

each client IP preﬁx and front-end server, as summarized in

Table II. The data set is aggregated to preﬁx level. In addition,

the data do not contain any user-identiﬁable information (such

as packet payloads, timings of individual requests, etc.) The

data set we study is based on a sample, and does not cover

all of the CDN network.

The data have some unavoidable limitations, imposed by

the systems that collect the measurements: the performance

data does not indicate which ingress and egress router were

used to carry the trafﬁc, since the front-end servers do not

have access to this information. This explains why the joint

data set has a set of ingress and egress routers. Fortunately,

the Netﬂow measurements allow us to estimate the request

rate for the individual ingress routers, egress routers, and AS

paths from the observed trafﬁc volumes; however, we cannot

4 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION

directly observe how the RTT varies based on the choice of

ingress and egress routers. Still, the joint data set provides

a wealth of information that can shed light on the causes of

large latency increases.

Latency map, and front-end server capacity and de-

mand: In addition to the joint data set, we analyze changes to

the latency map used to drive DNS-based server selection, as

discussed in more detail in Section III-B. We also collect logs

of server capacity and demand at all front-end servers. We use

the logs to determine whether a speciﬁc FE was overloaded

at a given time (when the demand exceeded capacity, and

requests were load balanced to other front-end servers).

III. DESIGN OF THE LATLONG TOOL

Analyzing wide-area latency increases is difﬁcult, because

multiple inter-related factors can lead to higher round-trip

times. Also, our analysis should account for the fact that

clients may direct trafﬁc to multiple front ends, either because

the front-end server changes or because different clients in the

same region use different front-end servers.

In this section, we present the design of the decision tree

which LatLong uses to analyze latency increases, as illustrated

in Figure 2. We propose metrics for distinguishing FE changes

from latency changes that affect individual FEs. Then, we

describe the techniques to identify the cause of FE changes

(the latency map, or load balancing). Lastly, we present the

method to correlate the latency increases that affect individual

FEs with routing changes. The classiﬁcation of our tool is

general, and the method is not dependent on speciﬁc way to

aggregate users or speciﬁc timescale. Therefore, we can sup-

port to diagnose latency changes at different granularities (e.g.,

different ways to aggregate users, and different timescales).

Table III summarizes the notation used in this paper.

A. Front-End Server Change vs. Latency Increase

The average round-trip time could increase for one of two

main reasons:

•Front-end server changes (ΔFE): The clients switch

from one front-end server to another, where the new

server used has a higher RTT. This change could be

caused by an FE failure or a change in the CDN’s

server-selection policies, as shown in the upper branch

of Figure 2.

•Front-end latency changes (ΔLat): The clients could

continue using the same FE, but have a higher RTT for

reaching that server. The increased latency could stem

from changes along the forward or reverse path to the

client, as shown in the lower branch of Figure 2.

The analysis is difﬁcult because a group of clients could con-

tact multiple front-end servers, and the RTT and RPD for each

server changes. Correctly distinguishing all of these factors

requires grappling with sets of front-ends and weighting the

RTT measurements appropriately.

The average round-trip time experienced by the clients is

the average over the requests sent to multiple front-ends, each

with its own average round-trip time. For example, consider

a region of clients experiencing an average round-trip time of

RT T1at time 1, with a request rate of RP D1iand round-trip

time RT T1ifor each front-end server i. Then,

RT T1=

RT T1i∗RP D1i

RP D1

where RP D1=iRP D1iis the total number of requests

from that region, across all front-end servers, for time period 1.

A similar equation holds for the second time period, with the

subscripts changed to consider round-trip times and request

rates at time 2.

The increase in average round-trip time from time 1 to time

2(i.e.,ΔRT T =RT T2−RT T1)is,then,

ΔRT T =

iRT T2i∗RP D2i

RP D2

−RT T1i∗RP D1i

RP D1

The equation shows how the latency increases could come

either from a higher round-trip time for the same server (i.e.,

RT T2i>RTT

1i) or a shift in the fraction of requests directed

to each FE (i.e., RP D2i/RP D2vs. RP D1i/RP D1), or both.

To tease these two factors apart, consider one FE i,andthe

term inside the summation. We can split the term into two

parts that sum to the same expression, where the ﬁrst captures

the impact on the round-trip time from trafﬁc shifting toward

front-end server i:

ΔFE

i=RT T2i∗RP D2i

RP D2

−RP D1i

RP D1

where ΔFE

iis high if the fraction of trafﬁc directed to front-

end server iincreases, or if the round-trip time is high at time

2. The second term captures the impact of the latency to front-

end server iincreasing:

ΔLati=(RT T2i−RT T1i)∗RP D1i

RP D1

where the latency is weighted by the fraction of requests

directed to front-end server i, to capture the relative impact

of this FE on the total increase in latency. Through simple

algebraic manipulation, we can show that

ΔRT T =

(ΔFE

i+ΔLati).

As such, we can quantify the contribution to the latency

change that comes from shifts between FEs:

ΔFE =

ΔFE

i/ΔRT T

and latency changes for individual front-end servers

ΔLat =

ΔLati/ΔRT T

where the factors sum to 1. For example, if the FE change

contributes 0.85 and the latency change contributes 0.15, we

can conclude that the latency increase was primarily caused

by a trafﬁc shift between front-end servers. If the FE change

contributes -0.1 and the latency change contributes 1.1, we

can conclude that the latency increase was due to an increase

in latency to reach the front-end servers rather than a trafﬁc

shift; if anything, the -0.1 suggests that some trafﬁc shifted to

FEs with lower latency, but this effect was dwarfed by one or

more FEs experiencing an increase in latency.

ZHU et al.: LATLONG: DIAGNOSING WIDE-AREA LATENCY CHANGES FOR CDNS 5

Fig. 2. LatLong system design: classiﬁcation of large latency changes.

TAB L E I I

SUMMARY OF KEY NOTATION

Symbol Meaning

RT T1,RT T2round-trip time for a client region at time 1 and time 2

ΔRT T change in RTT from time 1 to time 2 (i.e., RTT2−RT T1)

RT T1i,RT T2iround-trip time for requests to FE

iat time 1 and time 2

RP D1,RP D2requests for a client region at time 1 and time 2

RP D1i,RP D2irequests to FE

iat time 1 and time 2

ΔFE

ilatency change contribution from trafﬁc shifts at FE

ΔLatilatency change contribution from latency changes at FE

ΔFE latency change contribution from trafﬁc shifts at all FEs

ΔLat latency change contribution from latency changes at all FEs

r1i,r2ifraction of requests served at FE

ipredicted by the latency map at time 1 and time 2

ΔLatM ap fraction of requests shifting FEs predicted by the latency map

ΔFEDist actual fraction of requests shifting FEs

LoadBalance1fraction of requests shifting FEs by the load balancer at time 1

ΔLoadBal difference of the fraction of requests shifting FEs by the load balancer from time 1 to time 2

ΔIngress fraction of the trafﬁc shifting ingress router at a speciﬁc FE

ΔEg ressASP ath fraction of the trafﬁc shifting (egress router, AS path) at a speciﬁc FE

In the following subsections, we present the method to

identify the causes of the FE changes: the latency map and

load balancing.

B. Front-End Changes by the Latency Map

Google CDN periodically constructs a latency map to direct

clients to the closest front-end server. The CDN constructs the

latency map by measuring the round-trip time for each /24

preﬁx to different front-end servers, resulting in a list mapping

each /24 preﬁx to a single, closest FE. From the latency map,

we can compute the target distribution of requests over the

front-end servers for groups of co-located clients in two time

intervals. To combine this information across all /24 preﬁxes in

the same region, we weight by the requests per day (RPD) for

each /24 preﬁx. This results in a distribution of the fraction of

requests r1ifrom the client region directed to front-end server

i, at time 1.

As the latency map and the request rates change, the region

may have a different distribution {r2i}at time 2. To analyze

changes in the latency map, we consider the fraction of

requests that should shift to different front-end servers:

ΔLatMap =

|r2i−r1i|/2

Note that we divide the difference by two, to avoid double

counting the fraction of requests that move away from one

FE (i.e., r2i−r1idecreasing for one front-end server i)and

towards another (i.e., r2i−r1iincreasing for some other front-

end server).

C. Front-End Changes by Load Balancing

In practice, the actual distribution of requests to front-end

servers does not necessarily follow the latency map. Some

FEs may be overloaded, or unavailable due to maintenance.

To understand how the trafﬁc distribution changes in practice,

we quantify the changes in front-end servers as follows:

ΔFEDist =

i



RP D2i

RP D2

−RP D1i

RP D1



That is, we calculate the fraction of requests to FE iat time

1 and time 2, and compute the difference, summing over all

front-end servers. As with the equation for ΔLatMap,we

divide the sum by two to avoid double counting shifts away

from one front-end server and shifts toward another.

The differences are caused by the CDN’s own load-

balancing policy, which directs trafﬁc away from busy front-

end servers. This may be necessary during planned mainte-

nance. For example, an FE may consist of a cluster of com-

puters; if some of these machines go down for maintenance,

the aggregate server capacity decreases temporarily. In other

cases, a surge in client demand may temporarily overload the

closest front-end server. In both cases, directing some clients

6 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION

to an alternate front-end server is important for preventing

degradation in performance. A slight increase in round-trip

time for some clients is preferable to all clients experiencing

slower downloads due to congestion.

To estimate the fraction of requests shifted by the load

balancer, we identify front-end servers that handle a lower

fraction of requests than what is suggested by the latency

map. The latency map indicates that front-end server ishould

handle a fraction r1iof the requests for the clients at time 1.

In reality, the server handles RP D1i/RP D1.

LoadBalance1=

ir1i−RP D1i

RP D1+

where []

+indicates that the sum only includes the positive

values, with the target request load in excess of the actual load.

Similarly, we deﬁne the fraction of queries load balanced at

time 2 as LoadBalance2.

If much more requests are load balanced on the second

day, then more requests are directed to alternative FEs that

are further away, leading to higher round-trip times. Thus, we

use the difference of the load balancer metric to capture more

load balancing trafﬁc at time 2:

ΔLoadBal =LoadBalance2−LoadBalance1

We expect the load-balancing policy to routinely trigger

some small shifts in trafﬁc.

D. Inter-domain Routing Changes

Next, our analysis focuses on events where the RTT jumps

signiﬁcantly for speciﬁc FEs. These increases in round-trip

time could be caused by routing changes, or by congestion

along the paths to and from the client. Since the CDN does not

have direct visibility into congestion outside its own network,

we correlate the RTT increases only with the routing changes

visible to the CDN—changes of the ingress router where client

trafﬁc enters the CDN network, and changes of the egress

router and the AS path used to reach the client.

Recall that the latency metric ΔLat can be broken down to

the sum of latency metrics at individual FEs (i.e., ΔLati). We

focus our attention on the FE with the highest value of ΔLati,

because the latency change for requests to this FE has the most

impact on the latency increase seen by the clients. Then, we

deﬁne metrics to capture what fraction of the trafﬁc destined

to this FE experiences a visible routing change. Focusing on

the front-end server iwith the highest latency increase, we

consider where the trafﬁc enters the network. Given all the

trafﬁc from the client region to the front-end server, we can

compute the fractions f1jand f2jentering at ingress router j

at time 1 and time 2, respectively. Note that we compute these

fractions from the “bytes-in” statistics from the Netﬂow data,

since the front-end server cannot differentiate the requests per

day (RPD) by which ingress router carried the trafﬁc.

To quantify how trafﬁc shifted to different ingress routers,

we compute:

ΔIngress =

|f2j−f1j|/2

Note that the difference between the fractions is divided by

two, to avoid double counting trafﬁc that shifts away from

one ingress router and toward another. Similarly, we deﬁne a

metric to measure the fraction of trafﬁc to a FE that switches

to a different egress router or AS path. Suppose the fraction

of trafﬁc to (egress router, AS path) kis g1kat time 1 and

g2kat time 2. Then,

ΔEg ressASP ath =

|g2k−g1k|/2

similar to the equation for analyzing the ingress routers. These

metrics allow us to correlate large increases in latency to server

iwith observable routing changes. Note that the analysis

can only establish a correlation between latency increases

and routing changes, rather than deﬁnitively “blaming” the

routing change for the higher delay, since the performance

measurements cannot distinguish RTT by which ingress or

egress router carried the trafﬁc.

IV. DISTRIBUTION OF LATENCY CHANGES

In the rest of the paper, we apply our tool to measurement

data from Google’s CDN. The BGP and Netﬂow data are col-

lected and joined on a 15-minute timescale; the performance

data is collected daily, and joined with the routing and trafﬁc

data to form a joint data set for each day in June 2010. For our

analysis, we focus on the large latency increases which last for

a long time and affect a large number of clients. We pick daily

changes as the timescale, because the measurement data we get

is aggregated daily. We group clients by “region,” combining

all IP addresses with the same origin AS and located in the

same country. In this section, we describe how we preprocess

the data, and characterize the distribution of daily increases in

latency to identify the most signiﬁcant events which last for

days. We also determine the threshold for the large latency

increases we study.

As our datasets are proprietary, we are not able to reveal

the exact number of regions or events, and instead report

percentages in our tables and graphs; we believe percentages

are more meaningful, since the exact number of events and

regions naturally differ from one CDN to another. In addition,

the granularity of the data, both spatially (i.e., by region) and

temporally (i.e., by day) are beyond our control; these choices

are not fundamental to our methodology, which could easily

be applied to ﬁner-grain measurement data.

A. Aggregating Measurements by Region

Our joint dataset has trafﬁc and performance data at the

level of BGP preﬁxes, leading to approximately 250K groups

of clients to consider. Many of these preﬁxes generate very

little trafﬁc, making it difﬁcult to distinguish meaningful

changes in latency from statistical noise. In addition, CDN

administrators understandably prefer to have more concise

summaries of signiﬁcant latency changes that affect many

clients, rather than reports for hundreds of thousands of

preﬁxes.

Combining preﬁxes with the same origin AS seems like a

natural way to aggregate the data, because many routing and

trafﬁc changes take place at the AS level. Yet, some ASes

ZHU et al.: LATLONG: DIAGNOSING WIDE-AREA LATENCY CHANGES FOR CDNS 7

are quite large in their own right, spanning multiple countries.

We combine preﬁxes that share the same country and origin

AS (which we deﬁne as a region), for our analysis. From the

performance measurements, we know the country for each /24

preﬁx, allowing us to identify the country (or set of countries)

associated with each BGP preﬁx. A preﬁx spanning multiple

countries could have large variations in average RTT simply

due to differences in the locations of the active clients. As

such, we ﬁlter the small number of BGP preﬁxes spanning

multiple countries. This ﬁlters approximately 2K preﬁxes,

which contribute 3.2% of client requests and 3.3% of the trafﬁc

volume.

After aggregating clients by region, some regions still

contribute very little trafﬁc. For each region, we calculate the

minimum number of requests per day (RPD) over the month

of June 2010. We choose a threshold for the minimum RPD

to ﬁlter the regions with very low client demand. This process

improves statistical accuracy, because it makes sure that we

have enough samples of requests for the regions we study. This

also helps focus our attention on regions with many clients,

and reduce the volume of the measurement data we analyze.

This process helps us to exclude the regions ranging at the

long tail in trafﬁc distribusion. After this preprocessing step,

our experiments still cover 94% of the trafﬁc, which include

15% of the regions. We also ensure that these regions cover

all the major geographical areas globally and all the major

ASes.

Hence, for the rest of our analysis, we focus on clients

aggregated by region (i.e., by country and origin AS), and

regions generating a signiﬁcant number of requests per day.

Note that our analysis methodology could be applied equally

well to alternate ways of aggregating the clients and ﬁltering

the data.

The measurement results we present in the following sec-

tions cover all the days in the month of June 2010. The data

represents the global trafﬁc we receive at Google, and we

ensure that all the major geographical areas and large ASes

are covered.

B. Identifying Large Latency Increases

To gain an initial understanding of latency changes, we ﬁrst

characterize the differences in latency from one day to the

next throughout the month of June 2010, across all the client

regions we selected. We consider both the absolute changes

(i.e., RT T2−RT T1)andtherelative change (i.e., (RT T2−

RT T1)/RT T1), as shown in Figures IV-A(a) and IV-A(b),

respectively. The graphs plot only the increases in latency,

because the distributions of daily increases and decreases are

symmetric.

The two graphs are plotted as complementary cumula-

tive distributions, with a logarithmic scale on both axes,

to highlight the large outliers. Figure IV-A(a) shows that

latency increases less than 10msec for 79.4% of the time.

Yet, nearly 1% of the latency increases exceed 100 msec, and

every so often latency increases by more than one second.

Figure IV-A(b) shows that the RTT increases by less than

10% in 80.0% of cases. Yet, the daily RTT at least doubles

(i.e., a relative increase of 1 or more) for 0.45% of the time,

and we see occasional increases by a factor of ten.

TABLE III

EVENTS WITH A LARGE DAILY RTT INCREASE

Category %Events

Absolute RTT Increase ≥100 ms 76.9%

Relative RTT Increase ≥135.6%

Total large events 100%

TAB L E I V

CAUSES OF LARGE LATENCY INCREASES (WHERE LATENCY MORE THAN

DOUBLES,OR INCREAS ES BY MORE THAN 100 MSEC), RELATIVE TO

PREVIOUS DAY

.NOT E THAT NE ARLY 9% OF EVENTS INVOLVE BOTH FE

LATENCY INCREASES AND FE SERVER CHANGES.

Category %Events

FE latency increase 73.9%

Ingress router 10.3%

(Egress, AS Path) 14.5%

Both 17.4%

Unknown 31.5%

FE server change 34.7%

Latency map 14.2%

Load balancing 2.9%

Both 9.3%

Unknown 8.4%

Tot a l 100.0%

We de ﬁ n e a n event to be a daily RTT increase over a

threshold for a speciﬁc region. Table IV-B summarizes the

events we selected to characterize the latency increase. We

choose the threshold of absolute RTT increase as 100 ms

and the threshold of relative RTT increase as 1, leading to a

combined list of hundreds of events corresponding to the most

signiﬁcant increases in latency: with 76.9% of the events over

the absolute RTT increase threshold; 35.6% of the events over

the relative RTT increase threshold; and 12.5% of the events

over both thresholds.

V. L ATLONG DIAGNOSIS OF LATENCY INCREASES

In this section, we apply our tool to study the events of large

latency increases, which are identiﬁed in the previous section.

We ﬁrst classify them into FE changes and latency increases

at individual FEs. Then, we further classify the events of

FE changes according to the causes of the latency map and

load balancing; classify the events of FE latency increases

according to the causes of inter-domain routing changes.

Our high-level results in this section are summarized in

Table V. Nearly three-quarters of these events were explained

(at least in part) by a large increase in latency to reach

an existing front-end server. These latency increases often

coincided with a change in the ingress router or egress router

(or both!); still, many had no visible interdomain routing

change and were presumably caused by BGP routing changes

on the forward path or by congestion or intradomain routing

changes. Around one-third of the events involved a signiﬁcant

shift of client trafﬁc to different front-end servers, often due

to load-balancing decisions or changes in CDN’s own view of

the closest server. Nearly 9% of events involved both an “FE

latency increase” and an “FE server change,” which is why

they sum to more than 100%.

A. FE Change vs. Latency Increase

Applying our tool to each event identiﬁed in the last section,

we see that large increases in latency to reach existing servers

(i.e., ΔLat) are responsible for more than two-thirds of the

8 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION

1e-05

0.0001

0.001

0.01

0.1

1 10 100 1000 10000

CCDF

Daily RTT Increase (ms) in 2010/06

CCDF

1e-05

0.0001

0.001

0.01

0.1

0.0001 0.001 0.01 0.1 1 10 100

CCDF

Daily RTT Increase (Relative) in 2010/06

CCDF

(a) Absolute RTT increase (b) Relative RTT increase

Fig. 3. Distribution of daily RTT increase

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

-1 -0.5 0 0.5 1 1.5 2

CDF (All Events)

 FE and  Lat

 FE

 Lat

Fig. 4. ΔFE and ΔLat for large events.

events with a large increase in round-trip time. To identify the

cause of latency increases, we ﬁrst show the CDF of ΔFE

(trafﬁc shift) and ΔLat (latency increase) for the events we

study in Figure 4. The distributions are a reﬂection of each

other (on both the x and y axes), because ΔFE and ΔLat

sum to 1 for each event.

The graph shows that about half of the events have ΔFE

below 0.1, implying that shifts in trafﬁc from one FE to

another are not the major cause of large-latency events.

Still, trafﬁc shifts are responsible for some of the latency

increases—one event has a ΔFE of 5.83! (Note that we do

not show the very few points with extreme ΔFE or ΔLat

values, so we can illustrate the majority of the distribution

more clearly in the graph). In comparison, ΔLat is often fairly

high—in fact, more than 70% of these events have a ΔLat

higher than 0.5.

To classify these events, we apply a threshold to both

distributions and identify whether ΔFE or ΔLat (or both)

exceeds the threshold. Table V-A summarizes the results for

thresholds 0.3,0.4,and0.5. These results show that, for

a range of thresholds, around two-thirds of the events are

explained primarily by an increase in latency between the

clients and the FEs. For example, using a threshold of 0.4

for both distributions, 65% of events have a large ΔLat and

another 9% of events have large values for both metrics,

resulting in nearly three-quarters of the events caused (in large

part) by increases in RTTs to select front-end servers. In the

TAB L E V

EVENTS CLASSIFIED BY ΔLat AND ΔFE

Threshold

0.3 0.4 0.5

ΔLat 61% 65% 71%

ΔFE 23% 26% 29%

Both 16% 9% 0%

rest of the paper, we apply a threshold of 0.4to distinguish

events into the three categories in Table V-A. This is because

the threshold of 0.5 separates the two categories apart; the

threshold of 0.3 (where one factor contributes to 30% of the

latency increases) is not as signiﬁcant as 0.4.

B. Normal Front-End Changes

To understand the normal distribution of latency-map

changes, we calculate ΔLatM ap for all of the regions—

whether or not they experience a large increase in latency—

on two consecutive days in June 2010. Figure 5 shows the

results. For 76.9% of the regions, less than 10% of the requests

change FEs because of changes to the latency map. For 85.7%

of regions, less than 30% of trafﬁc shifts to different front-

end servers. Less than 10% of the regions see more than

half of the requests changing front-end servers. Often, these

changes involve shifts to another front-end server in a nearby

geographic region.

However, note that the distribution of ΔLatM ap has a long

tail, with some regions having 80% to 90% of the requests

shifting FEs. For these regions, changes in the measured

latency lead to changes in the latency map which, in turn,

lead to shifts in trafﬁc to different front-end servers. These

outliers are not necessarily a problem, though, since the FEs

on the second day may be very close to the FEs on the ﬁrst

day. To understand the impact of these trafﬁc shifts, we need

to consider the resulting latency experienced by the clients.

Figure 5 also shows the resulting distribution of ΔFEDist

(i.e., the actual FE changes) for all client regions for one

pair of consecutive days in June 2010. As expected, the

distribution matches relatively closely with the distribution

for ΔLatMap, though some signiﬁcant differences exist.

Sometimes the trafﬁc shifts even though the latency map does

not change. This is evident in the lower left part of the graph,

where 40% of the client regions see little or no change to the

ZHU et al.: LATLONG: DIAGNOSING WIDE-AREA LATENCY CHANGES FOR CDNS 9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CDF (Normal Cases)

 LatencyMap and  FEDistribution

 LatencyMap

 FEDistribution

Fig. 5. Distribution of ΔLatM ap and ΔFEDistacross all client regions

for one day in June 2010.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

-0.6 -0.4 -0.2 0 0.2 0.4 0.6

CDF

 Load Balance

Normal Cases

 FE Events

Fig. 6. Distribution of ΔLoadBalance for normal cases and events

ΔFE ≥0.4.

latency map, but more than half of the regions experience as

much as a 5% shift in trafﬁc.

We expect the load-balancing policy to routinely trigger

some small shifts in trafﬁc. Figure 6 plots the distribution of

ΔLoadBal for all client regions for a single day in June 2010,

as shown in the “Normal Cases” curve. As expected, around

30% of the client regions are directed to the closest front-

end server, as indicated by the clustering of the distribution

around ΔLoadBal =0. In the next subsection, we show that

the large latency events coincide with larger shifts in trafﬁc,

as illustrated by the “ΔFE Events” curve in Figure 6.

C. Front-End Changes During Events

To understand the inﬂuence of trafﬁc shifts during the

events, we analyze the large-latency events where front-end

changes are a signiﬁcant contributor to the increase in latency

(i.e., ΔFE ≥0.4); 35% of the events fall into this category,

as shown earlier in Table V-A. Figure 7 plots the distributions

of ΔLatMap and ΔFEDist for these events. For these

events, the FE distribution still mostly agrees with the latency

map. Compared with the curves in Figure 5, the events which

experienced large latency increases have a stronger correlation

with FE changes. According to the latency map, only 14% of

events have fewer than 10% of requests changing FEs; 46% of

the events have more than half of queries shifting FEs. Note

that FE changes (i.e., in nearby geographical locations) do

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CDF ( FE events)

 LatencyMap and  FEDistribution

 LatencyMap

 FEDistribution

Fig. 7. Distribution of ΔLatencyMap and ΔFEDistributionfor events

ΔFE ≥0.4.

not necessarily lead to large latency increases, and may even

improve user-perceived throughput by avoiding busy servers.

That said, these FE changes can cause increases in round-trip

time, so we need to understand how and why they happen.

We then calculate the ΔLoadBal, the difference of fraction

of trafﬁc directed by the load balancer from one day to the

next. Figure 6 shows the distribution of ΔLoadBal for these

events and for all client regions. As illustrated in the ﬁgure,

92.5% of the normal cases have less than 10% of requests

shifted away from the closest front-end server. In contrast,

for the ΔFE events, 27.7% of the events have a ΔLoadBal

value greater than 10%; more than 9% of the events have

aΔLoadBal in excess of 0.3, suggesting that the load-

balancing policy is responsible for many of the large increases

in latency.

Based on the ΔLatM ap and ΔLoadBal metrics, we

classify the events into four categories: (i) correlated only

with latency map changes, (ii) correlated only with load

balancing changes, (iii) correlated with both latency-map

changes and load balancing; and (iv) unknown. We choose the

85th-percentile and 90th-percentile in the distribution for the

normal cases as the thresholds for ΔLatM ap and ΔLoadBal.

Table V-C summarizes the results: 26.7% of the events are

correlated with both changes to the latency map and load

balancing; 40.8% of the events only with changes in the

latency map; 8.3% of the events only with load balancing;

and 24.3% of the events fall into the unknown category. The

table also shows results for the 90th-percentile thresholds.

Note that in the “unknown” category, although the fraction

of trafﬁc shifting FEs is low, this does not mean that the FE

change is not responsible for the latency increases. This is

because: what matters is the latency difference between the

FEs, not only the fraction of trafﬁc shifting FEs. For these

events in the unknown category, we still need to analyze

how much the latency differs between the FEs from one

day to the next; we suspect that, while the fraction of trafﬁc

shifting is small, the absolute increase in latency may be high.

Completing this analysis is part of our ongoing work.

D. Inter-domain Routing Changes

In this subsection, we study the events where the round-trip

time increases to existing front-end servers. We characterize

10 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION

TAB L E V I

CLASSIFICATION OF EVENTS WI TH ΔFE≥0.4

Threshold (0.27, 0.06) (0.53, 0.08)

(Percentile) 85th 90th

Latency Map 40.8% 23.8%

Load Balancing 8.3% 12.6%

Both 26.7% 18.4%

Unknown 24.3% 45.1%

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.2 0.4 0.6 0.8 1

CCDF

 Ingress

Normal Cases

 Lat Events

Fig. 8. Ingress router shifts (ΔIngress).

the events based on these metrics, and classify events based on

changes to the ingress router, the egress router and AS path,

or both.

For better insight into whether routing changes are respon-

sible for latency increases, we ﬁrst consider the prevalence of

routing changes for all client regions—when latency does not

necessarily increase signiﬁcantly—for a pair of consecutive

days in June 2010. Figure 8 shows the CCDF, with the y-axis

cropped at 0.45, to highlight the tail of the distribution where

clients experience a large shift in ingress routers. Signiﬁcant

routing changes are relatively rare for the “Normal Cases.” In

fact, 76.6% of the client regions experience no change in the

distribution of trafﬁc across ingress routers. Less than 7% of

the regions experience a shift of more than 10%. As such, we

see that shifts in where trafﬁc enters Google’s CDN network

do not occur often, and usually affect a relatively small fraction

of the trafﬁc.

However, large shifts in ingress routers are more common

for the events where the round-trip time to a front-end server

increases signiﬁcantly (i.e., ΔLat ≥0.4),asshownbythe

“ΔLat Events” curve in Figure 8. The events we study have a

much stronger correlation with changes in the ingress routers,

compared with the normal cases. Though 55% of these events

do not experience any change in ingress routers, 22.2% of

events see more than a 10% shift, and 6.7% of the events see

more than half of the trafﬁc shifting ingress routers.

Similarly, we calculate ΔEg ressASP ath for both the

normal cases and the ΔLat events, as illustrated in Figure 9.

Compared with ingress changes, we see more egress and

AS path changes, in part because we can distinguish routing

changes at a ﬁner level of detail since we see the AS path.

For the normal cases, 63% of the client regions see no change

in the egress router or the AS path; 91% see less than 10% of

the trafﬁc shifting egress router or AS path. In comparison, for

the “ΔLat Events,” only 39% of the events see no changes in

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.2 0.4 0.6 0.8 1

CCDF

 EgressASPath

Normal Cases

 Lat Events

Fig. 9. Egress router and AS path shifts (ΔEgressAS path).

TAB L E V I I

CLASSIFICATION OF EVENTS W ITH ΔLat ≥0.4

Thresholds (0.025, 0.05) (0.06, 0.09)

(Percentile) 85th 90th

Ingress 13.9% 12.6%

Egress/AS-path 19.6% 17.4%

Both 23.7% 17.6%

Unknown 42.7% 52.5%

the egress routers and AS paths; 32% of the events see more

than 10% of the trafﬁc changing egress router or AS path, and

10% of the events see more than half of the trafﬁc shifting

egress routers and/or AS paths.

Based on both of the routing indicators, we classify the

events into four categories: (i) correlated only with ingress

router changes, (ii) correlated only with changes in the egress

router and AS path, (iii) correlated with both ingress changes

and egress/AS-path changes, and (iv) unknown. To identify

signiﬁcant shifts, we look to the distributions for “Normal

Cases” and consider the 85th and 95th percentiles for shifts

in both ΔIngress and ΔEgr essASP ath.TableV-Dsum-

marizes the results. Based on the 85th-percentile thresholds,

23.7% of the events are associated with large shifts in both the

ingress routers and the egress/AS-path; 13.9% of the events

are associated with ingress-router shifts; 19.6% of the events

are associated with shifts in the egress router and AS path;

and 42.7% of the events fall into the unknown category. We

also show results using the 90th-percentile thresholds.

Note that around half of the events fall into the unknown

category, where we could not correlate latency increases

with large, visible changes to interdomain routing. Potential

explanations include AS-level routing changes on the forward

path (from the client to the front-end server) that do not affect

where trafﬁc enters Google’s CDN network. Intradomain

routing changes in individual ASes could also cause increases

in round-trip time without changing the ingress router, egress

router, or AS path seen by the CDN. Finally, congestion along

either the forward or reverse path could be responsible. These

results suggest that CDNs should supplement BGP and trafﬁc

data with ﬁner-grain measurements of the IP-level forwarding

path (e.g., using traceroute and reverse traceroute [5]) both

for better accuracy in diagnosing latency increases and to

drive new BGP path-selection techniques that make routing

decisions based on direct observations of performance.

ZHU et al.: LATLONG: DIAGNOSING WIDE-AREA LATENCY CHANGES FOR CDNS 11

VI. CASE STUDIES

For a better understanding of large latency increases, we

explore several events in greater detail. These case studies

illustrate the general challenges CDNs face in minimizing

wide-area latency and point to directions for future work.

Although many of these problems are known already, our case

studies highlight that these issues arise in practice and are

responsible for very large increases in latency affecting real

users.

A. Latency-Map Inaccuracies

During one day in June 2010, an ISP in the United States

saw the average round-trip time increase by 111 msec. Our

analysis shows that the RTT increased because of a shift of

trafﬁc to different front-end servers; in particular, ΔFE was

1.01. These shifts were triggered primarily by a change in

the latency map; in particular, ΔLatMap was 0.90. Looking

at the latency map in more detail revealed the reason for

the change. On the ﬁrst day, 78% of client requests were

directed to front-end servers in the United States, and 22%

were directed to servers in Europe. In contrast, on the second

day, all requests were directed to front-end servers in Europe.

Hence, the average latency increased because the clients were

directed to servers that were further away. The situation was

temporary, and the clients were soon directed to closer front-

end servers.

This case study points to the challenges of identifying the

closest servers and using DNS to direct clients to servers—

topics explored by several other research studies [6], [4], [7],

[8], [9]. Clients do not necessarily reside near their local DNS

servers, especially with the increasing use of services like

GoogleDNS and OpenDNS. Similarly, client IP addresses do

not necessarily fall in the same IP preﬁx as their local DNS

server. Further, DNS caching causes the local DNS server to

return the same IP address to many clients over a period of

time. All of these limitations of DNS make it difﬁcult for a

CDN to exert ﬁne-grain control over server selection. Recent

work at the IETF proposes extensions to DNS so requests

from local DNS servers include the client’s IP address [10],

which should go a long way toward addressing this problem.

Still, further research on efﬁcient measurement techniques and

efﬁcient, ﬁne-grain control over server selection would be very

useful.

B. Flash Crowd Leads to Load Balancing to Distant Front-

End Servers

As another example, we saw the average round-trip time

double for an ISP in Malaysia. The RTT increase was caused

by a trafﬁc shift to different front-end servers; in particular,

ΔFE was 0.979. To understand why, we looked at the metrics

for front-end server changes. First, we noticed that ΔLatM ap

was 0.005, suggesting that changes in the latency map were

not responsible. Second, we observed that ΔFEDist =0.34

and ΔLoadBal =0.323, suggesting that load balancing

was responsible for the shift in trafﬁc. Looking at the client

request rate, we noticed that the requests per day jumped

signiﬁcantly from the ﬁrst day to the second; in particular,

RP D2/RP D1=2.5. On the ﬁrst day, all requests were

served as front-end servers close to the clients; however, on

the second day, 40% of requests were directed to alternate

front-end servers that were further way. This led to a large

increase in the average round-trip time for the whole region.

This case study points to a general limitation of relying

on round-trip times as a measure of client performance. If,

on the second day, Google’s CDN had directed all client

requests to the closest front-end server, the user-perceived

performance would likely have been worse. Sending more

requests to an already-overloaded server would lead to slow

downloads for a very large number of clients. Directing some

requests to another server—even one that is further away—

can result in higher throughput for the clients, including the

clients using the remote front-end server. Understanding these

effects requires more detailed measurements of download

performance, and accurate ways to predict the impact of

alternate load-balancing strategies of client performance. We

believe these are exciting avenues for future work, to enable

CDNs to handle ﬂash crowds and other shifts in user demand

as effectively as possible.

C. Shift to Ingress Router Further Away from the Front-End

Server

On day in June 2010, an ISP in Iran experienced an increase

of 387 msec in the average RTT. We ﬁrst determined that the

RTT was mainly caused by a large increase in latency to reach

a particular front-end server in western Europe. This front-end

server handled 65% of the requests on both days. However,

ΔLatifor this server was 0.73, meaning 73% of the increase

in RTT was caused by an increase in latency to reach this

front-end server. Looking at the routing changes, we saw a

ΔIngress of 0.38. Analyzing the trafﬁc by ingress router, we

found that, on the ﬁrst day, all of the trafﬁc to this front-end

server entered the CDN’s network at a nearby ingress router

in western Europe. However, on the second day, nearly 40%

of the trafﬁc entered at different locations that were further

away—21% in eastern Europe and 17% of trafﬁc in the United

States. Thus, the increase in RTT was likely caused by extra

latency between the ingress router and the front-end server,

and perhaps also by changes in latency for the clients to reach

these ingress routers.

This case study points to a larger difﬁculty in controlling

inbound trafﬁc using BGP. To balance load over the ingress

routers, and generally reduce latency, a large AS typically

announces its preﬁxes at many locations. This allows other

ASes to select interdomain routes with short AS paths and

nearby peering locations. However, an AS has relatively

little control over whether other ASes can (and do) make

good decisions. In some cases, a CDN may be able to use

the Multiple Exit Discriminator (MED) attribute in BGP to

control how individual neighbor ASes direct trafﬁc, or perform

selective AS prepending or selective preﬁx announcements to

make some entry points more attractive than others. Still, this

is an area that is ripe for future research, to give CDNs more

control over how clients reach their services.

D. Shorter AS Paths Not Always Better

On another day in June 2010, an ISP in Mauritius experi-

enced a 113 msec increase in the average round-trip time. On

12 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, ACCEPTED FOR PUBLICATION

both days, more than half of the client requests were handled

by a front-end server in Asia—60% on the ﬁrst day and 74%

on the second day. However, on the second day, the latency

to reach this front-end server increased substantially. Looking

at the routing data, we see that trafﬁc shifted to a different

egress router and AS path. On the ﬁrst day, 56% of the trafﬁc

left Google’s CDN’s network in Asia. On the second day,

this number dropped to 10%, and nearly two-thirds of the

trafﬁc left the network in Europe over a shorter AS path.

Presumably, upon learning a BGP route with a shorter AS

path, the routers preferred this route over the “longer” path

through Asia. However, AS-path length is (at best) loosely

correlated with round-trip time, and in this case the “shorter”

path had a much higher latency.

This case study points to a larger problem with today’s

interdomain routing system—routing decisions do not con-

sider performance. The BGP decision process uses AS-path

length as a (very) crude measure of performance, rather

than considering measurements of actual performance along

the end-to-end paths. Future work could explore lightweight

techniques for measuring the performance along different

interdomain paths, including the paths not currently selected

for carrying trafﬁc to clients. For example, recent work [11]

introduces a “route injection” mechanism for sampling the

performance on alternative paths. Once path performance is

known, CDNs can optimize interdomain path selection based

on performance, load, and cost. However, large CDNs with

their own backbone network introduce two interesting twists

on the problem of intelligent route control. First, the CDN

selects interdomain routes at multiple egress points, rather than

a single location. Second, the CDN can jointly control server

selection and route selection for much greater ﬂexibility in

directing trafﬁc.

VII. FUTURE RESEARCH DIRECTIONS

In this section, we brieﬂy discuss several natural directions

for future work on diagnosing wide-area latency increases for

CDNs.

Direct extensions of our measurement study: First, we

plan to extend our design in Section III to distinguish between

routing changes that affect the egress router from those that

only change the AS path. Second, as discussed at the end of

Section V-C, we plan to further explore the unexplained shifts

in trafﬁc from one front-end server to another. We suspect

that some of these shifts are caused by a relatively small

fraction of trafﬁc shifting to a much further away front-end

server. To analyze this further, we plan to incorporate the RTT

differences between front-end servers as part of our metrics

for studying FE changes. Third, our case studies in Section VI

required manual exploration, after automatically computing

the various metrics. We plan to conduct more case studies

and automate the analysis to generate reports for the network

operators.

More accurate diagnosis: First, we plan to work with the

groups that collect the measurement data to provide the data on

a smaller timescale (to enable ﬁner-grain analysis) and in real

time (to enable real-time analysis). Second, we plan to explore

better ways to track the performance data (including RTT and

RPD) separately for each ingress router and egress/AS-path.

Currently, the choice of ingress and egress routers are not

visible to the front-end servers, where the performance data

are collected. Third, we will explore techniques for correlating

across latency increases affecting multiple customer regions.

For example, correlating across interdomain routing changes

that affect the AS paths for multiple client preﬁxes may enable

us to better identify the root cause [12].

Incorporating additional data sets: We plan to investigate

techniques for improving the visibility of the routing and

performance changes from outside the CDN network. For

example, active measurements—such as performance probes

and traceroute (including both forward and reverse tracer-

oute [13])—would help explain the “unknown” category for

the ΔLat events, which we could not correlate with visible

routing changes. In addition, measurements from the front-

end servers could help estimate the performance of alternate

paths, to drive changes to the CDN’s routing decisions to avoid

interdomain paths offering poor performance.

VIII. RELATED WORK

CDNs have been widely deployed to serve Web content. In

these systems, clients are directed to different servers to reduce

latency and balance load. Our classiﬁcation reveals the main

causes of high latency between the clients and the servers.

An early work in [6] studied the effectiveness of DNS redi-

rection and URL rewriting in improving client performance.

This work characterizes the size and the number of the web

objects CDNs served, the number of distinct IP addresses used

in DNS redirection, and content download time, and compared

the performance for a number of CDN networks. Recent work

in [14] evaluated the performance of two large-scale CDNs—

Akamai and LimeLight. Instead of measuring CDNs from

end hosts, we design and evaluate techniques for a CDN to

diagnose wide-area latency problems, using readily-available

trafﬁc, performance, and routing data.

WhyHigh [3] combines active measurements with routing

and trafﬁc data to identify causes of persistent performance

problems for some CDN clients. For example, WhyHigh

identiﬁes conﬁguration problems and side-effects of trafﬁc

engineering that lead some clients to much higher latency

than others in the same region. In contrast, our work focuses

on detecting and diagnosing large changes in performance

over time, and also considers several causes of trafﬁc shifts

from one front-end server to another. The dynamics of latency

increases caused by the changes in FE server selection, load

balancing, and inter-domain routing changes are not studies

in the work of WhyHigh. WISE [15] predicts the effects

of possible conﬁguration and deployment changes in the

CDN. Our work is complementary in that, instead of studying

planned maintenance and operations, we study how to detect

and diagnose unplanned increases in latency.

PlanetSeer [16] uses passive monitoring to detect network

path anomalies in the wide-area, and correlates active probes

to characterize these anomalies (temporal vs. persistent, loops,

routing changes). The focus of our work is different in that,

instead of characterizing the end-to-end effects of performance

anomalies, we study how to classify them according to the

ZHU et al.: LATLONG: DIAGNOSING WIDE-AREA LATENCY CHANGES FOR CDNS 13

causes. Recent work [17] measures wide-area performance

for CoralCDN using kernel-level TCP statistics, and identiﬁed

causes of performance problems such as server-limits and the

congestion window. In comparison, we focus on the causes

of performance problems at the IP layer, related to the CDN

network design and Internet routing.

Note that management of wide-area performance of CDN

services is a relatively new topic, and a heavily commercial

topic, so not many published papers are available on how CDN

management is done today.

IX. CONCLUSION

The Internet is increasingly a platform for users to access

online services hosted on servers distributed throughout the

world. Today, ensuring good user-perceived performance is a

challenging task for the operators of large Content Distribution

Networks (CDNs). In this paper, we presented the system

design for automatically classifying large changes in wide-

area latency for CDNs, and the results from applying our

methodology to trafﬁc, routing, and performance data from

Google. Our techniques enable network operators to learn

quickly about signiﬁcant changes in user-perceived perfor-

mance for accessing their services, and adjust their routing

and server-selection policies to alleviate the problem.

Using only measurement data readily available to the CDN,

we can automatically trace latency changes to shifts in trafﬁc

to different front-end servers (due to load-balancing policies

or changes in the CDN’s own view of the closest server) and

changes in the interdomain paths (to and from the clients). Our

analysis and case studies suggest exciting avenues for future

research to make the Internet a better platform for accessing

and managing online services.

X. ACKNOWLEDGMENTS

We thank Roshan Baliga, Andre Broido and Mukarram

Tariq for their valuable feedback in the early stages of this

work, as well as Bo Fu for iterating on the implementation

details of the prototype. Special thanks to Ankur Jain for his

insightful comments on FE changes. We are also grateful to

Murtaza Motiwala, Srinivas Narayana, Vytautas Valancius, the

anonymous reviewers and the editors for their comments and

suggestions.

REFERENCES

[1] M. Szymaniak, D. Presotto, G. Pierre, and M. V. Steen, “Practical large-

scale latency estimation,” Computer Networks, 2008.

[2] Cisco NetFlow, http://www.cisco.com/en/US/products/ps6601/products

ios protocol group home.html.

[3] R. Krishnan, H. V. Madhyastha, S. Srinivasan, and S. Jain, “Moving

beyond end-to-end path information to optimize CDN performance,” in

Proc. 2009 Internet Measurement Conference.

[4] Z. M. Mao, C. Cranor, F. Douglis, M. Rabinovich, O. Spatscheck, and

J. Wang, “A precise and efﬁcient evaluation of the proximity between

web clients and their local DNS servers,” in Proc. 2002 USENIX Annual

Technical Conference.

[5] E. Katz-Bassett, H. V. Madhyastha, V. K. Adhikari, C. Scott, J. Sherry,

P. van Wesep, T. Anderson, and A. Krishnamurthy, “Reverse traceroute,”

in Proc. 2010 Networked Systems Design and Implementation.

[6] B. Krishnamurthy, C. Wills, and Y. Zhang, “On the use and performance

of content distribution networks,” in Proc. 2001 Internet Measurement

Workshop.

[7] A. Shaikh, R. Tewari, and M. Agarwal, “On the effectiveness of DNS-

based server selection,” in Proc. 2002 IEEE INFOCOM.

[8] J. Pang, A. Akella, A. Shaikh, B. Krishnamurthy, and S. Seshan, “On the

responsiveness of DNS-based network control,” in Proc. 2004 Internet

Measurement Conference.

[9] B. Ager, W. Muehlbauer, G. Smaragdakis, and S. Uhlig, “Comparing

DNS resolvers in the wild,” in Proc. 2010 Internet Measurement

Conference.

[10] C. Contavalli, W. van der Gaast, S. Leach, and D. Rodden, “Client

IP information in DNS requests,” May 2010, Internet Draft, draft-

vandergaast-edns-client-ip-01.

[11] Z. Zhang, M. Zhang, A. Greenberg, Y. C. Hu, R. Mahajan, and

B. Christian, “Optimizing cost and performance in online service

provider networks,” in Proc. 2010 Networked Systems Design and

Implementation.

[12] A. Feldmann, O. Maennel, Z. M. Mao, A. Berger, and B. Maggs,

“Locating Internet routing instabilities,” in Proc. 2004 ACM SIGCOMM.

[13] E. Katz-Bassett, H. Madhyastha, V. K. Adhikar, C. Scott, J. Sherry,

P. V. Wesep, T. Anderson, and A. Krishnamurthy, “Reverse traceroute,”

in Proc. 2010 USENIX/ACM NSDI.

[14] C. Huang, A. Wang, J. Li, and K. W. Ross, “Measuring and evaluating

large-scale CDNs,” Microsoft Research Technical Report MSR-TR-

2008-106, 2008.

[15] M. B. Tariq, A. Zeitoun, V. Valancius, N. Feamster, and M. Ammar,

“Answering ‘what-if’ deployment and conﬁguration questions with

WISE,” in Proc. 2008 ACM SIGCOMM.

[16] M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer:

Internet path failure monitoring and characterization in wide-area ser-

vices,” in Proc. 2004 OSDI.

[17] P. Sun, M. Yu, M. Freedman, and J. Rexford, “Identifying performance

bottlenecks in CDNs through TCP-level monitoring,” 2011 SIGCOMM

Workshop on Measurements up the Stack.

Yaping Zhu is a software engineer at Google Inc. Before joining Google,

she was a research assistant in the Network Systems Group at Princeton

University. She also worked at AT&T Labs Research and NEC Labs America

on research and development of network management tools. Yaping Zhu

received her BS in computer science from Peking University in 2005, and

her MA and Ph.D. degrees in computer science from Princeton University in

2007 and 2011.

Benjamin Helsley is a software engineer at Google Inc. where he works

on systems for network monitoring and conﬁguration management. Prior to

joining Google, Benjamin worked at MacDonald Dettwiler and Safe Software

on algorithms for UAV cooperation and GIS data transformation. He received

his BS in computer science from the University of British Columbia in 2008.

Jennifer Rexford is a Professor in the Computer Science department at

Princeton University. From 1996-2004, she was a member of the Network

Management and Performance department at AT&T Labs–Research. Jennifer

is co-author of the book Web Protocols and Practice (Addison-Wesley, May

2001). She served as the chair of ACM SIGCOMM from 2003 to 2007, and is

a senior member of the IEEE. Jennifer received her BSE degree in electrical

engineering from Princeton University in 1991, and her MSE and Ph.D.

degrees in computer science and electrical engineering from the University of

Michigan in 1993 and 1996, respectively. She was the 2004 winner of ACM’s

Grace Murray Hopper Award for outstanding young computer professional.

Aspi Siganporia is director of engineering at Google. From 1990-2006,

he worked at various start-up and established networking companies in the

Silicon Valley such as Netsys and Cisco Systems. Aspi received his BTech

degree in electrical engineering from IIT Mumbai in 1982, and his MS degree

in computer science from the University of Louisiana in 1986.

Sridhar Srinivasan is a software engineer at Google. Before joining Google,

he was a research assistant in the Networking and Telecommunications Group

at Georgia Tech. Sridhar received his B.E. in computer engineering from the

Delhi Institute of Technology, M.S. in computer science from Iowa State

University and Ph.D. in computer science from Georgia Tech.

BigBen: Telemetry Processing for Internet-wide Event Monitoring

Preprint

Nov 2020

This paper describes BigBen, a network telemetry processing system designed to enable accurate and timely reporting of Internet events (e.g., outages, attacks and configuration changes). BigBen is distinct from other Internet-wide event detection systems in its use of passive measurements of Network Time Protocol (NTP) traffic. We describe the architecture of BigBen, which includes (i) a distributed NTP traffic collection component, (ii) an Extract Transform Load (ETL) component, (iii) an event identification component, and (iv) a visualization and reporting component. We also describe a cloud-based implementation of BigBen developed to process large NTP data sets and provide daily event reporting. We demonstrate BigBen on a 15.5TB corpus of NTP data. We show that our implementation is efficient and could support hourly event reporting. We show that BigBen identifies a wide range of Internet events characterized by their location, scope and duration. We compare the events detected by BigBen vs. events detected by a large active probe-based detection system. We find only modest overlap and show how BigBen provides details on events that are not available from active measurements. Finally, we report on the perspective that BigBen provides on Internet events that were reported by third parties. In each case, BigBen confirms the event and provides details that were not available in prior reports, highlighting the utility of the passive, NTP-based approach.

Client-side Active Measurements Without Application Control

Preprint

Full-text available

Jul 2020

Matt Calder

Monitoring performance and availability are critical to operating successful content provider networks. Internet measurements provide data needed for traffic engineering, alerting, and network diagnostics. While there are significant benefits to combining server-side passive measurements with end-user active measurements, these capabilities are limited to a small number of content providers with both network and application control. In this work, we present a solution to a long-standing problem for a method to issue active measurements from clients without application control. Our approach uses features of the W3C Network Error Logging specification that allow a CDN to induce a browser connection to an HTTPS server of the CDN's choosing.

Dissecting Latency in the Internet's Fiber Infrastructure

Preprint

Full-text available

Nov 2018

The recent publication of the `InterTubes' map of long-haul fiber-optic cables in the contiguous United States invites an exciting question: how much faster would the Internet be if routes were chosen to minimize latency? Previous measurement campaigns suggest the following rule of thumb for estimating Internet latency: multiply line-of-sight distance by 2.1, then divide by the speed of light in fiber. But a simple computation of shortest-path lengths through the conduits in the InterTubes map suggests that the conversion factor for all pairs of the 120 largest population centers in the U.S.\ could be reduced from 2.1 to 1.3, in the median, even using less than half of the links. To determine whether an overlay network could be used to provide shortest paths, and how well it would perform, we used the diverse server deployment of a CDN to measure latency across individual conduits. We were surprised to find, however, that latencies are sometimes much higher than would be predicted by conduit length alone. To understand why, we report findings from our analysis of network latency data from the backbones of two Tier-1 ISPs, two scientific and research networks, and the recently built fiber backbone of a CDN.

Video Sessions KPIs clustering framework in CDNs

Conference Paper

Full-text available

Jan 2019

PoiEvent: An approach to extract the persistent and destructive routing events

Article

Aug 2022
COMPUT NETW

To extract the persistent and destructive routing events is critical for Internet service providers (ISPs) to improve the network performance. Currently, the latest approach that leverages the notion of empathy to aggregate the paths that changed similarly over time can extract routing events from an arbitrary set of traceroutes. However, the cascading effects of path change prevent ISPs from accurately identifying the root cause of routing events. Meanwhile, the lack of evaluation for the impact of events limit ISPs from extracting the persistent and destructive routing events. In order to extract the persistent and destructive routing events and identify their root causes accurately, we propose PoiEvent. First, to infer the routing events and identify their root causes accurately, we improve the existing algorithm to remove the cascading effects. Then, to characterize the impact of routing events, we propose an event-based characterization method, which considers the location, severity, scope, congestion effect, and duration of each routing event. Finally, to extract the persistent and destructive routing events, we propose an event filtering method using the number of changed paths as hints to extract the routing events in terms of their impact. We perform experiments with data from RIPE Atlas to evaluate the performance of PoiEvent. The results show that PoiEvent can extract the persistent and destructive routing events. We believe that PoiEvent can be an effective aid for improving the network performance at the ISPs level.

BigBen: Telemetry Processing for Internet-Wide Event Monitoring

Article

Sep 2022

This paper describes BigBen, a network telemetry processing system designed to enable accurate and timely reporting of Internet events (e.g., outages, attacks and configuration changes). BigBen is distinct from other Internet-wide event detection systems in its use of passive measurements of Network Time Protocol (NTP) traffic. We describe the architecture of BigBen, and a cloud-based implementation developed to process large NTP data sets and provide accurate daily event reporting. We demonstrate BigBen on a 15.5TB corpus of NTP data. We show that BigBen identifies a wide range of Internet events characterized by their location, scope and duration. We compare the events detected by BigBen vs. events detected by a large active probe-based detection system. We find only modest overlap between the two datasets and show how BigBen provides details on events that are not available from active measurements. Finally, we report on the perspective that BigBen provides on Internet events that were reported by third parties. In each case, BigBen confirms the event and provides details that were not available in prior reports, highlighting the utility of the passive, NTP-based approach.

Towards client-side active measurements without application control

Article

Jan 2022
COMPUT COMMUN REV

Monitoring performance and availability are critical to operating successful content distribution networks. Internet measurements provide the data needed for traffic engineering, alerting, and network diagnostics. While there are significant benefits to performing end-user active measurements, these capabilities are limited to a small number of content providers with application control. In this work, we present a solution to the long-standing problem of issuing active measurements from clients without requiring application control, e.g., injecting JavaScript to the content served. Our approach uses server-side programmable features of the Network Error Logging specification that allow a CDN to induce a browser connection to an HTTPS server of the CDN's choosing without application control.

Anycast In context: a tale of two systems

Conference Paper

Aug 2021

I Sent It: Where Does Slow Data Go to Wait?

Conference Paper

Mar 2019

Emerging applications like virtual reality (VR), augmented reality (AR), and 360-degree video aim to exploit the unprecedentedly low latencies promised by technologies like the tactile Internet and mobile 5G networks. Yet these promises are still unrealized. In order to fulfill them, it is crucial to understand where packet delays happen, which impacts protocol performance such as throughput and latency. In this work, we empirically find that sender-side protocol stack delays can cause high end-to-end latencies, though existing solutions primarily address network delays. Unfortunately, however, current latency diagnosis tools cannot even distinguish between delays on network links and delays in the end hosts. To close this gap, we present ELEMENT, a latency diagnosis framework that decomposes end-to-end TCP latency into endhost and network delays, without requiring admin privileges at the sender or receiver.

Wrinkles in Time: Detecting Internet-wide Events via NTP

Conference Paper

May 2018

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring

Conference Paper

Full-text available

Aug 2011

Content distribution networks (CDNs) need to make decisions, such as server selection and routing, to improve performance for their clients. The performance may be limited by various factors such as packet loss in the network, a small receive buffer at the client, or constrained server CPU and disk resources. Conventional measurement techniques are not effective for distinguishing these performance problems: application-layer logs are too coarse-grained, while network-level traces are too expensive to collect all the time. We argue that passively monitoring the transport-level statistics in the server's network stack is a better approach. This paper presents a tool for monitoring and analyzing TCP statistics, and an analysis of a CoralCDN node in PlanetLab for six weeks. Our analysis shows that more than 10% of connections are server-limited at least 40% of the time, and many connections are limited by the congestion window despite no packet loss. Still, we see that clients in 377 Autonomous Systems (ASes) experience persistent packet loss. By separating network congestion from other performance problems, our analysis provides a much more accurate view of the performance of the network paths than what is possible with server logs alone.

Comparing DNS resolvers in the wild

Conference Paper

Full-text available

Nov 2010

The Domain Name System (DNS) is a fundamental building block of the Internet. Today, the performance of more and more applications depend not only on the responsiveness of DNS, but also the exact answer returned by the queried DNS resolver, e.g., for Content Distribution Networks (CDN). In this paper, we compare local DNS resolvers against GoogleDNS and OpenDNS for a large set of vantage points. Our end-host measurements inside 50 commercial ISPs reveal that two aspects have a significant impact on responsiveness: (1) the latency to the DNS resolver, (2) the content of the DNS cache when the query is issued. We also observe significant diversity, even at the AS-level, among the answers provided by the studied DNS resolvers. We attribute this diversity to the location-awareness of CDNs as well as to the location of DNS resolvers that breaks the assumption made by CDNs about the vicinity of the end-user and its DNS resolver. Our findings pinpoint limitations within the DNS deployment of some ISPs, as well as the way third-party DNS resolvers bias DNS replies.

Moving beyond end-to-end path information to optimize CDN performance

Conference Paper

Full-text available

Nov 2009

Replicatingcontentacrossageographicallydistributedsetofservers and redirecting clients to the closest server in terms of latency has emerged as a common paradigm for improving client performance. Inthispaper, weanalyzelatenciesmeasuredfromserversinGoogle's content distribution network (CDN) to clients all across the Inter- net to study the effectiveness of latency-based server selection. Our main result is that redirecting every client to the server with least latency does not suffice to optimize client latencies. First, even though most clients are served by a geographically nearby CDN node, a sizeable fraction of clients experience latencies several tens of milliseconds higher than other clients in the same region. Sec- ond, we find that queueing delays often override the benefits of a client interacting with a nearby server. To help the administrators of Google's CDN cope with these problems, we have built a system called WhyHigh. First, WhyHigh measures client latencies across all nodes in the CDN and correlates measurements to identify the prefixes affected by inflated latencies. Second, since clients in several thousand prefixes have poor laten- cies, WhyHigh prioritizes problems based on the impact that solv- ing them would have, e.g., by identifying either an AS path com- mon to several inflated prefixes or a CDN node where path inflation is widespread. Finally, WhyHigh diagnoses the causes for inflated latencies using active measurements such as traceroutes and pings, in combination with datasets such as BGP paths and flow records. Typical causes discovered include lack of peering, routing miscon- figurations, and side-effects of traffic engineering. We have used WhyHigh to diagnose several instances of inflated latencies, and our efforts over the course of a year have significantly helped im- prove the performance offered to clients by Google's CDN.

Locating Internet routing instabilities

Conference Paper

Full-text available

Oct 2004
COMPUT COMMUN REV

This paper presents a methodology for identifying the autonomous system (or systems) responsible when a routing change is observed and propagated by BGP. The origin of such a routing instability is deduced by examining and correlating BGP updates for many prefixes gathered at many observation points. Although interpreting BGP updates can be perplexing, we find that we can pinpoint the origin to either a single AS or a session between two ASes in most cases. We verify our methodology in two phases. First, we perform simulations on an AS topology derived from actual BGP updates using routing policies that are compatible with inferred peering/customer/provider relationships. In these simulations, in which network and router behavior are "ideal", we inject inter-AS link failures and demonstrate that our methodology can effectively identify most origins of instability. We then develop several heuristics to cope with the limitations of the actual BGP update propagation process and monitoring infrastructure, and apply our methodology and evaluation techniques to actual BGP updates gathered at hundreds of observation points. This approach of relying on data from BGP simulations as well as from measurements enables us to evaluate the inference quality achieved by our approach under ideal situations and how it is correlated with the actual quality and the number of observation points.

Answering what-if deployment and configuration questions with wise

Conference Paper

Full-text available

Oct 2008
COMPUT COMMUN REV

Designers of content distribution networks often need to determine how changes to infrastructure deployment and configuration affect service response times when they deploy a new data center, change ISP peering, or change the mapping of clients to servers. Today, the designers use coarse, back-of-the-envelope calculations, or costly field deployments; they need better ways to evaluate the effects of such hypothetical "what-if" questions before the actual deployments. This paper presents What-If Scenario Evaluator (WISE), a tool that predicts the effects of possible configuration and deployment changes in content distribution networks. WISE makes three contributions: (1) an algorithm that uses traces from existing deployments to learn causality among factors that affect service response-time distributions; (2) an algorithm that uses the learned causal structure to estimate a dataset that is representative of the hypothetical scenario that a designer may wish to evaluate, and uses these datasets to predict future response-time distributions; (3) a scenario specification language that allows a network designer to easily express hypothetical deployment scenarios without being cognizant of the dependencies between variables that affect service response times. Our evaluation, both in a controlled setting and in a real-world field deployment at a large, global CDN, shows that WISE can quickly and accurately predict service response-time distributions for many practical What-If scenarios.

Measuring and Evaluating Large-scale CDNS

Article

Jan 2008

On the Effectiveness of DNS-based Server Selection

Article

Jan 2001

This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and speci c requests. After outside publication, requests should be lled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Some reports are available at

Measuring and evaluating large-scale CDNs

Conference Paper

Jan 2008

CDNs play a critical and central part of today's Internet in- frastructure. In this paper we conduct extensive and thor- ough measurements that accurately characterize the perfor- mance of two large-scale commercial CDNs: Akamai and Limelight. Our measurements include charting the CDNs (locating all their content and DNS servers), assessing their server availability, and quantifying their world-wide delay performance. Our measurement techniques can be adopted by CDN customers to independently evaluate the performance of CDN vendors. It can also be used by a new CDN en- trant to choose an appropriate CDN design and to locate its servers. Based on the measurements, we shed light on two radically different design philosophies for CDNs: the Aka- mai design, whichenters deep into ISPs; and the Limelight design, which brings ISPs to home. We compare these two CDNs with regards to the numbers of their content servers, their internal DNS designs, the geographic locations of their data centers, and their DNS and content server delays. Fur- thermore, we study where Limelight can locate additional servers to reap the greatest delay performance gains. As a byproduct, we also evaluate Limelight's use of IP anycast, and gain insight into a large-scale IP anycast production sys- tem.

On the use and performance of content distribution networks

Conference Paper

Jan 2001

Content distribution networks (CDNs) are a mechanism to deliver content to end users on behalf of origin Web sites. Content distribution offloads work from origin servers by serving some or all of the contents of Web pages. We found an order of magnitude increase in the number and percentage of popular origin sites using CDNs between November 1999 and December 2000.In this paper we discuss how CDNs are commonly used on the Web and define a methodology to study how well they perform. A performance study was conducted over a period of months on a set of CDN companies employing the techniques of DNS redirection and URL rewriting to balance load among their servers. Some CDNs generally provide better results than others when we examine results from a set of clients. The performance of one CDN company clearly improved between the two testing periods in our study due to a dramatic increase in the number of distinct servers employed in its network. More generally, the results indicate that use of a DNS lookup in the critical path of a resource retrieval does not generally result in better server choices being made relative to client response time in either average or worst case situations.

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services.

Conference Paper

Jan 2004

Detecting network path anomalies generally requires ex- amining large volumes of traffic data to find misbehav- ior. We observe that wide-area services, such as peer- to-peer systems and content distribution networks, ex- hibit large traffic volumes, spread over large numbers of geographically-dispersed endpoints. This makes them ideal candidates for observing wide-area network behav- ior. Specifically, we can combine passive monitoring of wide-area traffic to detect anomalous network behavior, with active probes from multiple nodes to quantify and characterize the scope of these anomalies. This approach provides several advantages over other techniques: (1) we obtain more complete and finer- grained views of failures since the wide-area nodes al- ready provide geographically diverse vantage points; (2) we incur limited additional measurement cost since most active probing is initiated when passive monitoring de- tects oddities; and (3) we detect failures at a much higher rate than other researchers have reported since the ser- vices providelarge volumes of traffic to sample. This pa- per shows how to exploit this combination of wide-area traffic, passive monitoring, and active probing, to both understand path anomalies and to provide optimization opportunities for the host service.

LatLong: Diagnosing Wide-Area Latency Changes for CDNs

Abstract and Figures

Recommended publications

Robust Optimization for Selecting NetFlow Points of Measurement in an IP Network

Path inference in data center networks

State-path decoupled QoS-based routing framework

A survey of active network research