PreprintPDF Available

QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forwarding Tree Cohorts

January 2018

January 2018

Authors:

Meta

Large inter-datacenter transfers are crucial for cloud service efficiency and are increasingly used by organizations that have dedicated wide area networks between datacenters. A recent work uses multicast forwarding trees to reduce the bandwidth needs and improve completion times of point-to-multipoint transfers. Using a single forwarding tree per transfer, however, leads to poor performance because the slowest receiver dictates the completion time for all receivers. Using multiple forwarding trees per transfer alleviates this concern--the average receiver could finish early; however, if done naively, bandwidth usage would also increase and it is apriori unclear how best to partition receivers, how to construct the multiple trees and how to determine the rate and schedule of flows on these trees. This paper presents QuickCast, a first solution to these problems. Using simulations on real-world network topologies, we see that QuickCast can speed up the average receiver's completion time by as much as 10x while only using 1.04x more bandwidth; further, the completion time for all receivers also improves by as much as 1.6x faster at high loads.

Bandwidth usage of various partitioning techniques (lower is better)

…

Figures - uploaded by Max Noormohammadpour

Content may be subject to copyright.

Content uploaded by Max Noormohammadpour

Content may be subject to copyright.

QuickCast: Fast and Efﬁcient Inter-Datacenter

Transfers using Forwarding Tree Cohorts

Mohammad Noormohammadpour1, Cauligi S. Raghavendra1, Srikanth Kandula2, Sriram Rao2

1Ming Hsieh Department of Electrical Engineering, University of Southern California

2Microsoft

Abstract—Large inter-datacenter transfers are crucial for

cloud service efﬁciency and are increasingly used by organiza-

tions that have dedicated wide area networks between datacen-

ters. A recent work uses multicast forwarding trees to reduce

the bandwidth needs and improve completion times of point-to-

multipoint transfers. Using a single forwarding tree per transfer,

however, leads to poor performance because the slowest receiver

dictates the completion time for all receivers. Using multiple

forwarding trees per transfer alleviates this concern–the average

receiver could ﬁnish early; however, if done naively, bandwidth

usage would also increase and it is apriori unclear how best

to partition receivers, how to construct the multiple trees and

how to determine the rate and schedule of ﬂows on these trees.

This paper presents QuickCast, a ﬁrst solution to these problems.

Using simulations on real-world network topologies, we see that

QuickCast can speed up the average receiver’s completion time

by as much as 10×while only using 1.04×more bandwidth;

further, the completion time for all receivers also improves by as

much as 1.6×faster at high loads.

Index Terms—Software Deﬁned WAN; Datacenter; Schedul-

ing; Completion Times; Replication

I. INTRODUCTION

Software Deﬁned Networking (SDN) is increasingly

adopted across Wide Area Networks (WANs) [1]. SDN allows

careful monitoring, management and control of status and

behavior of networks offering improved performance, agility

and ease of management. Consequently, large cloud providers,

such as Microsoft [2] and Google [3], have built dedicated

large scale WAN networks that can be operated using SDN

which we refer to as SD-WAN. These networks connect

dozens of datacenters for increased reliability and availability

as well as improved utilization of network bandwidth reducing

communication costs [4], [5].

Employing geographically distributed datacenters has many

beneﬁts in supporting users and applications. Replicating

objects across multiple datacenters improves user-access la-

tency, availability and fault tolerance. For example, Content

Delivery Networks (CDNs) replicate objects (e.g. multimedia

ﬁles) across many cache locations, search engines distribute

large index updates across many locations regularly, and

VMs are replicated across multiple locations for scale out

of applications. In this context, Point to Multipoint (P2MP)

transfers (also known as One-to-Many transfers) are necessary.

A P2MP transfer is a special case of multicasting with a

single sender and a ﬁxed set of receivers that are known

This is an extended version of a paper accepted for publication in IEEE

INFOCOM 2018, Honolulu, HI, USA

apriori. These properties together provide an opportunity for

network optimizations, such as sizable reductions in bandwidth

usage and faster completion times by using carefully selected

forwarding trees.

We review several approaches for performing P2MP trans-

fers. One can perform P2MP transfers as many independent

point-to-point transfers [4]–[7] which can lead to wasted

bandwidth and increased completion times. Internet multicas-

ting approaches [8] build multicast trees gradually as new

receivers join the multicast sessions, and do not consider the

distribution of load across network links while connecting

new receivers to the tree. This can lead to far from optimal

multicast trees which are larger than necessary and poor load

balancing. Application layer multicasting, such as [9], focuses

on use of overlay networks for building virtual multicast trees.

This may lead to poor performance due to limited visibility

into network link level status and lack of control over how

trafﬁc is directed in the network. Peer-to-peer ﬁle distribution

techniques [10], [11] try to maximize throughput per receiver

locally and greedily which can be far from a globally optimal

solution. Centralized multicast tree selection approaches have

been proposed [12], [13] that operate on regular and structured

topologies of networks inside datacenters which cannot be

directly applied to inter-DC networks. Other related research

either does not consider elasticity of inter-DC transfers which

allows them to change their transmission rate according to

available bandwidth [14], [15] or inter-play among may inter-

DC transfers for global network-wide optimization [16], [17].

We recently presented a solution called DCCast [18] that

reduces tail completion times for P2MP transfers. DCCast

employs a central trafﬁc engineering server with a global view

of network status, topology and resources to select forwarding

trees over which trafﬁc ﬂows from senders to all receivers.

This approach combines good visibility into and control over

networks offered by SDN with bandwidth savings achieved

by reducing the number of links used to make data transfers.

OpenFlow [19], a dominant SDN protocol, has supported

forwarding to multiple outgoing ports since v1.1[20]; for

example, by using Group Tables and the OFPGT_ALL ﬂag.

Several recent SDN switches support this feature [21]–[24].

In this paper, we propose a new rate-allocation and tree

selection technique called QuickCast with the aim of minimiz-

ing average completion times of inter-DC transfers. QuickCast

reduces completion times by replacing a large forwarding tree

with multiple smaller trees each connected to a subset of

arXiv:1801.00837v1 [cs.NI] 2 Jan 2018

T T T 2T 2T 2T

T T T 2T T T

Fig. 1. Partitioning receivers into subsets can improve mean completion times

(this ﬁgure assumes FCFS rate-allocation policy, dashed blue transfer arrives

slightly earlier, all links have equal capacity of 1)

receivers which we refer to as a cohort of forwarding trees.

Next, QuickCast applies the Fair Sharing scheduling policy

which we show through simulations and examples minimizes

contention for available bandwidth across P2MP transfers

compared to other scheduling policies.

Despite offering bandwidth savings and reduced tail com-

pletion times, DCCast can suffer from signiﬁcant increase in

completion times when P2MP transfers have many receivers.

As the number of receivers increases, forwarding trees grow

large creating many overlapping edges across transfers, in-

creasing contention and completion times. Since the edge with

minimal bandwidth determines the overall throughput of a

forwarding tree, we refer to this as “Weakest Link” problem.

We demonstrate weakest link problem using a simple example.

Figure 1shows a scenario where two senders (top two nodes)

are transmitting over forwarding trees and they share a link

(x→y). According to DCCast which uses the First Come

First Serve (FCFS) policy, the dashed blue transfer (on left)

which arrived just before the green (on right) is scheduled ﬁrst

delaying the beginning of green transfer until time T. As a

result, all green receivers ﬁnish at time 2T. By using multiple

trees, in this case two for the green sender, each tree can

be scheduled independently, and thereby reducing completion

time of the two receivers on the right from 2Tto T. That is,

we create a new tree that does not have the weakest link, which

in this scenario is x→yfor both blue and green transfers.

To replace a large tree with a cohort of trees, we propose

partitioning receiver sets of P2MP transfers into multiple

subsets and using a separate forwarding tree per subset. This

approach can signiﬁcantly reduce completion times of P2MP

transfers. We performed an experiment with DCCast over

random topologies with 50 nodes to determine if there is

beneﬁt in partitioning all P2MP transfers. We simply grouped

receivers into two subsets according to proximity, i.e., shortest

path hop count between receiver pairs, and attached each

partition with an independent forwarding tree (DCCast+2CL).

As shown in Figure 2, this reduced completion times by 50%

while increasing bandwidth usage by 6% (not shown).

To further reduce completion times and decrease additional

bandwidth usage, partitioning should be only applied to P2MP

transfers that beneﬁt from it. For example, using more than

one tree for the dashed blue transfer (on left) in Figure 1

will increase contention and will hurt completion times. By

carefully selecting P2MP transfers that beneﬁt from partition-

10 Receivers

Mean Tail

1.2

1.4

1.6

1.8

DCCast DCCast+2CL

20 Receivers

Mean Tail

1.2

1.4

1.6

1.8

1.53

1.43

1.57

1.48

Fig. 2. Comparison of completion times (mean and tail) of DCCast vs.

DCCast+2CL (two-clustering according to proximity) for P2MP transfers with

10 and 20 receivers per transfer (Normalized)

ing and using an independent forwarding tree per partition,

we can considerably improve average completion times with

small extra bandwidth usage. We refer to this approach as

selective partitioning. We limit the number of partitions per

transfer to two to minimize bandwidth overhead of additional

edges, minimize contention due to overlapping trees and limit

the number of network forwarding entries.

In addition to using multiple forwarding trees, we in-

vestigate various scheduling policies namely FCFS used in

DCCast, Shortest Remaining Processing Time (SRPT) and Fair

Sharing based on Max-Min Fairness (MMF) [25]. Although

SRPT is optimal for minimizing mean completion times while

scheduling trafﬁc on a single link, we ﬁnd that MMF beats

SRPT by a large factor as forwarding trees grow (which causes

many trees to share every link) and as offered load increases

(which increases contention for using available bandwidth).

Using a cohort of forwarding trees per P2MP transfer can

also increase reliability in two ways. First, by mitigating the

effect of weak links, a number of receivers will complete

reception earlier which reduces the probability of data loss

due to failures. Second, in case of a link failure, using more

than one tree reduces the probability of all receivers being

affected. In case subsequent trees are constructed in a link

disjoint manner, no single link failure can affect all receivers.

In summary, we make the following contributions:

1) We formulate a general optimization problem for mini-

mizing mean completion times while scheduling P2MP

transfers over a general network.

2) We propose QuickCast, which mitigates the “Weakest

Link” problem by partitioning receivers into multiple

subsets and attaching each partition with an independent

forwarding tree. Using multiple forwarding trees can

improve average completion time and reliability with

only a little additional bandwidth usage. We show that

partitioning should only be applied to transfers that

beneﬁt from it, i.e., partitioning some transfers leads to

increased contention and worse completion times.

3) We explore well-known scheduling policies (FCFS,

SRPT and Fair Sharing) for central rate-allocation over

forwarding trees and ﬁnd MMF to be the most effective

in maximizing overall bandwidth utilization and reduc-

ing completion times.

4) We perform extensive simulations using both synthetic

and real inter-datacenter trafﬁc patterns over real WAN

topologies. We ﬁnd that performance gain of QuickCast

depends on the offered load and show that under heavy

loads, QuickCast can improve mean times by as much as

10×and tail times by as much as 1.57×while imposing

very small increase in typical bandwidth usage (only

4%) compared to DCCast.

The rest of this paper is organized as follows. In Section II,

we state the P2MP scheduling problem, explain the constraints

and variables used in the paper and formulate the problem as

an optimization scenario. In Section III, we present QuickCast

and the two procedures it is based on. In Section IV, we

perform abstract simulations to evaluate QuickCast and in

Section Vwe discuss practical considerations and issues. At

the end, in Section VI, we provide a comprehensive overview

of related research efforts.

II. ASSUMPTIONS AND PRO BL EM STATE ME NT

Similar to prior work on inter-datacenter networks [4]–[7],

[18], [26], [27], we assume a SD-WAN managed by a Traf-

ﬁc Engineering (TE) server which receives transfer requests

from end-points, performs rate-allocations and manages the

forwarding plane. Transfers arrive at the TE server in an

online fashion and are serviced as they arrive. Requests are

speciﬁed with four parameters of arrival time, source, set

of receivers and size (volume in bytes). End-points apply

rate-limiting to minimize congestion. We consider a slotted

timeline to allow for ﬂexible rate-allocation while limiting

number of rate changes to allow time to converge to speciﬁed

rates and minimize rate-allocation overhead [6], [26]. We focus

on long running transfers that deliver large objects to many

datacenters such as applications in [5]. For such transfers,

small delays are usually acceptable, including overhead of

centralized scheduling and network conﬁgurations. To reduce

conﬁguration overhead (e.g. latency and control plane failures

[28]), we assume that a forwarding tree is not changed once

conﬁgured on the forwarding plane.

To reduce complexity we assume that end-points can ac-

curately rate-limit data ﬂows and that they quickly converge

to required rates; that there are no packet losses due to

congestion, corruption or errors; and that scheduling is done

for a speciﬁc class of trafﬁc meaning all requests have the

same priority. In Section V, we will discuss ways to deal with

cases when some of these assumptions do not hold.

A. Problem Formulation

Earlier we showed that partitioning of receiver sets can

improve completion times via decreasing network contention.

However, to further improve completion times, partitioning

should be done according to transfer properties, topology

and network status. Optimal partitioning for minimization of

completion times is an open problem and requires ﬁnding

solution to a complex joint optimization model that takes into

account forwarding tree selection and rate-allocation.

TABLE I

DEFI NIT IO N OF VARI ABL ES

Variable Deﬁnition

tand tnow Some timeslot and current timeslot

NTotal number of receivers per transfer

eA directed edge

CeCapacity of ein bytes per second

(x, y)A directed edge from xto y

GA directed inter-datacenter network graph

TSome directed tree connecting a sender to its receivers

VGand VT

VTSethi of vertices of Gand T

EGand ET

ETSethi of edges of Gand T

BeCurrent available bandwidth on edge e

BTCurrent available bandwidth over tree T

δWidth of a timeslot in seconds

RiA transfer request where i∈I

I = {1...I}

OiData object associated with Ri

SRiSource datacenter of Ri

ARiArrival time of Ri

VRiOriginal volume of Riin bytes

RiResidual volume of Ri, (Vr

Ri=VRiat t=ARi)

DRiSethi of destinations of Ri

nMaximum subsets (partitions) allowed per receiver set

iSethi of receivers of Riin partition j∈ {1,...,n}

qiTotal bytes used to deliver Oito D

DRi

iForwarding tree for partition j∈ {1,...,n}of Ri

Lee’s total outstanding load, i.e., Le=Pi,j

e∈Tj

i(t)Transmission rate of Rion Tj

iat timeslot t

γj

i(t)Whether Riis transmitted over Tj

iat timeslot t

θj

i,e Whether edge e∈EG

EGis on Tj

νi,j,v Whether v∈D

DRiis in P

i{∇

∇

∇ | ∇

∇

∇ ⊂ VG

VG,∇

∇

∇∩{P

i∪SRi} 6=∅,(VG

VG− ∇

∇

∇)∩

i∪SRi} 6=∅}

E(∇

∇

∇){e= (x, y)|x∈ ∇

∇

∇, y ∈(VG

VG− ∇

∇

∇)}

We ﬁrst point out the most basic constraint out of using

forwarding trees. Table Iprovides the list of variables we will

use. For any tree T, packets ﬂow from the source to receivers

with same rate rTthat satisﬁes:

rT≤BT= min

e∈ET

(Be)(1)

We formulate the problem as an online optimization sce-

nario. Assuming a set of requests Ri, i ∈ {1. . . I }already in

the system, upon arrival of new request RI+1, an optimization

problem needs to be formulated and solved to ﬁnd rate-

allocations, partitions and forwarding trees. Figure 3shows

the overall optimization problem. This model considers up to

n≥1partitions per transfer. Demands of existing requests

are updated to their residuals upon arrival of new requests.

The objective is formulated in a hierarchical fashion giving

higher priority to minimizing mean completion times and then

reducing bandwidth usage. The purpose of γj

i(t)indicator

variables is to calculate the mean times: the latest timeslot

over which we have fj

i(t)>0determines the completion

time of partition jof request i. These completion times are

minimize X

i∈I

X

j∈{1,...,n}

i|X

t>ARI+1

(t−ARI+1 )γj

i(t)

t0>t

(1 −γj

i(t0)) 

+Pi∈I

Iqi

n|EG

EG|Pi∈I

IVRi

subject to

Calculate total bandwidth usage:

(1) qi=X

e∈EG

j∈{1,...,n}

θj

i,e)Vr

Ri∀i

Demand satisfaction constraints:

(2) X

γj

i(t)fj

i(t) = Vr

δ∀i, j

Capacity constraints:

(3) X

i∈I

j∈{1,...,n}

θj

i,efj

i(t)≤Ce∀j, t, e

Steiner tree constraints [29]:

(4) X

e∈E(∇

∇

∇)

θj

i,e ≥1∀i, j, ∇ ∈ M

Basic range constraints:

(5) γj

i(t)=0, fj

i(t)=0 ∀i, j, t < ARI+1

(6) fj

i(t)≥0∀i, j, t

(7) θj

i,e ∈ {0,1} ∀i, j, e

(8) γj

i(t)∈ {0,1} ∀i, j, t

(9) νi,j,v ∈ {0,1} ∀i, j, v ∈D

DRi

(10) γj

i(t)=0 ∀i, j, t < ARi

Fig. 3. Online optimization model, variables deﬁned in Table I

then multiplied by partition size to create the total sum of

completion times per receiver. Constraint 4ascertains that

there is a connected subgraph across sender and receivers per

partition per request which is similar to constraints used to

ﬁnd minimal edge Steiner trees [29]. Since our objective is an

increasing function of bandwidth, these constraints eventually

lead to minimal trees connecting any sender to its receivers

while not increasing mean completion times (the second part

of objective that minimizes bandwidth is necessary to ensure

no stray edges).

B. Challenges

We focus on two objectives of ﬁrst minimizing completion

times of data transfers and minimizing total bandwidth usage.

This is a complex optimization problem for a variety of

reasons. First, breaking a receiver set to several partitions

leads to exponential number of possibilities. Moreover, the

optimization version of Steiner tree problem which aims to

ﬁnd minimal edge or minimal weight trees is a hard problem

[29]. Completion times of transfers then depend on how

partitions are formed and which trees are used to connect

partitions to senders. In addition, the scenario is naturally an

online problem which means even if we were able to compute

an optimal solution for a given set of transfers in a short

amount of time, we still would not be able to compute a

solution that is optimal over longer periods of time due to

incomplete knowledge of future arrivals.

III. QUICKCAST

We present our heuristic approach in Algorithm 1called

QuickCast with the objective of reducing mean completion

times of elastic P2MP inter-DC transfers. We ﬁrst review con-

cepts behind design of this heuristic, namely rate-allocation,

partitioning, forwarding tree selection and selective parti-

tioning. Next, we discuss how Algorithm 1realizes these

concepts using two procedures one executed upon arrival of

new transfers and the other per timeslot.

A. Rate-allocation

To compute rates per timeslot, we explore well-known

scheduling policies: FCFS, SRPT and Fair Sharing. Although

simple, FCFS can lead to increased mean times if large

transfers block multiple edges by fully utilizing them. SRPT

is known to offer optimal mean times over a single link but

may lead to starvation of large transfers. QuickCast uses Fair

Sharing based on MMF policy.

To understand effect of different scheduling policies, let us

consider the following example. Figure 4shows a scenario

where multiple senders have initiated trees with multiple

branches and they share links along the way to their receivers.

SRPT gives a higher priority to the top transfer with size

10 and then to the next smallest transfer and so on. When

the ﬁrst transfer is being sent, all other transfers are blocked

due to shared links. This occurs again when the next transfer

begins. Scheduling according to FCFS leads to same result.

In this example, mean completion times for both FCFS and

SRPT is about 1.16×larger than MMF. In Section IV, we

perform simulation experiments that conﬁrm the outcome in

this example. We ﬁnd that as trees grow larger and under high

utilization, the beneﬁt of using MMF over FCFS or SRPT

becomes more signiﬁcant due to increased contention. We also

realize that tail times grow much faster for SRPT compared to

both MMF and FCFS (since it also suffers from the starvation

problem) while scheduling over forwarding trees with many

receivers, and that increases SRPT’s mean completion times.

B. Partitioning

There are two conﬁguration parameters for grouping re-

ceivers into multiple partitions: partitioning criteria and num-

ber of partitions. In general, partitioning may lead to higher

bandwidth usage. However, it may be the case that a small

increase in bandwidth usage can considerably improve com-

pletion times. Efﬁcient and effective partitioning to minimize

completion times and bandwidth usage is a complex open

problem. We discuss our approach in the following.

Partitioning Criteria: We focus on minimizing extra

bandwidth usage while breaking large forwarding trees via

partitioning. Generally, one could select partitions according

to current network conditions, such as distribution of load

A&B

A&C

A&D

B&C

B&D

C&D

vol = 10

Arrival = 0

vol = 11

Arrival = 1

vol = 12

Arrival = 2

vol = 13

Arrival = 3

Completion Times

A B C D Mean

FCFS / SRPT 10 21 33 46 27.5

MMF 19 23 26 27.5 23.87

RECEIVERS

Fig. 4. Fair Sharing can offer better mean times compared to both SRPT and

FCFS while scheduling over forwarding trees, all links have capacity of 1

across edges. However, we notice that network conditions

are continuously changing as current transfers ﬁnish and new

transfers arrive. Minimizing bandwidth on the other hand

appears as a globally desirable objective for partitioning and

was hence chosen. To ﬁnd partitions, QuickCast groups re-

ceivers according to proximity until we are left with desired

number of groups each forming a partition. Distance between

two receivers is computed as the number of hops on the

shortest path between them. With this approach, a partition

requires minimal number of edges to connect the nodes within.

Reassigning any receiver to other partitions will increase the

number of edges and thus consumed bandwidth.

Number of Partitions: The right number of partitions

per transfer depends on factors such as topology, number

of receivers, forwarding trees of other transfers, source and

destinations of a transfer, and overall network load. In the

extreme case of Npartitions, a P2MP transfer is broken into N

unicast transfers which signiﬁcantly increases bandwidth us-

age. Partitioning is most effective if forwarding trees assigned

to partitions do not increase overall contention for network

bandwidth, i.e., the number of overlapping edges across new

trees is minimal. Therefore, increasing number of partitions to

more than connectivity degree of datacenters may offer minor

gains or even loss of performance (e.g., in case of Google

B4 [5], the minimum and maximum connectivity degrees are

2and 4, respectively). From the practical aspect, number of

partitions and hence forwarding trees determines the number

of Group Table rules that need to be setup in network

switches. Therefore, we focus on partitioning receivers into

up to 2groups each assigned an independent forwarding tree.

Exploration of effects of more partitions is left for future work.

C. Forwarding Tree Selection

After computing two partitions, QuickCast assigns an in-

dependent forwarding tree per partition using tree selection

approach presented in [18] which was shown to provide

high bandwidth savings and improved completion times. It

operates by giving a weight of We= (Le+VR)(see Table

Ifor deﬁnition of variables) and then selecting the minimum

weight forwarding tree. This technique allows load balancing

of P2MP data transfers over existing trees according to total

bandwidth scheduled on the edges. It also takes into account

transfer volumes while selecting trees. Particularly, larger

transfers are most likely assigned to smaller trees to minimize

bandwidth usage while smaller transfers are assigned to least

loaded trees (regardless of tree size). This approach becomes

more effective when a small number of transfers are orders

of magnitude larger than median [30], as number of larger

forwarding trees is usually signiﬁcantly larger than smaller

trees on any graph.

D. Selective Partitioning

Partitioning is beneﬁcial only if it decreases or minimally

increases bandwidth usage and contention over resources

which necessitates selectively partitioning the receivers. When

we chose the two partitions by grouping receivers and after

selecting a forwarding tree for every group, QuickCast cal-

culates the total weight of each forwarding tree by summing

up weights of their edges. We then compare sum of these

two weights with no partitioning case where a single tree was

used. If the total weight of two smaller trees is less than some

partitioning factor (shown as pf) of the single tree case, we

accept to use two trees. If pfis close to 1.0, partitioning occurs

only if it incurs minimal extra weight, i.e., (pf−1) times

weight of the single forwarding tree that would have been

chosen if we applied no partitioning. With this approach, we

most likely avoid selection of subsequent trees that are either

much larger or much more loaded than the initial tree in no

partitioning case. Generally, an effective pfis a function of

trafﬁc distribution and topology. According to our experiments

with several trafﬁc distributions and topologies, choosing it in

the range of 1.0≤pf≤1.1offers the best completion times

and minimal bandwidth usage.

E. QuickCast Algorithm

A TE server is responsible for managing elastic transfers

over inter-DC network. Each partition of a transfer is man-

aged independently of other partitions. We refer to a transfer

partition as active if it has not been completed yet. TE server

keeps a list of active transfer partitions and tracks them at

every timeslot. A TE server running QuickCast algorithm uses

two procedures as shown in Algorithm 1.

Submit(R,n,pf): This procedure is executed upon arrival

of a new P2MP transfer R. It performs partitioning and

forwarding tree selection for the new transfer given its volume,

source and destinations. We consider the general case where

we may have up to npartitions per transfer. First, we compute

edge weights based on current trafﬁc matrix and volume of

new transfer. We then build the agglomerative hierarchy of

receivers using average linkage and considering proximity as

clustering metric. Agglomerative clustering is a bottom up

approach where at every level the two closest clusters are

merged forming one cluster. The distance of any two clusters

is computed using average linkage which is the average over

pairwise distances of nodes in the two clusters. The distance

between every pair of receivers is the number of edges on

the shortest path from one to the other. It should be noted that

although our networks are directed, all edges are considered to

be bidirectional and so the distance in either direction between

any two nodes should be the same. When the hierarchy is

ready, we start from the level where there are nclusters (or

at the bottom if total number of receivers is less than or

equal to n) and compute the total weight of nforwarding

trees (minimum weight Steiner trees) to these clusters. We

move forward with this partitioning if the total weight is less

than pftimes weight of the forwarding tree that would have

been selected if we grouped all receivers into one partition.

Otherwise, the same process is repeated while moving up one

level in the clustering hierarchy (one less cluster). If we accept

a partitioning, this procedure ﬁrst assigns a forwarding tree to

every partition while continuously updating edge weights. It

then returns the partitions and their forwarding trees.

DispatchRates(): This procedure is executed at the begin-

ning of every timeslot. It calculates rates per active transfer

partition and according to MMF rate-allocation policy. New

transfers arriving somewhere within a timeslot are allocated

rates starting next timeslot. To calculate residual demands

needed for rate calculations, senders report back the actual

volume of data delivered during past timeslot per partition.

This allows QuickCast to cope with inaccurate rate-limiting

and packet losses which may prevent a transfer from fully

utilizing its allotted share of bandwidth.

IV. EVALUATIONS

We considered various topologies and transfer size distri-

butions as in Tables II and III. For simplicity, we consid-

ered a uniform capacity of 1.0for all edges, accurate rate-

limiting at end-points, no dropped packets due to congestion

or corruption, and no link failures. Transfer arrival followed a

Poisson distribution with rate of λ. For all simulations, we

considered a partitioning factor of pf= 1.1and timeslot

length of δ= 1.0. Unless otherwise stated, we assumed a

ﬁxed λ= 1.0. Also, for all trafﬁc distributions, we considered

an average demand equal to volume of 20 full timeslots per

transfer. For heavy-tailed distribution that is based on Pareto

distribution, we used a minimum transfer size equal to that of

2full timeslots. Finally, to prevent generation of intractably

large transfers, we limited maximum transfer volume to that

of 2000 full timeslots. We focus on scenarios with no link

failures to evaluate gains.

A. Comparison of Scheduling Policies over Forwarding Trees

We ﬁrst compare the performance of three well-known

scheduling policies of FCFS, SRPT and Fair Sharing (based on

Algorithm 1: QuickCast

Submit (R,n,pf)

Input: R(VR, SR,D

DR),n(= 2 in this paper), pf,G,Le

for ∀e∈EG

EG(Variables deﬁned in Table I)

Output: Pairs of (Partition, Forwarding Tree)

∀α, β ∈D

DR, α 6=β,DISTα,β ←number of edges on

the shortest path from αto β;

To every edge e∈EG

EG, assign weight We= (Le+VR);

Find the minimum weight Steiner tree TRthat connects

SR∪D

DRand its total weight WTR;

for k=nto k= 2 by −1do

Agglomeratively cluster D

DRusing average linkage

and distance metric DIST calculated in previous

line until only kclusters left forming

R, i ∈ {1,...,k};

foreach i∈ {1,...,k}do

Find WTP

, weight of minimum weight Steiner

tree that connects SR∪P

if Pi∈{1,...,k}WTP

≤pf×WTRthen

foreach i∈ {1,...,k}do

Find the minimum weight Steiner tree TP

that connects SR∪P

Le←Le+VR,∀e∈TP

Update We= (Le+VR)for all e∈EG

EG;

return P

Ras well as TP

R,∀i∈ {1,...,k};

Le←Le+VR,∀e∈TR;

return D

DRand TR;

DispatchRates ()

Input: Set of active request partitions P

P, their current

residual demands and forwarding trees Vr

Pand

TP,∀P∈P

P, and timeslot width δ

Output: Rate per active request per partition for next

timeslot

COUNTe←number of forwarding trees TP,∀P∈P

sharing edge e, ∀e∈EG

EG;

P0←P

Pand CAPe←1,∀e∈EG

EG;

while |P

P0|>0do

foreach P∈P

P0do

SHAREP←mine∈TP(CAPe

COUNTe);

P0←a partition Pwith minimum SHAREPvalue;

RATEP0←min(SHAREP0,Vr

δ);

P0←P

P0− {P0};

COUNTe←COUNTe−1,∀e∈TP0;

CAPe←CAPe−RATEP0,∀e∈TP0;

return RATEP,∀P∈P

TABLE II

VARIO US TO PO LOG IE S USE D IN E VALUATI ON

Name Description

Random Randomly generated and strongly connected with 50

nodes and 150 edges. Each node has a minimum

connectivity of two.

GScale [5] Connects Google datacenters across the globe with 12

nodes and 19 links.

Cogent [31] A large backbone and transit network that spans across

USA and Europe with 197 nodes and 243 links.

TABLE III

TRA NSF ER S IZE D IST RI BUT IO NS US ED I N EVALUATI ON

Name Description

Light-tailed According to Exponential distribution.

Heavy-tailed According to Pareto distribution.

Facebook Cache-

Follower [30]

Generated across Facebook inter-datacenter net-

works running cache applications.

Facebook

Hadoop [30]

Generated across Facebook inter-datacenter net-

works running geo-distributed analytics.

MMF). We used the weight assignment in [18] for forwarding

tree selection and considered Random topology in Table II.

We considered both light-tailed and heavy-tailed distributions.

All policies used almost identical amount of bandwidth (not

shown). Under light load, we obtained results similar to

scheduling trafﬁc on a single link where SRPT performs

better than Fair Sharing (not shown). Figure 5shows the

results of our experiment under heavy load. When the number

of receivers is small, SRPT is the best policy to minimize

mean times. However, as we increase the number of receivers

(larger trees), Fair Sharing offers better mean and tail times.

This simply occurs because the contention due to overlapping

trees caused by prioritizing transfers over one another (either

according to residual size in case of SRPT or arrival order in

case of FCFS) increases as more transfers are submitted or as

transfers grow in size.

B. Bandwidth usage of partitioning techniques

We considered three partitioning techniques and measured

the average bandwidth usage over multiple runs and many

timeslots. We used the topologies in Table II and trafﬁc

patterns of Table III. Figure 6shows the results. We cal-

culated the lower bound by considering a single minimal

edge Steiner tree per transfer. Other schemes considered

are: Random(Uniform Dist) breaks each set of receivers into

two partitions by randomly assigning each receiver to one

of the two partitions with equal probability, Agg(proximity

between receivers) clusters receivers according to closeness

to each other, and Agg(closeness to source) clusters receivers

according to their distance from source (receivers closer to

source are bundled together). As can be seen, Agg(proximity

between receivers), which is used by QuickCast, provides the

least bandwidth overhead (up to 17% to lower bound). In

general, breaking receivers into subsets that are attached to

a sender with minimal bandwidth usage is an open problem.

100

101

102

Mean Times (Normalized)

Light-tailed

FCFS SRPT Fair Sharing

100

101

102Heavy-tailed

2468

# Receivers per Transfer

100

101

102

Tail Times (Normalized)

2468

# Receivers per Transfer

1.5

2.5

Fig. 5. Performance of three well-known scheduling policies under heavy

load (forwarding tree selection according to DCCast)

TABLE IV

EVALUATI ON OF PA RTIT IO NIN G TE CHN IQ UES F OR P2MP TRANSFERS

Scheme Details

QuickCast Algorithm 1(Selective Partitioning).

QuickCast(NP) QuickCast with no partitioning applied.

QuickCast(TWO) QuickCast with two partitions always.

C. QuickCast with different partitioning approaches

We compare three partitioning approaches shown in Table

IV. We considered receiver sets of 5and 10 with both light-

tailed and heavy-tailed distributions. We show both mean

(top row) and tail (bottom row) completion times in the

form of a CDF in Figure 7. As expected, when there is no

partitioning, all receivers complete at the same time (vertical

line in CDF). When partitioning is always applied, completion

times can jump far beyond the no partitioning case due to

unnecessary creation of additional weak links. The beneﬁt

of QuickCast is that it applies partitioning selectively. The

amount of beneﬁt obtained is a function of partitioning factor

pf(which for the topologies and trafﬁc patterns considered

here was found to be most effective between 1.0and 1.1

according to our experiments, we used 1.1). With QuickCast,

the fastest receiver can complete up to 41% faster than the

slowest receiver and even the slowest receiver completes up

to 10% faster than when no partitioning is applied.

D. QuickCast vs. DCCast

We now compare QuickCast with DCCast using real topolo-

gies and inter-datacenter transfer size distributions, namely

GScale(Hadoop) and Cogent(Cache-Follower) shown in Ta-

bles II and III. Figure 8shows the results. We considered 10

receivers per P2MP transfer. In all cases, QuickCast uses up

10 Receivers per Transfer

Cogent(Cache-Follower) GScale(Hadoop) Random(Heavy-tailed) Random(Light-tailed)

Topology(Transfer Size Distribution)

1.2

1.4

Bandwidth

(Normalized by Lower Bound)

Random(Uniform Dist) Agg(proximity between receivers) Agg(closeness to source)

Fig. 6. Bandwidth usage of various partitioning techniques (lower is better)

1 1.1 1.2 1.3

0.2

0.4

0.6

0.8

Fraction of Receivers

Light-tailed, 5 Receivers, Mean

1 1.1 1.2 1.3

0.2

0.4

0.6

0.8

Heavy-tailed, 5 Receivers, Mean

1 1.1 1.2 1.3 1.4

0.2

0.4

0.6

0.8

Light-tailed, 10 Receivers, Mean

1 1.2 1.4 1.6 1.8

0.2

0.4

0.6

0.8

Heavy-tailed, 10 Receivers, Mean

1 1.1 1.2 1.3 1.4

Time (Normalized)

0.2

0.4

0.6

0.8

Fraction of Receivers

Light-tailed, 5 Receivers, Tail

1 1.1 1.2 1.3

Time (Normalized)

0.2

0.4

0.6

0.8

Heavy-tailed, 5 Receivers, Tail

1 1.05 1.1

Time (Normalized)

0.2

0.4

0.6

0.8

Light-tailed, 10 Receivers, Tail

1 1.2 1.4 1.6

Time (Normalized)

0.2

0.4

0.6

0.8

Heavy-tailed, 10 Receivers, Tail

QuickCast(NP) QuickCast(TWO) QuickCast(uses Selective Partitioning)

Fig. 7. Comparison of partitioning approaches in Table IV

to 4% more bandwidth. For lightly loaded scenarios where

λ= 0.01, QuickCast performs up to 78% better in mean

times, but about 35% worse in tail times. The loss in tail

times is a result of rate-allocation policy: FCFS performs

better in tail times compared to Fair Sharing under light

loads where contention due to overlapping trees is negligible

(similar to single link case when all transfers compete for

one resource). For heavily loaded scenarios where λ= 1,

network contention due to overlapping trees is considerable

and therefore QuickCast has been able to reduce mean times

by about 10×and tail times by about 57%. This performance

gap continues to increase in favor of QuickCast as offered load

grows further. In general, operators aim to maximize network

utilization over dedicated WANs [4], [5] which could lead to

heavily loaded time periods. Such scenarios may also appear

as a result of bursty transfer arrivals.

In a different experiment, we studied the effect of number

of replicas of data objects on performance of DCCast and

QuickCast as shown in Figure 9(please notice the difference

in vertical scale of different charts). Bandwidth usage of

both schemes were almost identical (QuickCast used less than

4% extra bandwidth in the worst case). We considered two

operating modes of lightly to moderately loaded (λ= 0.01)

and moderately to heavily loaded (λ= 0.1). QuickCast

offers most beneﬁt when number of copies grows. When the

number of copies is small, breaking receivers into multiple sets

may provide limited beneﬁt or even degrade performance as

resulting partitions will be too small. This is why mean and tail

times degrade by up to 5% and 40% across both topologies and

trafﬁc patterns when network is lightly loaded, respectively.

Under increasing load and with more copies, it can be seen

that QuickCast can reduce mean times signiﬁcantly, i.e., by as

much as 6×for Cogent(Cache-Follower) and as much as 16×

for GScale(Hadoop), respectively. The sudden increase in tail

times for GScale topology is because this network has only

12 nodes which means partitioning while making 10 copies

may most likely lead to overlapping edges across partitioned

trees and increase completion times. To address this problem,

one could reduce pfto minimize unnecessary partitioning.

E. Complexity

We discuss two complexity metrics of run-time and number

of Group Table entries needed to realize QuickCast.

Computational Complexity: We computed run-time on a

machine with a Core-i7 6700 CPU and 24GBs of memory

using JRE 8. We used the same transfer size properties men-

tioned at the beginning of this section. With GScale(Hadoop)

setting and λ= 0.01, run-time of procedure Submit increased

from 1.44ms to 2.37ms on average while increasing copies

from 2to 10. With the same settings, runtime of procedure

DispatchRates stayed below 2µs for varying number of

copies. Next, we increased both load and network size by

GScale(Hadoop), =1

BW Mean Tail

Cogent(Cache-Follower), =1

BW Mean Tail

DCCast QuickCast

1.02

10.1

1.26 1.04

10.2

1.57

(a) Heavy Load

GScale(Hadoop), =0.01

BW Mean Tail

0.5

1.5

2.5 Cogent(Cache-Follower), =0.01

BW Mean Tail

DCCast QuickCast

1.04

1.78

1.29

1.02 1.14

1.31

(b) Light Load

Fig. 8. Comparison of completion times of and bandwidth used by QuickCast

vs. DCCast (Normalized by minimum in each category)

switching to Cogent(Cache-Follower) setting and λ= 1.0.

This time, run-time of procedure Submit increased from

3.5ms to 35ms and procedure DispatchRates increased from

0.75ms to 1.6ms on average while increasing copies from 2

to 10. Although these run-times are signiﬁcantly smaller than

timeslot widths used in prior works which are in the range of

minutes [7], [26], more efﬁcient implementation of proposed

techniques may result in even further reduction of run-time.

Finally, with our implementation, memory usage of QuickCast

algorithm is in the order of 100’s of megabytes on average.

Group Table Entries: We performed simulations over

2000 timeslots with λ= 1.0(heavily loaded scenario) and

number of copies set to 10. With GScale(Hadoop) setting,

the maximum number of required Group Table entries was 455

and the average of maximum rules for all nodes observed over

all timeslots was 166. With Cogent(Cache-Follower) setting,

which is more than 10 times larger than GScale, we observed

a maximum of 165 and an average of 9Group Table entries

for the maximum observed by all nodes over all timeslots. We

considered the highly loaded case as it leads to higher number

of concurrent forwarding trees. Currently, most switches that

support Group Tables offer a maximum of 512 or 1024 entries

in total. In this experiment, a maximum of one Group Table

entry per transfer per node was enough as we considered

switches that support up to 32 action buckets per entry [32]

which is more then total number of receivers we chose per

transfer. In general, we may need more than one entry per

node per transfer or we may have to limit the branching factor

of selected forwarding trees, for example when Group Table

246810

1.2

1.4

Mean Times (Normalized)

= 0.01

DCCast QuickCast

246810

= 0.1

246810

# Receivers per Transfer

1.5

Tail Times (Normalized)

246810

# Receivers per Transfer

1.5

2.5

(a) Cogent(Cache-Follower)

246810

1.5

2.5

Mean Times (Normalized)

= 0.01

246810

= 0.1

246810

# Receivers per Transfer

1.2

1.4

1.6

Tail Times (Normalized)

246810

# Receivers per Transfer

(b) GScale(Hadoop)

Fig. 9. Comparison of completion times of QuickCast and DCCast (Normal-

ized by minimum in each category) by number of object copies

entries support up to 8action buckets [33].

V. DISCUSSION

The focus of this paper is on algorithm design and abstract

evaluations. In this section, we dive a bit further into practical

details and issues that may arise in a real setting.

Timeslot duration: One conﬁguration factor is timeslot

length δ. In general, smaller timeslots allow for faster response

to changes and arrival of new requests, but add the overhead of

rate computations. Minimum possible timeslot length depends

on how long it takes for senders to converge to centrally

allocated rates.

Handling rate-limiting inaccuracies: Rate-limiting is gen-

erally not very accurate, especially if done in software [34]. To

deal with inaccuracies and errors, every sender has to report

back to the TE server at the end of every timeslot and specify

how much trafﬁc it was able to deliver. TE server will deduct

these from the residual demand of requests to get the new

residual demands. Rate-allocation continues until a request is

completely satisﬁed.

Receiver feedback to sender: Forwarding trees allow ﬂow

of data from sender to receivers but receivers also need to

communicate with senders. Forwarding rules can be installed

so that it supports receivers sending feedback back to sender

over the same tree but in reverse direction. There will not

be need to use Group Tables on the reverse direction since

no replication is performed. One can use a simple point

to point scheme for receiver feedback. Since we propose

applying forwarding trees over wired networks with rate-

limiting, dropped packets due to congestion and corruptions

are expected to be low. This means if a scheme such as

Negative Acknowledgement (NAK) is used, receiver feedback

should be tiny and can be easily handled by leaving small

spare capacity over edges.

Handling network capacity loss: Link/switch failures may

occur in a real setting. In case of a failure, the TE server can be

notiﬁed by a network element that detects the failure. The TE

server can then exclude the faulty link/switch from topology

and re-allocate all requests routed on that link using their

residual demands. After new rates are allocated and forwarding

trees recomputed, forwarding plane can be updated and new

rates can be given to end-points for rate-limiting.

TE server failure: Another failure scenario is when TE

server stops working. It is helpful if end-points are equipped

with some distributed congestion control mechanism. In case

TE server fails, end-points can roll back to the distributed

mode and determine their rates according to network feedback.

VI. RE LATE D WORK

IP multicasting [8], CM [35], TCP-SMO [36] and NORM

[37] are instances of approaches where receivers can join

groups anytime to receive required data and multicast trees

are updated as nodes join or leave. This may lead to trees far

from optimal. Also, since load distribution is not taken into

account, network capacity may be poorly utilized.

Having knowledge of the topology, centralized management

allows for more careful selection of multicast trees and im-

proved utilization via rate-control and bandwidth reservation.

CastFlow [38] precalculates multicast trees which can then

be used at request arrival time for rule installation. ODPA

[39] presents algorithms for dynamic adjustment of multicast

spanning trees according to speciﬁc metrics. These approaches

however do not apply rate-control. MPMC [16], [17] proposes

use of multiple multicast trees for faster delivery of ﬁles to

many receivers then applies coding for improved performance.

MPMC does not consider the inter-play between transfers

when many P2MP requests are initiated from different source

datacenters. In addition, MPMC requires continues changes

to the multicast tree which incurs signiﬁcant control plane

overhead as number of chunks and transfers increases. [40]

focuses on rate and buffer size calculation for senders. This

work does not propose any solution for tree calculations.

RAERA [14] is an approximation algorithm to ﬁnd Steiner

trees that minimize data recovery costs for multicasting given

a set of recovery nodes. We do not have recovery nodes in our

scenario. MTRSA [15] is an approximation algorithm for mul-

ticast trees that satisfy a minimum available rate over a general

network given available bandwidth over all edges. This work

assumes constant rate requirement for transfers and focuses on

minimizing bandwidth usage rather than completion times.

For some regular and structured topologies, such as FatTree

intra-datacenter networks, it is possible to ﬁnd optimal (or

close to optimal) multicast trees efﬁciently. Datacast [13]

sends data over edge-disjoint Steiner trees found by pruning

spanning trees over various topologies of FatTree, BCube and

Torus. AvRA [12] focuses on Tree and FatTree topologies and

builds minimal edge Steiner trees that connect the sender to

all receivers as they join. MCTCP [41] reactively schedules

ﬂows according to link utilization.

As an alternative to in-network multicasting, one can use

overlay networks where hosts perform forwarding. RDCM

[42] populates backup overlay networks as nodes join and

transmits lost packets in a peer-to-peer fashion over them.

NICE [9] creates hierarchical clusters of multicast peers and

aims to minimize control trafﬁc overhead. AMMO [43] allows

applications to specify performance constraints for selection

of multi-metric overlay trees. DC2 [44] is a hierarchy-aware

group communication technique to minimize cross-hierarchy

communication. SplitStream [45] builds forests of multicast

trees to distribute load across many machines. Due to lack

of complete knowledge of underlying network topology and

status (e.g. link utilizations, congestion or even failures),

overlay systems are limited in reducing bandwidth usage and

managing distribution of trafﬁc.

Alternatives to multicasting for bulk data distribution in-

clude peer-to-peer [10], [11], [46] and store-and-forward [47]–

[50] approaches. Peer-to-peer approaches do not consider

careful link level trafﬁc allocation and scheduling and do not

focus on minimizing bandwidth usage. The main focus of peer-

to-peer schemes is to locally and greedily optimize completion

times rather than global optimization over many transfers

across a network. Store-and-forward approaches focus on

minimizing costs by utilizing diurnal trafﬁc patterns while

delivering bulk objects and incur additional bandwidth and

storage costs on intermediate datacenters. Coﬂows are another

related concept where ﬂows with a collective objective are

jointly scheduled for improved performance [51]. Coﬂows

however do not aim at bandwidth savings.

There are recent solutions for management of P2MP trans-

fers with deadlines. DDCCast [52] uses a single forwarding

tree and the As Late As Possible (ALAP) scheduling policy for

admission control of multicast deadline trafﬁc. In [53], authors

propose use of few parallel forwarding trees from source to

all receivers to increase throughput and meet more deadlines

considering transfer priorities and volumes. The techniques

we proposed in QuickCast for receiver set partitioning can be

applied to these prior work for further performance gains.

In design of QuickCast, we did not focus on throughput-

optimality, which guarantees network stability for any offered

load in capacity region. We believe our tree selection approach,

which balances load across many existing forwarding trees,

aids in moving towards throughput-optimality. One could

consider developing a P2MP joint scheduling and routing

scheme with throughput-optimality as the objective. However,

in general, throughput-optimality does not necessarily lead to

highest performance (e.g., lowest latency) [54], [55].

Reliability: Various techniques have been proposed to make

multicasting reliable including use of coding and receiver

(negative or positive) acknowledgements or a combination of

them [56]. In-network caching has also been used to reduce

recovery delay, network bandwidth usage and to address

the ACK/NAK implosion problem [13], [57]. Using positive

ACKs does not lead to ACK implosion for medium scale (sub-

thousand) receiver groups [36]. TCP-XM [58] allows reliable

delivery by using a combination of IP multicast and unicast

for data delivery and re-transmissions. MCTCP [41] applies

standard TCP mechanisms for reliability. Receivers may also

send NAKs upon expiration of some inactivity timer [37].

NAK suppression to address implosion can be done by routers

[57]. Forward Error Correction (FEC) can be used for reliable

delivery, using which a sender encodes kpieces of an object

into npieces (k≤n) where any kout of npieces allow

for recovery [59], [60]. FEC has been used to reduce re-

transmissions [37] and improve the completion times [61].

Some popular FEC codes are Raptor Codes [62] and Tornado

Codes [63].

Congestion Control: PGMCC [64], MCTCP [41] and TCP-

SMO [36] use window-based TCP like congestion control to

compete fairly with other ﬂows. NORM [37] uses an equation-

based rate control scheme. Datacast [13] determines the rate

according to duplicate interest requests for data packets. All

of these approaches track the slowest receiver.

VII. CONCLUSIONS AND FUTURE WOR K

In this paper, we presented QuickCast algorithm to reduce

completion times of P2MP transfers across datacenters. We

showed that by breaking receiver sets of P2MP transfers with

many receivers into smaller subsets and using a separate tree

per subset, we can reduce completion times. We proposed

partitioning according to proximity as an effective approach

for ﬁnding such receiver subsets, and showed that partitioning

need be applied to transfers selectively. To do so, we proposed

a partitioning factor that can be tuned according to topology

and trafﬁc distribution. Further investigation is required on

ﬁnding metrics to selectively apply partitioning per transfer.

Also, investigation of partitioning techniques that optimize

network performance metrics as well as study of optimality

bounds of such techniques are left as part of future work.

Next, we discovered that while managing P2MP transfers

with many receivers, FCFS policy used in DCCast creates

many contentions as a result of overlapping forwarding trees

signiﬁcantly reducing utilization. We applied Fair Sharing

policy reducing network contention and improving completion

times. Finally, we performed experiments with well-known

rate-allocation policies and realized that Max-Min Fairness

provides much lower completion times compared to SRPT

while scheduling over large forwarding trees. More research

is needed on best rate-allocation policy for P2MP transfers.

Alternatively, one may drive a more effective joint partitioning,

rate-allocation and forwarding tree selection algorithm by

approximating a solution to optimization model we proposed.

REFERENCES

[1] “Predicting sd-wan adoption,” http://bit.ly/gartner-sdwan, visited on July

21, 2017.

[2] “Microsoft azure: Cloud computing platform & services,” https://azure.

microsoft.com/.

[3] “Compute engine - iaas - google cloud platform,” https://cloud.google.

com/compute/.

[4] C.-Y. Hong, S. Kandula, R. Mahajan et al., “Achieving high utilization

with software-driven wan,” in SIGCOMM. ACM, 2013, pp. 15–26.

[5] S. Jain, A. Kumar et al., “B4: Experience with a globally-deployed

software deﬁned wan,” SIGCOMM, vol. 43, no. 4, pp. 3–14, 2013.

[6] S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaring

for wide area networks,” SIGCOMM, vol. 44, no. 4, pp. 515–526, 2015.

[7] X. Jin, Y. Li, D. Wei, S. Li, J. Gao, L. Xu, G. Li, W. Xu, and

J. Rexford, “Optimizing bulk transfers with software-deﬁned optical

wan,” in SIGCOMM. ACM, 2016, pp. 87–100.

[8] S. Deering, “Host Extensions for IP Multicasting,” Internet Requests

for Comments, pp. 1–16, 1989. [Online]. Available: https://tools.ietf.

org/html/rfc1112

[9] S. Banerjee, B. Bhattacharjee, and C. Kommareddy, “Scalable applica-

tion layer multicast,” in SIGCOMM. ACM, 2002, pp. 205–217.

[10] R. Sherwood, R. Braud, and B. Bhattacharjee, “Slurpie: a cooperative

bulk data transfer protocol,” in INFOCOM, vol. 2, 2004, pp. 941–951.

[11] J. Pouwelse, P. Garbacki, D. Epema et al.,The Bittorrent P2P File-

Sharing System: Measurements and Analysis. Springer Berlin Heidel-

berg, 2005.

[12] A. Iyer, P. Kumar, and V. Mann, “Avalanche: Data center multicast using

software deﬁned networking,” in COMSNETS. IEEE, 2014, pp. 1–8.

[13] J. Cao, C. Guo, G. Lu, Y. Xiong, Y. Zheng, Y. Zhang, Y. Zhu, C. Chen,

and Y. Tian, “Datacast: A scalable and efﬁcient reliable group data

delivery service for data centers,” IEEE Journal on Selected Areas in

Communications, vol. 31, no. 12, pp. 2632–2645, 2013.

[14] S. H. Shen, L. H. Huang, D. N. Yang, and W. T. Chen, “Reliable

multicast routing for software-deﬁned networks,” in INFOCOM, April

2015, pp. 181–189.

[15] L. H. Huang, H. C. Hsu, S. H. Shen, D. N. Yang, and W. T. Chen, “Mul-

ticast trafﬁc engineering for software-deﬁned networks,” in INFOCOM.

IEEE, 2016, pp. 1–9.

[16] A. Nagata, Y. Tsukiji, and M. Tsuru, “Delivering a ﬁle by multipath-

multicast on openﬂow networks,” in International Conference on Intel-

ligent Networking and Collaborative Systems, 2013, pp. 835–840.

[17] K. Ogawa, T. Iwamoto, and M. Tsuru, “One-to-many ﬁle transfers

using multipath-multicast with coding at source,” in IEEE International

Conference on High Performance Computing and Communications,

2016, pp. 687–694.

[18] M. Noormohammadpour, C. S. Raghavendra, S. Rao, and S. Kandula,

“Dccast: Efﬁcient point to multipoint transfers across datacenters,” in

HotCloud. USENIX Association, 2017.

[19] N. McKeown, T. Anderson et al., “Openﬂow: Enabling innovation in

campus networks,” SIGCOMM, vol. 38, no. 2, pp. 69–74, 2008.

[20] B. Pfaff, B. Lantz, B. Heller et al., “Openﬂow switch speciﬁcation, ver-

sion 1.1.0 implemented (wire protocol 0x02),” http://archive.openﬂow.

org/documents/openﬂow-spec-v1.1.0.pdf, 2011.

[21] “Omniswitch aos release 8 switch management guide,” http://bit.ly/sdn-

lucent, visited on July 21, 2017.

[22] “Openﬂow v1.3.1 compliance matrix for devices running junos os,” http:

//bit.ly/sdn-juniper, visited on July 21, 2017.

[23] “Hp openﬂow 1.3 administrator guide,” http://bit.ly/sdn-hp, visited on

July 21, 2017.

[24] “Network os software deﬁned networking (sdn) conﬁguration guide,”

http://bit.ly/sdn-brocade, visited on July 21, 2017.

[25] D. Bertsekas and R. Gallager, “Data networks,” 1987.

[26] H. Zhang, K. Chen, W. Bai et al., “Guaranteeing deadlines for inter-

datacenter transfers,” in EuroSys. ACM, 2015, p. 20.

[27] M. Noormohammadpour, C. S. Raghavendra, and S. Rao, “Dcroute:

Speeding up inter-datacenter trafﬁc allocation while guaranteeing dead-

lines,” in High Performance Computing, Data, and Analytics (HiPC).

IEEE, 2016.

[28] H. H. Liu, S. Kandula, R. Mahajan, M. Zhang, and D. Gelernter, “Trafﬁc

engineering with forward fault correction,” in SIGCOMM. ACM, 2014,

pp. 527–538.

[29] M. Stanojevic and M. Vujoevic, “An exact algorithm for steiner tree

problem on graphs,” International Journal of Computers Communica-

tions & Control, vol. 1, no. 1, pp. 41–46, 2006.

[30] A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren, “Inside the

social network’s (datacenter) network,” in SIGCOMM. ACM, 2015,

pp. 123–137.

[31] “The internet topology zoo (cogent),” http://www.topology-zoo.org/ﬁles/

Cogentco.gml, visited on July 19, 2017.

[32] “Understanding how the openﬂow group action works,”

https://www.juniper.net/documentation/en US/junos/topics/concept/

junos-sdn- openﬂow-groups.html, visited on March 14, 2017.

[33] “Hp 5130 ei switch series openﬂow conﬁguration guide,” https:

//support.hpe.com/hpsc/doc/public/display?docId=c04217797&lang=

en-us&cc=us, visited on Dec 8, 2017.

[34] M. Noormohammadpour and C. S. Raghavendra, “Datacenter Trafﬁc

Control: Understanding Techniques and Trade-offs,” arXiv preprint

arXiv:1712.03530, 2017. [Online]. Available: https://arxiv.org/abs/1712.

03530

[35] S. Keshav and S. Paul, “Centralized multicast,” in International Confer-

ence on Network Protocols. IEEE, 1999, pp. 59–68.

[36] S. Liang and D. Cheriton, “Tcp-smo: extending tcp to support medium-

scale multicast applications,” in Proceedings.Twenty-First Annual Joint

Conference of the IEEE Computer and Communications Societies, vol. 3,

2002, pp. 1356–1365.

[37] B. Adamson, C. Bormann, M. Handley, and J. Macker, “Nack-oriented

reliable multicast (norm) transport protocol,” 2009.

[38] C. A. C. Marcondes, T. P. C. Santos, A. P. Godoy, C. C. Viel, and

C. A. C. Teixeira, “Castﬂow: Clean-slate multicast approach using in-

advance path processing in programmable networks,” in IEEE Sympo-

sium on Computers and Communications, 2012, pp. 94–101.

[39] J. Ge, H. Shen, E. Yuepeng, Y. Wu, and J. You, “An openﬂow-

based dynamic path adjustment algorithm for multicast spanning trees,”

in IEEE International Conference on Trust, Security and Privacy in

Computing and Communications, 2013, pp. 1478–1483.

[40] X. Ji, Y. Liang, M. Veeraraghavan, and S. Emmerson, “File-stream

distribution application on software-deﬁned networks (sdn),” in IEEE

Annual Computer Software and Applications Conference, vol. 2, July

2015, pp. 377–386.

[41] T. Zhu, F. Wang, Y. Hua, D. Feng et al., “Mctcp: Congestion-aware

and robust multicast tcp in software-deﬁned networks,” in International

Symposium on Quality of Service, June 2016, pp. 1–10.

[42] D. Li, M. Xu, M. c. Zhao, C. Guo, Y. Zhang, and M. y. Wu, “Rdcm:

Reliable data center multicast,” in 2011 Proceedings IEEE INFOCOM,

2011, pp. 56–60.

[43] A. Rodriguez, D. Kostic, and A. Vahdat, “Scalability in adaptive multi-

metric overlays,” in International Conference on Distributed Computing

Systems, 2004, pp. 112–121.

[44] K. Nagaraj, H. Khandelwal, C. Killian, and R. R. Kompella, “Hierarchy-

aware distributed overlays in data centers using dc2,” in COMSNETS.

IEEE, 2012, pp. 1–10.

[45] M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron,

and A. Singh, “Splitstream: High-bandwidth multicast in cooperative

environments,” in SOSP. ACM, 2003, pp. 298–313.

[46] M. Hefeeda, A. Habib, B. Botev et al., “Promise: Peer-to-peer media

streaming using collectcast,” in MULTIMEDIA. ACM, 2003, pp. 45–54.

[47] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “Inter-datacenter

bulk transfers with netstitcher,” in SIGCOMM. ACM, 2011, pp. 74–85.

[48] S. Su, Y. Wang, S. Jiang, K. Shuang, and P. Xu, “Efﬁcient algorithms

for scheduling multiple bulk data transfers in inter-datacenter networks,”

International Journal of Communication Systems, vol. 27, no. 12, 2014.

[49] N. Laoutaris, G. Smaragdakis, R. Stanojevic, P. Rodriguez, and R. Sun-

daram, “Delay-tolerant bulk data transfers on the internet,” IEEE/ACM

TON, vol. 21, no. 6, 2013.

[50] Y. Wang, S. Su et al., “Multiple bulk data transfers scheduling among

datacenters,” Computer Networks, vol. 68, pp. 123–137, 2014.

[51] M. Chowdhury and I. Stoica, “Coﬂow: A networking abstraction for

cluster applications,” in Workshop on Hot Topics in Networks. ACM,

2012, pp. 31–36.

[52] M. Noormohammadpour and C. S. Raghavendra, “DDCCast: Meeting

Point to Multipoint Transfer Deadlines Across Datacenters using ALAP

Scheduling Policy,” Department of Computer Science, University of

Southern California, Tech. Rep. 17-972, 2017.

[53] S. Ji, “Resource optimization across geographically distributed datacen-

ters,” Master’s thesis, University of Toronto, 2017.

[54] A. Yekkehkhany, “Near Data Scheduling for Data Centers with Multi

Levels of Data Locality,” arXiv preprint arXiv:1702.07802, 2017.

[55] A. Yekkehkhany, A. Hojjati, and M. H. Hajiesmaili, “Gb-pandas:

Throughput and heavy-trafﬁc optimality analysis for afﬁnity scheduling,”

arXiv preprint arXiv:1709.08115, 2017.

[56] M. Handley, L. Vicisano, M. Luby, B. Whetten, and R. Kermode,

“The reliable multicast design space for bulk data transfer,” Internet

Requests for Comments, pp. 1–22, 2000. [Online]. Available:

https://tools.ietf.org/html/rfc2887

[57] L. H. Lehman, S. J. Garland, and D. L. Tennenhouse, “Active reliable

multicast,” in INFOCOM, vol. 2, Mar 1998, pp. 581–589 vol.2.

[58] K. Jeacle and J. Crowcroft, “Tcp-xm: unicast-enabled reliable multicast,”

in ICCCN, 2005, pp. 145–150.

[59] M. Luby, L. Vicisano, J. Gemmell et al., “The use of forward error

correction (fec) in reliable multicast,” 2002.

[60] M. Luby, J. Gemmell, L. Vicisano, L. Rizzo, and J. Crowcroft, “Asyn-

chronous layered coding (alc) protocol instantiation,” 2002.

[61] C. Gkantsidis, J. Miller, and P. Rodriguez, “Comprehensive view of a

live network coding p2p system,” in IMC. ACM, 2006, pp. 177–188.

[62] A. Shokrollahi, “Raptor codes,” IEEE Transactions on Information

Theory, vol. 52, no. 6, pp. 2551–2567, 2006.

[63] J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege, “A digital

fountain approach to reliable distribution of bulk data,” in SIGCOMM.

ACM, 1998, pp. 56–67.

[64] L. Rizzo, “Pgmcc: A tcp-friendly single-rate multicast congestion con-

trol scheme,” in SIGCOMM, 2000.

Proactive inter-datacenter multicast with realtime and bulk transfers

Conference Paper

Jun 2019

In content distribution networks, a key objective is the efficient utilization of the network that interconnects geographically distributed datacenters. This is a challenging problem due to vastly different characteristics and requirements of bulk and realtime transfers that share the interconnection network. Bulk transfers aim at delivering a copy of a usually large file to multiple datacenters before a deadline, while realtime transfers are absolutely delay-intolerant with unsteady and dynamic demands. In this paper, we consider the problem of multicasting deadline-critical bulk transfers in an inter-datacenter network in the presence of unknown and fluctuating demand by realtime transfers. Specifically, we develop a joint admission control and routing algorithm called PMDx, which anticipates future realtime demands and proactively reserves just the right amount of network resources in order to serve future realtime transfers without adversely affecting network utilization or bulk transfer deadlines. We show that the PMDx algorithm is a 2/δ-approximation with probability 1 - ϵ, and runs in polynomial time proportional to ln(1/ϵ)/(1 - δ)², for 0 < δ,ϵ < 1. We also provide extensive model-driven simulation results to study the behaviour of our algorithms in real world network topologies. Our results confirm that PMDx is very close to the optimal, and improves the utilization of the network by 14% compared to a recently proposed algorithm.

DaRTree: deadline-aware multicast transfers in reconfigurable wide-area networks

Conference Paper

Jun 2019

The increasing amount of data replication across datacenters introduces a need for efficient bulk data transfer protocols which meet QoS guarantees, notably timely completion. We present DaRTree which leverages emerging optical reconfiguration technologies, to jointly optimize topology and multicast transfers, and thereby maximize throughput and acceptance ratio of transfer requests subject to deadlines. DaRTree is based on a novel integer linear program relaxation and deterministic rounding scheme. To this end, DaRTree uses multicast Steiner trees and adaptive routing based on the current network load. DaRTree provides its guarantees without need for rescheduling or preemption. Our evaluations show that DaRTree increases the network throughput and the number of accepted requests by up to 70%, especially for larger Wide-Area Networks (WANs). In fact, we also find that DaRTree even outperforms state-of-the-art solutions when the network scheduler is only capable of routing unicast transfers or when the WAN topology is bound to be non-reconfigurable.

Datacenter Traffic Control: Understanding Techniques and Trade-offs

Article

Full-text available

May 2018

Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today’s cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems. To measure the performance of datacenter networks, different performance metrics have been used such as flow completion times, deadline miss rate, throughput and fairness. Depending on the application and user requirements, some metrics may need more attention. While investigating different traffic control techniques, we point out the trade-offs involved in terms of costs, complexity and performance. We find that a combination of different traffic control techniques may be necessary at particular entities and layers in the network to improve the variety of performance metrics. We also find that despite significant research efforts, there are still open problems that demand further attention from the research community.

GB-PANDAS: Throughput and heavy-traffic optimality analysis for affinity scheduling

Conference Paper

Full-text available

Nov 2017

Dynamic affinity scheduling has been an open problem for nearly three decades. The problem is to dynamically schedule multi-type tasks to multi-skilled servers such that the resulting queueing system is both stable in the capacity region (throughput optimality) and the mean delay of tasks is minimized at high loads near the boundary of the capacity region (heavy-traffic optimality). As for applications, data-intensive analytics like MapReduce, Hadoop, and Dryad fit into this setting, where the set of servers is heterogeneous for different task types, so the pair of task type and server determines the processing rate of the task. The load balancing algorithm used in such frameworks is an example of affinity scheduling which is desired to be both robust and delay optimal at high loads when hot-spots occur. Fluid model planning, the MaxWeight algorithm, and the generalized $c\mu$-rule are among the first algorithms proposed for affinity scheduling that have theoretical guarantees on being optimal in different senses, which will be discussed in the related work section. All these algorithms are not practical for use in data center applications because of their non-realistic assumptions. The join-the-shortest-queue-MaxWeight (JSQ-MaxWeight), JSQ-Priority, and weighted-workload algorithms are examples of load balancing policies for systems with two and three levels of data locality with a rack structure. In this work, we propose the Generalized-Balanced-Pandas algorithm (GB-PANDAS) for a system with multiple levels of data locality and prove its throughput optimality. We prove this result under an arbitrary distribution for service times, whereas most previous theoretical work assumes geometric distribution for service times. The extensive simulation results show that the GB-PANDAS algorithm alleviates the mean delay and has a better performance than the JSQ-MaxWeight algorithm by up to twofold at high loads. We believe that the GB-PANDAS algorithm is heavy-traffic optimal in a larger region than JSQ-MaxWeight, which is an interesting problem for future work.

DDCCast: Meeting Point to Multipoint Transfer Deadlines Across Datacenters using ALAP Scheduling Policy

Technical Report

Full-text available

Jul 2017

Large cloud companies manage dozens of datacenters across the globe connected using dedicated inter-datacenter networks. An important application of these networks is data replication which is done for purposes such as increased resiliency via making backup copies, getting data closer to users for reduced delay and WAN bandwidth usage, and global load balancing. These replications usually lead to network transfers with deadlines that determine the time prior to which all datacenters should have a copy of the data. Inter-datacenter networks have limited capacity and need be utilized efficiently to maximize performance. In this report, we focus on applications that transfer multiple copies of objects from one datacenter to several datacenters given deadline constraints. Existing solutions are either deadline agnostic, or only consider point-to-point transfers. We propose DDCCast, a simple yet effective deadline aware point to multipoint technique based on DCCast and using ALAP traffic allocation. DDCCast performs careful admission control using temporal planning, uses rate-allocation and rate-limiting to avoid congestion and sends traffic over forwarding trees that are carefully selected to reduce bandwidth usage and maximize deadline meet rate. We perform experiments confirming DDCCast's potential to reduce total bandwidth usage by up to 45% while admitting up to 25% more traffic into the network compared to existing solutions that guarantee deadlines.

DCCast: Efficient Point to Multipoint Transfers Across Datacenters

Conference Paper

Full-text available

Jul 2017

Using multiple datacenters allows for higher availability, load balancing and reduced latency to customers of cloud services. To distribute multiple copies of data, cloud providers depend on inter-datacenter WANs that ought to be used efficiently considering their limited capacity and the ever-increasing data demands. In this paper, we focus on applications that transfer objects from one datacenter to several datacenters over dedicated inter-datacenter networks. We present DCCast, a centralized Point to Multi-Point (P2MP) algorithm that uses forwarding trees to efficiently deliver an object from a source datacenter to required destination datacenters. With low computational overhead, DCCast selects forwarding trees that minimize bandwidth usage and balance load across all links. With simulation experiments on Google’s GScale network, we show that DCCast can reduce total bandwidth usage and tail Transfer Completion Times (TCT) by up to 50% compared to delivering the same objects via independent point-to-point (P2P) transfers.

Near Data Scheduling for Data Centers with Multi Levels of Data Locality

Article

Full-text available

Feb 2017

Ali Yekkehkhany

Data locality is a fundamental issue for data-parallel applications. Considering MapReduce in Hadoop, the map task scheduling part requires an efficient algorithm which takes data locality into consideration; otherwise, system may get unstable under loads inside the system's capacity region or jobs may experience longer completion times which are not of interest. The data chunk needed for any map task can be in memory, on a local disk, in a local rack, in the same cluster or even in another data center. Hence, unless there has been so much work on improving the speed of data center networks, still there exists different levels of service rates for a task depending on where its data chunk is saved and from which server it receives service. Most of the theoretical work on load balancing is for systems with two levels of data locality among which I can name Pandas algorithm by Xie et al. and JSQ-MW by Wang et al., where the former is both throughput and heavy-traffic optimal, while the latter is only throughput optimal, but heavy-traffic optimal in only a special traffic load. We show that an extension of JSQ-MW for a system with thee levels of data locality is throughput optimal, but not heavy-traffic optimal for all load, but again for a special traffic scenario. Furthermore, we show that Pandas is not even throughput optimal for a system with three levels of data locality. We then propose a novel algorithm, Balanced-Pandas, which is both throughput and heavy-traffic optimal. To the best of our knowledge, this is the first theoretical work on load balancing for a system with more than two levels of data locality which as we will see is more challenging than two levels of data locality as a dilemma between performance and throughput emerges.

Delay tolerant bulk data transfers on the internet

Article

Jun 2009
Perform Eval Rev

Many emerging scientific and industrial applications require transferring multiple Tbytes of data on a daily basis. Examples include pushing scientific data from particle accelerators/colliders to laboratories around the world, synchronizing data-centers across continents, and replicating collections of high definition videos from events taking place at different time-zones. A key property of all above applications is their ability to tolerate delivery delays ranging from a few hours to a few days. Such Delay Tolerant Bulk (DTB) data are currently being serviced mostly by the postal system using hard drives and DVDs, or by expensive dedicated networks. In this work we propose transmitting such data through commercial ISPs by taking advantage of already-paid-for off-peak bandwidth resulting from diurnal traffic patterns and percentile pricing. We show that between sender-receiver pairs with small time-zone difference, simple source scheduling policies are able to take advantage of most of the existing off-peak capacity. When the time-zone difference increases, taking advantage of the full capacity requires performing store-and-forward through intermediate storage nodes. We present an extensive evaluation of the two options based on traffic data from 200+ links of a large transit provider with PoPs at three continents. Our results indicate that there exists huge potential for performing multi Tbyte transfers on a daily basis at little or no additional cost.

PROMISE: peer-to-peer media streaming using CollectCast

Conference Paper