PreprintPDF Available

QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forwarding Tree Cohorts

Authors:

Abstract and Figures

Large inter-datacenter transfers are crucial for cloud service efficiency and are increasingly used by organizations that have dedicated wide area networks between datacenters. A recent work uses multicast forwarding trees to reduce the bandwidth needs and improve completion times of point-to-multipoint transfers. Using a single forwarding tree per transfer, however, leads to poor performance because the slowest receiver dictates the completion time for all receivers. Using multiple forwarding trees per transfer alleviates this concern--the average receiver could finish early; however, if done naively, bandwidth usage would also increase and it is apriori unclear how best to partition receivers, how to construct the multiple trees and how to determine the rate and schedule of flows on these trees. This paper presents QuickCast, a first solution to these problems. Using simulations on real-world network topologies, we see that QuickCast can speed up the average receiver's completion time by as much as 10x while only using 1.04x more bandwidth; further, the completion time for all receivers also improves by as much as 1.6x faster at high loads.
Content may be subject to copyright.
QuickCast: Fast and Efficient Inter-Datacenter
Transfers using Forwarding Tree Cohorts
Mohammad Noormohammadpour1, Cauligi S. Raghavendra1, Srikanth Kandula2, Sriram Rao2
1Ming Hsieh Department of Electrical Engineering, University of Southern California
2Microsoft
Abstract—Large inter-datacenter transfers are crucial for
cloud service efficiency and are increasingly used by organiza-
tions that have dedicated wide area networks between datacen-
ters. A recent work uses multicast forwarding trees to reduce
the bandwidth needs and improve completion times of point-to-
multipoint transfers. Using a single forwarding tree per transfer,
however, leads to poor performance because the slowest receiver
dictates the completion time for all receivers. Using multiple
forwarding trees per transfer alleviates this concern–the average
receiver could finish early; however, if done naively, bandwidth
usage would also increase and it is apriori unclear how best
to partition receivers, how to construct the multiple trees and
how to determine the rate and schedule of flows on these trees.
This paper presents QuickCast, a first solution to these problems.
Using simulations on real-world network topologies, we see that
QuickCast can speed up the average receiver’s completion time
by as much as 10×while only using 1.04×more bandwidth;
further, the completion time for all receivers also improves by as
much as 1.6×faster at high loads.
Index Terms—Software Defined WAN; Datacenter; Schedul-
ing; Completion Times; Replication
I. INTRODUCTION
Software Defined Networking (SDN) is increasingly
adopted across Wide Area Networks (WANs) [1]. SDN allows
careful monitoring, management and control of status and
behavior of networks offering improved performance, agility
and ease of management. Consequently, large cloud providers,
such as Microsoft [2] and Google [3], have built dedicated
large scale WAN networks that can be operated using SDN
which we refer to as SD-WAN. These networks connect
dozens of datacenters for increased reliability and availability
as well as improved utilization of network bandwidth reducing
communication costs [4], [5].
Employing geographically distributed datacenters has many
benefits in supporting users and applications. Replicating
objects across multiple datacenters improves user-access la-
tency, availability and fault tolerance. For example, Content
Delivery Networks (CDNs) replicate objects (e.g. multimedia
files) across many cache locations, search engines distribute
large index updates across many locations regularly, and
VMs are replicated across multiple locations for scale out
of applications. In this context, Point to Multipoint (P2MP)
transfers (also known as One-to-Many transfers) are necessary.
A P2MP transfer is a special case of multicasting with a
single sender and a fixed set of receivers that are known
This is an extended version of a paper accepted for publication in IEEE
INFOCOM 2018, Honolulu, HI, USA
apriori. These properties together provide an opportunity for
network optimizations, such as sizable reductions in bandwidth
usage and faster completion times by using carefully selected
forwarding trees.
We review several approaches for performing P2MP trans-
fers. One can perform P2MP transfers as many independent
point-to-point transfers [4]–[7] which can lead to wasted
bandwidth and increased completion times. Internet multicas-
ting approaches [8] build multicast trees gradually as new
receivers join the multicast sessions, and do not consider the
distribution of load across network links while connecting
new receivers to the tree. This can lead to far from optimal
multicast trees which are larger than necessary and poor load
balancing. Application layer multicasting, such as [9], focuses
on use of overlay networks for building virtual multicast trees.
This may lead to poor performance due to limited visibility
into network link level status and lack of control over how
traffic is directed in the network. Peer-to-peer file distribution
techniques [10], [11] try to maximize throughput per receiver
locally and greedily which can be far from a globally optimal
solution. Centralized multicast tree selection approaches have
been proposed [12], [13] that operate on regular and structured
topologies of networks inside datacenters which cannot be
directly applied to inter-DC networks. Other related research
either does not consider elasticity of inter-DC transfers which
allows them to change their transmission rate according to
available bandwidth [14], [15] or inter-play among may inter-
DC transfers for global network-wide optimization [16], [17].
We recently presented a solution called DCCast [18] that
reduces tail completion times for P2MP transfers. DCCast
employs a central traffic engineering server with a global view
of network status, topology and resources to select forwarding
trees over which traffic flows from senders to all receivers.
This approach combines good visibility into and control over
networks offered by SDN with bandwidth savings achieved
by reducing the number of links used to make data transfers.
OpenFlow [19], a dominant SDN protocol, has supported
forwarding to multiple outgoing ports since v1.1[20]; for
example, by using Group Tables and the OFPGT_ALL flag.
Several recent SDN switches support this feature [21]–[24].
In this paper, we propose a new rate-allocation and tree
selection technique called QuickCast with the aim of minimiz-
ing average completion times of inter-DC transfers. QuickCast
reduces completion times by replacing a large forwarding tree
with multiple smaller trees each connected to a subset of
arXiv:1801.00837v1 [cs.NI] 2 Jan 2018
X
YZ
T T T 2T 2T 2T
X
YZ
T T T 2T T T
Fig. 1. Partitioning receivers into subsets can improve mean completion times
(this figure assumes FCFS rate-allocation policy, dashed blue transfer arrives
slightly earlier, all links have equal capacity of 1)
receivers which we refer to as a cohort of forwarding trees.
Next, QuickCast applies the Fair Sharing scheduling policy
which we show through simulations and examples minimizes
contention for available bandwidth across P2MP transfers
compared to other scheduling policies.
Despite offering bandwidth savings and reduced tail com-
pletion times, DCCast can suffer from significant increase in
completion times when P2MP transfers have many receivers.
As the number of receivers increases, forwarding trees grow
large creating many overlapping edges across transfers, in-
creasing contention and completion times. Since the edge with
minimal bandwidth determines the overall throughput of a
forwarding tree, we refer to this as “Weakest Link” problem.
We demonstrate weakest link problem using a simple example.
Figure 1shows a scenario where two senders (top two nodes)
are transmitting over forwarding trees and they share a link
(xy). According to DCCast which uses the First Come
First Serve (FCFS) policy, the dashed blue transfer (on left)
which arrived just before the green (on right) is scheduled first
delaying the beginning of green transfer until time T. As a
result, all green receivers finish at time 2T. By using multiple
trees, in this case two for the green sender, each tree can
be scheduled independently, and thereby reducing completion
time of the two receivers on the right from 2Tto T. That is,
we create a new tree that does not have the weakest link, which
in this scenario is xyfor both blue and green transfers.
To replace a large tree with a cohort of trees, we propose
partitioning receiver sets of P2MP transfers into multiple
subsets and using a separate forwarding tree per subset. This
approach can significantly reduce completion times of P2MP
transfers. We performed an experiment with DCCast over
random topologies with 50 nodes to determine if there is
benefit in partitioning all P2MP transfers. We simply grouped
receivers into two subsets according to proximity, i.e., shortest
path hop count between receiver pairs, and attached each
partition with an independent forwarding tree (DCCast+2CL).
As shown in Figure 2, this reduced completion times by 50%
while increasing bandwidth usage by 6% (not shown).
To further reduce completion times and decrease additional
bandwidth usage, partitioning should be only applied to P2MP
transfers that benefit from it. For example, using more than
one tree for the dashed blue transfer (on left) in Figure 1
will increase contention and will hurt completion times. By
carefully selecting P2MP transfers that benefit from partition-
10 Receivers
Mean Tail
1
1.2
1.4
1.6
1.8
DCCast DCCast+2CL
20 Receivers
Mean Tail
1
1.2
1.4
1.6
1.8
1.53
1.43
1.57
1.48
Fig. 2. Comparison of completion times (mean and tail) of DCCast vs.
DCCast+2CL (two-clustering according to proximity) for P2MP transfers with
10 and 20 receivers per transfer (Normalized)
ing and using an independent forwarding tree per partition,
we can considerably improve average completion times with
small extra bandwidth usage. We refer to this approach as
selective partitioning. We limit the number of partitions per
transfer to two to minimize bandwidth overhead of additional
edges, minimize contention due to overlapping trees and limit
the number of network forwarding entries.
In addition to using multiple forwarding trees, we in-
vestigate various scheduling policies namely FCFS used in
DCCast, Shortest Remaining Processing Time (SRPT) and Fair
Sharing based on Max-Min Fairness (MMF) [25]. Although
SRPT is optimal for minimizing mean completion times while
scheduling traffic on a single link, we find that MMF beats
SRPT by a large factor as forwarding trees grow (which causes
many trees to share every link) and as offered load increases
(which increases contention for using available bandwidth).
Using a cohort of forwarding trees per P2MP transfer can
also increase reliability in two ways. First, by mitigating the
effect of weak links, a number of receivers will complete
reception earlier which reduces the probability of data loss
due to failures. Second, in case of a link failure, using more
than one tree reduces the probability of all receivers being
affected. In case subsequent trees are constructed in a link
disjoint manner, no single link failure can affect all receivers.
In summary, we make the following contributions:
1) We formulate a general optimization problem for mini-
mizing mean completion times while scheduling P2MP
transfers over a general network.
2) We propose QuickCast, which mitigates the “Weakest
Link” problem by partitioning receivers into multiple
subsets and attaching each partition with an independent
forwarding tree. Using multiple forwarding trees can
improve average completion time and reliability with
only a little additional bandwidth usage. We show that
partitioning should only be applied to transfers that
benefit from it, i.e., partitioning some transfers leads to
increased contention and worse completion times.
3) We explore well-known scheduling policies (FCFS,
SRPT and Fair Sharing) for central rate-allocation over
forwarding trees and find MMF to be the most effective
in maximizing overall bandwidth utilization and reduc-
ing completion times.
4) We perform extensive simulations using both synthetic
and real inter-datacenter traffic patterns over real WAN
topologies. We find that performance gain of QuickCast
depends on the offered load and show that under heavy
loads, QuickCast can improve mean times by as much as
10×and tail times by as much as 1.57×while imposing
very small increase in typical bandwidth usage (only
4%) compared to DCCast.
The rest of this paper is organized as follows. In Section II,
we state the P2MP scheduling problem, explain the constraints
and variables used in the paper and formulate the problem as
an optimization scenario. In Section III, we present QuickCast
and the two procedures it is based on. In Section IV, we
perform abstract simulations to evaluate QuickCast and in
Section Vwe discuss practical considerations and issues. At
the end, in Section VI, we provide a comprehensive overview
of related research efforts.
II. ASSUMPTIONS AND PRO BL EM STATE ME NT
Similar to prior work on inter-datacenter networks [4]–[7],
[18], [26], [27], we assume a SD-WAN managed by a Traf-
fic Engineering (TE) server which receives transfer requests
from end-points, performs rate-allocations and manages the
forwarding plane. Transfers arrive at the TE server in an
online fashion and are serviced as they arrive. Requests are
specified with four parameters of arrival time, source, set
of receivers and size (volume in bytes). End-points apply
rate-limiting to minimize congestion. We consider a slotted
timeline to allow for flexible rate-allocation while limiting
number of rate changes to allow time to converge to specified
rates and minimize rate-allocation overhead [6], [26]. We focus
on long running transfers that deliver large objects to many
datacenters such as applications in [5]. For such transfers,
small delays are usually acceptable, including overhead of
centralized scheduling and network configurations. To reduce
configuration overhead (e.g. latency and control plane failures
[28]), we assume that a forwarding tree is not changed once
configured on the forwarding plane.
To reduce complexity we assume that end-points can ac-
curately rate-limit data flows and that they quickly converge
to required rates; that there are no packet losses due to
congestion, corruption or errors; and that scheduling is done
for a specific class of traffic meaning all requests have the
same priority. In Section V, we will discuss ways to deal with
cases when some of these assumptions do not hold.
A. Problem Formulation
Earlier we showed that partitioning of receiver sets can
improve completion times via decreasing network contention.
However, to further improve completion times, partitioning
should be done according to transfer properties, topology
and network status. Optimal partitioning for minimization of
completion times is an open problem and requires finding
solution to a complex joint optimization model that takes into
account forwarding tree selection and rate-allocation.
TABLE I
DEFI NIT IO N OF VARI ABL ES
Variable Definition
tand tnow Some timeslot and current timeslot
NTotal number of receivers per transfer
eA directed edge
CeCapacity of ein bytes per second
(x, y)A directed edge from xto y
GA directed inter-datacenter network graph
TSome directed tree connecting a sender to its receivers
VG
VG
VGand VT
VT
VTSethi of vertices of Gand T
EG
EG
EGand ET
ET
ETSethi of edges of Gand T
BeCurrent available bandwidth on edge e
BTCurrent available bandwidth over tree T
δWidth of a timeslot in seconds
RiA transfer request where iI
I
I = {1...I}
OiData object associated with Ri
SRiSource datacenter of Ri
ARiArrival time of Ri
VRiOriginal volume of Riin bytes
Vr
RiResidual volume of Ri, (Vr
Ri=VRiat t=ARi)
D
D
DRiSethi of destinations of Ri
nMaximum subsets (partitions) allowed per receiver set
P
P
Pj
iSethi of receivers of Riin partition j∈ {1,...,n}
qiTotal bytes used to deliver Oito D
D
DRi
Tj
iForwarding tree for partition j∈ {1,...,n}of Ri
Lee’s total outstanding load, i.e., Le=Pi,j
eTj
i
Vr
Ri
fj
i(t)Transmission rate of Rion Tj
iat timeslot t
γj
i(t)Whether Riis transmitted over Tj
iat timeslot t
θj
i,e Whether edge eEG
EG
EGis on Tj
i
νi,j,v Whether vD
D
DRiis in P
P
Pj
i
M
M
Mj
i{∇
∇ | ∇
∇ ⊂ VG
VG
VG,
∇∩{P
P
Pj
iSRi} 6=,(VG
VG
VG− ∇
)
{P
P
Pj
iSRi} 6=∅}
E(
){e= (x, y)|x∈ ∇
, y (VG
VG
VG− ∇
)}
We first point out the most basic constraint out of using
forwarding trees. Table Iprovides the list of variables we will
use. For any tree T, packets flow from the source to receivers
with same rate rTthat satisfies:
rTBT= min
eET
ET
ET
(Be)(1)
We formulate the problem as an online optimization sce-
nario. Assuming a set of requests Ri, i ∈ {1. . . I }already in
the system, upon arrival of new request RI+1, an optimization
problem needs to be formulated and solved to find rate-
allocations, partitions and forwarding trees. Figure 3shows
the overall optimization problem. This model considers up to
n1partitions per transfer. Demands of existing requests
are updated to their residuals upon arrival of new requests.
The objective is formulated in a hierarchical fashion giving
higher priority to minimizing mean completion times and then
reducing bandwidth usage. The purpose of γj
i(t)indicator
variables is to calculate the mean times: the latest timeslot
over which we have fj
i(t)>0determines the completion
time of partition jof request i. These completion times are
minimize X
iI
I
I
X
j∈{1,...,n}
|P
P
Pj
i|X
t>ARI+1
(tARI+1 )γj
i(t)
Y
t0>t
(1 γj
i(t0))
+PiI
I
Iqi
n|EG
EG
EG|PiI
I
IVRi
subject to
Calculate total bandwidth usage:
(1) qi=X
eEG
EG
EG
(X
j∈{1,...,n}
θj
i,e)Vr
Rii
Demand satisfaction constraints:
(2) X
t
γj
i(t)fj
i(t) = Vr
Ri
δi, j
Capacity constraints:
(3) X
iI
I
IX
j∈{1,...,n}
θj
i,efj
i(t)Cej, t, e
Steiner tree constraints [29]:
(4) X
eE(
)
θj
i,e 1i, j, ∇ ∈ M
M
Mj
i
Basic range constraints:
(5) γj
i(t)=0, fj
i(t)=0 i, j, t < ARI+1
(6) fj
i(t)0i, j, t
(7) θj
i,e ∈ {0,1} ∀i, j, e
(8) γj
i(t)∈ {0,1} ∀i, j, t
(9) νi,j,v ∈ {0,1} ∀i, j, v D
D
DRi
(10) γj
i(t)=0 i, j, t < ARi
Fig. 3. Online optimization model, variables defined in Table I
then multiplied by partition size to create the total sum of
completion times per receiver. Constraint 4ascertains that
there is a connected subgraph across sender and receivers per
partition per request which is similar to constraints used to
find minimal edge Steiner trees [29]. Since our objective is an
increasing function of bandwidth, these constraints eventually
lead to minimal trees connecting any sender to its receivers
while not increasing mean completion times (the second part
of objective that minimizes bandwidth is necessary to ensure
no stray edges).
B. Challenges
We focus on two objectives of first minimizing completion
times of data transfers and minimizing total bandwidth usage.
This is a complex optimization problem for a variety of
reasons. First, breaking a receiver set to several partitions
leads to exponential number of possibilities. Moreover, the
optimization version of Steiner tree problem which aims to
find minimal edge or minimal weight trees is a hard problem
[29]. Completion times of transfers then depend on how
partitions are formed and which trees are used to connect
partitions to senders. In addition, the scenario is naturally an
online problem which means even if we were able to compute
an optimal solution for a given set of transfers in a short
amount of time, we still would not be able to compute a
solution that is optimal over longer periods of time due to
incomplete knowledge of future arrivals.
III. QUICKCAST
We present our heuristic approach in Algorithm 1called
QuickCast with the objective of reducing mean completion
times of elastic P2MP inter-DC transfers. We first review con-
cepts behind design of this heuristic, namely rate-allocation,
partitioning, forwarding tree selection and selective parti-
tioning. Next, we discuss how Algorithm 1realizes these
concepts using two procedures one executed upon arrival of
new transfers and the other per timeslot.
A. Rate-allocation
To compute rates per timeslot, we explore well-known
scheduling policies: FCFS, SRPT and Fair Sharing. Although
simple, FCFS can lead to increased mean times if large
transfers block multiple edges by fully utilizing them. SRPT
is known to offer optimal mean times over a single link but
may lead to starvation of large transfers. QuickCast uses Fair
Sharing based on MMF policy.
To understand effect of different scheduling policies, let us
consider the following example. Figure 4shows a scenario
where multiple senders have initiated trees with multiple
branches and they share links along the way to their receivers.
SRPT gives a higher priority to the top transfer with size
10 and then to the next smallest transfer and so on. When
the first transfer is being sent, all other transfers are blocked
due to shared links. This occurs again when the next transfer
begins. Scheduling according to FCFS leads to same result.
In this example, mean completion times for both FCFS and
SRPT is about 1.16×larger than MMF. In Section IV, we
perform simulation experiments that confirm the outcome in
this example. We find that as trees grow larger and under high
utilization, the benefit of using MMF over FCFS or SRPT
becomes more significant due to increased contention. We also
realize that tail times grow much faster for SRPT compared to
both MMF and FCFS (since it also suffers from the starvation
problem) while scheduling over forwarding trees with many
receivers, and that increases SRPT’s mean completion times.
B. Partitioning
There are two configuration parameters for grouping re-
ceivers into multiple partitions: partitioning criteria and num-
ber of partitions. In general, partitioning may lead to higher
bandwidth usage. However, it may be the case that a small
increase in bandwidth usage can considerably improve com-
pletion times. Efficient and effective partitioning to minimize
completion times and bandwidth usage is a complex open
problem. We discuss our approach in the following.
Partitioning Criteria: We focus on minimizing extra
bandwidth usage while breaking large forwarding trees via
partitioning. Generally, one could select partitions according
to current network conditions, such as distribution of load
A
B
C
D
A&B
A&C
A&D
B&C
B&D
C&D
vol = 10
Arrival = 0
vol = 11
Arrival = 1
vol = 12
Arrival = 2
vol = 13
Arrival = 3
Completion Times
A B C D Mean
FCFS / SRPT 10 21 33 46 27.5
MMF 19 23 26 27.5 23.87
RECEIVERS
Fig. 4. Fair Sharing can offer better mean times compared to both SRPT and
FCFS while scheduling over forwarding trees, all links have capacity of 1
across edges. However, we notice that network conditions
are continuously changing as current transfers finish and new
transfers arrive. Minimizing bandwidth on the other hand
appears as a globally desirable objective for partitioning and
was hence chosen. To find partitions, QuickCast groups re-
ceivers according to proximity until we are left with desired
number of groups each forming a partition. Distance between
two receivers is computed as the number of hops on the
shortest path between them. With this approach, a partition
requires minimal number of edges to connect the nodes within.
Reassigning any receiver to other partitions will increase the
number of edges and thus consumed bandwidth.
Number of Partitions: The right number of partitions
per transfer depends on factors such as topology, number
of receivers, forwarding trees of other transfers, source and
destinations of a transfer, and overall network load. In the
extreme case of Npartitions, a P2MP transfer is broken into N
unicast transfers which significantly increases bandwidth us-
age. Partitioning is most effective if forwarding trees assigned
to partitions do not increase overall contention for network
bandwidth, i.e., the number of overlapping edges across new
trees is minimal. Therefore, increasing number of partitions to
more than connectivity degree of datacenters may offer minor
gains or even loss of performance (e.g., in case of Google
B4 [5], the minimum and maximum connectivity degrees are
2and 4, respectively). From the practical aspect, number of
partitions and hence forwarding trees determines the number
of Group Table rules that need to be setup in network
switches. Therefore, we focus on partitioning receivers into
up to 2groups each assigned an independent forwarding tree.
Exploration of effects of more partitions is left for future work.
C. Forwarding Tree Selection
After computing two partitions, QuickCast assigns an in-
dependent forwarding tree per partition using tree selection
approach presented in [18] which was shown to provide
high bandwidth savings and improved completion times. It
operates by giving a weight of We= (Le+VR)(see Table
Ifor definition of variables) and then selecting the minimum
weight forwarding tree. This technique allows load balancing
of P2MP data transfers over existing trees according to total
bandwidth scheduled on the edges. It also takes into account
transfer volumes while selecting trees. Particularly, larger
transfers are most likely assigned to smaller trees to minimize
bandwidth usage while smaller transfers are assigned to least
loaded trees (regardless of tree size). This approach becomes
more effective when a small number of transfers are orders
of magnitude larger than median [30], as number of larger
forwarding trees is usually significantly larger than smaller
trees on any graph.
D. Selective Partitioning
Partitioning is beneficial only if it decreases or minimally
increases bandwidth usage and contention over resources
which necessitates selectively partitioning the receivers. When
we chose the two partitions by grouping receivers and after
selecting a forwarding tree for every group, QuickCast cal-
culates the total weight of each forwarding tree by summing
up weights of their edges. We then compare sum of these
two weights with no partitioning case where a single tree was
used. If the total weight of two smaller trees is less than some
partitioning factor (shown as pf) of the single tree case, we
accept to use two trees. If pfis close to 1.0, partitioning occurs
only if it incurs minimal extra weight, i.e., (pf1) times
weight of the single forwarding tree that would have been
chosen if we applied no partitioning. With this approach, we
most likely avoid selection of subsequent trees that are either
much larger or much more loaded than the initial tree in no
partitioning case. Generally, an effective pfis a function of
traffic distribution and topology. According to our experiments
with several traffic distributions and topologies, choosing it in
the range of 1.0pf1.1offers the best completion times
and minimal bandwidth usage.
E. QuickCast Algorithm
A TE server is responsible for managing elastic transfers
over inter-DC network. Each partition of a transfer is man-
aged independently of other partitions. We refer to a transfer
partition as active if it has not been completed yet. TE server
keeps a list of active transfer partitions and tracks them at
every timeslot. A TE server running QuickCast algorithm uses
two procedures as shown in Algorithm 1.
Submit(R,n,pf): This procedure is executed upon arrival
of a new P2MP transfer R. It performs partitioning and
forwarding tree selection for the new transfer given its volume,
source and destinations. We consider the general case where
we may have up to npartitions per transfer. First, we compute
edge weights based on current traffic matrix and volume of
new transfer. We then build the agglomerative hierarchy of
receivers using average linkage and considering proximity as
clustering metric. Agglomerative clustering is a bottom up
approach where at every level the two closest clusters are
merged forming one cluster. The distance of any two clusters
is computed using average linkage which is the average over
pairwise distances of nodes in the two clusters. The distance
between every pair of receivers is the number of edges on
the shortest path from one to the other. It should be noted that
although our networks are directed, all edges are considered to
be bidirectional and so the distance in either direction between
any two nodes should be the same. When the hierarchy is
ready, we start from the level where there are nclusters (or
at the bottom if total number of receivers is less than or
equal to n) and compute the total weight of nforwarding
trees (minimum weight Steiner trees) to these clusters. We
move forward with this partitioning if the total weight is less
than pftimes weight of the forwarding tree that would have
been selected if we grouped all receivers into one partition.
Otherwise, the same process is repeated while moving up one
level in the clustering hierarchy (one less cluster). If we accept
a partitioning, this procedure first assigns a forwarding tree to
every partition while continuously updating edge weights. It
then returns the partitions and their forwarding trees.
DispatchRates(): This procedure is executed at the begin-
ning of every timeslot. It calculates rates per active transfer
partition and according to MMF rate-allocation policy. New
transfers arriving somewhere within a timeslot are allocated
rates starting next timeslot. To calculate residual demands
needed for rate calculations, senders report back the actual
volume of data delivered during past timeslot per partition.
This allows QuickCast to cope with inaccurate rate-limiting
and packet losses which may prevent a transfer from fully
utilizing its allotted share of bandwidth.
IV. EVALUATIONS
We considered various topologies and transfer size distri-
butions as in Tables II and III. For simplicity, we consid-
ered a uniform capacity of 1.0for all edges, accurate rate-
limiting at end-points, no dropped packets due to congestion
or corruption, and no link failures. Transfer arrival followed a
Poisson distribution with rate of λ. For all simulations, we
considered a partitioning factor of pf= 1.1and timeslot
length of δ= 1.0. Unless otherwise stated, we assumed a
fixed λ= 1.0. Also, for all traffic distributions, we considered
an average demand equal to volume of 20 full timeslots per
transfer. For heavy-tailed distribution that is based on Pareto
distribution, we used a minimum transfer size equal to that of
2full timeslots. Finally, to prevent generation of intractably
large transfers, we limited maximum transfer volume to that
of 2000 full timeslots. We focus on scenarios with no link
failures to evaluate gains.
A. Comparison of Scheduling Policies over Forwarding Trees
We first compare the performance of three well-known
scheduling policies of FCFS, SRPT and Fair Sharing (based on
Algorithm 1: QuickCast
Submit (R,n,pf)
Input: R(VR, SR,D
D
DR),n(= 2 in this paper), pf,G,Le
for eEG
EG
EG(Variables defined in Table I)
Output: Pairs of (Partition, Forwarding Tree)
α, β D
D
DR, α 6=β,DISTα,β number of edges on
the shortest path from αto β;
To every edge eEG
EG
EG, assign weight We= (Le+VR);
Find the minimum weight Steiner tree TRthat connects
SRD
D
DRand its total weight WTR;
for k=nto k= 2 by 1do
Agglomeratively cluster D
D
DRusing average linkage
and distance metric DIST calculated in previous
line until only kclusters left forming
P
P
Pi
R, i ∈ {1,...,k};
foreach i∈ {1,...,k}do
Find WTP
P
Pi
R
, weight of minimum weight Steiner
tree that connects SRP
P
Pi
R;
if Pi∈{1,...,k}WTP
P
Pi
R
pf×WTRthen
foreach i∈ {1,...,k}do
Find the minimum weight Steiner tree TP
P
Pi
R
that connects SRP
P
Pi
R;
LeLe+VR,eTP
P
Pi
R;
Update We= (Le+VR)for all eEG
EG
EG;
return P
P
Pi
Ras well as TP
P
Pi
R,i∈ {1,...,k};
LeLe+VR,eTR;
return D
D
DRand TR;
DispatchRates ()
Input: Set of active request partitions P
P
P, their current
residual demands and forwarding trees Vr
Pand
TP,PP
P
P, and timeslot width δ
Output: Rate per active request per partition for next
timeslot
COUNTenumber of forwarding trees TP,PP
P
P
sharing edge e, eEG
EG
EG;
P
P
P0P
P
Pand CAPe1,eEG
EG
EG;
while |P
P
P0|>0do
foreach PP
P
P0do
SHAREPmineTP(CAPe
COUNTe);
P0a partition Pwith minimum SHAREPvalue;
RATEP0min(SHAREP0,Vr
P0
δ);
P
P
P0P
P
P0− {P0};
COUNTeCOUNTe1,eTP0;
CAPeCAPeRATEP0,eTP0;
return RATEP,PP
P
P
TABLE II
VARIO US TO PO LOG IE S USE D IN E VALUATI ON
Name Description
Random Randomly generated and strongly connected with 50
nodes and 150 edges. Each node has a minimum
connectivity of two.
GScale [5] Connects Google datacenters across the globe with 12
nodes and 19 links.
Cogent [31] A large backbone and transit network that spans across
USA and Europe with 197 nodes and 243 links.
TABLE III
TRA NSF ER S IZE D IST RI BUT IO NS US ED I N EVALUATI ON
Name Description
Light-tailed According to Exponential distribution.
Heavy-tailed According to Pareto distribution.
Facebook Cache-
Follower [30]
Generated across Facebook inter-datacenter net-
works running cache applications.
Facebook
Hadoop [30]
Generated across Facebook inter-datacenter net-
works running geo-distributed analytics.
MMF). We used the weight assignment in [18] for forwarding
tree selection and considered Random topology in Table II.
We considered both light-tailed and heavy-tailed distributions.
All policies used almost identical amount of bandwidth (not
shown). Under light load, we obtained results similar to
scheduling traffic on a single link where SRPT performs
better than Fair Sharing (not shown). Figure 5shows the
results of our experiment under heavy load. When the number
of receivers is small, SRPT is the best policy to minimize
mean times. However, as we increase the number of receivers
(larger trees), Fair Sharing offers better mean and tail times.
This simply occurs because the contention due to overlapping
trees caused by prioritizing transfers over one another (either
according to residual size in case of SRPT or arrival order in
case of FCFS) increases as more transfers are submitted or as
transfers grow in size.
B. Bandwidth usage of partitioning techniques
We considered three partitioning techniques and measured
the average bandwidth usage over multiple runs and many
timeslots. We used the topologies in Table II and traffic
patterns of Table III. Figure 6shows the results. We cal-
culated the lower bound by considering a single minimal
edge Steiner tree per transfer. Other schemes considered
are: Random(Uniform Dist) breaks each set of receivers into
two partitions by randomly assigning each receiver to one
of the two partitions with equal probability, Agg(proximity
between receivers) clusters receivers according to closeness
to each other, and Agg(closeness to source) clusters receivers
according to their distance from source (receivers closer to
source are bundled together). As can be seen, Agg(proximity
between receivers), which is used by QuickCast, provides the
least bandwidth overhead (up to 17% to lower bound). In
general, breaking receivers into subsets that are attached to
a sender with minimal bandwidth usage is an open problem.
Fig. 5. Performance of three well-known scheduling policies under heavy
load (forwarding tree selection according to DCCast)
TABLE IV
EVALUATI ON OF PA RTIT IO NIN G TE CHN IQ UES F OR P2MP TRANSFERS
Scheme Details
QuickCast Algorithm 1(Selective Partitioning).
QuickCast(NP) QuickCast with no partitioning applied.
QuickCast(TWO) QuickCast with two partitions always.
C. QuickCast with different partitioning approaches
We compare three partitioning approaches shown in Table
IV. We considered receiver sets of 5and 10 with both light-
tailed and heavy-tailed distributions. We show both mean
(top row) and tail (bottom row) completion times in the
form of a CDF in Figure 7. As expected, when there is no
partitioning, all receivers complete at the same time (vertical
line in CDF). When partitioning is always applied, completion
times can jump far beyond the no partitioning case due to
unnecessary creation of additional weak links. The benefit
of QuickCast is that it applies partitioning selectively. The
amount of benefit obtained is a function of partitioning factor
pf(which for the topologies and traffic patterns considered
here was found to be most effective between 1.0and 1.1
according to our experiments, we used 1.1). With QuickCast,
the fastest receiver can complete up to 41% faster than the
slowest receiver and even the slowest receiver completes up
to 10% faster than when no partitioning is applied.
D. QuickCast vs. DCCast
We now compare QuickCast with DCCast using real topolo-
gies and inter-datacenter transfer size distributions, namely
GScale(Hadoop) and Cogent(Cache-Follower) shown in Ta-
bles II and III. Figure 8shows the results. We considered 10
receivers per P2MP transfer. In all cases, QuickCast uses up
10 Receivers per Transfer
Cogent(Cache-Follower) GScale(Hadoop) Random(Heavy-tailed) Random(Light-tailed)
Topology(Transfer Size Distribution)
1
1.2
1.4
Bandwidth
(Normalized by Lower Bound)
Random(Uniform Dist) Agg(proximity between receivers) Agg(closeness to source)
Fig. 6. Bandwidth usage of various partitioning techniques (lower is better)
1 1.1 1.2 1.3
0.2
0.4
0.6
0.8
Fraction of Receivers
Light-tailed, 5 Receivers, Mean
1 1.1 1.2 1.3
0.2
0.4
0.6
0.8
Heavy-tailed, 5 Receivers, Mean
1 1.1 1.2 1.3 1.4
0.2
0.4
0.6
0.8
Light-tailed, 10 Receivers, Mean
1 1.2 1.4 1.6 1.8
0.2
0.4
0.6
0.8
Heavy-tailed, 10 Receivers, Mean
1 1.1 1.2 1.3 1.4
Time (Normalized)
0.2
0.4
0.6
0.8
Fraction of Receivers
Light-tailed, 5 Receivers, Tail
1 1.1 1.2 1.3
Time (Normalized)
0.2
0.4
0.6
0.8
Heavy-tailed, 5 Receivers, Tail
1 1.05 1.1
Time (Normalized)
0.2
0.4
0.6
0.8
Light-tailed, 10 Receivers, Tail
1 1.2 1.4 1.6
Time (Normalized)
0.2
0.4
0.6
0.8
Heavy-tailed, 10 Receivers, Tail
QuickCast(NP) QuickCast(TWO) QuickCast(uses Selective Partitioning)
Fig. 7. Comparison of partitioning approaches in Table IV
to 4% more bandwidth. For lightly loaded scenarios where
λ= 0.01, QuickCast performs up to 78% better in mean
times, but about 35% worse in tail times. The loss in tail
times is a result of rate-allocation policy: FCFS performs
better in tail times compared to Fair Sharing under light
loads where contention due to overlapping trees is negligible
(similar to single link case when all transfers compete for
one resource). For heavily loaded scenarios where λ= 1,
network contention due to overlapping trees is considerable
and therefore QuickCast has been able to reduce mean times
by about 10×and tail times by about 57%. This performance
gap continues to increase in favor of QuickCast as offered load
grows further. In general, operators aim to maximize network
utilization over dedicated WANs [4], [5] which could lead to
heavily loaded time periods. Such scenarios may also appear
as a result of bursty transfer arrivals.
In a different experiment, we studied the effect of number
of replicas of data objects on performance of DCCast and
QuickCast as shown in Figure 9(please notice the difference
in vertical scale of different charts). Bandwidth usage of
both schemes were almost identical (QuickCast used less than
4% extra bandwidth in the worst case). We considered two
operating modes of lightly to moderately loaded (λ= 0.01)
and moderately to heavily loaded (λ= 0.1). QuickCast
offers most benefit when number of copies grows. When the
number of copies is small, breaking receivers into multiple sets
may provide limited benefit or even degrade performance as
resulting partitions will be too small. This is why mean and tail
times degrade by up to 5% and 40% across both topologies and
traffic patterns when network is lightly loaded, respectively.
Under increasing load and with more copies, it can be seen
that QuickCast can reduce mean times significantly, i.e., by as
much as 6×for Cogent(Cache-Follower) and as much as 16×
for GScale(Hadoop), respectively. The sudden increase in tail
times for GScale topology is because this network has only
12 nodes which means partitioning while making 10 copies
may most likely lead to overlapping edges across partitioned
trees and increase completion times. To address this problem,
one could reduce pfto minimize unnecessary partitioning.
E. Complexity
We discuss two complexity metrics of run-time and number
of Group Table entries needed to realize QuickCast.
Computational Complexity: We computed run-time on a
machine with a Core-i7 6700 CPU and 24GBs of memory
using JRE 8. We used the same transfer size properties men-
tioned at the beginning of this section. With GScale(Hadoop)
setting and λ= 0.01, run-time of procedure Submit increased
from 1.44ms to 2.37ms on average while increasing copies
from 2to 10. With the same settings, runtime of procedure
DispatchRates stayed below 2µs for varying number of
copies. Next, we increased both load and network size by
GScale(Hadoop), =1
BW Mean Tail
0
5
10
Cogent(Cache-Follower), =1
BW Mean Tail
DCCast QuickCast
1.02
10.1
1.26 1.04
10.2
1.57
(a) Heavy Load
GScale(Hadoop), =0.01
BW Mean Tail
0
0.5
1
1.5
2
2.5 Cogent(Cache-Follower), =0.01
BW Mean Tail
DCCast QuickCast
1.04
1.78
1.29
1.02 1.14
1.31
(b) Light Load
Fig. 8. Comparison of completion times of and bandwidth used by QuickCast
vs. DCCast (Normalized by minimum in each category)
switching to Cogent(Cache-Follower) setting and λ= 1.0.
This time, run-time of procedure Submit increased from
3.5ms to 35ms and procedure DispatchRates increased from
0.75ms to 1.6ms on average while increasing copies from 2
to 10. Although these run-times are significantly smaller than
timeslot widths used in prior works which are in the range of
minutes [7], [26], more efficient implementation of proposed
techniques may result in even further reduction of run-time.
Finally, with our implementation, memory usage of QuickCast
algorithm is in the order of 100’s of megabytes on average.
Group Table Entries: We performed simulations over
2000 timeslots with λ= 1.0(heavily loaded scenario) and
number of copies set to 10. With GScale(Hadoop) setting,
the maximum number of required Group Table entries was 455
and the average of maximum rules for all nodes observed over
all timeslots was 166. With Cogent(Cache-Follower) setting,
which is more than 10 times larger than GScale, we observed
a maximum of 165 and an average of 9Group Table entries
for the maximum observed by all nodes over all timeslots. We
considered the highly loaded case as it leads to higher number
of concurrent forwarding trees. Currently, most switches that
support Group Tables offer a maximum of 512 or 1024 entries
in total. In this experiment, a maximum of one Group Table
entry per transfer per node was enough as we considered
switches that support up to 32 action buckets per entry [32]
which is more then total number of receivers we chose per
transfer. In general, we may need more than one entry per
node per transfer or we may have to limit the branching factor
of selected forwarding trees, for example when Group Table
246810
1
1.2
1.4
Mean Times (Normalized)
= 0.01
DCCast QuickCast
246810
2
4
6
8
10
12
14
= 0.1
246810
# Receivers per Transfer
1
1.5
2
Tail Times (Normalized)
246810
# Receivers per Transfer
1
1.5
2
2.5
(a) Cogent(Cache-Follower)
246810
1
1.5
2
2.5
Mean Times (Normalized)
= 0.01
246810
0
20
40
60
= 0.1
246810
# Receivers per Transfer
1
1.2
1.4
1.6
Tail Times (Normalized)
246810
# Receivers per Transfer
1
2
3
(b) GScale(Hadoop)
Fig. 9. Comparison of completion times of QuickCast and DCCast (Normal-
ized by minimum in each category) by number of object copies
entries support up to 8action buckets [33].
V. DISCUSSION
The focus of this paper is on algorithm design and abstract
evaluations. In this section, we dive a bit further into practical
details and issues that may arise in a real setting.
Timeslot duration: One configuration factor is timeslot
length δ. In general, smaller timeslots allow for faster response
to changes and arrival of new requests, but add the overhead of
rate computations. Minimum possible timeslot length depends
on how long it takes for senders to converge to centrally
allocated rates.
Handling rate-limiting inaccuracies: Rate-limiting is gen-
erally not very accurate, especially if done in software [34]. To
deal with inaccuracies and errors, every sender has to report
back to the TE server at the end of every timeslot and specify
how much traffic it was able to deliver. TE server will deduct
these from the residual demand of requests to get the new
residual demands. Rate-allocation continues until a request is
completely satisfied.
Receiver feedback to sender: Forwarding trees allow flow
of data from sender to receivers but receivers also need to
communicate with senders. Forwarding rules can be installed
so that it supports receivers sending feedback back to sender
over the same tree but in reverse direction. There will not
be need to use Group Tables on the reverse direction since
no replication is performed. One can use a simple point
to point scheme for receiver feedback. Since we propose
applying forwarding trees over wired networks with rate-
limiting, dropped packets due to congestion and corruptions
are expected to be low. This means if a scheme such as
Negative Acknowledgement (NAK) is used, receiver feedback
should be tiny and can be easily handled by leaving small
spare capacity over edges.
Handling network capacity loss: Link/switch failures may
occur in a real setting. In case of a failure, the TE server can be
notified by a network element that detects the failure. The TE
server can then exclude the faulty link/switch from topology
and re-allocate all requests routed on that link using their
residual demands. After new rates are allocated and forwarding
trees recomputed, forwarding plane can be updated and new
rates can be given to end-points for rate-limiting.
TE server failure: Another failure scenario is when TE
server stops working. It is helpful if end-points are equipped
with some distributed congestion control mechanism. In case
TE server fails, end-points can roll back to the distributed
mode and determine their rates according to network feedback.
VI. RE LATE D WORK
IP multicasting [8], CM [35], TCP-SMO [36] and NORM
[37] are instances of approaches where receivers can join
groups anytime to receive required data and multicast trees
are updated as nodes join or leave. This may lead to trees far
from optimal. Also, since load distribution is not taken into
account, network capacity may be poorly utilized.
Having knowledge of the topology, centralized management
allows for more careful selection of multicast trees and im-
proved utilization via rate-control and bandwidth reservation.
CastFlow [38] precalculates multicast trees which can then
be used at request arrival time for rule installation. ODPA
[39] presents algorithms for dynamic adjustment of multicast
spanning trees according to specific metrics. These approaches
however do not apply rate-control. MPMC [16], [17] proposes
use of multiple multicast trees for faster delivery of files to
many receivers then applies coding for improved performance.
MPMC does not consider the inter-play between transfers
when many P2MP requests are initiated from different source
datacenters. In addition, MPMC requires continues changes
to the multicast tree which incurs significant control plane
overhead as number of chunks and transfers increases. [40]
focuses on rate and buffer size calculation for senders. This
work does not propose any solution for tree calculations.
RAERA [14] is an approximation algorithm to find Steiner
trees that minimize data recovery costs for multicasting given
a set of recovery nodes. We do not have recovery nodes in our
scenario. MTRSA [15] is an approximation algorithm for mul-
ticast trees that satisfy a minimum available rate over a general
network given available bandwidth over all edges. This work
assumes constant rate requirement for transfers and focuses on
minimizing bandwidth usage rather than completion times.
For some regular and structured topologies, such as FatTree
intra-datacenter networks, it is possible to find optimal (or
close to optimal) multicast trees efficiently. Datacast [13]
sends data over edge-disjoint Steiner trees found by pruning
spanning trees over various topologies of FatTree, BCube and
Torus. AvRA [12] focuses on Tree and FatTree topologies and
builds minimal edge Steiner trees that connect the sender to
all receivers as they join. MCTCP [41] reactively schedules
flows according to link utilization.
As an alternative to in-network multicasting, one can use
overlay networks where hosts perform forwarding. RDCM
[42] populates backup overlay networks as nodes join and
transmits lost packets in a peer-to-peer fashion over them.
NICE [9] creates hierarchical clusters of multicast peers and
aims to minimize control traffic overhead. AMMO [43] allows
applications to specify performance constraints for selection
of multi-metric overlay trees. DC2 [44] is a hierarchy-aware
group communication technique to minimize cross-hierarchy
communication. SplitStream [45] builds forests of multicast
trees to distribute load across many machines. Due to lack
of complete knowledge of underlying network topology and
status (e.g. link utilizations, congestion or even failures),
overlay systems are limited in reducing bandwidth usage and
managing distribution of traffic.
Alternatives to multicasting for bulk data distribution in-
clude peer-to-peer [10], [11], [46] and store-and-forward [47]–
[50] approaches. Peer-to-peer approaches do not consider
careful link level traffic allocation and scheduling and do not
focus on minimizing bandwidth usage. The main focus of peer-
to-peer schemes is to locally and greedily optimize completion
times rather than global optimization over many transfers
across a network. Store-and-forward approaches focus on
minimizing costs by utilizing diurnal traffic patterns while
delivering bulk objects and incur additional bandwidth and
storage costs on intermediate datacenters. Coflows are another
related concept where flows with a collective objective are
jointly scheduled for improved performance [51]. Coflows
however do not aim at bandwidth savings.
There are recent solutions for management of P2MP trans-
fers with deadlines. DDCCast [52] uses a single forwarding
tree and the As Late As Possible (ALAP) scheduling policy for
admission control of multicast deadline traffic. In [53], authors
propose use of few parallel forwarding trees from source to
all receivers to increase throughput and meet more deadlines
considering transfer priorities and volumes. The techniques
we proposed in QuickCast for receiver set partitioning can be
applied to these prior work for further performance gains.
In design of QuickCast, we did not focus on throughput-
optimality, which guarantees network stability for any offered
load in capacity region. We believe our tree selection approach,
which balances load across many existing forwarding trees,
aids in moving towards throughput-optimality. One could
consider developing a P2MP joint scheduling and routing
scheme with throughput-optimality as the objective. However,
in general, throughput-optimality does not necessarily lead to
highest performance (e.g., lowest latency) [54], [55].
Reliability: Various techniques have been proposed to make
multicasting reliable including use of coding and receiver
(negative or positive) acknowledgements or a combination of
them [56]. In-network caching has also been used to reduce
recovery delay, network bandwidth usage and to address
the ACK/NAK implosion problem [13], [57]. Using positive
ACKs does not lead to ACK implosion for medium scale (sub-
thousand) receiver groups [36]. TCP-XM [58] allows reliable
delivery by using a combination of IP multicast and unicast
for data delivery and re-transmissions. MCTCP [41] applies
standard TCP mechanisms for reliability. Receivers may also
send NAKs upon expiration of some inactivity timer [37].
NAK suppression to address implosion can be done by routers
[57]. Forward Error Correction (FEC) can be used for reliable
delivery, using which a sender encodes kpieces of an object
into npieces (kn) where any kout of npieces allow
for recovery [59], [60]. FEC has been used to reduce re-
transmissions [37] and improve the completion times [61].
Some popular FEC codes are Raptor Codes [62] and Tornado
Codes [63].
Congestion Control: PGMCC [64], MCTCP [41] and TCP-
SMO [36] use window-based TCP like congestion control to
compete fairly with other flows. NORM [37] uses an equation-
based rate control scheme. Datacast [13] determines the rate
according to duplicate interest requests for data packets. All
of these approaches track the slowest receiver.
VII. CONCLUSIONS AND FUTURE WOR K
In this paper, we presented QuickCast algorithm to reduce
completion times of P2MP transfers across datacenters. We
showed that by breaking receiver sets of P2MP transfers with
many receivers into smaller subsets and using a separate tree
per subset, we can reduce completion times. We proposed
partitioning according to proximity as an effective approach
for finding such receiver subsets, and showed that partitioning
need be applied to transfers selectively. To do so, we proposed
a partitioning factor that can be tuned according to topology
and traffic distribution. Further investigation is required on
finding metrics to selectively apply partitioning per transfer.
Also, investigation of partitioning techniques that optimize
network performance metrics as well as study of optimality
bounds of such techniques are left as part of future work.
Next, we discovered that while managing P2MP transfers
with many receivers, FCFS policy used in DCCast creates
many contentions as a result of overlapping forwarding trees
significantly reducing utilization. We applied Fair Sharing
policy reducing network contention and improving completion
times. Finally, we performed experiments with well-known
rate-allocation policies and realized that Max-Min Fairness
provides much lower completion times compared to SRPT
while scheduling over large forwarding trees. More research
is needed on best rate-allocation policy for P2MP transfers.
Alternatively, one may drive a more effective joint partitioning,
rate-allocation and forwarding tree selection algorithm by
approximating a solution to optimization model we proposed.
REFERENCES
[1] “Predicting sd-wan adoption,” http://bit.ly/gartner-sdwan, visited on July
21, 2017.
[2] “Microsoft azure: Cloud computing platform & services,” https://azure.
microsoft.com/.
[3] “Compute engine - iaas - google cloud platform,” https://cloud.google.
com/compute/.
[4] C.-Y. Hong, S. Kandula, R. Mahajan et al., “Achieving high utilization
with software-driven wan,” in SIGCOMM. ACM, 2013, pp. 15–26.
[5] S. Jain, A. Kumar et al., “B4: Experience with a globally-deployed
software defined wan,” SIGCOMM, vol. 43, no. 4, pp. 3–14, 2013.
[6] S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaring
for wide area networks,” SIGCOMM, vol. 44, no. 4, pp. 515–526, 2015.
[7] X. Jin, Y. Li, D. Wei, S. Li, J. Gao, L. Xu, G. Li, W. Xu, and
J. Rexford, “Optimizing bulk transfers with software-defined optical
wan,” in SIGCOMM. ACM, 2016, pp. 87–100.
[8] S. Deering, “Host Extensions for IP Multicasting,” Internet Requests
for Comments, pp. 1–16, 1989. [Online]. Available: https://tools.ietf.
org/html/rfc1112
[9] S. Banerjee, B. Bhattacharjee, and C. Kommareddy, “Scalable applica-
tion layer multicast,” in SIGCOMM. ACM, 2002, pp. 205–217.
[10] R. Sherwood, R. Braud, and B. Bhattacharjee, “Slurpie: a cooperative
bulk data transfer protocol,” in INFOCOM, vol. 2, 2004, pp. 941–951.
[11] J. Pouwelse, P. Garbacki, D. Epema et al.,The Bittorrent P2P File-
Sharing System: Measurements and Analysis. Springer Berlin Heidel-
berg, 2005.
[12] A. Iyer, P. Kumar, and V. Mann, “Avalanche: Data center multicast using
software defined networking,” in COMSNETS. IEEE, 2014, pp. 1–8.
[13] J. Cao, C. Guo, G. Lu, Y. Xiong, Y. Zheng, Y. Zhang, Y. Zhu, C. Chen,
and Y. Tian, “Datacast: A scalable and efficient reliable group data
delivery service for data centers,IEEE Journal on Selected Areas in
Communications, vol. 31, no. 12, pp. 2632–2645, 2013.
[14] S. H. Shen, L. H. Huang, D. N. Yang, and W. T. Chen, “Reliable
multicast routing for software-defined networks,” in INFOCOM, April
2015, pp. 181–189.
[15] L. H. Huang, H. C. Hsu, S. H. Shen, D. N. Yang, and W. T. Chen, “Mul-
ticast traffic engineering for software-defined networks,” in INFOCOM.
IEEE, 2016, pp. 1–9.
[16] A. Nagata, Y. Tsukiji, and M. Tsuru, “Delivering a file by multipath-
multicast on openflow networks,” in International Conference on Intel-
ligent Networking and Collaborative Systems, 2013, pp. 835–840.
[17] K. Ogawa, T. Iwamoto, and M. Tsuru, “One-to-many file transfers
using multipath-multicast with coding at source,” in IEEE International
Conference on High Performance Computing and Communications,
2016, pp. 687–694.
[18] M. Noormohammadpour, C. S. Raghavendra, S. Rao, and S. Kandula,
“Dccast: Efficient point to multipoint transfers across datacenters,” in
HotCloud. USENIX Association, 2017.
[19] N. McKeown, T. Anderson et al., “Openflow: Enabling innovation in
campus networks,” SIGCOMM, vol. 38, no. 2, pp. 69–74, 2008.
[20] B. Pfaff, B. Lantz, B. Heller et al., “Openflow switch specification, ver-
sion 1.1.0 implemented (wire protocol 0x02),” http://archive.openflow.
org/documents/openflow-spec-v1.1.0.pdf, 2011.
[21] “Omniswitch aos release 8 switch management guide,” http://bit.ly/sdn-
lucent, visited on July 21, 2017.
[22] “Openflow v1.3.1 compliance matrix for devices running junos os,” http:
//bit.ly/sdn-juniper, visited on July 21, 2017.
[23] “Hp openflow 1.3 administrator guide,” http://bit.ly/sdn-hp, visited on
July 21, 2017.
[24] “Network os software defined networking (sdn) configuration guide,”
http://bit.ly/sdn-brocade, visited on July 21, 2017.
[25] D. Bertsekas and R. Gallager, “Data networks,” 1987.
[26] H. Zhang, K. Chen, W. Bai et al., “Guaranteeing deadlines for inter-
datacenter transfers,” in EuroSys. ACM, 2015, p. 20.
[27] M. Noormohammadpour, C. S. Raghavendra, and S. Rao, “Dcroute:
Speeding up inter-datacenter traffic allocation while guaranteeing dead-
lines,” in High Performance Computing, Data, and Analytics (HiPC).
IEEE, 2016.
[28] H. H. Liu, S. Kandula, R. Mahajan, M. Zhang, and D. Gelernter, “Traffic
engineering with forward fault correction,” in SIGCOMM. ACM, 2014,
pp. 527–538.
[29] M. Stanojevic and M. Vujoevic, “An exact algorithm for steiner tree
problem on graphs,” International Journal of Computers Communica-
tions & Control, vol. 1, no. 1, pp. 41–46, 2006.
[30] A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren, “Inside the
social network’s (datacenter) network,” in SIGCOMM. ACM, 2015,
pp. 123–137.
[31] “The internet topology zoo (cogent),” http://www.topology-zoo.org/files/
Cogentco.gml, visited on July 19, 2017.
[32] “Understanding how the openflow group action works,”
https://www.juniper.net/documentation/en US/junos/topics/concept/
junos-sdn- openflow-groups.html, visited on March 14, 2017.
[33] “Hp 5130 ei switch series openflow configuration guide,” https:
//support.hpe.com/hpsc/doc/public/display?docId=c04217797&lang=
en-us&cc=us, visited on Dec 8, 2017.
[34] M. Noormohammadpour and C. S. Raghavendra, “Datacenter Traffic
Control: Understanding Techniques and Trade-offs,” arXiv preprint
arXiv:1712.03530, 2017. [Online]. Available: https://arxiv.org/abs/1712.
03530
[35] S. Keshav and S. Paul, “Centralized multicast,” in International Confer-
ence on Network Protocols. IEEE, 1999, pp. 59–68.
[36] S. Liang and D. Cheriton, “Tcp-smo: extending tcp to support medium-
scale multicast applications,” in Proceedings.Twenty-First Annual Joint
Conference of the IEEE Computer and Communications Societies, vol. 3,
2002, pp. 1356–1365.
[37] B. Adamson, C. Bormann, M. Handley, and J. Macker, “Nack-oriented
reliable multicast (norm) transport protocol,” 2009.
[38] C. A. C. Marcondes, T. P. C. Santos, A. P. Godoy, C. C. Viel, and
C. A. C. Teixeira, “Castflow: Clean-slate multicast approach using in-
advance path processing in programmable networks,” in IEEE Sympo-
sium on Computers and Communications, 2012, pp. 94–101.
[39] J. Ge, H. Shen, E. Yuepeng, Y. Wu, and J. You, “An openflow-
based dynamic path adjustment algorithm for multicast spanning trees,”
in IEEE International Conference on Trust, Security and Privacy in
Computing and Communications, 2013, pp. 1478–1483.
[40] X. Ji, Y. Liang, M. Veeraraghavan, and S. Emmerson, “File-stream
distribution application on software-defined networks (sdn),” in IEEE
Annual Computer Software and Applications Conference, vol. 2, July
2015, pp. 377–386.
[41] T. Zhu, F. Wang, Y. Hua, D. Feng et al., “Mctcp: Congestion-aware
and robust multicast tcp in software-defined networks,” in International
Symposium on Quality of Service, June 2016, pp. 1–10.
[42] D. Li, M. Xu, M. c. Zhao, C. Guo, Y. Zhang, and M. y. Wu, “Rdcm:
Reliable data center multicast,” in 2011 Proceedings IEEE INFOCOM,
2011, pp. 56–60.
[43] A. Rodriguez, D. Kostic, and A. Vahdat, “Scalability in adaptive multi-
metric overlays,” in International Conference on Distributed Computing
Systems, 2004, pp. 112–121.
[44] K. Nagaraj, H. Khandelwal, C. Killian, and R. R. Kompella, “Hierarchy-
aware distributed overlays in data centers using dc2,” in COMSNETS.
IEEE, 2012, pp. 1–10.
[45] M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron,
and A. Singh, “Splitstream: High-bandwidth multicast in cooperative
environments,” in SOSP. ACM, 2003, pp. 298–313.
[46] M. Hefeeda, A. Habib, B. Botev et al., “Promise: Peer-to-peer media
streaming using collectcast,” in MULTIMEDIA. ACM, 2003, pp. 45–54.
[47] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “Inter-datacenter
bulk transfers with netstitcher,” in SIGCOMM. ACM, 2011, pp. 74–85.
[48] S. Su, Y. Wang, S. Jiang, K. Shuang, and P. Xu, “Efficient algorithms
for scheduling multiple bulk data transfers in inter-datacenter networks,
International Journal of Communication Systems, vol. 27, no. 12, 2014.
[49] N. Laoutaris, G. Smaragdakis, R. Stanojevic, P. Rodriguez, and R. Sun-
daram, “Delay-tolerant bulk data transfers on the internet,” IEEE/ACM
TON, vol. 21, no. 6, 2013.
[50] Y. Wang, S. Su et al., “Multiple bulk data transfers scheduling among
datacenters,” Computer Networks, vol. 68, pp. 123–137, 2014.
[51] M. Chowdhury and I. Stoica, “Coflow: A networking abstraction for
cluster applications,” in Workshop on Hot Topics in Networks. ACM,
2012, pp. 31–36.
[52] M. Noormohammadpour and C. S. Raghavendra, “DDCCast: Meeting
Point to Multipoint Transfer Deadlines Across Datacenters using ALAP
Scheduling Policy,” Department of Computer Science, University of
Southern California, Tech. Rep. 17-972, 2017.
[53] S. Ji, “Resource optimization across geographically distributed datacen-
ters,” Master’s thesis, University of Toronto, 2017.
[54] A. Yekkehkhany, “Near Data Scheduling for Data Centers with Multi
Levels of Data Locality,arXiv preprint arXiv:1702.07802, 2017.
[55] A. Yekkehkhany, A. Hojjati, and M. H. Hajiesmaili, “Gb-pandas:
Throughput and heavy-traffic optimality analysis for affinity scheduling,
arXiv preprint arXiv:1709.08115, 2017.
[56] M. Handley, L. Vicisano, M. Luby, B. Whetten, and R. Kermode,
“The reliable multicast design space for bulk data transfer,” Internet
Requests for Comments, pp. 1–22, 2000. [Online]. Available:
https://tools.ietf.org/html/rfc2887
[57] L. H. Lehman, S. J. Garland, and D. L. Tennenhouse, “Active reliable
multicast,” in INFOCOM, vol. 2, Mar 1998, pp. 581–589 vol.2.
[58] K. Jeacle and J. Crowcroft, “Tcp-xm: unicast-enabled reliable multicast,”
in ICCCN, 2005, pp. 145–150.
[59] M. Luby, L. Vicisano, J. Gemmell et al., “The use of forward error
correction (fec) in reliable multicast,” 2002.
[60] M. Luby, J. Gemmell, L. Vicisano, L. Rizzo, and J. Crowcroft, “Asyn-
chronous layered coding (alc) protocol instantiation,” 2002.
[61] C. Gkantsidis, J. Miller, and P. Rodriguez, “Comprehensive view of a
live network coding p2p system,” in IMC. ACM, 2006, pp. 177–188.
[62] A. Shokrollahi, “Raptor codes,” IEEE Transactions on Information
Theory, vol. 52, no. 6, pp. 2551–2567, 2006.
[63] J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege, “A digital
fountain approach to reliable distribution of bulk data,” in SIGCOMM.
ACM, 1998, pp. 56–67.
[64] L. Rizzo, “Pgmcc: A tcp-friendly single-rate multicast congestion con-
trol scheme,” in SIGCOMM, 2000.
Conference Paper
In content distribution networks, a key objective is the efficient utilization of the network that interconnects geographically distributed datacenters. This is a challenging problem due to vastly different characteristics and requirements of bulk and realtime transfers that share the interconnection network. Bulk transfers aim at delivering a copy of a usually large file to multiple datacenters before a deadline, while realtime transfers are absolutely delay-intolerant with unsteady and dynamic demands. In this paper, we consider the problem of multicasting deadline-critical bulk transfers in an inter-datacenter network in the presence of unknown and fluctuating demand by realtime transfers. Specifically, we develop a joint admission control and routing algorithm called PMDx, which anticipates future realtime demands and proactively reserves just the right amount of network resources in order to serve future realtime transfers without adversely affecting network utilization or bulk transfer deadlines. We show that the PMDx algorithm is a 2/δ-approximation with probability 1 - ϵ, and runs in polynomial time proportional to ln(1/ϵ)/(1 - δ)², for 0 < δ,ϵ < 1. We also provide extensive model-driven simulation results to study the behaviour of our algorithms in real world network topologies. Our results confirm that PMDx is very close to the optimal, and improves the utilization of the network by 14% compared to a recently proposed algorithm.
Conference Paper
The increasing amount of data replication across datacenters introduces a need for efficient bulk data transfer protocols which meet QoS guarantees, notably timely completion. We present DaRTree which leverages emerging optical reconfiguration technologies, to jointly optimize topology and multicast transfers, and thereby maximize throughput and acceptance ratio of transfer requests subject to deadlines. DaRTree is based on a novel integer linear program relaxation and deterministic rounding scheme. To this end, DaRTree uses multicast Steiner trees and adaptive routing based on the current network load. DaRTree provides its guarantees without need for rescheduling or preemption. Our evaluations show that DaRTree increases the network throughput and the number of accepted requests by up to 70%, especially for larger Wide-Area Networks (WANs). In fact, we also find that DaRTree even outperforms state-of-the-art solutions when the network scheduler is only capable of routing unicast transfers or when the WAN topology is bound to be non-reconfigurable.
Article
Full-text available
Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today’s cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems. To measure the performance of datacenter networks, different performance metrics have been used such as flow completion times, deadline miss rate, throughput and fairness. Depending on the application and user requirements, some metrics may need more attention. While investigating different traffic control techniques, we point out the trade-offs involved in terms of costs, complexity and performance. We find that a combination of different traffic control techniques may be necessary at particular entities and layers in the network to improve the variety of performance metrics. We also find that despite significant research efforts, there are still open problems that demand further attention from the research community.
Conference Paper
Full-text available
Dynamic affinity scheduling has been an open problem for nearly three decades. The problem is to dynamically schedule multi-type tasks to multi-skilled servers such that the resulting queueing system is both stable in the capacity region (throughput optimality) and the mean delay of tasks is minimized at high loads near the boundary of the capacity region (heavy-traffic optimality). As for applications, data-intensive analytics like MapReduce, Hadoop, and Dryad fit into this setting, where the set of servers is heterogeneous for different task types, so the pair of task type and server determines the processing rate of the task. The load balancing algorithm used in such frameworks is an example of affinity scheduling which is desired to be both robust and delay optimal at high loads when hot-spots occur. Fluid model planning, the MaxWeight algorithm, and the generalized $c\mu$-rule are among the first algorithms proposed for affinity scheduling that have theoretical guarantees on being optimal in different senses, which will be discussed in the related work section. All these algorithms are not practical for use in data center applications because of their non-realistic assumptions. The join-the-shortest-queue-MaxWeight (JSQ-MaxWeight), JSQ-Priority, and weighted-workload algorithms are examples of load balancing policies for systems with two and three levels of data locality with a rack structure. In this work, we propose the Generalized-Balanced-Pandas algorithm (GB-PANDAS) for a system with multiple levels of data locality and prove its throughput optimality. We prove this result under an arbitrary distribution for service times, whereas most previous theoretical work assumes geometric distribution for service times. The extensive simulation results show that the GB-PANDAS algorithm alleviates the mean delay and has a better performance than the JSQ-MaxWeight algorithm by up to twofold at high loads. We believe that the GB-PANDAS algorithm is heavy-traffic optimal in a larger region than JSQ-MaxWeight, which is an interesting problem for future work.
Technical Report
Full-text available
Large cloud companies manage dozens of datacenters across the globe connected using dedicated inter-datacenter networks. An important application of these networks is data replication which is done for purposes such as increased resiliency via making backup copies, getting data closer to users for reduced delay and WAN bandwidth usage, and global load balancing. These replications usually lead to network transfers with deadlines that determine the time prior to which all datacenters should have a copy of the data. Inter-datacenter networks have limited capacity and need be utilized efficiently to maximize performance. In this report, we focus on applications that transfer multiple copies of objects from one datacenter to several datacenters given deadline constraints. Existing solutions are either deadline agnostic, or only consider point-to-point transfers. We propose DDCCast, a simple yet effective deadline aware point to multipoint technique based on DCCast and using ALAP traffic allocation. DDCCast performs careful admission control using temporal planning, uses rate-allocation and rate-limiting to avoid congestion and sends traffic over forwarding trees that are carefully selected to reduce bandwidth usage and maximize deadline meet rate. We perform experiments confirming DDCCast's potential to reduce total bandwidth usage by up to 45% while admitting up to 25% more traffic into the network compared to existing solutions that guarantee deadlines.
Conference Paper
Full-text available
Using multiple datacenters allows for higher availability, load balancing and reduced latency to customers of cloud services. To distribute multiple copies of data, cloud providers depend on inter-datacenter WANs that ought to be used efficiently considering their limited capacity and the ever-increasing data demands. In this paper, we focus on applications that transfer objects from one datacenter to several datacenters over dedicated inter-datacenter networks. We present DCCast, a centralized Point to Multi-Point (P2MP) algorithm that uses forwarding trees to efficiently deliver an object from a source datacenter to required destination datacenters. With low computational overhead, DCCast selects forwarding trees that minimize bandwidth usage and balance load across all links. With simulation experiments on Google’s GScale network, we show that DCCast can reduce total bandwidth usage and tail Transfer Completion Times (TCT) by up to 50% compared to delivering the same objects via independent point-to-point (P2P) transfers.
Article
Full-text available
Data locality is a fundamental issue for data-parallel applications. Considering MapReduce in Hadoop, the map task scheduling part requires an efficient algorithm which takes data locality into consideration; otherwise, system may get unstable under loads inside the system's capacity region or jobs may experience longer completion times which are not of interest. The data chunk needed for any map task can be in memory, on a local disk, in a local rack, in the same cluster or even in another data center. Hence, unless there has been so much work on improving the speed of data center networks, still there exists different levels of service rates for a task depending on where its data chunk is saved and from which server it receives service. Most of the theoretical work on load balancing is for systems with two levels of data locality among which I can name Pandas algorithm by Xie et al. and JSQ-MW by Wang et al., where the former is both throughput and heavy-traffic optimal, while the latter is only throughput optimal, but heavy-traffic optimal in only a special traffic load. We show that an extension of JSQ-MW for a system with thee levels of data locality is throughput optimal, but not heavy-traffic optimal for all load, but again for a special traffic scenario. Furthermore, we show that Pandas is not even throughput optimal for a system with three levels of data locality. We then propose a novel algorithm, Balanced-Pandas, which is both throughput and heavy-traffic optimal. To the best of our knowledge, this is the first theoretical work on load balancing for a system with more than two levels of data locality which as we will see is more challenging than two levels of data locality as a dilemma between performance and throughput emerges.
Article
Many emerging scientific and industrial applications require transferring multiple Tbytes of data on a daily basis. Examples include pushing scientific data from particle accelerators/colliders to laboratories around the world, synchronizing data-centers across continents, and replicating collections of high definition videos from events taking place at different time-zones. A key property of all above applications is their ability to tolerate delivery delays ranging from a few hours to a few days. Such Delay Tolerant Bulk (DTB) data are currently being serviced mostly by the postal system using hard drives and DVDs, or by expensive dedicated networks. In this work we propose transmitting such data through commercial ISPs by taking advantage of already-paid-for off-peak bandwidth resulting from diurnal traffic patterns and percentile pricing. We show that between sender-receiver pairs with small time-zone difference, simple source scheduling policies are able to take advantage of most of the existing off-peak capacity. When the time-zone difference increases, taking advantage of the full capacity requires performing store-and-forward through intermediate storage nodes. We present an extensive evaluation of the two options based on traffic data from 200+ links of a large transit provider with PoPs at three continents. Our results indicate that there exists huge potential for performing multi Tbyte transfers on a daily basis at little or no additional cost.