Conference PaperPDF Available

Creek: Inter Many-to-Many Coflows Scheduling for Datacenter Networks

Authors:
  • sekolah tinggi teologi bethel indonesia

Abstract

Datacenter networked applications, often require multiple data transfer flows that semantically constitute a coflow group. A coflow is thus considered completed when all the transfers in the coflow are completed. Hence, application performance is optimized whenever the completion time of a coflow is minimized, rather than that of the flows composing it. Currently, popular coflow scheduling algorithms are mostly centralized, and they incur high overheads. The decentralized approach in the ``many-to-many’’ scenario also incurs high communication overheads due to the communication among the local controllers. In this paper, we present a coflow scheduling mechanism that aims to minimize the coflow completion time for coflows that show a many-to-many communication pattern, and as a byproduct communication overhead cost is also minimized. Our algorithm preserves compatibility with existing commodity switches and network protocols and improves the coflow completion times on average by 1.8 times compared to the baseline as demonstrated via testbed implementation and large-scale simulation
Creek: Inter Many-to-Many Coflows Scheduling
for Datacenter Networks
Hengky Susanto1, Ahmed M. Abdelmoniem2, Hao Jin3, Brahim Bensaou4
HKUST1,2,3,4, Assiut University2, Texas A&M University3
hsusanto@cs.uml.edu1, amas@cse.ust.hk2, haojin@tamu.edu3 , brahim@cse.ust.hk4
AbstractDatacenter networked applications, often require
multiple data transfer flows that semantically constitute a
coflow group. A coflow is thus considered completed when all
the transfers in the coflow are completed. Hence, application
performance is optimized whenever the completion time of a
coflow is minimized, rather than that of the flows composing it.
Currently, popular coflow scheduling algorithms are mostly
centralized, and they incur high overheads. The decentralized
approach in the ``many-to-many’’ scenario also incurs high
communication overheads due to the communication among the
local controllers. In this paper, we present a coflow scheduling
mechanism that aims to minimize the coflow completion time
for coflows that show a many-to-many communication pattern,
and as a byproduct communication overhead cost is also
minimized. Our algorithm preserves compatibility with existing
commodity switches and network protocols and improves the
coflow completion times on average by 1.8 times compared to
the baseline as demonstrated via testbed implementation and
large-scale simulation.
1
I. INTRODUCTION
Recently, the term coflow has been coined to provide a
meaningful semantic that translates application performance
requirements in datacenter networks into performance metrics
that can be understood at network level (e.g., in the data
plane). In networking context, a coflow consists of a set of
concurrently active data flows set to complete a specific data
transfer started by the application. Typically, the completion
of data transfer of all flows within the same coflow signifies
the completion of the communication stage for the
application. Applications strive to achieve faster completion
of their communication tasks, which translates into
minimizing coflowscompletion times (CCT). However, due
to the simultaneity of the flows in the network, minimizing
greedily the CCT may induce inter-coflow bottlenecks, which
can in turn severely degrade the performance at the
application level.
To address these dependency problems, many recent
proposals put this problem into the form of CCT
minimization. The popular approaches are usually designed in
centralized manner [4,5,6,7,8,9] where a single centralized
scheduler is responsible for scheduling the coflows of the
entire network. However, a high overhead cost is incurred for
maintaining such a centralized system in large datacenters. As
1
This work has appeared in IEEE International Conference
on Communications, 2019. 10.1109/ICC.2019.8762027
an alternative, various decentralized state of the art solutions
have been proposed. For instance, Baraat [3] requires switch
modifications where the task of scheduling coflows is
performed at the switches. Baraat, however, lacks access to
coflow level information because switches only have access
to information at flow level, which leads to sub-optimal
outcome. In addition, because this decentralized solution
requires elaborate software modifications in the switches, it is
harder to deploy. Stream [27] does not require switch
modification but requires local controllers of the same coflow
to exchange information, which may result in extra
communication overhead cost. Moreover, decentralized
schemes also commonly suffer from sub-optimal outcome
because of the lack of a complete picture of coflow states and
the inability to achieve global coordination between the local
controllers. In this paper we present Creek, a decentralized
inter coflow scheduler for coflows that exhibit a many-to-
many communication pattern, without requiring hardware
modifications while imposing only minimal communication
overheads. Creek is designed to resolve the challenges
encountered in decentralized scheduling systems, while
possessing the key advantages of centralized system. Creek is
capable of acquiring a more complete picture of coflow states
and accomplishes an approximate global coordination, to
achieve near optimal performance, without the overhead cost
experienced by centralized solutions.
The key to the solution depends on understanding the
communication pattern which provides insights to achieve the
objective of minimizing CCTs effectively. One-to-many is a
pattern where a single node receives data transfer from many
senders and forms a single coflow [22,23,24]. Many-to-many
is a pattern where many receivers receive data transfer from
many senders [18,20]. In other words, that is a single many-
to-many coflow consists of multiple many-to-one coflows,
which is the focus of this paper.
To achieve its targets, Creek acquires the necessary
information on coflows at receiver end. The scheduling policy
is enforced and communicated by leveraging existing network
components (e.g., functionalities that are commonly available
in commodity switches) and the mechanics of existing
transport protocols such as TCP/IP. For inter coflows
scheduling decision, Creek invokes the well-known Smallest
Task First (STF) policy. To reduce the communication
overhead incurred by the receivers of a coflow in
communicating with each other, Creek outsources the
information management to a third party, which can be a
designated node that stores coflow information.
In our performance analysis, we evaluate our solution
through actual testbed experiments and large-scale simulation
experiments. In the testbed experiments, we implement Creek
and deploy the prototype in a small datacenter testbed. This
also shows that the solution is production deployments
friendly. Moreover, the experiments demonstrate that Creek
outperforms the baseline schedulers by at least 1.8 times. In
the large-scale simulation, we evaluate Creek’s performance
by replaying an actual trace of coflow traffic workload
collected from 3000 servers (150 racks) in Facebook
production datacenter [4].
Specifically, the evaluation is performed by using widely
accepted traces from Facebook along with two benchmarks:
TPC-DS [5] query and Facebook’s Tao structure [28]. In our
evaluation, Creek exceeds both Baraat and the traditional per-
flow fair sharing scheme by 1.85
!
on average and achieves
comparable performance with the centralized scheme. As for
mice coflow CCT, Creek is up to 28
!
better than per flow fair
sharing and up to 18
!
better than Baraat. Here, Creek also
achieves similar outcome to centralized systems. At last,
finding in [4] shows that priority-based scheme follows
diminishing return behavior, and in this paper, we provide an
insight to this behavior through theoretical and (testbed and
simulation) experimental results.
Our contributions can be summarized as follows:
1. We propose a coflow scheduling scheme for coflows with
many-to-many communication patterns, which minimizes
the communication overhead between receivers.
2. We deploy our solution in our mini datacenter and evaluate
it in a large-scale setting via simulation.
The rest of the paper is organized as follows. We present
previous related work in section II and the system model in
section III. Then, we describe Creek in section IV. Simulation
results are presented in section V, then concluding remarks are
given in section VI.
II. RELATED WORK
One of the early works on this theme is Orchestra [6],
where the semantic among flows is taken into account in the
design of the flow transfers optimization in datacenter. By
adopting the smallest-total-size-first scheduling policy,
Sincronia [2], Varys [4], Aalo [5], and NC-DRF [21] improve
the performance compared to [6]. RAPIER [7] extends [4] by
incorporating routing algorithms into the scheduling scheme.
Likewise, CODA [12] also extends the problem in [4] by
integrating machine learning into the coflow scheduling
scheme. In later development, the authors of [8,9] extended
the problem in [4] by taking into account the importance level
of different coflows and reformulated the problem into a
weighted CCTs minimization problem. The aforementioned
schemes fall into the centralized scheduling category that
typically provide near optimal scheduling. However, these
approaches are criticized for incurring very high cost of
centralized management and are generally hard to realize in
practice because they require significant switch modifications
and/or a complex control plane.
On the other hand, as an alternative, there is the
decentralized approach. In this approach, Baraat [3]
dominates as the state-of-the-art decentralized coflow
scheduling system. Baraat relies on various heuristics and is
based on a multiplexed First-In First-Out (FIFO) principles.
In Baraat, whenever large coflows are observed in the
network, mice flows
Flow size
The length of a flow
Coflow size
The sums of all flow sizes in a coflow in bytes.
Coflow width
The number of parallel flows in a coflow.
Coflow length
Largest or longest flow in the coflow in bytes.
Table 1. Terminology
Fig. 1. Data Shuffle between mappers and reduces in Hadoop [18].
Fig. 2. CDF: a) coflow size , b) coflow length, and c) coflow width from
Facebook [4], and d) coflow size in Microsoft Bing [3].
Are processed in the background. Otherwise, mice flows are
processed according to the trivial FIFO scheduling. Even
though, Baraat proves to be effective, it has several
drawbacks: first, its scheduling decisions are made locally at
the switches which limits the scheduler access to only flow
level information. So, the scheduler has an incomplete
information about coflow states and results in sub-optimal
performance; second, Baraat also requires modifications to
the switches which makes it not deployment-friendly.
Stream [27] is another recently proposed decentralized
scheduler that opportunistically chooses the receiver in many-
to-one and many-to-many communication patterns. However,
since Stream requires its receivers of a same coflow to
communicate with each other for coordination, it shows a high
overhead communication cost. In our work, we adopt a
different approach where we solve the general coflow
scheduling problem in a decentralized manner for many-to-
many patterns, without requiring hardware modification with
minimal communication overhead.
III. SYSTEM MODEL
In this section, we discuss the coflow abstraction,
describe the characteristics of coflows as observed in
production datacenter environments, and then introduce the
network model used in the study.
a)
b)
c)
d)
Coflow Abstraction. A coflow state is generally
characterized by three parameters, the number of its
concurrent flows (usually called the width), the total number
of bytes transferred (referred as the size), and the longest flow
in bytes (called the coflow length). For example, a coflow
state can be known by tracking the number of completed
flows of the coflow, the number of bytes transferred/received
of the coflow, and so on.
Coflows in Production. In [4], it is observed that coflow sizes
in production environment (Facebook datacenter) follow a
heavily trailed distribution. More precisely, large coflows, of
at least 10 Gb and the ones of at least 1 Gb amount to only 8%
and 15% respectively of all coflows, in spite of being
respectively responsible for 98% and 99.6% of the total traffic
in the datacenter. This implies that most coflows in the
datacenter are small in size and contribute the least bytes to
the network. This is illustrated by Fig. 2a, 2b, and 2c. The
same findings are observed in [3, 6] from Microsoft’s
datacenter, as illustrated by Fig. 2d. In [14] data-mining
application traffic is studied, and here also the distribution of
flow sizes is found to be heavily tailed, with 95% of all data
bytes coming from flows larger than 35MB, which make for
only 3.6% of all flows. This confirms that data mining
application also generates more small sized flows, but the
traffic in the network comes from the few large sized flows.
Network Model. In this work, we consider a Tree-based
topology [3, 7, 8, 10,19,25]. We conduct the experiments in
a testbed and via NS-3 simulation using the FatTree topology
[30]. From the experiments, we find that the processing and
queuing times are significant in the aggregation and core
switches which agrees with the findings in [10,11,25].
Moreover, we find that the bottleneck has shifted from ToR
switches to become more evenly distributed among different
layers. This is due to the high speed NICs matching the
speeds of core switch ports.
IV. SCHEDULING SCHEME
A. Problem Formulation
The problem for the offline case of coflow scheduling can
be formulated as follows: we have
"
coflows, numbered 1, 2,
…,
#"
, in a system. Then, the CCT minimization problem can
be written:
$%"%$%&'#
(
)*
+
*,- #.################################
/
0
1
##
(
234 56
376 .###89 7 :.##################################
/
0;<
1
=34 >?#.###8@ 7 A.###BC'D#)*.=3E F#.##########
/
0;G
1
where,
)*
refers to the completion time of coflow
A
, defined as
)*HIJK
L
)3M##8@ 7 A
N
.
where
)3
refers to the completion time
of a flow
@
in
A
. That is,
)*
is simply the completion time of
the slowest flow in coflow
A
. Constraint (1.a) expresses link
capacity constraints, i.e., it ensures that the aggregate flow on
link
9
does not exceed the link capacity
56
. Constraints (1.b)
ensures that starvation and packet out-of-order problems are
eliminated. Observe that the above formulation for CCTs
minimization is an NP-Hard problem [3,4] as it is reducible to
the well-known Open Shop Scheduling Problem [4].
B. Decentralized Coflow Scheduling Mechanism
Prior works focused on the many-to-one scenario with
the assumption that coflow size is unknown a priori. In this
paper, however, we address the more difficult many-to-many
scenarios.
Generally, Creek uses the STF scheduler to reduce the
CCTs by simply giving a high priority to smaller coflows
over larger ones. The problem is that coflow size is unknown
a priori, as prior size measurement is not possible. So, we can
dynamically compare the coflow size to a threshold
>
at the
receiver’s end. Then, if the coflow size exceeds
>
, then the
coflow is demoted. Also, initially we assign all new coflows
to the highest priority and then dynamically downgrade their
priority based on the number of bytes received. The receiver
then updates the workers with the new priority values by
piggybacking this priority on the ACK packets.
Creek also takes into account the coflow condition when
assigning the coflow to a priority group (e.g., the number of
completed flows). It also ensures compatibility with
commodity switches, by performing the scheduling at the
receiver side as the information on the coflow and its flows
is readily available at the receiver side.
Creek enforces STF scheduling policy by relying on the
multiple priority queues commonly found in most
commodity switches, to realize a multi-level feedback queue
(MLFQ). As pointed out in [5], MLFQ may result into the
starvation of some flows and Weighted Fair Queuing (WFQ)
may provide a better solution. In Creek, MLFQ is adopted
because priority queues provide better in-network
prioritization and potentially achieves lower CCT. Moreover,
WFQ may introduce the out-of-order problem for TCP flows.
Having said that, later we propose an algorithm that ensures
starvation free operation for Creek.
Coflow Priority Decision. Consider a commodity switch
with
O
priority queues [1]. Given a coflow
A
, we denote by
priority
P3
Q
denotes that the
RST
priority queue is assigned to
flow
@ 7 A
, with
0 4 R 4 O
, with
P3
-
being the highest
priority and
P3
U
being the lowest. Each priority level
R
is
associated with a threshold
VQ
. Not that, most existing
commodity switches only support a maximum of 8 priorities
queue [1]. Let
P3
denote the priority assigned to flow
@
, such
that
P3H P3
Q
. Initially, all flows
@
are assigned to
P3
-
, such
that
8@ 7 A. P3H P3
-
. Thereafter, given the flow size
23E F
,
the priority
P3
evolves as described below.
Coflow management. In coflows that create many-to-many
communication patterns, the coflow typically may consist of
many sub-coflows. In such case, there would be many
receivers in a single coflow. Hence, sub-coflows of the same
coflow are considered as a single entity and the completion
of the coflow relies on the completion of all of its sub-
coflows. Some of the many scheduling challenges with this
pattern in decentralized settings is how to keep track of the
relationship among sub-coflows of the same coflow, deciding
the appropriate priority values when coflow information is
sparse, and a sub-coflow may not know about some of the
other sub-coflows. To address these challenges, Creek
utilizes shared-storage to allow sub-coflows of the same
coflow to easily exchange the necessary status information
with each other. In other words, the receivers of the same
coflow will share and access the same data storage.
A task manager allocates a small amount of space at a
designated storage space in a server to every new coflow.
Thus, all receivers of this coflow use this storage to provide
information, such as updates and queries on the total bytes
that have been sent. Hence, the number of communications
within a coflow can be reduced from
W
/
"X
1 down to
W
/
"
1,
where is
"
is the number of receivers of a coflow. However,
one of the practical challenges in doing this is how to
synchronize the receivers of a coflow, such that the
information can be updated appropriately without running
into a race condition. That is, there are multiple receivers
sharing a common buffer but only one of them can update the
information at any given time. This problem can be solved
using locking mechanism such as a Mutex allowing only a
single receiver to update and modify the information in the
shared storage space. We also utilize Mutex semaphore based
locking mechanism to resolve the race condition between
receivers of a coflow in our testbed implementation.
Starvation Mitigation. To resolve the starvation problem,
when the wait exceeds a waiting threshold, the worker of the
starving flow retransmits packets that have not been
acknowledged with higher priority assignment. Duplicate
packets are dropped at the receiver by TCP [29] if there is
any. By doing so, the solution also avoids packet TCP out of
order problem. The process is repeated until the flow escapes
the starvation. Then, upon receiving a packet from the
starving flow, the receiver compares the priority of the recent
sent packet with the priority currently assigned to the starving
flow. If it does not match, then the receiver increases that
coflow priority and notifies the worker of the starving flow
with the new priority through the ACK packet. ECN can help
in mitigating starvation, but it may accidently mark packets
from mice coflows, because ECN is not designed to be aware
of coflows.
Setting the threshold. The value of threshold is important in
determining the system performance. If the threshold is too
small or too large, packets of short flows may experience long
queuing delay behind elephant flows. Although threshold is
commonly used in system design [3,4,10,25,28], there is very
little study on how the threshold should be set, such that
system achieves optimality.
From our experiments, we derive two observations: /
%
1
Thresholds should be able to quickly direct traffic into the
appropriate queue; /
%%
1 to mitigate starvation, the wait of the
lowest priority should not exceed TCP retransmission
timeout (RTO) [29]. Using these two rules of thumbs, our
threshold leads to very minimal starvation in our testbed
experiments. At this point, however, the threshold is decided
by using exhaustive search, which may imply a higher
overhead cost for larger systems. We will further investigate
setting of thresholds using machine learning techniques as
proposed in [26] in our future work.
Data structure. One challenge in implementing Creek is
keeping track of the number of bytes sent generated by a large
number of coflows. In practice, multiple coflows arrive and
complete the task. Thus, information on coflows must be
added or removed to the data structure when coflows start and
complete respectively. For this reason, the data structure must
be adaptive to the dynamics of start-complete cycle while at
the same time keeping the computation cost low (e.g. lookup
operation).
In our testbed implementation, we use two dynamic arrays
available in C++ library (e.g. vector) to track coflows and
sub-coflows’ bytes sent. We assume that coflow ID is unique
globally. Then, Creek utilizes these IDs as coflow index and
the information is inserted such that the IDs are sorted in
increasing order, which is linear using the existing technique.
Since the structure is dynamic caused by the start-complete
cycle, straightforward hashing is not suitable for lookup
operation. To resolve this, Creek uses binary based search
algorithm [32], which takes an
W
/
YZ["
1 operation with
"
being the array size. This is possible because the array is
sorted.
Without careful coordination between insertion, deletion,
update, and information retrieval operations, the system may
end up in a race condition where different threads compete to
modify the same information or data structure. This can result
in inaccurate information update. For example, two threads
are performing simultaneous updates at the same location in
memory causing the new information to only reflect one of
the updates instead of both updates.
We mitigate this issue by utilizing strict priority queue
non-preemptive scheduling (SPQ-NS). Here, an operation
(e.g. delete, update, insert, or read) cannot be interrupted
when it is being performed even when there is an operation
in higher queue waiting to be performed. Operations in each
queue of SPQ-NS is performed in first-in-first-out order and
the coordination between operation is done using Mutex.
Here, information update and insert operations are assigned
to the highest priority, information retrieval is assigned to a
lower priority, and delete operation is assigned to the lowest
priority. Moreover, deletion is only performed once every
interval time (e.g. every 1 second). By doing this we prevent
race condition.
V. EVALUATION
The performance of the proposed scheduling scheme is
evaluated via a number of experiments in a 10Gbps testbed
as well as via large-scale network simulation using NS3 with
Facebook traces from [4,5]. The main metrics used for
evaluation are the average CCT and the performance
improvement ratio, which is calculated as the ratio of the
target scheme’s CCT to the CCT achieved by Creek. So, if
the improvement x is greater (smaller) than one, then Creek
is faster (slower) than the target scheme by x times.
The main findings are summarized below:
1. In the testbed experiment, Creek significantly reduces the
average coflows CCTs relative to TCP by up to 1.8
!
and
the average mice flows FCTs by up to 1.833
!
2. In the simulation experiments, Creek outperforms
decentralized approaches such as Baraat, Per-Flow-Fair-
Sharing (FS), and Stream by up to 1.82
!
, while achieving
comparable outcomes to Aalo’s.
A. Testbed Experiment
Prototype: Creek prototype is built on top of the existing
TCP implementation and synthesized as a loadable kernel
module in Linux. Then, we implement client and server
model to emulate multiple workers and receivers by utilizing
socket programming at the application level. In this model,
packets are transmitted from clients acting as workers to
server acting as receivers. Our prototype randomly generates
216 and 432 TCP flows with different sizes according to a
heavy tailed distribution; then, these flows are randomly
clustered into 20 and 30 coflows respectively with each
Fig. 3. Testbed experiment. Scenario 1: There are 30 coflows with each
coflow has 3 receivers and each receiver is serving 5 flows. Scenario 2:
There are 20 coflows with each coflow has 2 receivers and each receiver is
serving 2 to 3 flows.
coflow having 2 and 3 receivers. In this experiment, the TCP
kernel module is modified so that the coflow ID can be
inserted into the IP option field in TCP packet header [29].
Moreover, we used local memory to store coflow
information, such as total bytes sent.
Testbed: Fig. 3 shows the testbed used in the evaluation. It
consists of 12 datacenter-scale servers connected together via
a ToR 48-port 1 Gigabit Ethernet switch (Pica8 P-3297) and
a control-plane 4-port 10 Gigabit Ethernet switch. The ToR
switch supports strict priority queuing with at most 8 classes
of services queue [1]. Each server is a HUAWEI RH1288 V2
with 24-core Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz,
64G memory, a 2T hard disk, and Broadcom BCM5719
NetXtreme Gigabit Ethernet NIC. Each server runs Ubuntu
14.04.2 LTS with Linux 4.0 kernel. In the ToR switch, strict
priority queues are enforced, and packets are classified based
on the DSCP field [1,29].
Experiment: To evaluate Creek, we create two experimental
scenarios. In the experiments, 10 machines are running the
client application sending data to a 11th machine running the
server application. In the first one, the experiment is
conducted by starting 432 TCP flows which are classified
into 30 coflows. In the second one, 216 TCP flows are
initiated to make up for 20 coflows. In both scenarios, to
reflect a more realistic environment, the 12th server is used to
generate background traffic using iperf, which is a popular
Linux traffic generator, at the speed of 500 Mbps (which is
the equivalent of 50% of the link capacity). This is a common
traffic pattern seen in many datacenters [11]. In both
scenarios, we compare the CCTs of our scheduling scheme to
the CCTs of using regular TCP [29]. This set of experiment
is conducted using 8 priority queues. Later in the section, we
conduct another experiment to measure the performance of
using different number of priority queues. One of the
challenges in performing testbed experiments is to generate
sufficient traffic load to reproduce bursty traffic patterns
without causing Denial of Service (DoS) [29]. In our testbed,
traffic with 435 connections or larger causes Denial of
Service.
The testbed results, as shown in Fig. 3(a), shows that,
compared to TCP, Creek can improve the average
performance by 1.8
!
and 1.533
!
in the first and second
scenario respectively. Specifically, the average CCTs of 30
coflows with TCP is 14.9 and 9.73 milliseconds in the first
and second scenario respectively; on the other hand, the
average CCTs in our scheduling methodology is 8.1 and 6.47
Fig. 4, Large scale experiments using (a) TPC-DS and (b) FB-Tao
benchmark.
I
II
III
IV
V
Size B
6MB-1GB
1GB-10GB
10GB-100GB
100GB-1TB
>1TB
Table 2. Five categories of coflow with different size in many-to-many
pattern (size B).
milliseconds respectively. Similarly, Fig. 3(b) depicts that
our coflow scheduling also improves the average
performance of mice flows by 1.8
!
and 1.7
!
with 20 and 30
coflows respectively. This experiment shows that the
proposed scheme performs better than TCP, especially in
networks with higher traffic load.
B. Large Scale Simulation Experiments
To evaluate our proposed scheduling scheme in large
scale network, we develop a flow-level simulator that takes
into account coflow arrival and departure events at the flow
level. It updates the rate and remaining volume of each flow
when the event occurs. We model a data center with 3465
hosts and 720 switches of 10 Gigabit (10G) link speed in Fat-
tree topology [30] of size k=24.
In the simulation experiments, Creek’s performance is
compared to the baseline Per-Flow-Fair-Sharing, Baraat [3],
Stream [27], and Aalo [5]. Per-Flow-Fair-Sharing (PFS)
mechanism is a scheduling scheme that divides the resource
capacity equally among flows traversing the same link, which
is also the baseline in our analysis. Baraat is a First-in-First-
out with limited multiplexing scheme. Stream is also a
0
1
2
1 2
Improvement
Scenario
Mice Flow
0
0.5
1
1.5
2
III III IV V
Improvement
Group
Fair-Sha re Baraat Stream Aalo
a)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
III III IV V
Improvement
Group
Fair-Sha re Baraat Stream Aalo
b)
decentralized scheduling scheme, which opportunistically
leverages coflow communication pattern.
Realistic traffic pattern and load. Creek is evaluated using
real traffic patterns and traffic load by replaying 526 coflows
from actual production traffic traces from 3000 servers in
Facebook production datacenter [4,5], which capture a one-
hour Hive/MapReduce trace. In our simulation, Equal-cost
multi-path routing (ECMP) [29], which is used in datacenters
to route and load balance network traffic, is also used.
Moreover, TCP is the dominant transport protocol in
datacenters, hence we implement rate limiters that acts like
TCP for all the schemes, except for Baraat whose rate limiter
follows its design [3].
Traffic Pattern. To run the simulations, Cloudera’s
Industrial benchmark is used. Specifically, the TPC-DS
query-42 (TPC-DS) [4], and Facebook Tao (FB-Tao) [28,31]
traces are used to create many-to-many scenario (because
Facebook trace only consists of coflow with many-to-one).
We use these benchmarks and insights from [3,4,23,24,31] to
synthesize the original trace to generate realistic trace of
many-to-many pattern. The coflow sizes for the many-to-
many pattern is shown in table 2.
Scenario 1: TPC-DS benchmark. Fig. 4(a) shows that
Creek is at least 1.82
!
better than Baraat and FS. And it
shows similar performance as the centralized scheme Aalo.
Creek also outperforms Baraat, FS, and Aalo in Group I by
almost 1.8
!
, 1.6
!
, and 1.2
!
, respectively. All in all,
compared to Baraat and FS, Creek is at least 1.83
!
better and
Creek’s and Aalo’s performance are comparable.
Scenario 2: FB-Tao benchmark. Fig. 4(b) shows that, on
average, Creek exceeds Baraat, FS, and Stream by 1.6
!
,
1.2
!
, and 1.1
!
respectively. And, for small coflows, Creek
is only within 1% to Aalo. In conclusion, Creek is better than
both Baraat and FS, by at least 1.2
!
across all groups. This
is because Creek can achieve similar performance of Stream
but without its communication overhead. Creek also has
comparable performance to Aalo across the various groups.
Creek’s ability to quickly differentiate coflows according
to their states with information at the sub-coflow level allows
it to achieve better results compared to Baraat and FS. This
allows Creek to quickly divert coflows and allocate
appropriate resources earlier, which avoids delay. In contrast,
Baraat and FS suffer from longer delays. Moreover, by
outsourcing the information management to a third party,
Creek achieves slightly better performance compared to
Stream, but with significantly lesser communication
overheads (i.e.,
W
/
"
1 instead of
W
/
"X
1).
On average, Creek’s overall performance is comparable to
a centralized scheme Aalo. This is because Aalo only realizes
a coflow is a mice coflow when it is completed; this means
mice coflows are processed together with larger coflows in
Aalo. Creek on the other hand is a sub-coflow based system,
and therefore mice coflows can be quickly recognized as soon
as a sub-coflow is completed. This enables Creek to prioritize
mice coflows before their completion and to quickly separate
them from larger coflows, which results in lower CCTs. This
approach takes advantage of the fact that sub-coflows of a
mice coflow is typically small. For large coflows consisting
of many mice sub-coflows, one of the parents of mice sub-
coflows can recognize and separate it.
Finally, Aalo is performs better than Creek (by ~0.1
!
)
because it is a centralized scheme with global information
(i.e., Aalo can be more precise in distinguishing coflows with
similar characters, which benefits these two categories).
However, Creek compensate for this by achieving superior
performance in all categories compared to the decentralized
schemes.
VI. CONCLUSION
Creek is a decentralized coflow scheduler that aims to
minimize CCT for many-to-many communication patterns
and the communication overhead between receivers. The
results from both testbed and large-scale network simulation
experiments show that Creek is a simple but effective
coordination between receivers can improve applications’
performance in datacenters. Creek outperforms decentralized
schemes like Baraat, FS, and Stream, and performs
comparably well to centralized schedulers like Aalo.
References
[1] http://www.pica8.com/documents/pica8-datasheet-picos.pdf
[2] S. Argawal, et al, , Sincronia: Near-Optimal Netwok Design for
Coflows”, ACM SIGCOM, 2018.
[3] F. Dogar, et al, “Decentralized Task-Aware Schduling for Data Center
Networks”, ACM SIGCOMM, 2014.
[4] M. Chowdhury, Y. Zhong, and I. Stoica, ”Efficient Coflow Scheduling
with Varys”, ACM SIGCOMM, 2014.
[5] M. Chowdhury and I. Stoica, ”Efficient Coflow Schduling Without
Prior Knowldege”, ACM SIGCOMM, 2015.
[6] M. Chowdhury, et al,”Managing Data Transfer in Computer Clusters
with Orchestra”, ACM SIGCOMM, 2011.
[7] Y. Zho, et al, “RAPIER: Integrating Routing and Scheduling for
Coflow-aware Data Center Networks”, IEEE INFOCOM 2015.
[8] Z. Huang, et al “Need for Speed: CORA Scheduler for Optimizing
Completion Time in the Cloud”, INFOCOM 2015.
[9] Z. Qiu, et al, “ Minimizing the Total Weighted Completion Time of
Coflows in Datacenter Networks”, ACM SPAA, 2015.
[10] M. Alizadeh, et al, pFabric:Minimal Near-Optimal Datacenter
Transport”, ACM SIGCOMM, 2013.
[11] M. Alizadeh, et al,“Data Center TCP (DCTCP)”, SIGCOMM, 2010.
[12] H. Zhang. et al,“CODA: Toward Automatically Identifying and
Scheduling Coflows in the Dark“,ACM SIGGCOMM, 2016.
[13] M. Alizadeh, et al., “CONGA: Distributed Congestion-Aware Load
Balancing for Datacenters”, ACM SIGGCOMM, 2014.
[14] A. Greenberg et al., “VL2: a Scalable and Flexible Data Center
Network”, SIGCOMM 2009.
[15] M. Chowdhury and I. Stoica, “Coflow: A Networking Abstraction for
Cluster Applications”, USENIX HotNets, 2012.
[16] A. Munir, et al, “Friends, not Foes Syntehsizing Exiting Transport
Strategies for Data Center Networks, ACM SIGCOMM, 2014.
[17] T. Benson, A. Akella, and D. A. Maltz, ”Network Traffic
Characteristics of Data Centers in the Wild”, ACM IMC, 2010.
[18] J. Dean and S. Ghemawat, “MapReduce: Simplifed Data Processing on
Large Clusters”, USENIX OSDI, 2004.
[19] Z. Liu, et al, “Enabling Work-conserving Bandwidth Guarantees for
Multi-tenant Datacenters via Dynamic Tenant-Queue Binding”,
INFOCOM 2018.
[20] M. Zhaharia, et al., “Resilent Distributed Datasets: A Fault-Tolerant
Abstraction for in-Memory Custer Computing”, USENIX NSDI, 2008.
[21] L. Wang and W. Wang, “Fair Coflow Scheduling without Prior
Knowledge”, IEEE ICDCS, 2018.
[22] R. Chaiken, et al.”SCOPE: Easy and Efficient Parallel Processing of
Massive Dataset”, VLDB, 2008.
[23] G. Malewicz, et al.,”Pregel: A System for Large-Scale Graph
Processing”, ACM SIGMOD, 2008.
[24] Y. Low, et al., “Distrubted GraphLab: A Framework for Machine
Learning and Data Mining in the Cloud”. PVLDB 2012.
[25] W. Bai, et al, ”Information-Agnostic Flow Scheduling for Comodity
Data Centers”, USENIX NSDI, 2015.
[26] P. Poupart, et al., “Online Flow Size prediction fo Improved Network
Routing”, IEEE ICNP, 2016.
[27] H. Susanto, J. Hao, K. Chen, Stream: Decentralized Inter Coflow
Scheduling for Datacenter Networks”, IEEE ICNP, 2016.
[28] N. Bronson, et al, TAO: Facebook’s Distributed Data Store for the
Social Graph”, USENIX ATC, 2013.
[29] J. Kurose and K. Ross, “Computer Networking, a Top Down Approach
6th addition”, Pearson, 2013.
[30] M. Al-Fares, A. Laukissas, and A. Vahdat, “A Scalable, Commodity
Data Center Network Architecture”, ACM SIGCOMM, 2008.
[31] A. Roy, et al, “Inside the Social Network’s (Datacenter) Network,” in
ACM SIGCOMM 2015.
[32] C., L., R., S., ” Introduction to Algorithm”, MIT Press, 2001.
[33] Z. Liu, et al, Managing recurrent virtual updates in multi tenant
datacenters a system perspective”, IEEE. TPDS, 2019.
... A number of interesting research works have investigated more or less successfully the first two elements of this framework [13][14][15][16][17][18][19][20][21][22][23][24][25]. In [13,14], highly scalable network topologies offering a 1:1 over-subscription and a high bisection bandwidth were proposed. ...
Preprint
Full-text available
To meet the timing requirements of interactive applications, the no-frills congestion-agnostic transport protocols like UDP are increasingly deployed side-by-side in the same network with congestion-responsive TCP. In cloud platforms, even though the computation and storage is totally virtualized, they lack a true virtualization mechanism for the network (i.e., the underlying data centers networks). The impact of such lack of isolation services, may result into frequent outages (for some applications) when such diverse traffics contend for the small buffers in the commodity switches used in data centers. In this paper, we explore the design space of a simple, practical and transport-agnostic scheme to enable a scalable and flexible end-to-end congestion control in data centers. Then, we present the the shortcomings of coupling the monitoring and control of congestion in the conventional system and discuss how a Software-Defined Network (SDN) would provide an appealing alternative to circumvent the problems of the conventional system. The two systems implements a software-based congestion control mechanisms that perform monitoring, control decisions and traffic control enforcement functions. Both systems are designed with a major assumption that the applications (or transport protocols) are non-cooperative with the system, ultimately making it deployable in existing data centers without any service disruption or hardware upgrade. Both systems are implemented and evaluated via simulation in NS2 as well as real-life small-scale test-bed deployment and experiments.
... [7] propose a mechanism to truncate the payload of packets causing congestion in the network and forwarding only the header to the receiver. Other approaches consider the co-flow abstraction to collectively optimize the performance of flows who share the same goal or task [22,23]. ...
Preprint
Full-text available
In this work, we provide the design and implementation of a switch-assisted congestion control algorithm for data center networks (DCNs). In particular, we provide a prototype of the switch-driven congestion control algorithm and deploy it in a real data center. The prototype is based on few simple modifications to the switch software. The modifications imposed by the algorithm on the switch are to enable the switch to modify the TCP receive-window field in the packet headers. By doing so, the algorithm can enforce a pre-calculated (or target rate) to limit the sending rate at the sources. Therefore, the algorithm requires no modifications to the TCP source or receiver code which considered out of the DCN operators' control (e.g., in the public cloud where the VM is maintained by the tenant). This paper describes in detail two implementations, one as a Linux kernel module and the second as an added feature to the well-known software switch, Open vSwitch. Then we present evaluation results based on experiments of the deployment of both designs in a small testbed to demonstrate the effectiveness of the proposed technique in achieving high throughput, good fairness, and short flow completion times for delay-sensitive flows.
... The second approach aims at controlling queue build-up at the switches by relying on ECN marks to limit the sending rate of the servers [8,10,34], or by controlling the congestion window [32] or receiver window [2,3,60] of TCP flows. Similar approaches deployed global traffic scheduler [12,13,15,56,57] or tacked fine-grained sub-microsecond updates in RTT to detect congestion [45]. All of these works achieved their goals and have shown they could reduce the FCT of short flows as well as achieving high link utilization. ...
Preprint
Full-text available
Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small buffer space in data-center switches. Such applications require a short flow completion time (FCT) to perform their jobs effectively. However, TCP is oblivious to the composite nature of application data and artificially inflates the FCT of such flows by several orders of magnitude. This is due to TCP's Internet-centric design that fixes the retransmission timeout (RTO) to be at least hundreds of milliseconds. To better understand this problem, in this paper, we use empirical measurements in a small testbed to study, at a microscopic level, the effects of various types of packet losses on TCP's performance. In particular, we single out packet losses that impact the tail end of small flows, as well as bursty losses, that span a significant fraction of the small congestion window of TCP flows in data-centers, to show a non-negligible effect on the FCT. Based on this, we propose the so-called, timely-retransmitted ACKs (or T-RACKs), a simple loss recovery mechanism to conceal the drawbacks of the long RTO even in the presence of heavy packet losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCP in the tenant's virtual machine (VM). T-RACKs can be implemented as a software shim layer in the hypervisor between the VMs and server's NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show that T-RACKs achieves remarkable performance improvements.
... For example, in Linux RT O min is equal to 200ms and is hard-coded in the TCP source code. Other approaches consider the co-flow abstraction to collectively optimize the performance of flows who share the same goal or task [42], [43], [43], [44]. ...
Preprint
Full-text available
In data centers, the nature of the composite bursty traffic along with the small bandwidth-delay product and switch buffers lead to several congestion problems that are not handled well by traditional congestion control mechanisms such as TCP. Existing work try to address the problem by modifying TCP to suit the operational nature of data centers. This is practically feasible in private settings, however, in public environments, such modifications are prohibited. Therefore, in this work, we design simple switch-based queue management to deal with such congestion issues adequately. This approach entails no modification to the TCP sender and receiver algorithms which enables easy and seamless deployment in public data centers. We present a theoretical analysis to show the stability and effectiveness of the scheme. We also present, three different real implementations (as a Linux kernel module and as an added feature to OpenvSwitch) and give numerical results from both NS-2 simulation and experiments of real deployment in a small test-bed cluster to show its effectiveness in achieving high throughput overall, a good fairness and short flow completion times for delay-sensitive flows.
Article
Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small switch buffer space in data-center. Such applications require a small flow completion time (FCT) to be effective. Unfortunately, TCP is myopic with respect to the composite nature of application data. In addition it tends to artificially inflate the FCT of individual flows by several orders of magnitude, because of its Internet-centric design, that fixes the retransmission timeout (RTO) to be at least hundreds of milliseconds. To better understand this problem, in this paper, we use empirical measurements in a small data center testbed to study, at a microscopic level, the effects of various types of packet losses on TCP's performance. In particular, we single out packet losses that impact the tail end of small flows, as well as bursty losses that span a significant fraction of small TCP congestion windows, and show a non-negligible effect of such losses on the FCT. Based on this, we propose the so-called, timely-retransmitted ACKs (or T-RACKs), a simple loss recovery mechanism that conceals the drawbacks of the long RTO even in the presence of heavy packet losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCP in the tenant's virtual machine (VM) or container. T-RACKs can be implemented as a software shim layer in the hypervisor between the VMs and the server's NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show remarkable performance improvements.
Article
With the advent of software-defined networking, network configuration through programmable interfaces becomes practical, leading to various on-demand opportunities for network routing update in multi-tenant datacenters, where tenants have diverse requirements on network routings such as short latency, low path inflation, large bandwidth, high reliability, etc. Conventional solutions that rely on topology search coupled with an objective function to find desired routings have at least two shortcomings: (i) they run into scalability issues when handling consistent and frequent routing updates and (ii) they restrict the flexibility and capability to satisfy various routing requirements. To address these issues, this paper proposes a novel search and optimization decoupled design, which not only saves considerable topology search costs via search result reuse, but also avoids possible sub-optimality in greedy routing search algorithms by making decisions based on the global view of all possible routings. We implement a prototype of our proposed system, OpReduce, and perform extensive evaluations to validate its design goals.
Conference Paper
In this paper we present pFabric, a minimalistic datacenter transport design that provides near theoretically optimal flow completion times even at the 99th percentile for short flows, while still minimizing average flow completion time for long flows. Moreover, pFabric delivers this performance with a very simple design that is based on a key conceptual insight: datacenter transport should decouple flow scheduling from rate control. For flow scheduling, packets carry a single priority number set independently by each flow; switches have very small buffers and implement a very simple priority-based scheduling/dropping mechanism. Rate control is also correspondingly simpler; flows start at line rate and throttle back only under high and persistent packet loss. We provide theoretical intuition and show via extensive simulations that the combination of these two simple mechanisms is sufficient to provide near-optimal performance.
Article
Many existing data center network (DCN) flow scheduling schemes, that minimize flow completion times (FCT) assume prior knowledge of flows and custom switch functions, making them superior in performance but hard to implement in practice. By contrast, we seek to minimize FCT with no prior knowledge and existing commodity switch hardware. To this end, we present PIAS, a DCN flow scheduling mechanism that aims to minimize FCT by mimicking shortest job first (SJF) on the premise that flow size is not known a priori. At its heart, PIAS leverages multiple priority queues available in existing commodity switches to implement a multiple level feedback queue, in which a PIAS flow is gradually demoted from higher-priority queues to lower-priority queues based on the number of bytes it has sent. As a result, short flows are likely to be finished in the first few high-priority queues and thus be prioritized over long flows in general, which enables PIAS to emulate SJF without knowing flow sizes beforehand. We have implemented a PIAS prototype and evaluated PIAS through both testbed experiments and ns-2 simulations. We show that PIAS is readily deployable with commodity switches and backward compatible with legacy TCP/IP stacks. Our evaluation results show that PIAS significantly outperforms existing information-agnostic schemes, for example, it reduces FCT by up to 50% compared to DCTCP [11] and L2DCT [32]; and it only has a 1.1% performance gap to an ideal information-aware scheme, pFabric [13], for short flows under a production DCN workload.