Conference PaperPDF Available

Creek: Inter Many-to-Many Coflows Scheduling for Datacenter Networks

May 2019

May 2019

DOI:10.1109/ICC.2019.8762027

Conference: ICC 2019 - 2019 IEEE International Conference on Communications (ICC)

Authors:

Hengky Susanto

sekolah tinggi teologi bethel indonesia

Ahmed M. Abdelmoneim

Queen Mary, University of London

Brahim Bensaou

The Hong Kong University of Science and Technology

Datacenter networked applications, often require multiple data transfer flows that semantically constitute a coflow group. A coflow is thus considered completed when all the transfers in the coflow are completed. Hence, application performance is optimized whenever the completion time of a coflow is minimized, rather than that of the flows composing it. Currently, popular coflow scheduling algorithms are mostly centralized, and they incur high overheads. The decentralized approach in the ``many-to-many’’ scenario also incurs high communication overheads due to the communication among the local controllers. In this paper, we present a coflow scheduling mechanism that aims to minimize the coflow completion time for coflows that show a many-to-many communication pattern, and as a byproduct communication overhead cost is also minimized. Our algorithm preserves compatibility with existing commodity switches and network protocols and improves the coflow completion times on average by 1.8 times compared to the baseline as demonstrated via testbed implementation and large-scale simulation

Content uploaded by Ahmed M. Abdelmoneim

Content may be subject to copyright.

Content uploaded by Ahmed M. Abdelmoneim

Content may be subject to copyright.

Creek: Inter Many-to-Many Coflows Scheduling

for Datacenter Networks

Hengky Susanto1, Ahmed M. Abdelmoniem2, Hao Jin3, Brahim Bensaou4

HKUST1,2,3,4, Assiut University2, Texas A&M University3

hsusanto@cs.uml.edu1, amas@cse.ust.hk2, haojin@tamu.edu3 , brahim@cse.ust.hk4

Abstract— Datacenter networked applications, often require

multiple data transfer flows that semantically constitute a

coflow group. A coflow is thus considered completed when all

the transfers in the coflow are completed. Hence, application

performance is optimized whenever the completion time of a

coflow is minimized, rather than that of the flows composing it.

Currently, popular coflow scheduling algorithms are mostly

centralized, and they incur high overheads. The decentralized

approach in the ``many-to-many’’ scenario also incurs high

communication overheads due to the communication among the

local controllers. In this paper, we present a coflow scheduling

mechanism that aims to minimize the coflow completion time

for coflows that show a many-to-many communication pattern,

and as a byproduct communication overhead cost is also

minimized. Our algorithm preserves compatibility with existing

commodity switches and network protocols and improves the

coflow completion times on average by 1.8 times compared to

the baseline as demonstrated via testbed implementation and

large-scale simulation.

I. INTRODUCTION

Recently, the term coflow has been coined to provide a

meaningful semantic that translates application performance

requirements in datacenter networks into performance metrics

that can be understood at network level (e.g., in the data

plane). In networking context, a coflow consists of a set of

concurrently active data flows set to complete a specific data

transfer started by the application. Typically, the completion

of data transfer of all flows within the same coflow signifies

the completion of the communication stage for the

application. Applications strive to achieve faster completion

of their communication tasks, which translates into

minimizing coflows’ completion times (CCT). However, due

to the simultaneity of the flows in the network, minimizing

greedily the CCT may induce inter-coflow bottlenecks, which

can in turn severely degrade the performance at the

application level.

To address these dependency problems, many recent

proposals put this problem into the form of CCT

minimization. The popular approaches are usually designed in

centralized manner [4,5,6,7,8,9] where a single centralized

scheduler is responsible for scheduling the coflows of the

entire network. However, a high overhead cost is incurred for

maintaining such a centralized system in large datacenters. As

This work has appeared in IEEE International Conference

on Communications, 2019. 10.1109/ICC.2019.8762027

an alternative, various decentralized state of the art solutions

have been proposed. For instance, Baraat [3] requires switch

modifications where the task of scheduling coflows is

performed at the switches. Baraat, however, lacks access to

coflow level information because switches only have access

to information at flow level, which leads to sub-optimal

outcome. In addition, because this decentralized solution

requires elaborate software modifications in the switches, it is

harder to deploy. Stream [27] does not require switch

modification but requires local controllers of the same coflow

to exchange information, which may result in extra

communication overhead cost. Moreover, decentralized

schemes also commonly suffer from sub-optimal outcome

because of the lack of a complete picture of coflow states and

the inability to achieve global coordination between the local

controllers. In this paper we present Creek, a decentralized

inter coflow scheduler for coflows that exhibit a many-to-

many communication pattern, without requiring hardware

modifications while imposing only minimal communication

overheads. Creek is designed to resolve the challenges

encountered in decentralized scheduling systems, while

possessing the key advantages of centralized system. Creek is

capable of acquiring a more complete picture of coflow states

and accomplishes an approximate global coordination, to

achieve near optimal performance, without the overhead cost

experienced by centralized solutions.

The key to the solution depends on understanding the

communication pattern which provides insights to achieve the

objective of minimizing CCTs effectively. One-to-many is a

pattern where a single node receives data transfer from many

senders and forms a single coflow [22,23,24]. Many-to-many

is a pattern where many receivers receive data transfer from

many senders [18,20]. In other words, that is a single many-

to-many coflow consists of multiple many-to-one coflows,

which is the focus of this paper.

To achieve its targets, Creek acquires the necessary

information on coflows at receiver end. The scheduling policy

is enforced and communicated by leveraging existing network

components (e.g., functionalities that are commonly available

in commodity switches) and the mechanics of existing

transport protocols such as TCP/IP. For inter coflows

scheduling decision, Creek invokes the well-known Smallest

Task First (STF) policy. To reduce the communication

overhead incurred by the receivers of a coflow in

communicating with each other, Creek outsources the

information management to a third party, which can be a

designated node that stores coflow information.

In our performance analysis, we evaluate our solution

through actual testbed experiments and large-scale simulation

experiments. In the testbed experiments, we implement Creek

and deploy the prototype in a small datacenter testbed. This

also shows that the solution is production deployments

friendly. Moreover, the experiments demonstrate that Creek

outperforms the baseline schedulers by at least 1.8 times. In

the large-scale simulation, we evaluate Creek’s performance

by replaying an actual trace of coflow traffic workload

collected from 3000 servers (150 racks) in Facebook

production datacenter [4].

Specifically, the evaluation is performed by using widely

accepted traces from Facebook along with two benchmarks:

TPC-DS [5] query and Facebook’s Tao structure [28]. In our

evaluation, Creek exceeds both Baraat and the traditional per-

flow fair sharing scheme by 1.85

on average and achieves

comparable performance with the centralized scheme. As for

mice coflow CCT, Creek is up to 28

better than per flow fair

sharing and up to 18

better than Baraat. Here, Creek also

achieves similar outcome to centralized systems. At last,

finding in [4] shows that priority-based scheme follows

diminishing return behavior, and in this paper, we provide an

insight to this behavior through theoretical and (testbed and

simulation) experimental results.

Our contributions can be summarized as follows:

1. We propose a coflow scheduling scheme for coflows with

many-to-many communication patterns, which minimizes

the communication overhead between receivers.

2. We deploy our solution in our mini datacenter and evaluate

it in a large-scale setting via simulation.

The rest of the paper is organized as follows. We present

previous related work in section II and the system model in

section III. Then, we describe Creek in section IV. Simulation

results are presented in section V, then concluding remarks are

given in section VI.

II. RELATED WORK

One of the early works on this theme is Orchestra [6],

where the semantic among flows is taken into account in the

design of the flow transfers optimization in datacenter. By

adopting the smallest-total-size-first scheduling policy,

Sincronia [2], Varys [4], Aalo [5], and NC-DRF [21] improve

the performance compared to [6]. RAPIER [7] extends [4] by

incorporating routing algorithms into the scheduling scheme.

Likewise, CODA [12] also extends the problem in [4] by

integrating machine learning into the coflow scheduling

scheme. In later development, the authors of [8,9] extended

the problem in [4] by taking into account the importance level

of different coflows and reformulated the problem into a

weighted CCTs minimization problem. The aforementioned

schemes fall into the centralized scheduling category that

typically provide near optimal scheduling. However, these

approaches are criticized for incurring very high cost of

centralized management and are generally hard to realize in

practice because they require significant switch modifications

and/or a complex control plane.

On the other hand, as an alternative, there is the

decentralized approach. In this approach, Baraat [3]

dominates as the state-of-the-art decentralized coflow

scheduling system. Baraat relies on various heuristics and is

based on a multiplexed First-In First-Out (FIFO) principles.

In Baraat, whenever large coflows are observed in the

network, mice flows

Flow size

The length of a flow

Coflow size

The sums of all flow sizes in a coflow in bytes.

Coflow width

The number of parallel flows in a coflow.

Coflow length

Largest or longest flow in the coflow in bytes.

Table 1. Terminology

Fig. 1. Data Shuffle between mappers and reduces in Hadoop [18].

Fig. 2. CDF: a) coflow size , b) coflow length, and c) coflow width from

Facebook [4], and d) coflow size in Microsoft Bing [3].

Are processed in the background. Otherwise, mice flows are

processed according to the trivial FIFO scheduling. Even

though, Baraat proves to be effective, it has several

drawbacks: first, its scheduling decisions are made locally at

the switches which limits the scheduler access to only flow

level information. So, the scheduler has an incomplete

information about coflow states and results in sub-optimal

performance; second, Baraat also requires modifications to

the switches which makes it not deployment-friendly.

Stream [27] is another recently proposed decentralized

scheduler that opportunistically chooses the receiver in many-

to-one and many-to-many communication patterns. However,

since Stream requires its receivers of a same coflow to

communicate with each other for coordination, it shows a high

overhead communication cost. In our work, we adopt a

different approach where we solve the general coflow

scheduling problem in a decentralized manner for many-to-

many patterns, without requiring hardware modification with

minimal communication overhead.

III. SYSTEM MODEL

In this section, we discuss the coflow abstraction,

describe the characteristics of coflows as observed in

production datacenter environments, and then introduce the

network model used in the study.

Coflow Abstraction. A coflow state is generally

characterized by three parameters, the number of its

concurrent flows (usually called the width), the total number

of bytes transferred (referred as the size), and the longest flow

in bytes (called the coflow length). For example, a coflow

state can be known by tracking the number of completed

flows of the coflow, the number of bytes transferred/received

of the coflow, and so on.

Coflows in Production. In [4], it is observed that coflow sizes

in production environment (Facebook datacenter) follow a

heavily trailed distribution. More precisely, large coflows, of

at least 10 Gb and the ones of at least 1 Gb amount to only 8%

and 15% respectively of all coflows, in spite of being

respectively responsible for 98% and 99.6% of the total traffic

in the datacenter. This implies that most coflows in the

datacenter are small in size and contribute the least bytes to

the network. This is illustrated by Fig. 2a, 2b, and 2c. The

same findings are observed in [3, 6] from Microsoft’s

datacenter, as illustrated by Fig. 2d. In [14] data-mining

application traffic is studied, and here also the distribution of

flow sizes is found to be heavily tailed, with 95% of all data

bytes coming from flows larger than 35MB, which make for

only 3.6% of all flows. This confirms that data mining

application also generates more small sized flows, but the

traffic in the network comes from the few large sized flows.

Network Model. In this work, we consider a Tree-based

topology [3, 7, 8, 10,19,25]. We conduct the experiments in

a testbed and via NS-3 simulation using the FatTree topology

[30]. From the experiments, we find that the processing and

queuing times are significant in the aggregation and core

switches which agrees with the findings in [10,11,25].

Moreover, we find that the bottleneck has shifted from ToR

switches to become more evenly distributed among different

layers. This is due to the high speed NICs matching the

speeds of core switch ports.

IV. SCHEDULING SCHEME

A. Problem Formulation

The problem for the offline case of coflow scheduling can

be formulated as follows: we have

coflows, numbered 1, 2,

…,

, in a system. Then, the CCT minimization problem can

be written:

$%"%$%&'#

(

*,- #.################################

(

234 56

376 .###89 7 :.##################################

0;<

=34 >?#.###8@ 7 A.###BC'D#)*.=3E F#.##########

0;G

where,

refers to the completion time of coflow

, defined as

)*HIJK

)3M##8@ 7 A

where

refers to the completion time

of a flow

. That is,

is simply the completion time of

the slowest flow in coflow

. Constraint (1.a) expresses link

capacity constraints, i.e., it ensures that the aggregate flow on

link

does not exceed the link capacity

. Constraints (1.b)

ensures that starvation and packet out-of-order problems are

eliminated. Observe that the above formulation for CCTs

minimization is an NP-Hard problem [3,4] as it is reducible to

the well-known Open Shop Scheduling Problem [4].

B. Decentralized Coflow Scheduling Mechanism

Prior works focused on the many-to-one scenario with

the assumption that coflow size is unknown a priori. In this

paper, however, we address the more difficult many-to-many

scenarios.

Generally, Creek uses the STF scheduler to reduce the

CCTs by simply giving a high priority to smaller coflows

over larger ones. The problem is that coflow size is unknown

a priori, as prior size measurement is not possible. So, we can

dynamically compare the coflow size to a threshold

at the

receiver’s end. Then, if the coflow size exceeds

, then the

coflow is demoted. Also, initially we assign all new coflows

to the highest priority and then dynamically downgrade their

priority based on the number of bytes received. The receiver

then updates the workers with the new priority values by

piggybacking this priority on the ACK packets.

Creek also takes into account the coflow condition when

assigning the coflow to a priority group (e.g., the number of

completed flows). It also ensures compatibility with

commodity switches, by performing the scheduling at the

receiver side as the information on the coflow and its flows

is readily available at the receiver side.

Creek enforces STF scheduling policy by relying on the

multiple priority queues commonly found in most

commodity switches, to realize a multi-level feedback queue

(MLFQ). As pointed out in [5], MLFQ may result into the

starvation of some flows and Weighted Fair Queuing (WFQ)

may provide a better solution. In Creek, MLFQ is adopted

because priority queues provide better in-network

prioritization and potentially achieves lower CCT. Moreover,

WFQ may introduce the out-of-order problem for TCP flows.

Having said that, later we propose an algorithm that ensures

starvation free operation for Creek.

Coflow Priority Decision. Consider a commodity switch

with

priority queues [1]. Given a coflow

, we denote by

priority

denotes that the

RST

priority queue is assigned to

flow

@ 7 A

, with

0 4 R 4 O

, with

being the highest

priority and

being the lowest. Each priority level

associated with a threshold

. Not that, most existing

commodity switches only support a maximum of 8 priorities

queue [1]. Let

denote the priority assigned to flow

, such

that

P3H P3

. Initially, all flows

are assigned to

, such

that

8@ 7 A. P3H P3

. Thereafter, given the flow size

23E F

the priority

evolves as described below.

Coflow management. In coflows that create many-to-many

communication patterns, the coflow typically may consist of

many sub-coflows. In such case, there would be many

receivers in a single coflow. Hence, sub-coflows of the same

coflow are considered as a single entity and the completion

of the coflow relies on the completion of all of its sub-

coflows. Some of the many scheduling challenges with this

pattern in decentralized settings is how to keep track of the

relationship among sub-coflows of the same coflow, deciding

the appropriate priority values when coflow information is

sparse, and a sub-coflow may not know about some of the

other sub-coflows. To address these challenges, Creek

utilizes shared-storage to allow sub-coflows of the same

coflow to easily exchange the necessary status information

with each other. In other words, the receivers of the same

coflow will share and access the same data storage.

A task manager allocates a small amount of space at a

designated storage space in a server to every new coflow.

Thus, all receivers of this coflow use this storage to provide

information, such as updates and queries on the total bytes

that have been sent. Hence, the number of communications

within a coflow can be reduced from

1 down to

where is

is the number of receivers of a coflow. However,

one of the practical challenges in doing this is how to

synchronize the receivers of a coflow, such that the

information can be updated appropriately without running

into a race condition. That is, there are multiple receivers

sharing a common buffer but only one of them can update the

information at any given time. This problem can be solved

using locking mechanism such as a Mutex allowing only a

single receiver to update and modify the information in the

shared storage space. We also utilize Mutex semaphore based

locking mechanism to resolve the race condition between

receivers of a coflow in our testbed implementation.

Starvation Mitigation. To resolve the starvation problem,

when the wait exceeds a waiting threshold, the worker of the

starving flow retransmits packets that have not been

acknowledged with higher priority assignment. Duplicate

packets are dropped at the receiver by TCP [29] if there is

any. By doing so, the solution also avoids packet TCP out of

order problem. The process is repeated until the flow escapes

the starvation. Then, upon receiving a packet from the

starving flow, the receiver compares the priority of the recent

sent packet with the priority currently assigned to the starving

flow. If it does not match, then the receiver increases that

coflow priority and notifies the worker of the starving flow

with the new priority through the ACK packet. ECN can help

in mitigating starvation, but it may accidently mark packets

from mice coflows, because ECN is not designed to be aware

of coflows.

Setting the threshold. The value of threshold is important in

determining the system performance. If the threshold is too

small or too large, packets of short flows may experience long

queuing delay behind elephant flows. Although threshold is

commonly used in system design [3,4,10,25,28], there is very

little study on how the threshold should be set, such that

system achieves optimality.

From our experiments, we derive two observations: /

Thresholds should be able to quickly direct traffic into the

appropriate queue; /

1 to mitigate starvation, the wait of the

lowest priority should not exceed TCP retransmission

timeout (RTO) [29]. Using these two rules of thumbs, our

threshold leads to very minimal starvation in our testbed

experiments. At this point, however, the threshold is decided

by using exhaustive search, which may imply a higher

overhead cost for larger systems. We will further investigate

setting of thresholds using machine learning techniques as

proposed in [26] in our future work.

Data structure. One challenge in implementing Creek is

keeping track of the number of bytes sent generated by a large

number of coflows. In practice, multiple coflows arrive and

complete the task. Thus, information on coflows must be

added or removed to the data structure when coflows start and

complete respectively. For this reason, the data structure must

be adaptive to the dynamics of start-complete cycle while at

the same time keeping the computation cost low (e.g. lookup

operation).

In our testbed implementation, we use two dynamic arrays

available in C++ library (e.g. vector) to track coflows and

sub-coflows’ bytes sent. We assume that coflow ID is unique

globally. Then, Creek utilizes these IDs as coflow index and

the information is inserted such that the IDs are sorted in

increasing order, which is linear using the existing technique.

Since the structure is dynamic caused by the start-complete

cycle, straightforward hashing is not suitable for lookup

operation. To resolve this, Creek uses binary based search

algorithm [32], which takes an

YZ["

1 operation with

being the array size. This is possible because the array is

sorted.

Without careful coordination between insertion, deletion,

update, and information retrieval operations, the system may

end up in a race condition where different threads compete to

modify the same information or data structure. This can result

in inaccurate information update. For example, two threads

are performing simultaneous updates at the same location in

memory causing the new information to only reflect one of

the updates instead of both updates.

We mitigate this issue by utilizing strict priority queue

non-preemptive scheduling (SPQ-NS). Here, an operation

(e.g. delete, update, insert, or read) cannot be interrupted

when it is being performed even when there is an operation

in higher queue waiting to be performed. Operations in each

queue of SPQ-NS is performed in first-in-first-out order and

the coordination between operation is done using Mutex.

Here, information update and insert operations are assigned

to the highest priority, information retrieval is assigned to a

lower priority, and delete operation is assigned to the lowest

priority. Moreover, deletion is only performed once every

interval time (e.g. every 1 second). By doing this we prevent

race condition.

V. EVALUATION

The performance of the proposed scheduling scheme is

evaluated via a number of experiments in a 10Gbps testbed

as well as via large-scale network simulation using NS3 with

Facebook traces from [4,5]. The main metrics used for

evaluation are the average CCT and the performance

improvement ratio, which is calculated as the ratio of the

target scheme’s CCT to the CCT achieved by Creek. So, if

the improvement x is greater (smaller) than one, then Creek

is faster (slower) than the target scheme by x times.

The main findings are summarized below:

1. In the testbed experiment, Creek significantly reduces the

average coflows CCTs relative to TCP by up to 1.8

and

the average mice flows FCTs by up to 1.833

2. In the simulation experiments, Creek outperforms

decentralized approaches such as Baraat, Per-Flow-Fair-

Sharing (FS), and Stream by up to 1.82

, while achieving

comparable outcomes to Aalo’s.

A. Testbed Experiment

Prototype: Creek prototype is built on top of the existing

TCP implementation and synthesized as a loadable kernel

module in Linux. Then, we implement client and server

model to emulate multiple workers and receivers by utilizing

socket programming at the application level. In this model,

packets are transmitted from clients acting as workers to

server acting as receivers. Our prototype randomly generates

216 and 432 TCP flows with different sizes according to a

heavy tailed distribution; then, these flows are randomly

clustered into 20 and 30 coflows respectively with each

Fig. 3. Testbed experiment. Scenario 1: There are 30 coflows with each

coflow has 3 receivers and each receiver is serving 5 flows. Scenario 2:

There are 20 coflows with each coflow has 2 receivers and each receiver is

serving 2 to 3 flows.

coflow having 2 and 3 receivers. In this experiment, the TCP

kernel module is modified so that the coflow ID can be

inserted into the IP option field in TCP packet header [29].

Moreover, we used local memory to store coflow

information, such as total bytes sent.

Testbed: Fig. 3 shows the testbed used in the evaluation. It

consists of 12 datacenter-scale servers connected together via

a ToR 48-port 1 Gigabit Ethernet switch (Pica8 P-3297) and

a control-plane 4-port 10 Gigabit Ethernet switch. The ToR

switch supports strict priority queuing with at most 8 classes

of services queue [1]. Each server is a HUAWEI RH1288 V2

with 24-core Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz,

64G memory, a 2T hard disk, and Broadcom BCM5719

NetXtreme Gigabit Ethernet NIC. Each server runs Ubuntu

14.04.2 LTS with Linux 4.0 kernel. In the ToR switch, strict

priority queues are enforced, and packets are classified based

on the DSCP field [1,29].

Experiment: To evaluate Creek, we create two experimental

scenarios. In the experiments, 10 machines are running the

client application sending data to a 11th machine running the

server application. In the first one, the experiment is

conducted by starting 432 TCP flows which are classified

into 30 coflows. In the second one, 216 TCP flows are

initiated to make up for 20 coflows. In both scenarios, to

reflect a more realistic environment, the 12th server is used to

generate background traffic using iperf, which is a popular

Linux traffic generator, at the speed of 500 Mbps (which is

the equivalent of 50% of the link capacity). This is a common

traffic pattern seen in many datacenters [11]. In both

scenarios, we compare the CCTs of our scheduling scheme to

the CCTs of using regular TCP [29]. This set of experiment

is conducted using 8 priority queues. Later in the section, we

conduct another experiment to measure the performance of

using different number of priority queues. One of the

challenges in performing testbed experiments is to generate

sufficient traffic load to reproduce bursty traffic patterns

without causing Denial of Service (DoS) [29]. In our testbed,

traffic with 435 connections or larger causes Denial of

Service.

The testbed results, as shown in Fig. 3(a), shows that,

compared to TCP, Creek can improve the average

performance by 1.8

and 1.533

in the first and second

scenario respectively. Specifically, the average CCTs of 30

coflows with TCP is 14.9 and 9.73 milliseconds in the first

and second scenario respectively; on the other hand, the

average CCTs in our scheduling methodology is 8.1 and 6.47

Fig. 4, Large scale experiments using (a) TPC-DS and (b) FB-Tao

benchmark.

III

Size B

6MB-1GB

1GB-10GB

10GB-100GB

100GB-1TB

>1TB

Table 2. Five categories of coflow with different size in many-to-many

pattern (size B).

milliseconds respectively. Similarly, Fig. 3(b) depicts that

our coflow scheduling also improves the average

performance of mice flows by 1.8

and 1.7

with 20 and 30

coflows respectively. This experiment shows that the

proposed scheme performs better than TCP, especially in

networks with higher traffic load.

B. Large Scale Simulation Experiments

To evaluate our proposed scheduling scheme in large

scale network, we develop a flow-level simulator that takes

into account coflow arrival and departure events at the flow

level. It updates the rate and remaining volume of each flow

when the event occurs. We model a data center with 3465

hosts and 720 switches of 10 Gigabit (10G) link speed in Fat-

tree topology [30] of size k=24.

In the simulation experiments, Creek’s performance is

compared to the baseline Per-Flow-Fair-Sharing, Baraat [3],

Stream [27], and Aalo [5]. Per-Flow-Fair-Sharing (PFS)

mechanism is a scheduling scheme that divides the resource

capacity equally among flows traversing the same link, which

is also the baseline in our analysis. Baraat is a First-in-First-

out with limited multiplexing scheme. Stream is also a

1 2

Improvement

Scenario

Avg. CCTs

1 2

Improvement

Scenario

Mice Flow

0.5

1.5

III III IV V

Improvement

Group

Fair-Sha re Baraat Stream Aalo

0.2

0.4

0.6

0.8

1.2

1.4

1.6

III III IV V

Improvement

Group

Fair-Sha re Baraat Stream Aalo

decentralized scheduling scheme, which opportunistically

leverages coflow communication pattern.

Realistic traffic pattern and load. Creek is evaluated using

real traffic patterns and traffic load by replaying 526 coflows

from actual production traffic traces from 3000 servers in

Facebook production datacenter [4,5], which capture a one-

hour Hive/MapReduce trace. In our simulation, Equal-cost

multi-path routing (ECMP) [29], which is used in datacenters

to route and load balance network traffic, is also used.

Moreover, TCP is the dominant transport protocol in

datacenters, hence we implement rate limiters that acts like

TCP for all the schemes, except for Baraat whose rate limiter

follows its design [3].

Traffic Pattern. To run the simulations, Cloudera’s

Industrial benchmark is used. Specifically, the TPC-DS

query-42 (TPC-DS) [4], and Facebook Tao (FB-Tao) [28,31]

traces are used to create many-to-many scenario (because

Facebook trace only consists of coflow with many-to-one).

We use these benchmarks and insights from [3,4,23,24,31] to

synthesize the original trace to generate realistic trace of

many-to-many pattern. The coflow sizes for the many-to-

many pattern is shown in table 2.

Scenario 1: TPC-DS benchmark. Fig. 4(a) shows that

Creek is at least 1.82

better than Baraat and FS. And it

shows similar performance as the centralized scheme Aalo.

Creek also outperforms Baraat, FS, and Aalo in Group I by

almost 1.8

, 1.6

, and 1.2

, respectively. All in all,

compared to Baraat and FS, Creek is at least 1.83

better and

Creek’s and Aalo’s performance are comparable.

Scenario 2: FB-Tao benchmark. Fig. 4(b) shows that, on

average, Creek exceeds Baraat, FS, and Stream by 1.6

1.2

, and 1.1

respectively. And, for small coflows, Creek

is only within 1% to Aalo. In conclusion, Creek is better than

both Baraat and FS, by at least 1.2

across all groups. This

is because Creek can achieve similar performance of Stream

but without its communication overhead. Creek also has

comparable performance to Aalo across the various groups.

Creek’s ability to quickly differentiate coflows according

to their states with information at the sub-coflow level allows

it to achieve better results compared to Baraat and FS. This

allows Creek to quickly divert coflows and allocate

appropriate resources earlier, which avoids delay. In contrast,

Baraat and FS suffer from longer delays. Moreover, by

outsourcing the information management to a third party,

Creek achieves slightly better performance compared to

Stream, but with significantly lesser communication

overheads (i.e.,

1 instead of

1).

On average, Creek’s overall performance is comparable to

a centralized scheme Aalo. This is because Aalo only realizes

a coflow is a mice coflow when it is completed; this means

mice coflows are processed together with larger coflows in

Aalo. Creek on the other hand is a sub-coflow based system,

and therefore mice coflows can be quickly recognized as soon

as a sub-coflow is completed. This enables Creek to prioritize

mice coflows before their completion and to quickly separate

them from larger coflows, which results in lower CCTs. This

approach takes advantage of the fact that sub-coflows of a

mice coflow is typically small. For large coflows consisting

of many mice sub-coflows, one of the parents of mice sub-

coflows can recognize and separate it.

Finally, Aalo is performs better than Creek (by ~0.1

)

because it is a centralized scheme with global information

(i.e., Aalo can be more precise in distinguishing coflows with

similar characters, which benefits these two categories).

However, Creek compensate for this by achieving superior

performance in all categories compared to the decentralized

schemes.

VI. CONCLUSION

Creek is a decentralized coflow scheduler that aims to

minimize CCT for many-to-many communication patterns

and the communication overhead between receivers. The

results from both testbed and large-scale network simulation

experiments show that Creek is a simple but effective

coordination between receivers can improve applications’

performance in datacenters. Creek outperforms decentralized

schemes like Baraat, FS, and Stream, and performs

comparably well to centralized schedulers like Aalo.

References

[1] http://www.pica8.com/documents/pica8-datasheet-picos.pdf

[2] S. Argawal, et al, , Sincronia: Near-Optimal Netwok Design for

Coflows”, ACM SIGCOM, 2018.

[3] F. Dogar, et al, “Decentralized Task-Aware Schduling for Data Center

Networks”, ACM SIGCOMM, 2014.

[4] M. Chowdhury, Y. Zhong, and I. Stoica, ”Efficient Coflow Scheduling

with Varys”, ACM SIGCOMM, 2014.

[5] M. Chowdhury and I. Stoica, ”Efficient Coflow Schduling Without

Prior Knowldege”, ACM SIGCOMM, 2015.

[6] M. Chowdhury, et al,”Managing Data Transfer in Computer Clusters

with Orchestra”, ACM SIGCOMM, 2011.

[7] Y. Zho, et al, “RAPIER: Integrating Routing and Scheduling for

Coflow-aware Data Center Networks”, IEEE INFOCOM 2015.

[8] Z. Huang, et al “Need for Speed: CORA Scheduler for Optimizing

Completion Time in the Cloud”, INFOCOM 2015.

[9] Z. Qiu, et al, “ Minimizing the Total Weighted Completion Time of

Coflows in Datacenter Networks”, ACM SPAA, 2015.

[10] M. Alizadeh, et al, “pFabric:Minimal Near-Optimal Datacenter

Transport”, ACM SIGCOMM, 2013.

[11] M. Alizadeh, et al,“Data Center TCP (DCTCP)”, SIGCOMM, 2010.

[12] H. Zhang. et al,“CODA: Toward Automatically Identifying and

Scheduling Coflows in the Dark“,ACM SIGGCOMM, 2016.

[13] M. Alizadeh, et al., “CONGA: Distributed Congestion-Aware Load

Balancing for Datacenters”, ACM SIGGCOMM, 2014.

[14] A. Greenberg et al., “VL2: a Scalable and Flexible Data Center

Network”, SIGCOMM 2009.

[15] M. Chowdhury and I. Stoica, “Coflow: A Networking Abstraction for

Cluster Applications”, USENIX HotNets, 2012.

[16] A. Munir, et al, “Friends, not Foes – Syntehsizing Exiting Transport

Strategies for Data Center Networks, ACM SIGCOMM, 2014.

[17] T. Benson, A. Akella, and D. A. Maltz, ”Network Traffic

Characteristics of Data Centers in the Wild”, ACM IMC, 2010.

[18] J. Dean and S. Ghemawat, “MapReduce: Simplifed Data Processing on

Large Clusters”, USENIX OSDI, 2004.

[19] Z. Liu, et al, “Enabling Work-conserving Bandwidth Guarantees for

Multi-tenant Datacenters via Dynamic Tenant-Queue Binding”,

INFOCOM 2018.

[20] M. Zhaharia, et al., “Resilent Distributed Datasets: A Fault-Tolerant

Abstraction for in-Memory Custer Computing”, USENIX NSDI, 2008.

[21] L. Wang and W. Wang, “Fair Coflow Scheduling without Prior

Knowledge”, IEEE ICDCS, 2018.

[22] R. Chaiken, et al.”SCOPE: Easy and Efficient Parallel Processing of

Massive Dataset”, VLDB, 2008.

[23] G. Malewicz, et al.,”Pregel: A System for Large-Scale Graph

Processing”, ACM SIGMOD, 2008.

[24] Y. Low, et al., “Distrubted GraphLab: A Framework for Machine

Learning and Data Mining in the Cloud”. PVLDB 2012.

[25] W. Bai, et al, ”Information-Agnostic Flow Scheduling for Comodity

Data Centers”, USENIX NSDI, 2015.

[26] P. Poupart, et al., “Online Flow Size prediction fo Improved Network

Routing”, IEEE ICNP, 2016.

[27] H. Susanto, J. Hao, K. Chen, “Stream: Decentralized Inter Coflow

Scheduling for Datacenter Networks”, IEEE ICNP, 2016.

[28] N. Bronson, et al, “TAO: Facebook’s Distributed Data Store for the

Social Graph”, USENIX ATC, 2013.

[29] J. Kurose and K. Ross, “Computer Networking, a Top Down Approach

6th addition”, Pearson, 2013.

[30] M. Al-Fares, A. Laukissas, and A. Vahdat, “A Scalable, Commodity

Data Center Network Architecture”, ACM SIGCOMM, 2008.

[31] A. Roy, et al, “Inside the Social Network’s (Datacenter) Network,” in

ACM SIGCOMM 2015.

[32] C., L., R., S., ” Introduction to Algorithm”, MIT Press, 2001.

[33] Z. Liu, et al, “Managing recurrent virtual updates in multi tenant

datacenters a system perspective”, IEEE. TPDS, 2019.

The Switch from Conventional to SDN: The Case for Transport-Agnostic Congestion Control

Preprint

Full-text available

Sep 2022

To meet the timing requirements of interactive applications, the no-frills congestion-agnostic transport protocols like UDP are increasingly deployed side-by-side in the same network with congestion-responsive TCP. In cloud platforms, even though the computation and storage is totally virtualized, they lack a true virtualization mechanism for the network (i.e., the underlying data centers networks). The impact of such lack of isolation services, may result into frequent outages (for some applications) when such diverse traffics contend for the small buffers in the commodity switches used in data centers. In this paper, we explore the design space of a simple, practical and transport-agnostic scheme to enable a scalable and flexible end-to-end congestion control in data centers. Then, we present the the shortcomings of coupling the monitoring and control of congestion in the conventional system and discuss how a Software-Defined Network (SDN) would provide an appealing alternative to circumvent the problems of the conventional system. The two systems implements a software-based congestion control mechanisms that perform monitoring, control decisions and traffic control enforcement functions. Both systems are designed with a major assumption that the applications (or transport protocols) are non-cooperative with the system, ultimately making it deployable in existing data centers without any service disruption or hardware upgrade. Both systems are implemented and evaluated via simulation in NS2 as well as real-life small-scale test-bed deployment and experiments.

Implementation and Evaluation of Data Center Congestion Controller with Switch Assistance

Preprint

Full-text available

Jun 2021

In this work, we provide the design and implementation of a switch-assisted congestion control algorithm for data center networks (DCNs). In particular, we provide a prototype of the switch-driven congestion control algorithm and deploy it in a real data center. The prototype is based on few simple modifications to the switch software. The modifications imposed by the algorithm on the switch are to enable the switch to modify the TCP receive-window field in the packet headers. By doing so, the algorithm can enforce a pre-calculated (or target rate) to limit the sending rate at the sources. Therefore, the algorithm requires no modifications to the TCP source or receiver code which considered out of the DCN operators' control (e.g., in the public cloud where the VM is maintained by the tenant). This paper describes in detail two implementations, one as a Linux kernel module and the second as an added feature to the well-known software switch, Open vSwitch. Then we present evaluation results based on experiments of the deployment of both designs in a small testbed to demonstrate the effectiveness of the proposed technique in achieving high throughput, good fairness, and short flow completion times for delay-sensitive flows.

T-RACKs: A Faster Recovery Mechanism for TCP in Data Center Networks

Preprint

Full-text available

Feb 2021

Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small buffer space in data-center switches. Such applications require a short flow completion time (FCT) to perform their jobs effectively. However, TCP is oblivious to the composite nature of application data and artificially inflates the FCT of such flows by several orders of magnitude. This is due to TCP's Internet-centric design that fixes the retransmission timeout (RTO) to be at least hundreds of milliseconds. To better understand this problem, in this paper, we use empirical measurements in a small testbed to study, at a microscopic level, the effects of various types of packet losses on TCP's performance. In particular, we single out packet losses that impact the tail end of small flows, as well as bursty losses, that span a significant fraction of the small congestion window of TCP flows in data-centers, to show a non-negligible effect on the FCT. Based on this, we propose the so-called, timely-retransmitted ACKs (or T-RACKs), a simple loss recovery mechanism to conceal the drawbacks of the long RTO even in the presence of heavy packet losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCP in the tenant's virtual machine (VM). T-RACKs can be implemented as a software shim layer in the hypervisor between the VMs and server's NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show that T-RACKs achieves remarkable performance improvements.

Design and Implementation of Fair Congestion Control for Data Centers Networks

Preprint

Full-text available

Dec 2020

In data centers, the nature of the composite bursty traffic along with the small bandwidth-delay product and switch buffers lead to several congestion problems that are not handled well by traditional congestion control mechanisms such as TCP. Existing work try to address the problem by modifying TCP to suit the operational nature of data centers. This is practically feasible in private settings, however, in public environments, such modifications are prohibited. Therefore, in this work, we design simple switch-based queue management to deal with such congestion issues adequately. This approach entails no modification to the TCP sender and receiver algorithms which enables easy and seamless deployment in public data centers. We present a theoretical analysis to show the stability and effectiveness of the scheme. We also present, three different real implementations (as a Linux kernel module and as an added feature to OpenvSwitch) and give numerical results from both NS-2 simulation and experiments of real deployment in a small test-bed cluster to show its effectiveness in achieving high throughput overall, a good fairness and short flow completion times for delay-sensitive flows.

Max-Min Fairness based Scheduling Optimization Mechanism on Switches

Conference Paper

Nov 2022

T-RACKs: A Faster Recovery Mechanism for TCP in Data Center Networks

Article

Mar 2021

Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small switch buffer space in data-center. Such applications require a small flow completion time (FCT) to be effective. Unfortunately, TCP is myopic with respect to the composite nature of application data. In addition it tends to artificially inflate the FCT of individual flows by several orders of magnitude, because of its Internet-centric design, that fixes the retransmission timeout (RTO) to be at least hundreds of milliseconds. To better understand this problem, in this paper, we use empirical measurements in a small data center testbed to study, at a microscopic level, the effects of various types of packet losses on TCP's performance. In particular, we single out packet losses that impact the tail end of small flows, as well as bursty losses that span a significant fraction of small TCP congestion windows, and show a non-negligible effect of such losses on the FCT. Based on this, we propose the so-called, timely-retransmitted ACKs (or T-RACKs), a simple loss recovery mechanism that conceals the drawbacks of the long RTO even in the presence of heavy packet losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCP in the tenant's virtual machine (VM) or container. T-RACKs can be implemented as a software shim layer in the hypervisor between the VMs and the server's NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show remarkable performance improvements.

A Near Optimal Multi-Faced Job Scheduler for Datacenter Workloads

Conference Paper

Full-text available

Jul 2019

Managing Recurrent Virtual Network Updates in Multi-Tenant Datacenters: A System Perspective

Article

Jan 2019

With the advent of software-defined networking, network configuration through programmable interfaces becomes practical, leading to various on-demand opportunities for network routing update in multi-tenant datacenters, where tenants have diverse requirements on network routings such as short latency, low path inflation, large bandwidth, high reliability, etc. Conventional solutions that rely on topology search coupled with an objective function to find desired routings have at least two shortcomings: (i) they run into scalability issues when handling consistent and frequent routing updates and (ii) they restrict the flexibility and capability to satisfy various routing requirements. To address these issues, this paper proposes a novel search and optimization decoupled design, which not only saves considerable topology search costs via search result reuse, but also avoids possible sub-optimality in greedy routing search algorithms by making decisions based on the global view of all possible routings. We implement a prototype of our proposed system, OpReduce, and perform extensive evaluations to validate its design goals.

Enabling Work-Conserving Bandwidth Guarantees for Multi-Tenant Datacenters via Dynamic Tenant-Queue Binding

Conference Paper

Apr 2018

Fair Coflow Scheduling without Prior Knowledge

Conference Paper

Jul 2018

pFabric: minimal near-optimal datacenter transport

Conference Paper

Aug 2013
COMPUT COMMUN REV

In this paper we present pFabric, a minimalistic datacenter transport design that provides near theoretically optimal flow completion times even at the 99th percentile for short flows, while still minimizing average flow completion time for long flows. Moreover, pFabric delivers this performance with a very simple design that is based on a key conceptual insight: datacenter transport should decouple flow scheduling from rate control. For flow scheduling, packets carry a single priority number set independently by each flow; switches have very small buffers and implement a very simple priority-based scheduling/dropping mechanism. Rate control is also correspondingly simpler; flows start at line rate and throttle back only under high and persistent packet loss. We provide theoretical intuition and show via extensive simulations that the combination of these two simple mechanisms is sufficient to provide near-optimal performance.

PIAS: Practical Information-Agnostic Flow Scheduling for Commodity Data Centers

Article

Mar 2017

Many existing data center network (DCN) flow scheduling schemes, that minimize flow completion times (FCT) assume prior knowledge of flows and custom switch functions, making them superior in performance but hard to implement in practice. By contrast, we seek to minimize FCT with no prior knowledge and existing commodity switch hardware. To this end, we present PIAS, a DCN flow scheduling mechanism that aims to minimize FCT by mimicking shortest job first (SJF) on the premise that flow size is not known a priori. At its heart, PIAS leverages multiple priority queues available in existing commodity switches to implement a multiple level feedback queue, in which a PIAS flow is gradually demoted from higher-priority queues to lower-priority queues based on the number of bytes it has sent. As a result, short flows are likely to be finished in the first few high-priority queues and thus be prioritized over long flows in general, which enables PIAS to emulate SJF without knowing flow sizes beforehand. We have implemented a PIAS prototype and evaluated PIAS through both testbed experiments and ns-2 simulations. We show that PIAS is readily deployable with commodity switches and backward compatible with legacy TCP/IP stacks. Our evaluation results show that PIAS significantly outperforms existing information-agnostic schemes, for example, it reduces FCT by up to 50% compared to DCTCP [11] and L2DCT [32]; and it only has a 1.1% performance gap to an ideal information-aware scheme, pFabric [13], for short flows under a production DCN workload.

A scalable, commodity data center network architecture

Article