Conference PaperPDF Available

DCQUIC: Flexible and Reliable Software-defined Data Center Transport

Authors:
  • Qilu University of Technology (Shandong Academy of Sciences)
DCQUIC: Flexible and Reliable Software-defined Data Center Transport
Lizhuang Tan, Wei Su, Yanwen Liu
NGIID
Beijing Jiaotong University
lzhtan; wsu; 19120081@bjtu.edu.cn
Xiaochuan Gao
China Unicom
gaoxc50@chinaunicom.cn
Wei Zhang
Shandong Computer Science Center
National Supercomputer Center in Ji’nan
wzhang@sdas.org
Abstract—Numerous innovations based on the data center
TCP support the rapid development of data center. However,
with changes in topology, scale and traffic patterns, the new
requirements of data center on transport protocol are more
agile and more reliable. Improving transport performance by
patching TCP seems to be facing a bottleneck. We explored the
possibility of applying QUIC inside the datacenter networking.
This paper proposes a new data center transport scheme based
on QUIC, called data center QUIC (DCQUIC). We especially
proposed an proactive connection migration mechanism suit-
able for datacenter networking. Like the efficiency of UDP and
the reliability of TCP, DCQUIC exhibits exciting performance
and scalability, and may become a potential transport technol-
ogy to support the development and innovation of data centers
in the future.
1. Introduction
In the past decade, TCP has supported the rapid devel-
opment of data centers [1]. Numerous network innovations
based on TCP continue to emerge, covering transmission
architecture [2], congestion control [3], [4], [5], streaming
task scheduling [6], [7], TCP acceleration [8], [9], [10], and
load balancing [11], [12]. These efforts have significantly
improved the efficiency of datacenter networking.
However, with the evolution of datacenter networking
topology, scale, and applications [13], there has been a
bottleneck in further improving the efficiency [14]. The
innovation of datacenter networking faces many challenges,
including but not limited to:
1) Have we really discovered the characteristics and
changing patterns of datacenter networking traf-
fic? Are these phenomena appearance or substance
[15]?
2) Do our innovative schemes follow the principle of
Great Truths Are Always Simple”? Are they really
easy to deploy?
3) Can our solution support the long-term evolution of
datacenter networking in actual deployment, instead
of adding burden to it?
In this paper, We analyzed the feasibility of applying
the Quick UDP Internet Connections (QUIC) protocol [16]
inside the datacenter networking, which we call data center
QUIC (DCQUIC). We test the performance improvement
of DCQUIC in real datacenter networking, and these results
are gratifying. There are three test items. The first is the
performance comparison of DCTCP and DCQUIC. The
second is the differential packet size transport for long and
short flows. The third is connection migration technology
to improve transport reliability. We open sourced the first
version of DCQUIC [17], which is based on C language
and can be directly used by data center applications.
The experimental results show that DCQUIC has ad-
vantages in establishing connection speed, flexibility and
reliability. The experimental results are as follows:
1) Compared with DCTCP, the transport efficiency of
DCQUIC has increased by 49.59%-73.02%.
2) By using a large packet size for short flows and a
small packet size for long flows, differential packet
size transport of DCQUIC reduces the completion
time dropped by 9.17%-63.2%.
3) Connection migration technology can effectively
improve transport reliability, even in the case of net-
work failures or virtual machine migration. Com-
pared with the existing application layer switching
mechanism, connection migration reduces the task
completion time by 67.92%.
2. Existing Work
Numerous TCP innovations have emerged for datacenter
networking, such as flow classification, flow-based load bal-
ancing, flow priority-based scheduling, etc. Table 1 classifies
and summarizes some typical works. It should be noted that
in the UDP-based DCQUIC protocol, some [18], [19] of
these innovations can be reserved, and some [1], [20], [21],
[22], [23] need to be redesigned.
At present, most application-level service frameworks
rely on TCP, such as Spring Boot/Spring Cloud, Google
gRPC, Thrift, Finagle and Dubbo/Dubbox. These frame-
works are responsible for service discovery, load balancing,
fault tolerance, network transmission, serialization and other
functions to support numerous data center services. Appli-
cation developers can ignore specific network communica-
tion processes when calling services, which are taken over
TABLE 1. TYPICAL TCP-BAS ED I NN OVAT IO NS O F DATA CEN TE R
TRANSMISSION.
Transport Protocol Typical work
Congestion avoidance DCTCP [1], D2TCP [20]...
Flow control PIAS [21], HyFabric [23], ...
Bandwidth allocation D3[18], FCTcon [19], ...
Flow classification ElasticSketch [22],
Memento [24], ...
Flow priority-based scheduling PIAS [21], HyFabric [23], ...
Flow-based load balancing Clove [25], IntFlow [26], ...
by remote communication protocols such as RMI, Socket,
SOAP (HTTP XML) and REST (HTTP JSON). Most of
these remote communication protocols are based on TCP.
The benefits of TCP as data center transport layer pro-
tocol are:
1) Friendly to development. Some businesses can
directly use existing mature development models
and communication frameworks, thereby reducing
the workload of developers.
2) Mature congestion control and other mecha-
nisms. Some congestion control, flow control, and
reliability mechanisms that have performed well in
other areas can continue to be used in data centers,
and they still seem to perform well.
Although these TCP-based frameworks and mechanisms
are so popular, and they have tried their best to simplify
network transmission costs through long connection and
multiplexing, etc., tens of thousands of TCP connections
will cause huge overheads in computing, storage and net-
working. Moreover, the number of TCP connections per
second processed by OS is also limited, which becomes a
TCP performance bottleneck. Issues that still have room for
improvement include:
1) The overhead is still not small enough. In data-
center networking, is it really necessary to sacrifice
the efficiency of connection establishment and ter-
mination in exchange for reliability like WAN?TCP
fast open (TFO) [27] can reduce one RTT by ex-
changing data during TCP handshake, but it still
cannot achieve 0RTT. Long connection reduces re-
sponse time and network congestion, but may harm
the overall performance (computing resources and
concurrency) of server.
2) Congestion control is too conservative. In dat-
acenter networking, do we really need to detect
link bottlenecks step by step? When a micro-burst
[28] occurs, do we really know the limit of rapid
recovery? The TCP-based transport layer is not
convenient for us to verify them.
3) The development and deployment of new fea-
tures is slow. Pull one hair and the whole body
is affected. The intertwined protocol dependencies
make it so difficult to modify a very small module,
which is why data center owners would rather
tolerate inefficiencies than mistakes.
4) Patching causes confusion in operation and
maintenance. In order to achieve excellent conges-
tion control, the protocols run on data center hosts
and switches have been patched beyond recogni-
tion.
3. DCQUIC
In this section, firstly, we give the design goals of
DCQUIC. Then, the design of DCQUIC datacenter network-
ing transport system is given. Finally, we respectively dis-
cuss some key technologies and characteristics of DCQUIC
that can be directly applied to datacenter networking, and
improve the connection migration technology in standard
QUIC to make it more suitable for datacenter networking.
3.1. Design Goals
The design goal of DCQUIC is to try to replace TCP
with UDP in datacenter networking, and to separate and
open the original TCP congestion control, flow control and
connection mechanisms to support QUIC-based innovation,
including:
1) More efficient than DCTCP. There are many
DCTCP congestion control solutions, but the in-
dustry is still using a few of the most primitive
and mature solutions. The reason for restricting the
deployment of these congestion control schemes is
that changes to DCTCP involve OS on both ends.
By supporting pluggable development of conges-
tion control, DCQUIC provides the most suitable
congestion control algorithm for data center appli-
cations.
2) More reliable than DCUDP. Although UDP is al-
ready very efficient, including connectionless com-
munication. But DCQUIC hopes to further improve
transmission reliability, while UDP only considers
sending and not receiving. Through mechanisms
such as multiplexing and fast retransmission, even
if packets are accidentally lost on the network,
DCQUIC still will not experience performance col-
lapse.
These innovations will support incremental deployment,
from the TCP era, to the coexistence era of TCP and UDP,
to the final UDP era. These innovations will make the data
plane pay more attention to forwarding efficiency, and over-
turn some of the existing flow control, congestion control
and development models, etc., which will be independent
from the Kernel and support the software definition from
application developers and administrators.
3.2. Standard QUIC
QUIC is a UDP-based transport protocol designed by
Google. HTTP/3 has chosen to use QUIC instead of TCP
as its transport layer protocol. QUIC has new features such
Physical NIC
VM #1 (with DCTCP)
Virtual NIC
VM #2 (with DCUDP)
Virtual NIC
VM #3 (with DCQUIC)
Virtual NIC
Data Center QUIC (DCQUIC)
UDP Socket
Network Layer
Application #1
DCUDP Socket
Network Layer
Application #1
Flow Control
Congestion
Control
Retransmission
Fragmentation/
Reorganization
Application #2
Application #2
Flow Control
Congestion
Control
Retransmission
Fragmentation/
Reorganization
DCTCP
TCP Socket
Flow Control
Congestion Control
Network Layer
Application #1
Retransmission
Fragmentation/ Reorganization
Application #2
Network Bridge
Flow Control
Congestion
Control
Retransmission
Fragmentation/
Reorganization
Flow Control
Congestion
Control
Retransmission
Fragmentation/
Reorganization
Figure 1. Protocol stack and functional modules of DCTCP, DCUDP and DCQUIC.
as low-latency connections, improved congestion control,
multiplexing without head-of-line blocking, forward error
correction, and connection migration, which can signifi-
cantly improve transport efficiency and reliability [29]. In
2016, IETF started standardizing QUIC. In 2018, 35% north-
south traffic of Google is QUIC.
3.3. Architecture of DCQUIC
Standard QUIC is suitable for Long Fat Network (LFN),
but the datacenter networking is a Short Fat Network (SFN).
In DCQUIC, the more important features include flow
control, congestion control, retransmission, and connection
migration. Since the Kernel is not involved, DCQUIC inher-
ently supports data center owners to develop and deploy self-
developed protocols, such as forward error correction and
redundant transmission. The packet type and protocol format
of DCQUIC are basically similar to QUIC. Compared with
the existing QUIC, the research contents of DCQUIC that
can be improved are listed in Section 5. As is shown in Fig-
ure 1, compared with DCTCP and DCUDP [30], DCQUIC
has the following two advantages:
1) The UDP-based DCQUIC protocol stack is simple,
with low coupling and high cohesion. It inherits the
efficiency of UDP and the reliability of TCP. DC-
QUIC is implemented directly based on User Mode,
rather than Kernel, which can be quickly iteratively
updated without changing operating system.
2) DCQUIC supports the modular development of
flow control, congestion control, multiplexing and
retransmission mechanisms, etc., which will pro-
mote a new round of agile deployment of datacenter
networking innovations.
We developed a prototype of DCQUIC [17], which is
specifically optimized for the datacenter networking on the
basis of standard QUIC.
3.4. Connection Migration in DCQUIC
DCQUIC continues to use the connection migration
mechanism in QUIC. DCQUIC uses connection ID to iden-
tify a connection from the client to the server. The connec-
tion migration technology can help the data center realize the
dynamic migration of virtual machines at L3/L4. In addition,
connection migration also allows servers, VMs or containers
to achieve congestion avoidance and redundant transmission,
improving the reliability of transmission.
In DCQUIC, connection migration is divided into proac-
tive connection migration and passive connection migration.
The former is suitable for multi-NIC servers, VMs or con-
tainers to keep connection, and the latter is suitable for client
after migration to keep connection.
3.4.1. Proactive Connection Migration. When a drastic
increase in network delay or response interruption is de-
tected, the proactive connection migration will promptly
switch to other available source IP to continue to maintain
existing connection with communication server to avoid
service interruption.
Stream State Management maintains stream infor-
mation, including stream establishment time, end time,
and total transmitted bytes. It is expressed as <
Str eamID, EstT ime, EndT ime, T otalB ytes >.
StreamID is a variable unsigned integer number. EstTime
records the time of stream creation, which is initialized by
Create Stream.EndTime means the end time of the stream,
which is the moment when a frame with FIN=1 is received.
TotalBytes represents the total number of bytes transmitted
on one stream, and its function is to normalize and compare
streams of different lengths. TotalBytes is calculated based
on the frame with FIN=1 of each stream. Its calculation
equation is:
T otalBy tes =Of fset +DataLength. (1)
we can use the stream complete time per byte to evaluate
efficiency. If all streams with fixed bytes, we can directly
use response time.
Str eamComplT imeP erB yte =(E ndT ime E stT ime)
T otalBy tes .
(2)
For streams with different numbers of bytes, by Eq.2,
we can evaluate the completion time of different streams.
In addition to the completion time, proactive connection
migration also needs to record the mapping relationship
between stream and UDP, which can be expressed as
< S treamI D, U DP Sour ceIP >.
Proactive connection migration evaluates fluctuations in
network quality by tracking the completion time of different
flows of the same connection. When network fluctuations
exceed a certain threshold, proactive connection migration
will actively switch NIC to avoid more serious long-term
network congestion. Str eamComplT imeP erB yte(m, n)
means the per byte completion time from the m-th stream
to the n-th stream:
Str eamComplT imeP erB yte(m, n)
=Pn
i=mStr eamComplT imeP erB yte(i)
nm
(3)
Assuming that the long-term observation window is W,
the short-term observation window is w(w < W ), the latest
stream id is n, and the network fluctuation threshold is
α, then we believe that when Equation 4 is satisfied, the
network performance deteriorates drastically, and the server
needs to start proactive connection migration:
Str eamComplT imeP erB yte(nw, n)
Str eamComplT imeP erB yte(nW, n w)> α (4)
We analyzed the performance of proactive connection
migration when the switch fails in Section 4.3.
3.4.2. Passive Connection Migration. After VM is mi-
grated, in addition to L2 technologies such as VXLAN,
passive connection migration can continue to use the new
source IP address to maintain the existing connection with
communication server.
4. Experiment and Practice
In this section, we have completed three experiments to
verify the performance of DCQUIC. Firstly, we compared
the performance of DCQUIC and DCTCP. Secondly, we
verified the possibility of improving transport performance
by negotiating the DCQUIC maximum packet size. Thirdly,
we verified the effect of maintaining connection through
proactive connection migration when switch fails.
4.1. Performance Comparison of DCTCP and DC-
QUIC
We deployed DCTCP and DCQUIC in private data
center, AWS and Tencent Cloud to verify their performance
differences in the same business scenario. We measured the
difference between the completion time of handshake, 10KB
JSON data request, and 100KB JSON data request between
DCTCP and DCQUIC during the RPC call. In order to
ensure the fairness of result, the handshake time of DCTCP
includes the TLS 1.3 handshake time.
Setting Private data center: Server and client are configured
with 8-core Intel Xeon CPU E3-1245 V2 @ 3.40GHz,
4G memory and Intel Corporation 82583V Gigabit NIC,
running 64-bit Ubuntu 16.04 (4.4.0-187 Kernel). The link
bandwidth Bis 1Gbps.
AWS: Server and client are two EC2s, with Intel Xeon
CPU E5-2676 v3 @ 2.40GHz, 1G memory and virtual NIC,
running 64-bit Ubuntu 16.04 (4.4.0-189-generic). The link
bandwidth Bis 1Gbps.
Tencent Cloud: Server and client are two CVMs, with
Intel Xeon CPU E5-26xx v4 @ 2.40GHz, 2G memory
and virtual NIC, running 64-bit Ubuntu 16.04 (4.4.0-157-
generic). The link bandwidth Bis 1Gbps.
All completion times are recorded by clients. For the
three measurement items, we have carried out 10000 exper-
iments. The congestion control algorithm is Cubic, and the
ECN marking threshold is 20.
Experimental phenomena and analysis As shown in Fig-
ure 2, the average single request completion time of DC-
QUIC is ahead of DCTCP in three measurement items.
Compared with DCTCP + TLS 1.3, DCQUIC’s handshake
completion time is reduced by 12.59ms/13.09ms/14.79ms
(˜73.02%/68.96%/68.82%), which means that DCQUIC can
establish a connection faster. When the request object size
is 10KB, the completion time of DCQUIC (including con-
nection establishment time) is 12.56ms/13.06ms/20.34ms
(˜63.70%/60.85%/67.88%) ahead of DCTCP. This shows
that the performance difference between DCQUIC and
DCTCP in small task transport is mainly the time con-
sumed by the handshake during the connection establish-
ment phase. When the RPC object is 100KB in size, the
server needs to split it into about 80 data packets and send
them sequentially. DCQUIC is also ahead of DCTCP by
13.59ms/14.11ms/30.60ms (˜50.92%/49.59%/66.23%). The
difference between the DCTCP and DCQUIC is no longer
as huge as in the first two test projects. This is because as
Handshake Request 10K Request 100K
0
1
2
3
Completion Time
(ms)
104DCQUIC DCTCP
Handshake Request 10K Request 100K
0
2
4
Completion Time
(ms)
104
Handshake Request 10K Request 100K
0
2
4
6
Completion Time
(ms)
104
Handshake Request 10K Request 100K
0
1
2
3
Completion Time
(ms)
104DCQUIC DCTCP
Handshake Request 10K Request 100K
0
2
4
Completion Time
(ms)
104
Handshake Request 10K Request 100K
0
2
4
6
Completion Time
(ms)
104
(b)
(a)
(c)
Handshake Request 10K Request 100K
0
1
2
3
Completion Time
(ms)
104DCQUIC DCTCP
Handshake Request 10K Request 100K
0
2
4
Completion Time
(ms)
104
Handshake Request 10K Request 100K
0
2
4
6
Completion Time
(ms)
104
Figure 2. Performance comparison of DCTCP and DCQUIC. (a) Private
data center. (b) AWS. (c) Tencent Cloud.
the size of the request object increases, compared with con-
nection establishment, the data transport latency becomes
the most important part of the task completion time.
4.2. Improve Transport Efficiency by Negotiating
DCQUIC Packet Size
The total length field in UDP header is 2 bytes, so the
total length of UDP packet is limited to 65535 bytes, which
can fit into an IP packet, making the implementation of
the UDP/IP protocol stack very simple and efficient. The
maximum payload length in UDP is 65527 bytes. This
results in the UDP load ratio up to 99.98%. Although the
DCQUIC protocol header and control field need to occupy
a part of the UDP data field, the DCQUIC information is
necessary and can be defined by data center owner.
Under the TCP/IP architecture of data center, fragmen-
tation is completed by IP protocol in Kernel, which leads to
many different types of applications sharing same transport
parameters, such as fragment length and send/receive buffer.
This fact results in TCP packet size often being set to
1452 bytes. However, existing data centers have already re-
alized the transport support for jumbo frames. For example,
many Amazon EC2 instances support 9001 MTU or jumbo
frames. We counted 50 of 52 instances support 9001 MTU
in Amazon Ningxia region. Based on this fact, we have
implemented a mechanism to negotiate packet size.
Firstly, we measured the transport performance of tasks
with different scales in different DCQUIC Maximum Packet
Sizes (MPS) and UDP Socket buffers.
Server #2
1000M 1000M
Server #4
Network
Damage
Simulator
Server #3
1000M
Aggregation and core switches
1000M
Server #1
Figure 3. Experimental topology.
Setting Figure 3 shows the small-scale testbed, which con-
sists of 4 servers and 4 switches. Switches are BMv2
Switches, which implement strict FIFO forwarding. Each
server is an ADLINK aTCA-8214 blade server, with 8-
core Intel Xeon CPU E3-1245 V2 @ 3.40GHz, 4G memory
and Intel Corporation 82583V Gigabit NIC, running 64-bit
Ubuntu 16.04 (4.4.0-187 Kernel). The link bandwidth Bis
1Gbps and background traffic is 800Mbps. Two-way delay
Dis about 50us. The byte number Lof transport tasks
ranges from 5K to 10G. The congestion control is Cubic.
Experimental phenomena and analysis We ignore the
buffers of the intermediate switches because they can be
regarded as UDP Socket buffer on the receiving side. The
measurement result is shown in Figure 4. In the case of
commonly instance configuration (UDP Socket buffer =
200K, QUIC Maximum Packet Size = 1252B), the actual
transport completion time is 1.8X - 9.6X the ideal comple-
tion time Tideal =L/B. There are many factors that affect
transport efficiency, such as MTU, receiver buffer, switch
buffer, available bandwidth, congestion control, flow control,
etc. We discuss the impact of the first two on transport
performance.
For small transfer tasks, the transport bottleneck is fixed
connection establishment delay. Especially when transport
task <100KB, the task completion time is significantly
reduced by increasing MPS.
For large transfer tasks, only increasing the receiving
UDP Socket buffer or increasing MPS often can not achieve
the ideal performance. When MPS = 9KB and UDP Socket
buffer = 1MB, the actual transport time is very close to ideal
transport time.
As shown in Figure 5, the throughput performance is
consistent with the task completion time. We also found
that transport performance does not increase linearly with
the increase of receiving UDP Socket buffer and MPS. It is
very difficult to find this turning point, but we can give an
empirical value for reference. For small task (L<1MB),
MPS can be 9KB. For large task (L1MB), MPS can be
5KB, and UDP Socket buffer is 1MB. This empirical value
is the result of balancing transport efficiency, computational
overhead of sending and receiving, and packet loss rate. The
setting of Maximum Packet Size in DCQUIC can be achieved
with only 2 lines of code.
Figure 4. The impact of the maximum packet size and receiver buffer on completion time.
0 50 100
Time (s)
0
2
4
6
8
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
MPS=1200B, B=1MB MPS=2500B, B=1MB
MPS=5000B, B=1MB MPS=9000B, B=1MB
0 50 100
Time (s)
0
2
4
6
8
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
MPS=1200B, B=1MB MPS=2500B, B=1MB
MPS=5000B, B=1MB MPS=9000B, B=1MB
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
Time (s)
0
2
4
6
8
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
MPS=1200B, B=1MB MPS=2500B, B=1MB
MPS=5000B, B=1MB MPS=9000B, B=1MB
0 50 100
Time (s)
0
2
4
6
8
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
MPS=1200B, B=1MB MPS=2500B, B=1MB
MPS=5000B, B=1MB MPS=9000B, B=1MB
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
0 50 100
0
5
10
Throughput
(Mbps)
0 50 100
0
50
100
150
0 50 100
0
100
200
Throughput
(Mbps)
0 50 100
0
100
200
(c)
(a)
(d)
(b)
0 50 100
Time (s)
0
2
4
6
8
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
0 50 100
Time (s)
0
50
100
150
200
Throughput (Mbps)
MPS=1200B, B=1MB MPS=2500B, B=1MB
MPS=5000B, B=1MB MPS=9000B, B=1MB
Figure 5. Throughput under different conditions: (a) Task is 10KB. (b) Task
is 1MB. (c) Task is 100MB. (d) Task is 1GB.
Secondly, we deployed DCQUIC Maximum Packet Size
negotiation mechanism on the testbed, which can select the
appropriate MPS according to the size of task.
Setting We use two common workload traffic: web search
workload and data mining workload. Web search traffic is
composed of two sizes of file blocks (100B and 10KB), each
of which accounts for 50%. The data mining traffic is com-
posed of three sizes of file blocks (0.5MB, 1MB and 2MB),
and their proportions are 33.3%. The two workload services
are randomly distributed on Server #1 and server #3, and 200
connections coexist during operation. We counted the task
completion time (TCT) under the two transport schemes.
The first is UDP Socket buffer = 200KB and MPS = 1252B.
The second is UDP Socket buffer = 1MB and MPS = 5KB
(if L1MB) or 9KB (if L<1MB).
Experimental phenomena and analysis Figure 6 indicates
the TCT results of different scenarios. We can get three
conclusions:
500K 1M
Task (Bytes)
10
20
30
40
50
TCT (ms)
500K 1M
Task (Bytes)
5
10
15
20
TCT (ms)
500K 1M 2M
Task (Bytes)
100
150
200
TCT (ms)
500K 1M 2M
Task (Bytes)
100
200
TCT (ms)
500K 1M
Task (Bytes)
10
20
30
40
50
TCT (ms)
500K 1M
Task (Bytes)
5
10
15
20
TCT (ms)
500K 1M 2M
Task (Bytes)
100
150
200
TCT (ms)
500K 1M 2M
Task (Bytes)
100
200
TCT (ms)
500K 1M
Task (Bytes)
50
100
150
200
50
Task Completion
Time (ms)
500K 1M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
500K 1M
Task (Bytes)
50
100
150
200
50
Task Completion
Time (ms)
500K 1M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
(a)
(b)
DCQUIC
(With negotiation
of MPS)
DCQUIC
(Without negotiation
of MPS)
DCQUIC
(With negotiation
of MPS)
DCQUIC
(Without negotiation
of MPS)
100 10K
Task (Bytes)
10
20
30
40
50
Task Completion
Time (ms)
100 10K
Task (Bytes)
5
10
15
20
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
100 10K
Task (Bytes)
10
20
30
40
50
Task Completion
Time (ms)
100 10K
Task (Bytes)
5
10
15
20
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
100 10K
Task (Bytes)
10
20
30
40
50
Task Completion
Time (ms)
100 10K
Task (Bytes)
5
10
15
20
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
500K 1M
Task (Bytes)
50
100
150
200
50
Task Completion
Time (ms)
500K 1M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
100 10K
Task (Bytes)
10
20
30
40
50
Task Completion
Time (ms)
100 10K
Task (Bytes)
5
10
15
20
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
100
150
200
Task Completion
Time (ms)
500K 1M 2M
Task (Bytes)
50
100
150
200
Task Completion
Time (ms)
DCQUIC
(with negotiation of
MPS)
DCQUIC
(w/o negotiation of
MPS)
DCQUIC
(with negotiation of
MPS)
DCQUIC
(w/o negotiation of
MPS)
(a)
(b)
Figure 6. Comparison of task completion time: (a) Web search workload.
(b) Data mining workload.
1) In web search workload and data mining work-
load scenarios, negotiating the DCQUIC Maximum
Packet Size is an effective way to reduce task
completion time. For transport tasks with different
sizes, Average TCT dropped by 9.17%-63.2%.
2) In a datacenter networking dominated by small
task, e.g. web search services, the TCT is reduced
more significantly by increasing DCQUIC Maxi-
mum Packet Size.
3) Due to the larger DCQUIC Maximum Packet Size
used in the transport of small task, it has not fallen
behind in the process of competing with large task.
This indicates that the performance degradation of
small task caused by increasing DCQUIC Maxi-
mum Packet Size of large task can be solved by
increasing DCQUIC Maximum Packet Size of small
task more radically. Therefore, DCQUIC is fair and
friendly to both large and small task.
4.3. The Performance of Proactive Connection Mi-
gration in DCQUIC
In datacenter networking, network equipment failure is
a normal phenomenon. Among all network device failures,
switch failures are the dominant type in terms of both
downtime. The data center can effectively avoid network
service interruption caused by switch failure by config-
uring multi-NIC servers and using multi-path topologies.
However, even though these methods have tried to reduce
the impact of switch failures, the traditional datacenter
networking with DCTCP still inevitably experiences short
interruptions. These interruption times are mainly composed
of the application server/client detecting the interruption and
re-establishing the connection, etc. In this section, we try to
use DCQUIC to solve the problem of service unavailability
caused by the intermediate switch failure from client-side,
so that client can smoothly use multiple NICs to maintain
the connection with the server [31].
Setting In the topology shown in Figure 3, Server #1, Server
#2 and Server #3 are three RPC clients, respectively call
an RPC service on Server #4. Server #1 and Server #2
use TCP to transmit RPC object. When Server #2 detects
that the network is unavailable, it will continue to request
services by switching IP at application layer. Server #3
uses the RPC service framework based on DCQUIC, and
uses proactive connection migration to continue to maintain
this connection by changing the source IP/Port. We use
a network damage simulator to increase forward delay to
simulate switch congestion and failure. Some experimental
parameters are set to W= 50, w = 10, α = 2. The request
object size is 100KB.
Experimental phenomena and analysis As shown in Fig-
ure 7, in 200th to 400th epochs, the network damage sim-
ulator increases the one-way delay by 100ms, which means
that congestion causes delay to increase; in 500th to 700th
epochs, it increases the packet loss rate to 100%, which
means that the switch #2 fails. Server #2 has been requesting
server #4 through switch #2, and even if the delay increases,
server #2 still does not trigger link switch at the applica-
tion layer. After the server #3 experiences a short delay
increase, it switches to the path represented by the switch #3,
thereby avoiding a more serious continuous delay increase.
Server #3 smoothly migrated connections with poor network
quality through proactive connection migration. In about the
500th epoch, server #2 re-establishes a new connection with
server #4 through switch #1 after continuous packet loss
and retransmission. In all 1000 epochs of request, DCQUIC
reduced the task completion time by 67.92%.
5. Some Open Issues
DCQUIC may be a future data center transport solution,
it has many key technologies that need to be overcome.
DCQUIC Offload The CPU and bus resources that
DCQUIC/UDP/IP needs to consume during data transport
need to be measured and evaluated, including crypto, con-
nection establishment & teardown, packet reordering, and
0 200 400 600 800 1000
Epoch
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 200 400 600 800 1000
Epoch
0
50
100
150
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
0 200 400 600 800 1000
Epoch
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 200 400 600 800 1000
Epoch
0
50
100
150
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
(b)
0 200 400 600 800 1000
Epoch
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 200 400 600 800 1000
Epoch
0
50
100
150
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
0 100 200 300 400 500 600 700 800 900 1000
Epoch
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 100 200 300 400 500 600 700 800 900 1000
Epoch
0
50
100
150
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
(a)
(b)
0 200 400 600 800 1000
Epoch
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 200 400 600 800 1000
Epoch
0
50
100
150
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
(a)
0 200 400 600 800 1000
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 200 400 600 800 1000
0
100
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
0 200 400 600 800 1000
0
20
40
60
80
100
One-way Delay
(ms)
0
20
40
60
80
100
Packet Loss (%)
One-way Delay
Packet Loss
0 200 400 600 800 1000
0
100
200
Task Completion
Time (ms)
Server #1
Server #2
Server #3
Figure 7. Experimental results of maintaining connections through connec-
tion migration. (a) Delay and packet loss rate of network damage simulator.
(b)Task completion time of three RPC clients.
packet header formatting. It seems a promising direction to
accelerate DCQUIC through DCQUIC offloading of hard-
ware/software co-design [32].
Modular DCQUIC DCQUIC inherently supports ar-
bitrary expansion at the beginning of design, including
function development and protocol expansion. In fact, our
first prototype [17] also follows this principle.
Congestion Control Hybrid Deployment In order to
ensure the fairness of all flows, the same congestion control
mechanism is often used inside data center [33]. Since
DCQUIC supports modular congestion control, in datacenter
networking with mixed deployment of TCP and UDP and
mixed deployment of different congestion control [34], it is
necessary to study the efficiency [35] and fairness among
different tenants and applications.
6. Conclusion
In this paper, we proposed a new data center transport
scheme DCQUIC. DCQUIC retains some features of QUIC,
and supports extensive scalability and deployability. DC-
QUIC not only brings intuitive performance improvements,
but also supports numerous future innovative solutions.
Acknowledgments
This work was supported in part by the National Key
R&D Program of China under Grant No.2018YFB1800305,
the National Natural Science Foundation of China under
Grant No. 61802233 and the Natural Science Foundation of
Shandong Provincial under Grant No. ZR2019LZH013.
References
[1] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel,
B. Prabhakar, S. Sengupta, and M. Sridharan, “Data Center TCP
(DCTCP),” in SIGCOMM ’10. ACM, Aug. 2010, pp. 63–74,
https://doi.org/10.1145/1851182.1851192.
[2] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W.
Moore, G. Antichi, and M. W´
ojcik, “Re-Architecting Datacen-
ter Networks and Stacks for Low Latency and High Perfor-
mance,” in SIGCOMM ’17. ACM, Aug. 2017, pp. 29–42,
https://doi.org/10.1145/3098822.3098825.
[3] W. Bai, L. Chen, K. Chen, and H. Wu, “Enabling
ECN in multi-service multi-queue data centers,” in NSDI
’16. USENIX Association, Mar. 2016, pp. 537–
549, https://www.usenix.org/conference/nsdi16/technical-
sessions/presentation/bai.
[4] M. Dong, T. Meng, D. Zarchy, E. Arslan, Y. Gilad, B. Godfrey, and
M. Schapira, “PCC vivace: Online-learning congestion control,” in
NSDI ’18. Renton, WA: USENIX Association, Apr. 2018, pp. 343–
356, https://www.usenix.org/conference/nsdi18/presentation/dong.
[5] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao,
M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: High Precision
Congestion Control,” in SIGCOMM ’19. ACM, Aug 2019, pp. 44–
58, https://doi.org/10.1145/3341302.3342085.
[6] L. Chen, J. Lingys, K. Chen, and F. Liu, “Auto: Scaling deep
reinforcement learning for datacenter-scale automatic traffic op-
timization,” in SIGCOMM ’18. ACM, 2018, pp. 191–205,
https://doi.org/10.1145/3230543.3230551.
[7] K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Mazi`
eres,
and C. Kozyrakis, “Shinjuku: Preemptive Scheduling for
µsecond-scale Tail Latency,” in NSDI ’19. Boston,
MA: USENIX Association, Feb. 2019, pp. 345–360,
https://www.usenix.org/conference/nsdi19/presentation/kaffes.
[8] B. Li, T. Cui, Z. Wang, W. Bai, and L. Zhang, “Socksdirect: Datacen-
ter Sockets Can Be Fast and Compatible,” in SIGCOMM ’19. ACM,
Aug. 2019, pp. 90–103, https://doi.org/10.1145/3341302.3342071.
[9] Y. Moon, S. Lee, M. A. Jamshed, and K. Park, “AccelTCP: Acceler-
ating Network Applications with Stateful TCP Offloading ,” in NSDI
’20. Santa Clara, CA: USENIX Association, Feb. 2020, pp. 77–92,
https://www.usenix.org/conference/nsdi20/presentation/moon.
[10] J. Hwang, Q. Cai, A. Tang, and R. Agarwal, “TCP RDMA:
CPU-efficient Remote Storage Access with i10 ,” in NSDI ’20.
Santa Clara, CA: USENIX Association, Feb. 2020, pp. 127–140,
https://www.usenix.org/conference/nsdi20/presentation/hwang.
[11] S. Ghorbani, Z. Yang, P. B. Godfrey, Y. Ganjali, and A. Firoozshahian,
“DRILL: Micro Load Balancing for Low-Latency Data Center
Networks,” in SIGCOMM ’17. ACM, 2017, pp. 225–238,
https://doi.org/10.1145/3098822.3098839.
[12] H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury, “Resilient
Datacenter Load Balancing in the Wild,” in SIGCOMM ’17. ACM,
2017, pp. 253–266, https://doi.org/10.1145/3098822.3098841.
[13] A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Ban-
non, S. Boving et al., “Jupiter Rising: A Decade of Clos Topologies
and Centralized Control in Google’s Datacenter Network,” in SIG-
COMM ’15. London, United Kingdom: ACM, 2015, pp. 183–197,
https://doi.org/10.1145/2829988.2787508.
[14] C. Raiciu and G. Antichi, “NDP: Rethinking Datacenter Networks and
Stacks Two Years After,” SIGCOMM Comput. Commun. Rev., vol. 49,
no. 5, pp. 112–114, 2019, https://doi.org/10.1145/3371934.3371968.
[15] A. Akella, T. Benson, B. Chandrasekaran, C. Huang, B. Maggs,
and D. Maltz, “A Universal Approach to Data Center Network
Design,” in ICDCN ’15. Goa, India: ACM, 2015, pp. 1–10,
https://doi.org/10.1145/2684464.2684505.
[16] A. Langley, J. Iyengar, J. Bailey, J. Dorfman, and I. Swett, “The
QUIC Transport Protocol: Design and Internet-Scale Deployment,
in SIGCOMM ’17, Los Angeles, CA, USA, Aug 2017, pp. 183–196,
https://doi.org/10.1145/3098822.3098842.
[17] X. Gao, “libgquic,” 2020, https://github.com/Gscienty/libgquic.
[18] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better Never
than Late: Meeting Deadlines in Datacenter Networks,” in SIGCOMM
’11. Toronto, Ontario, Canada: ACM, August 2011, pp. 50–61,
https://doi.org/10.1145/2018436.2018443.
[19] K. Zheng, Y. Bai, and X. Wang, “FCTcon: Dynamic Con-
trol of Flow Completion Time in Data Center Networks for
Power Efficiency,” IEEE Transactions on Cloud Computing, 2019,
10.1109/TCC.2019.2912969.
[20] B. Vamanan, J. Hasan, and T. Vijaykumar, “Deadline-Aware Datacen-
ter Tcp (D2TCP),” SIGCOMM Comput. Commun. Rev., vol. 42, no. 4,
pp. 115–126, Aug. 2012, https://doi.org/10.1145/2377677.2377709.
[21] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang,
“Information-agnostic flow scheduling for commodity data centers,
in NSDI ’15. Oakland, CA: USENIX Association, May 2015,
pp. 455–468, https://www.usenix.org/conference/nsdi15/technical-
sessions/presentation/bai.
[22] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao,
X. Li, and S. Uhlig, “Elastic sketch: Adaptive and fast network-wide
measurements,” in SIGCOMM ’18. Budapest, Hungary: ACM, Aug
2018, pp. 561–575, https://doi.org/10.1145/3230543.3230544.
[23] J. Bao, D. Dong, B. Zhao, and S. Huang, “HyFabric: Minimizing
FCT in Optical and Electrical Hybrid Data Center Networks,” in
SIGCOMM Posters and Demos ’19. Beijing, China: ACM, 2019,
pp. 57–59, https://doi.org/10.1145/3342280.3342306.
[24] R. B. Basat, G. Einziger, I. Keslassy, A. Orda, S. Vargaftik, and
E. Waisbard, “Memento: Making Sliding Windows Efficient for
Heavy Hitters,” in CoNEXT ’18. Heraklion, Greece: ACM, 2018,
pp. 254–266, https://doi.org/10.1145/3281411.3281427.
[25] N. Katta, A. Ghag, M. Hira, I. Keslassy, A. Bergman, C. Kim, and
J. Rexford, “Clove: Congestion-Aware Load Balancing at the Virtual
Edge,” in CoNEXT ’17. Incheon, Republic of Korea: ACM, 2017,
pp. 323–335, https://doi.org/10.1145/3143361.3143401.
[26] Q. Shi, F. Wang, and D. Feng, “Intflow: Integrating per-packet
and per-flowlet switching strategy for load balancing in datacenter
networks,” IEEE Transactions on Network and Service Management,
pp. 1–1, 2020, https://doi.org/10.1109/TNSM.2020.2990868.
[27] S. Radhakrishnan, Y. Cheng, J. Chu, A. Jain, and B. Raghavan, “Tcp
fast open,” in CoNEXT ’11. Tokyo, Japan: ACM, 2011, pp. 1–12,
https://doi.org/10.1145/2079296.2079317.
[28] D. Shan, F. Ren, P. Cheng, R. Shu, and C. Guo, “Observing and
Mitigating Micro-Burst Traffic in Data Center Networks,IEEE/ACM
Transactions on Networking, vol. 28, no. 1, pp. 98–111, 2019,
https://doi.org/10.1109/TNET.2019.2953793.
[29] Y. Cui, T. Li, C. Liu, X. Wang, and M. K¨
uhlewind, “Innovat-
ing transport with QUIC: Design approaches and research chal-
lenges,” IEEE Internet Computing, vol. 21, no. 2, pp. 72–76, 2017,
https://doi.org/10.1109/MIC.2017.44.
[30] L. Ye, L. Mhamdi, and M. Hamdi, “Efficient udp-based con-
gestion aware transport for data center traffic,” in Cloud-
Net ’14, Luxembourg, Luxembourg, Oct 2014, pp. 46–51,
https://doi.org/10.1109/CloudNet.2014.6968967.
[31] Q. De Coninck and O. Bonaventure, “Multipath QUIC: Design and
Evaluation,” in CoNEXT ’17. Incheon, Republic of Korea: ACM,
Nov 2017, pp. 160–166, https://doi.org/10.1145/3143361.3143370.
[32] X. Yang, L. Eggert, J. Ott, S. Uhlig, Z. Sun, and G. Antichi, “Making
QUIC Quicker With NIC Offload,” in EPIQ ’20. Virtual Event, USA:
ACM, 2020, pp. 21–27, https://doi.org/10.1145/3405796.3405827.
[33] G. Kumar, N. Dukkipati, K. Jang, H. M. G. Wassel, X. Wu, B. Mon-
tazeri, Y. Wang, K. Springborn, C. Alfeld, M. Ryan, D. Wetherall,
and A. Vahdat, “Swift: Delay is Simple and Effective for Congestion
Control in the Datacenter,” in SIGCOMM ’20. Virtual Event, USA:
ACM, 2020, pp. 514–528, https://doi.org/10.1145/3387514.3406591.
[34] R. Al-Saadi, G. Armitage, J. But, and P. Branch, “A Survey of
Delay-Based and Hybrid TCP Congestion Control Algorithms,” IEEE
Communications Surveys & Tutorials, vol. 21, no. 4, pp. 3609–3638,
2019.
[35] T. A. N. Nguyen, S. Gangadhar, and J. P. G. Sterbenz, “Performance
evaluation of tcp congestion control algorithms in data center net-
works,” in CFI ’16. Nanjing, China: ACM, June 2016, pp. 21–28,
https://doi.org/10.1145/2935663.2935669.

Supplementary resource (1)

Chapter
QUIC is a high-performance and secure transport layer protocol that has been applied in many scenarios. Considering that kernel bypass techniques can avoid the overhead of the kernel’s network stack and therefore improve the performance greatly, we choose AF_XDP, a compatible kernel bypass technique to improve the performance of QUIC. We present XSKConn, a Golang package that implements UDP connection based on AF_XDP and we integrate it into quic-go. The experiments show that XSKConn improves the RPS of quic-go by 5% to 40%, while the CPU usage also decreases by 5% to 50%.
Chapter
With the proliferation of distributed applications in data centers and the increase in latency-sensitive flows, it has become increasingly important to provide deterministic latency guarantees for business flows. The main component of business flow latency in the data center is queuing latency. Existing congestion control methods work in a passive reaction working approach, which can neither determine the queuing delay nor guarantee zero data loss. In this situation, this paper proposes a mechanism to deploy asynchronous traffic shapers on the Layer 3 switches of the data center network from the network side perspective. This mechanism shapes the data flow at each hop switch, which can prevent buffer overflow and packet loss, thus improving the deterministic performance of the data center network latency guarantee. We built a large Layer 2 data center network for simulation to verify the asynchronous traffic shaping mechanism based on laboratory conditions and we found that the shaping mechanism has deterministic delay guarantee capability, which can improve the delay deterministic performance of the network.KeywordsAsynchronous Traffic ShapingData Center NetworkDeterministic Latency
Article
The recently proposed dataplanes for microsecond scale applications, such as IX and ZygOS, use non-preemptive policies to schedule requests to cores. For the many real-world scenarios where request service times follow distributions with high dispersion or a heavy tail, they allow short requests to be blocked behind long requests, which leads to poor tail latency. Shinjuku is a single-address space operating system that uses hardware support for virtualization to make preemption practical at the microsecond scale. This allows Shinjuku to implement centralized scheduling policies that preempt requests as often as every 5µsec and work well for both light and heavy tailed request service time distributions. We demonstrate that Shinjuku provides significant tail latency and throughput improvements over IX and ZygOS for a wide range of workload scenarios. For the case of a RocksDB server processing both point and range queries, Shinjuku achieves up to 6.6× higher throughput and 88% lower tail latency.
Article
Datacenter network load balancing schemes handle network traffic generated by massive different applications. Some packet-based or flowlet-based schemes capture traffic bursts for load balancing. But frequent rerouting within a flow can mix ACKs belonging to different paths in congestion control protocols, which adversely affects flow rate control. Besides, performance optimization effect of flowlet-based schemes may be less noticeable under smoother workloads. And several packet-based mechanisms implemented at end hosts can proactively reroute congested flows based on flow status even under a smooth workload, but fail to improve performance with the bursty nature of traffic. Therefore, existing schemes cannot adapt to different burst levels of dynamic traffic in datacenter networks and still have significant performance flaws in some ways. This paper proposes IntFlow, a novel load balancing scheme that integrates end-host based per-packet monitoring of flow status with flowlet switching in programable switches. IntFlow proactively reroutes flows experiencing network congestion or failures and avoids doing flowlet switching for small flows with high sending rate. IntFlow can provide excellent performance under both high burst and smooth workloads. Finally experimental results show IntFlow achieves up to 32% and 28% better performance than CONGA and Hermes under asymmetries, respectively.
Article
Micro-burst traffic is not uncommon in data centers. It can cause packet dropping, which may result in serious performance degradation (e.g., Incast problem). However, current approaches to mitigate micro-burst is usually ad-hoc and not based on a principled understanding of the underlying behaviors. On the other hand, traditional studies focus on traffic burstiness in a single flow, while micro-burst traffic in the data centers could occur with highly fan-in communication pattern, and its dynamic behavior is still unclear. To this end, in this paper, we re-examine the micro-burst traffic in typical data center scenarios. We find that the evolution of micro-burst is determined by both TCP’s self-clocking mechanism and congestion control algorithm. Besides, dynamic behaviors of micro-burst under various scenarios can all be described by the time derivative of queue length evolution.Our observations also implicate that conventional solutions like absorbing and pacing are ineffective to mitigate micro-burst traffic.Instead, senders need to rapidly respond to some explicit signals of the queue buildup caused by the micro-burst traffic rather than independently and ineffectually pacing themselves in isolation. Inspired by the findings and insights from experimental observations, we propose Micro-burst-Aware Transport Control Protocol (MATCP), which leverages characteristic behaviors of micro-burst traffic derived from the time derivative of the queue occupancy. MATCP can suppress the sharp queue length increment by over $2\times $ and reduce the tail query completion time by up to 84.4%.
Article
The key goals of datacenter networks are to simultaneously provide wire-level latency for RPC-style applications and high-throughput for network-bound applications such as storage. Folded Clos networks [1, 11] are used in datacenters worldwide; such networks use many cheap commodity switches to provide the illusion of a big non-blocking switch to all hosts in a datacenter, offering many paths between any two pairs of hosts.
Conference Paper
In this paper, we propose HyFabric, jointing the flow and circuit scheduling to minimize flow completion times (FCT) in optical and electrical hybrid data center networks. Our extensive evaluation results show that HyFabric reduces 93% FCT and 32% energy consumption compared with Solstice[4] and ProjecToR[3]. And it delivers comparable FCT and reduces 5% ~ 14% energy consumptions compared with pFabric[1].
Conference Paper
Communication intensive applications in hosts with multi-core CPU and high speed networking hardware often put considerable stress on the native socket system in an OS. Existing socket replacements often leave significant performance on the table, as well have limitations on compatibility and isolation. In this paper, we describe SocksDirect, a user-space high performance socket system. SocksDirect is fully compatible with Linux socket and can be used as a drop-in replacement with no modification to existing applications. To achieve high performance, SocksDirect leverages RDMA and shared memory (SHM) for inter-host and intra-host communication, respectively. To bridge the semantics gap between socket and RDMA / SHM, we optimize for the common cases while maintaining compatibility in general. SocksDirect achieves isolation by employing a trusted monitor daemon to handle control plane operations such as connection establishment and access control. The data plane is peer-to-peer between processes, in which we remove multi-thread synchronization, buffer management, large payload copy and process wakeup overheads in common cases. Experiments show that SocksDirect achieves 7~20x better message throughput and 17~35x better latency than Linux socket, and reduces Nginx HTTP latency to 1/5.5.
Conference Paper
Congestion control (CC) is the key to achieving ultra-low latency, high bandwidth and network stability in high-speed networks. From years of experience operating large-scale and high-speed RDMA networks, we find the existing high-speed CC schemes have inherent limitations for reaching these goals. In this paper, we present HPCC (High Precision Congestion Control), a new high-speed CC mechanism which achieves the three goals simultaneously. HPCC leverages in-network telemetry (INT) to obtain precise link load information and controls traffic precisely. By addressing challenges such as delayed INT information during congestion and overreac-tion to INT information, HPCC can quickly converge to utilize free bandwidth while avoiding congestion, and can maintain near-zero in-network queues for ultra-low latency. HPCC is also fair and easy to deploy in hardware. We implement HPCC with commodity programmable NICs and switches. In our evaluation, compared to DCQCN and TIMELY, HPCC shortens flow completion times by up to 95%, causing little congestion even under large-scale incasts.
Article
Data center network (DCN) can consume a significant amount of power (e.g., 10%-20%) in large-scale data centers. To reduce the power consumption of DCN, traffic consolidation has been recently proposed as an effective approach. However, existing approaches do not sufficiently consider the flow completion time (FCT) requirement. On one hand, missing the FCT deadlines can cause serious violation for delay-sensitive services (e.g., E-commerce). Moreover, keeping all devices on to make FCTs shorter than the requirements is unnecessary because users may not be able to perceive the difference while leading to unnecessarily high power consumption. Therefore, we propose FCTcon, a dynamic FCT control strategy for DCN power optimization. FCTcon is designed based on control theory to dynamically control the FCT of delay-sensitive traffic flows exactly to the requirements, such that the desired FCT performance is guaranteed while achieving maximum DCN power savings. Both hardware and simulation evaluation demonstrate FCTcon can improve the DCN FCT performance, while achieving nearly the same power savings. It provides 22.0%-62.2% extra net profits for a data center with 50K servers. In addition, we propose two extended designs of FCTcon to handle the coflow abstraction for DCN, which successfully reduce the coflow deadline miss ratio by 12%-15%.