Conference PaperPDF Available

High performance wide area data transfers over high performance networks

Authors:

Abstract and Figures

First Page of the Article
Content may be subject to copyright.
High Performance Wide Area Data Transfers Over High Performance
Networks
Paul Woodward
Laboratory for Computational Science and Engineering
University of Minnesota
Abstract
This paper introduces a new user-level
communication protocol designed to provide
high-performance data transfers across high-
bandwidth, high-delay, networks. The protocol
incorporates the most important enhancements
defined by the networking community to improve
the performance to TCP for this environment,
and also defines enhancements unique to this
protocol. In terms of the so-called “Large
Window” extensions to TCP, this protocol
implements a communication window that is
essentially infinite, and provides a selective
acknowledgement window that spans the entire
data transfer. In terms of user-level extensions to
TCP, it implements a user-defined
acknowledgement frequency, a user-defined
“batch sending” window, and a simple
framework within which the user can define the
algorithm that determines the next data packet to
be sent out across all eligible packets. We
present experimental results demonstrating data
throughput on the order of 85% - 92% of the
maximum available bandwidth across both short
haul and long haul high-performance networks.
1 Introduction
The Internet2 initiative [19] promises the
development of high-performance networking
applications in diverse areas such as distributed
collaboration, visualization of scientific data,
high performance grid-based computations, and
Internet telephony. Abilene [20], the high-
performance backbone network associated with
the Internet2 project, provides an OC-48
connection between the regional aggregation
points (Gigapops) that it connects. Thus there is
significant bandwidth available for such
advanced networking applications, and the issue
becomes one of actually realizing the available
bandwidth.
It has been well documented that, in practice,
user-level distributed applications connected by
Abilene achieve only a small percentage of the
available bandwidth [1,3,4,5,6,7,10]. The
primary reason for this poor performance is that
the Transmission Control Protocol (TCP) [11],
the communication mechanism of choice for
most applications, is not well suited for high-
performance, wide area data transfers
[3,4,5,6,13,14,15]. Thus one critical area of
current research is the development of
mechanisms to improve the performance of TCP
in a high-bandwidth, high-delay environment,
and another is to study alternative
communication protocols that can circumvent the
problems associated with TCP.
In this paper, we present the results of our
efforts to achieve high-performance wide area
data transfers between selected sites connected
by the Abilene backbone network. In particular,
we developed and tested a very simple user-level
communication protocol, utilizing a single UDP
stream with a simple acknowledgement and
retransmission mechanism, designed specifically
for this type of environment. We tested our user-
level protocol against (an optimized) TCP stream
on data transfers between Argonne National
Laboratory and the Laboratory of Computational
Science and Engineering (LCSE) at the
Phillip Dickens
Department of Computer Science
Illinois Institute of Technology
William Gropp
Mathematics and Computer Science Division
Argonne National Laboratory
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
University of Minnesota, and between Argonne
National Laboratory and the Center for
Advanced Computing Research (CACR) at the
California Institute of Technology.
Our results are encouraging. We obtained
over 90% of the maximum available bandwidth
on data transfers between ANL and LCSE using
both approaches: that is, using a single
(optimized) TCP stream and our simple user-
level protocol that utilizes a single UDP stream.
On data transfers between ANL and CACR
however, only the user-level protocol was able to
obtain such a high percentage of this maximum
bandwidth. In particular, the user-level protocol
obtained up to 85% of the maximum available
bandwidth while the optimized TCP stream
obtained only on the order of 10% of this
maximum.
There are three primary contributions of this
paper. First, it outlines a simple user-level
protocol that was shown to provide excellent
performance across both short and long haul
high-performance networks. As noted, these
results were obtained with a single UDP data
stream in conjunction with a simple
acknowledgement and retransmission scheme.
This is in contrast to the alternative approaches
that utilize multiple TCP streams (up to eight
streams per host) to improve performance in this
setting [1,10,13]. A single data stream has the
advantage that it does not require the kernel to
multiplex multiple TCP streams, where such
multiplexing may negatively impact the
performance of applications executing on that
particular host.
Another contribution of this paper is that it
provides a detailed study of the impact on
performance due to the various parameters that
can be controlled at the user level. This provides
insight into the relationship between the
acknowledgement frequency and the amount of
network resources “wasted” due to the greedy
nature of the algorithm, and also helps to explain
the performance of TCP in this environment.
Thirdly, the end-points of the experiments
conducted between ANL and LCSE were two
Pentium3-based Windows 2000 boxes running
the “off-the-shelf” winsock2 API. Windows
2000 supports the “Large Window” extensions to
TCP [6], and it is important to look at the
performance of the Winsock2 API, with all of its
support for high-performance data transfers,
given that the vast majority of published
bandwidth studies employ either Unix or Linux
TCP implementations.
The rest of this paper is organized as follows.
In Section 2 we discussed other approaches to
providing high-performance data transfers in a
high-delay, high-bandwidth environment. In
Section 3, the experimental design is presented.
In Section 4, the user-level communication
protocol is presented. In Section 5 we present the
results of our experimental studies, and we
provide conclusions and future research in
Section 6.
2 Related Work
There is a significant amount of research relating
to obtaining (a high percentage of) the available
bandwidth in high-bandwidth, high-delay
networks. This research is proceeding along two
fronts: One approach is fundamental research
into mechanisms to improve the performance of
TCP itself. The other approach is to develop
techniques at the application level that
circumvent the performance problems associated
with TCP. We discuss each approach in turn.
As discussed in [13,14,15], the size of the
TCP window is the single most important factor
in achieving good performance over high-
bandwidth, high-delay networks. To keep such
“fat” pipes full, the TCP window size should be
as large as the product of the bandwidth and the
round-trip delay. This has led to research in
automatically tuning the size of the TCP socket
buffers at runtime [16]. Also, it has led to the
development of commercial TCP
implementations that allow the system
administrator to significantly increase the size of
the TCP window to achieve better performance
[14]. Another area of active research is the use of
a Selective Acknowledgement mechanism
[8,14,18] rather than the standard cumulative
acknowledgement scheme. In this approach, the
receiving TCP sends to the sending TCP a
Selective Acknowledgement (SACK) packet that
specifies exactly those packets that have been
received, allowing the sender to retransmit only
those segments that are missing. Additionally,
“fast retransmit” and “fast recovery” algorithms
have been developed that allow a TCP sender to
retransmit a packet before the retransmission
timer expires, and allows the TCP sender to
increase the size of its congestion control
window, when three duplicate acknowledgement
packets are received (without intervening
acknowledgements) [8,18]. An excellent source
of information, detailing which commercial and
experimental versions of TCP support which
particular TCP extensions, may be found in [14].
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
At the user level, the allocation of multiple
TCP streams has been investigated. PSockets
[13] employs multiple TCP sockets to increase
the size of the TCP window. As discussed by the
authors, the limitations on TCP window sizes
are on a per socket basis, and thus striping the
data across multiple sockets provides an
aggregate TCP buffer size that is closer to the
(ideal size) of the bandwidth times round-trip
delay. A similar approach has been investigated
within the domain of satellite-based information
systems [10]. The Globus project [17] developed
a grid-FTP[1] tool that employs multiple TCP
streams per host, with (perhaps) multiple hosts.
This again significantly increases the size of the
TCP window. It is also worth noting that using
multiple TCP sockets increases the probability
that, at any given time, there will be at least one
TCP stream that is ready to fire.
3 Experimental Design
We investigated (reasonably) large-scale data
transfers on two high-performance network
connections: one between Argonne National
Laboratory (ANL) and the Laboratory for
Computational Science and Engineering (LCSE)
at the University of Minnesota, and one between
ANL and the Center for Advanced Computing
Research (CACR) at the California Institute of
Technology. ANL is connected to both of these
sites across Abilene. The endpoints at both ANL
and LCSE were Intel Pentium3-based PCs
running Windows 2000 and using the Winsock2
API. We did not have access to a Windows 2000
machine at CACR at the time of this writing, and
used instead an SGI Origin200 (with two
225Mhz MIPS R10000 processors) running
IRIX 6.5. Similarly, we did not have access to an
IRIX 6.5 machine at ANL, and thus were forced
to run the experiments between a Windows 2000
machine at one end and an IRIX 6.5 machine at
the other.
As noted on the Pittsburgh Super Computing
Website [14], IRIX 6.5 (like Windows 2000)
supports the RFC 1323 [6] “Large Window”
extensions to TCP. Both machines also support
MTU path discovery [9], where the segment size
is determined by the maximum packet size that
can be transmitted across the complete path
without fragmentation, rather than simply using
a pre-determined segment size that may be
smaller (and thus less efficient) than this value.
Also, both machines support TCP Selective
Acknowledgements [8,18]. However, the default
TCP window size on the SGI Origin200 is
(approximately) 64KB, and we did not have
system-level access to the machine that would
have allowed us to increase this window size.
The TCP window under Windows 2000 is (or
can very easily) be extended to one Gigabyte
[7,14]. The slowest link on either pair of
connections was 100 Mb/sec, which was
incurred between the desktop PC at ANL used in
these experiments, and the Math and Computer
Science Division’s external router.
The round-trip delay between ANL and
LCSE was measured (using traceroute) to be on
the order of 26 milliseconds, and we (loosely)
categorized this as a short haul network. The
round-trip delay between ANL and CACR was
on the order of 65 milliseconds, which we (again
loosely) categorized as a long haul network. The
transmitted data size for the experiments
between ANL and LCSE was 40 MB, and was
20 MB between ANL and CACR. Similar to the
results found in [13], the amount of data
transferred did not have a significant impact on
the throughput rate, so we opted for a smaller
data size on the long haul connection to decrease
the cost of experimentation. It is interesting to
note that the bandwidth delay product for the
ANL to LCSE connection was 1.04 Gigabytes,
which was only fractionally larger than the TCP
window size used by Windows 2000. The
bandwidth delay product for the connection
between ANL and CACR was 1.3 Gigabytes,
which is orders of magnitude larger than the
default (approximately) 64 KB buffer allocated
by IRIX6.5. As will be seen, this significant
difference in the size of the TCP window had a
tremendous impact on the performance of the
TCP algorithm.
In our experiments, we replicated thirty trials
of sending either 20 MBs of data (across the haul
network), or 40 MBs of data (across the short
haul network). The metric of interest was the
percentage of the maximum available bandwidth
(which is 100 Mb/sec) that was obtained for each
approach. A byte re-ordering cost was
necessitated by the architectural differences
between the SGI Origin 200 and the Windows
2000 machine, and this cost was included in the
final bandwidth values reported.
4 Communication Protocols
We tested two communication protocols: an
optimized (single) TCP stream and a user-level
UDP-based protocol. We found that tuning the
performance of the Windows-based TCP
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
implementation was very simple. The primary
optimization was to simply request a socket
buffer size greater than 64 KB, which
automatically enabled the “Large Window” TCP
extensions [7,14]. Additionally, we disabled the
Nagel algorithm to avoid delays in placing
packets on the network, and experimented with
decomposing the data into smaller “chunk sizes”
(thus requiring multiple calls to the TPC send
routine to complete the data transfer). We were
very limited in optimizing TCP as implemented
on the SGI Origin200 since we did not have
system-level access to the TCP stack.
4.1 User-Level UDP Protocol
The user-level protocol we developed
incorporates, at the user level, many of the
important extensions defined for TCP in a high-
bandwidth, high-delay network environment.
One important assumption made by the user-
level algorithm however is that both the sender
and the receiver have pre-allocated buffers large
enough to accommodate the complete data
transfer. This seems to be a very reasonable
assumption, and certainly applies to the
applications in which we are interested. In
particular, it applies to wide-area MPI [21]
(where for every send there is a matching
receive), the File Transfer Protocol (FTP, where
the disk is used as a data buffer), and data
visualization applications (where the generated
data is consumed by the data receiver). Given
this very important characteristic of the
applications of interest, our user-level protocol
pushes to the limit the idea of “Large Window”
extensions developed for TCP: that is, the
window size is essentially infinite since it spans
the entire data buffer (albeit at the user level). It
also pushes to the limit the idea of selective
acknowledgements. Given a pre-allocated
receive buffer and constant packet sizes, each
data packet in the entire buffer can be numbered.
The data receiver can then maintain a very
simple data structure with one byte (or even one
bit) allocated per data packet, where this data
structure tracks the received/not received status
of every packet to be received. This information
can then be sent to the data sending process at a
user-defined acknowledgement frequency. Thus
the selective acknowledgement window is also in
a sense infinite. That is, the data sender is
provided with enough information to determine
exactly those packets, across the entire data
transfer, that have not yet been received (or at
least not received at the time the
acknowledgement packet was created and sent).
There are also features unique to our protocol
that have been determined (experimentally) to
have a very significant impact on performance.
One such feature is the acknowledgement
frequency, which, as discussed above,
determines the frequency at which an
acknowledgement packet (containing the
complete history of the data transfer) is sent to
the data sender. Another issue is the algorithm
that determines the next data packet, across all
eligible packets (to be defined below), to be sent
next. (To understand the importance of this
issue consider that when the data sender receives
an acknowledgment packet, it must determine
whether to perhaps re-send a data packet that
was lost, or to send a “new” data packet that has
not yet been sent for the first time.) The third
user-level parameter has to do with the number
of packets the data sender should transmit before
checking for (and processing) an
acknowledgement packet.
4.2 Algorithm Executed by Data Sender
The total data buffer was divided into pre-
determined, equal, and fixed-sized packets of
1024 bytes. This packet size was determined by
executing the MTU path discovery algorithm
along the paths from Argonne National
Laboratory to both LCSE and CACR. Thus in
the case of the 20 MB data buffer there were of
19,532 packets, and there were 39,063 packets
with 40 MB data buffer. One UDP socket was
used to transmit data from the sender to the
receiver, and another UDP socket was used to
send acknowledgement packets from the receiver
to the sender.
The data-sending algorithm iterates over three
basic phases. In the first phase, the data sender
employs some algorithm to determine the
number of data packets to be placed onto the
network before looking for, and processing if
available, an acknowledgement packet. This is
referred to as a “batch-sending” operation since
all such packets are placed onto the network
without interruption (although the select system
call is used to ensure adequate buffer space for
the packet). It is very important to note that after
a batch-send operation the data sender looks for,
but does not block for, an acknowledgement
packet.
In the second phase of the algorithm, the data
sender looks for, and if available, processes an
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
acknowledgement packet. Processing of an
acknowledgement packet entails updating the
receive/not received status for each data packet
acknowledged, and determining the number of
packets that were received by the data receiver
between the time it created the previous
acknowledgement packet and the time it created
the current acknowledgement packet. This
information can then be used to determine the
number of packets to send in the next batch-send
operation. If no acknowledgement packet is
available, this information can also be used to
determine the number of packets to send in the
next batch-send operation. Note that a repeated
batch-sending operation with zero packets is
logically equivalent to blocking on an
acknowledgement.
In the third phase of the algorithm, the data
sender executes some user-defined algorithm to
choose the next packet, out of all
unacknowledged packets, to be placed onto the
network.
4.2.1 Parameters for the Send Algorithm.
The first parameter studied was the number of
packets to be sent in the next batch-send
operation. Intuitively, one would expect that the
data sender should check for an
acknowledgement packet on a very frequent
basis, thus limiting the number of packets to be
placed onto the network in a given batch-send
operation. Our experimental results supported
this intuition, finding that two packets per batch-
send operation provided the best performance.
We therefore used this number in all subsequent
experimentation
We also performed extensive experimentation
to determine which particular packet, out of all
unacknowledged packets, should next be placed
onto the network. We tried several algorithms,
and, in the end, it became quite clear that the best
approach (by far) was to treat the data as a
circular buffer. That is, the algorithm never went
back to re-transmit a packet that was not yet
acknowledged, if there were any packets that had
not yet been sent for the first time. Similarly, a
given packet was re-transmitted for the n+1st
time only if all other unacknowledged packets
had been re-transmitted n times.
As can be seen, the algorithm executed by the
sender is very greedy, continuing to transmit (or
re-transmit) packets (without blocking) until it
receives an acknowledgement packet from the
data receiver specifying that all data has been
successfully received. Thus a reasonable
question to ask is how wasteful of network
resources is this approach. One measure of
wasted resources applicable to this approach is
the number of duplicate packets received by the
data receiver over the course of the entire data
transfer. We did track this information, where the
data receiver maintained a simple counter that
was incremented every time it received a packet
that had already been marked as having been
received. In hind-sight, it also would have also
been useful to track the number of messages still
in the pipe when the receiver determined it had
obtained all of the data (and thus had stopped
trying to read packets off of the network). In
future research, we will track this information,
and also attempt to track the number of packets
lost in the network due to contention.
4.3 Algorithm Executed by Data Receiver
The data receiver iterates over a loop with
the select system call at the top of the loop. The
select system call takes as parameters a set of
socket descriptors (in the case of Windows
2000), or a set of file descriptors (in the case of
Unix). The select call also takes as a parameter a
timer. The select call returns when one of the
sockets is available for reading or writing, or
when the timer expires. We set the time out
value for the data receiver at 1.5 seconds, which
was determined experimentally. The basic
algorithm is as follows.
1) Use the select system call to
determine if a data packet is
available.
2) If the select call times out send an
acknowledgement packet (that
contains a complete history of the
data transfer).
3) If a packet is available, read it off
of the network and determine if it is
a duplicate (i.e. has already been
received and acknowledged). If it is
a duplicate packet, discard it and
increment a counter tracking the
number of duplicate packets
received. If it is not a duplicate
packet, place the packet into its
proper position within the data
buffer using the packet number as
an offset.
4) If the data packet is not a duplicate,
increment a counter tracking the
total number of packets received by
the data receiver. If this value
exceeds some user-defined
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
threshold, send an
acknowledgement packet and reset
the counter to zero.
5) If the packet is a duplicate, and the
number of duplicate packets
received exceeds some threshold
value, then send an
acknowledgement packet and reset
the duplicate counter to zero.
4.3.1 Parameters for the Data Receiver. The
most important parameter with respect to the
data receiver is the number of new packets
received before generating and sending an
acknowledgement packet. The frequency with
which the data receiver sends acknowledgement
packets essentially determines the level of
synchronization between the two processes. A
small value (and thus a high level of
synchronization) implies that the data receiver
must frequently stop pulling packets off of the
network to create and send acknowledgement
packets. Given that the algorithm is UDP-based,
those packets missed while creating and sending
an acknowledgement will, in all likely-hood, be
lost. A very high value, and thus a very low level
of synchronization, results in both the data
sender and data receiver spending virtually all of
their time placing packets on, and reading
packets off, of the network. As will be seen, this
is actually a very good approach when the pipe is
completely clear.
The acknowledgement frequency, as
described above, takes into consideration only
the number of new data packets received. We
found that it is also necessary to place a bound
on the number of duplicate packets received
before sending an acknowledgement packet. The
data sender only sends duplicate packets when it
cannot correctly determine (or anticipate) the
packets that have not yet been received by the
data receiver. Thus when the data sender if
clearly “off the mark” in terms of the packets it
is selecting to send (or re-send), it is helpful for
the data receiver to send an acknowledgement
packet to provide an updated view of the state of
the data transfer. Our experimentation suggested
that sending an acknowledgement packet
whenever the duplicate count reached 50
provided the best performance over both network
connections.
4 Experimental Results
We compared the performance of the user-
level UDP protocol against that of an optimized
TCP implementation (when both end-points
were executing the winsock2 API), and against a
TCP implementation that we were unable to
optimize in any meaningful way (i.e. the IRIX
6.0 TCP implementation). As noted, we were
unable to significantly modify the window size
on the SGI Origin200 since we did not have
system-level access on the machine. First,
consider the performance of the user-level UDP
protocol shown in Figure 1. This figure depicts
the performance of the approach as a function of
the number of packets received before triggering
an acknowledgement. As can be seen, this simple
protocol, involving a single UDP stream (for
data sending), provides excellent performance
across both platforms and across both the short
haul and the long haul connections. In
particular, the protocol achieved a throughput of
over 90% of the maximum available bandwidth
on the connection between ANL and LCSE.
Also, it obtained on the order of 85% of the available on the connection between ANL and
CACR. It is important to remember that these
Performance of User-Level Protocol on Short and Long Haul
Networks
0%
20%
40%
60%
80%
100%
50
100
200
300
400
500
600
700
800
900
1000
1500
2500
Number of Packets Received Before
Sending an Acknowledgement
Packet
Percentage of Maximum
Bandwidth Obtained
Long Haul Network
Short Haul Network
Figure 1. This figure depicts the percentage of the maximum available bandwidth
obtained by the user-level UDP protocol as a function of the number of packets
received before sending an acknowledgement packet.
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
results were obtained using a single (optimized)
UDP stream.
It is interesting to note the impact on
performance due to the acknowledgement
frequency. When this frequency was very high
(e.g. 1/50), there was a significant detrimental
impact on performance. This is due to the fact
that when the data sender and data receiver were
tightly synchronized, both processes were
spending a non-trivial amount of time preparing,
sending, receiving, and processing
acknowledgement packets. Even though the cost
of preparing/processing acknowledgements was
not necessarily large, this extra time devoted to
processing such packets (and thus not
sending/receiving packets) turned out to have a
significant impact (at least on the high-
performance networks we tested). As the
acknowledgement frequency decreased (down to
a minimum value of one out of every 3000
packets), the performance began to improve. The
throughput on the short haul network peaked out
at a little over 90% of the available bandwidth.
In the case of the long haul network, the
throughput peaked out at approximately 85% of
the available bandwidth.
It is also interesting to look at the number of
duplicate packets received by the data receiver
across the complete data transfer (as noted
however, this value does not reflect packets in
the pipe when the data receiver stopped looking
for more packets). This is shown in Figure 2. In
the case of the short haul network, it was rare to
observe more than one or two duplicate packets
for all acknowledgement frequencies studied.
Thus the pipe was very clear, and there were
virtually no packets being lost in the network.
This was somewhat surprising given that the
trials were conducted during normal business
hours, although the summer students at Argonne
National Laboratory had departed (vastly
reducing the load on the ANL networks). As can
be seen however, the number of duplicate
packets did significantly increase when the data
was transferred over the long haul network. This
number was not large when the data sender and
data receiver were tightly synchronized, but as
the level of synchronization began to decrease,
the number of duplicate packets began to
increase (and rather dramatically after the
acknowledgement frequency was less than
1/900). The number of duplicate packets reached
a value of around 550 (when the
acknowledgement frequency was reduced to
1/2500). It is interesting to note that even though
the number of duplicate packets significantly
increased with a decreased acknowledgement
frequency, this did not have a significant
negative impact on performance. This can be
best understood by considering that even with
550 duplicate packets, this represented only 0.5
MB of additional data on the network. This, in
turn, represented less than 3% of the total data
transfer.
5.1 TCP Performance
Figure 3 depicts the performance of TCP across
the short and long haul networks. As can be
seen, the results obtained using the Windows
2000 TCP implementation (across the short haul
network) were quite impressive, providing
approximately the same performance as that of
the user-level protocol. These results certainly
help emphasize the fact that large TCP windows
are imperative over high-bandwidth, high-delay
networks. The other factor allowing TCP to
obtain such good performance was most likely
due to the absence of contention in the network.
The fact that virtually no packets were duplicated
in the UDP protocol (at least strongly) suggests
that TCP experienced very little packet loss
across this same network. Thus the TCP
congestion control mechanisms were not
triggered, allowing the TCP window to be
advanced without (any significant) delay. Clearly
the research in optimizing TCP for this type of
network has produced dramatic improvements in
performance.
As can be seen however, the performance of
TCP drops dramatically over the long haul
network. There were two reasons for this: First,
the TCP window size was significantly (by
several orders of magnitude) smaller than the
bandwidth delay product. As noted, we were not
able to modify this window size since we did not
have system-level access. Secondly, judging by
the packet loss incurred by the user-level
protocol, it is likely that TCP also experienced
packet loss during the data transfer. Thus the
TCP congestion control algorithms were most
likely triggered, resulting in a significant drop in
performance. When we secure a high-
performance Windows 2000 machine at the other
end-point of this connection, we should be able
to determine how much of the degradation was
due to the window size and how much was due
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
packet loss and the subsequent triggering of the
congestion control mechanisms.
We were also interested in whether breaking
the data into smaller data units (“chunks”), and
invoking the TCP send algorithm multiple times,
would have an impact on performance. As can be
seen, this approach did result in some
performance improvement over the short haul
network, but did not appear to have any impact
on performance across the long haul network.
5 Conclusions and Future
Research
In this paper, we have reported on the design and
implementation of a user-level UDP protocol
designed for high-delay, high-bandwidth
networks. The most important features of this
algorithm include a (logically) infinite window
size, a (logically) infinite selective
acknowledgement window, and a user-define
acknowledgement frequency. This algorithm was
shown to achieve a throughput of over 90% of
the maximum bandwidth when executed across a
short haul, high-performance network. Over a
long haul network, it was still able to achieve
throughout on the order of 85% of this
maximum.
This research also clearly demonstrates the
importance of the window size in the
performance of TCP. When the “Large Window”
extensions to TCP were enabled, a single TCP
stream was able to achieve on the order of 90%
of the available bandwidth. This performance
was dramatically reduced however when the
TCP window was significantly smaller than the
bandwidth delay product, and when packet loss
was introduced into the transfer.
Currently, we are investigating the significant
performance decline suffered by TCP across the
long haul network. We are interested in sorting
out the impact on performance due to the TCP
window size and due to the triggering of the TCP
congestion control algorithms. We are also
interested in providing a better definition and a
more robust measurement of any waste in
network resources due to the greedy nature of the
user-level algorithm. Finally, we wish to study
the impact on the performance of other
applications (executing on the same processor)
based on the communication protocol being
employed (i.e. the user-level protocol defined
herein, versus the use of multiple TCP streams).
0
100
200
300
400
500
600
50
200
400
600
800
1000
2500
Number of Packets Received Before
Sending an Acknowledgement Packet
Number of Dulicate Packets
Short Haul
Network
Long Haul
Network
Figure 2. This figure shows the number of duplicate packets received by the data
receiver as a function of the acknowledgement frequency.
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
References
[1] Allcock, B. Bester, Bresnahan, J.Chervenak, A.,
Foster, I., Kesselman, C. Meder, S.,
Nefedova, V., Quesnet, D., and S. Tuecke.
Secure, Efficient Data Transport and
Replica Management for High-Performance
Data-Intensive Computing. Preprint
ANL/MCS-P871-0201, Feb. 2001.
[2] Allman, M., Paxson, V., and W.Stevens.
TCP Congestion Control. RFC 2581, April
1999.
[3] Feng, W. and P. Tinnakornsrisuphap. The
Failure of TCP in High-Performance
Computational Grids. In the Proceedings of
the Super Computing 2000 (SC2000).
[4] Hobby, R. Internet2 End-to-End Performance
Initiative (or Pat Pipes Are Not Enough).
URL: http//www.internet2.org.
[5] Irwin, B. and M. Mathis. Web100:
Facilitating High-Performance Network Use.
White Paper for the Internet2 End-to-End
Performance Initiative.
URL:http://www.internet2.edu/e2epi/web02/p
_web100.shtml
[6] Jacobson, V., Braden, R., and D. Borman.
TCP Extensions for high performance. RFC
1323, May 1992.
[7] MacDonald and W. Barkley. Microsoft
Windows 2000 TCP/IP Implementation
Details. White Paper, May 2000.
[8] Mathis, M., Mahdavi, J., Floyd, S. and A.
Romanow. TCP Selective Acknowledgement
Options. RFC 2018
[9] Mogul, J. and S. Deering, "Path MTU
Discovery", RFC 1191,
November 1990.
[10] Ostermann, S., Allman, M., and H. Kruse. An
Application-Level solution to TCP’s Satellite
Inefficiencies. Workshop on Satellite-based
Information Services (WOSBIS), November,
1996.
[11] J. Postel, Transmission Control Protocol,
RFC793, September 1981.
[12] Semke, J., Jamshid Mahdavi, J., and M.
Mathis, Automatic TCP Buffer Tuning,
Computer Communications Review, a
publication of ACM SIGCOMM, volume 28,
number 4, October 1998].
[13] Sivakumar, H., Bailey, S., and R. Grossman.
PSockets: The Case for Application-level
Network Striping for Data Intensive
Applications using High Speed Wide Area
Networks. In Proceedings of Super
Computing 2000 (SC2000).
[14] URL:
http://www.psc.edu/networking/perf_tune.ht
ml#intro
Enabling High Performance Data Transfers
on Hosts: (Notes for Users and System
Administrators)
[15] URL:
http://dast.nlanr.net/Articles/GettingStarted/T
CP_window_size.html
[16] URL:
http://dast.nlanr.net/Projects/Autobuf_v1.0/au
totcp.html. Automatic TCP Window Tuning
and Applications
[17] URL: http://www.globus.org
[18] URL:
http://www.psc.edu/networking/all_sack.html
. List of sack implementations
[19] URL: http://www.internet2.org
[20] URL: http://www.internet2.edu/abilene
[21] URL: http://www-
unix.mcs.anl.gov/mpi/mpich/
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5K 15K 25K 40K 100K 4Meg
Long Haul Network
Short Haul Network
Figure 3. This figure shows the percentage o
f
maximum bandwidth attained by TCP across a
short and long haul high-performance networ
k
connection. The “chunk” size reflects a
decomposition of the total buffer size into
smaller data units.
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
... However, even though the underlying networking infrastructure of computational/data Grids can provide very high bandwidth, dataintensive applications are often able to obtain only a small percentage of such bandwidth. One problem is that TCP [12], the transport protocol of choice for most wide-area data transfers, was not designed for and does not perform well in the high-bandwidth, high-delay networks typical of computational Grids [26,41,47,56]. This has led to significant research activity aimed at modifying TCP itself to make it compatible with this new network environment (e.g., Highspeed TCP [35], Fast TCP [49], Scalable TCP [50]), as well as systems that monitor the end-to-end network to diagnose and correct performance problems in TCP networks (e.g., [4,55]). ...
... An alternative strategy has been the development of application-level protocols that can largely circumvent the performance issues of TCP. Some of these approaches are based on UDP (e.g., FOBS [22,[24][25][26], Tsunami [8], and UDT [40] ), while others spawn multiple TCP streams for a single data flow (e.g., GridFTP [10], PSOCKETS [58]). ...
... The platform for this research is FOBS: a lightweight, high-performance, UDP-based data transfer system for computational Grids [22,[24][25][26]. We began development of the prototype system in 2002, and were primarily interested in obtaining a large percentage of the available bandwidth rather than congestion control (the high-speed, research networks were very lightly loaded at that time). ...
Article
Full-text available
This paper describes a lightweight, high-performance communication protocol for the high-bandwidth, high-delay networks typical of computational Grids. One unique feature of this protocol is that it incorporates an extremely accurate classification mechanism that is efficient enough to diagnose the cause of data loss in real time, providing to the controller the opportunity to respond to different causes of data loss in different ways. The simplest adaptive response, and the one discussed in this paper, is to trigger aggressive congestion control measures only when the data loss is diagnosed as network related. However, even this very simple adaptation can have a tremendous impact on performance in a Grid setting where the resources allocated to a long-running, data-intensive application can fluctuate significantly during the course of its execution. In fact, we provide results showing that the utilization of the information provided by the classifier increased performance by over two orders of magnitude depending on the dominant cause of data loss. In this paper, we discuss the Bayesian statistical framework upon which the classifier is based and the classification metrics that make this approach highly successful. We discuss the integration of the classifier into the congestion control structures of an existing high-performance communication protocol, and provide empirical results showing that it correctly diagnosed the cause of data loss in over 98% of the experimental trials. Grid computing-High-performance communication protocols-High-performance networking-Bayesian analysis-Classification mechanisms
... While TCP is able to detect and respond to network congestion, its very aggressive congestion-control mechanisms result in poor bandwidth utilization even when the network is lightly loaded. User-level protocols such as GridFTP [2], RUDP [26,30], and previous versions of FOBS [13,15] are able to obtain a very large percentage of the available bandwidth. However, these approaches rely on the characteristics of the network to provide congestion control (that is, they generally assume there is no contention in the network). ...
... It uses UDP as the data transport protocol, and provides reliability through an application-level acknowledgment and retransmission mechanism. Experimental results have shown that FOBS performs extremely well in a computational Grid environment, consistently obtaining on the order of 90% of the available bandwidth across both short-and long-haul network connections [13][14][15]39]. Thus FOBS addresses quite well the issue of obtaining a large percentage of the available bandwidth in a Grid environment. ...
... The data transfer engine itself is (at least conceptually) reasonably simple. While it is beyond the scope of this paper to discuss in detail the implementation of FOBS and the technical issues addressed (the interested reader is directed to [12][13][14][15] for such a discussion), it is worthwhile to briefly discuss the implementation of the reliability mechanism. ...
Conference Paper
Full-text available
In this paper, we discuss our work on developing an efficient, lightweight application-level communication protocol for the high-bandwidth, high-delay network environments typical of computational grids. The goal of this research is to provide congestion-control algorithms that allow the protocol to obtain a large percentage of the underlying bandwidth when it is available, and to be responsive (and eventually proactive) to developing contention for system resources. Towards this end, we develop and evaluate two application-level congestion-control algorithms, one of which incorporates historical knowledge and one that only uses current information. We compare the performance of these two algorithms with respect to each other and with respect to TCP.
... While TCP is able to detect and respond to network congestion, its very aggressive congestion-control mechanisms result in poor bandwidth utilization even when the network is lightly loaded. User-level protocols such as GridFTP [1], RUDP [9,10], and previous versions of FOBS [5,7] are able to obtain a very large percentage of the available bandwidth. However, these approaches rely on the characteristics of the network to provide congestion control (that is, they generally assume there is no contention in the network). ...
... It uses UDP as the data transport protocol, and provides reliability through an application-level acknowledgment and retransmission mechanism. Experimental results have shown that FOBS performs extremely well in a computational Grid environment, consistently obtaining on the order of 90% of the available bandwidth across both short-and long-haul network connections [5][6][7]14]. Thus FOBS addresses quite well the issue of obtaining a large percentage of the available bandwidth in a Grid environment. ...
... Currently, these agents have functionality limited to carrying out the basic congestion control algorithms, but instilling them with the ability to collect and share end-to-end system-state information is a focus of current research. The data transfer engine itself is (at least conceptually) reasonably simple, and details of its implementation are provided in [4][5][6][7]. For this discussion, all that is required is a basic understanding of the reliability mechanism since it is closely connected to the congestion control mechanisms. ...
Conference Paper
Full-text available
In this paper, we discuss the development of a highly efficient, application-level data transfer system for computational grids. We describe and evaluate two application-level congestion control mechanisms, and compare their performance with respect to each other and with that obtained by TCP. We show that both application-level protocols are able to obtain performance that is very close to the maximum available bandwidth while keeping data loss rates low (from 0.06% to 1.6 %). Also, we show that both approaches obtain throughput that is close to an order of magnitude greater than that achieved by TCP. Finally, we begin to address the important issue of whether all data loss should be assumed to represent a network congestion event, with the goal of crafting the response to observed data loss as a function of the root causes of such loss.
... While TCP is able to detect and respond to network congestion, its very aggressive congestion-control mechanisms result in poor bandwidth utilization even when the network is lightly loaded. User-level protocols such as GridFTP [1], RUDP [9], and previous versions of FOBS [4,5,7] are able to obtain a very large percentage of the available bandwidth but rely on the characteristics of the network to provide congestion control (that is, they generally assume there is no contention in the network). Other approaches such as SABUL [10] attempt to bridge this gap by using UDP for fast delivery with a back-off mechanism that is triggered when the loss rate climbs above a certain threshold. ...
... FOBS is UDP based and provides reliability through an application-level acknowledgment and retransmission mechanism. Experimental results have shown that FOBS performs extremely well in a computational Grid environment, consistently obtaining on the order of 90% of the available bandwidth across both short-and long-haul network connections [5,7]. Thus FOBS addresses quite well the issue of obtaining a large percentage of the available bandwidth in a Grid environment. ...
... Several studies [2], [7], [11] (and references therein) have shown that, in practice, user-level distributed applications connected by high-speed networks (e.g., Abilene) cannot fully utilize the available bandwidth. The main reason is the congestion control mechanism of the transport protocol (e.g., TCP). ...
... Examples of new transport protocols are RBUDP [10], User-level UDP [2], and SABUL [8], [9]. These protocols use UDP to transfer the data and TCP or UDP for signaling. ...
Article
Several new protocols such as RBUDP, User- Level UDP, Tsunami, and SABUL, have been proposed as alternatives to TCP for high speed data transfer. The purpose of this paper is to analyze the effects of SABUL congestion control algorithm and its parame- ters on SABUL bandwidth utilization by using extensive simulations. The simulation results show that SABUL congestion control parameters have a direct impact on SABUL bandwidth utilization and SABUL cannot compete with uncontrolled flows such as UDP. We propose a modification of SABUL congestion control algorithm to improve the overall performance of SABUL and use ns- 2 simulations to verify our algorithms. Our results show that with two modifications, SABUL can achieve a better bandwidth utilization even when protocol parameters have been chosen improperly; moreover, the modified SABUL algorithm can achieve a better bandwidth utilization when competing with uncontrolled traffic.
... Our research has shown that complexity measures capture quite well the underlying system dynamics, and that understanding such dynamics provides significant insight into the cause(s) of observed data loss. The test-bed for this research is FOBS 1 : a high-performance data transfer system for computational Grids developed by the primary author [9][10][11][12]25]. FOBS is a UDP-based data transfer system that provides reliability through a selective-acknowledgment and retransmission mechanism. ...
... We do not discuss further the design, implementation, or performance of FOBS here. The interested reader is directed to [11][12][13] for detailed discussions on these issues. ...
Conference Paper
Full-text available
Given the critical nature of communications in computational Grids it is important to develop efficient, intelligent, and adaptive communication mechanisms. An important milestone on that path is the development of classification mechanisms that can distinguish between causes of data loss in cluster and Grid environments. In this paper, we investigate one promising approach to developing such classification mechanisms based on the analysis of what may be termed packet-loss signatures that describe the patterns of packet loss in the current transmission window. We analyze these signatures using complexity theory to learn about their underlying structure (or lack thereof), and are mapping the relationship between the complexity metrics and the system conditions responsible for their generation. We show that complexity measures are an excellent metric upon which a classification mechanism can be built. Further, we show how such a classification mechanism can be built based on the application of Bayesian statistics.
... The goal of the buffering scheme is to transfer data from a live simulation running in batch on a remote supercomputer over a Wide Area Network (WAN) to our local analysis/visualization cluster as efficiently as possible and provide minimal overhead on the simulation [31,32,63]. It should also have replication abilities so that the processed data can be duplicated to collaborators' clusters as and when needed. ...
... Another mentioned problem is the high communication delays existing in the Internet. Although, these delays are being reduced by the new hardware and software infrastructures present in modern networks [8], our proposal provides a peculiar mechanism for the communication among application components. This mechanism breaks the limitation imposed by the host-of-origin browser security policy that establishes that applets can only communicate with the host where they were loaded from. ...
Conference Paper
Full-text available
This paper proposes a generic infrastructure for the de- velopment of parallel and distributed applications on the Web. The infrastructure is oriented to allow that every single host in the Internet can participate in the execution of distributed applications using a very simple configuration with rigid guarantees of security. Our proposal is based on the use of World Wide Web protocols and Java applets, ex- clusively. Thus, users willing to participate in the execution of applications only require a conventional Web browser with the Java Runtime Environment enabled. This work dif- fers from previous proposals in its simplicity and improved performance. In addition, our study includes a detailed evaluation of real parallel applications.
... In this section, we provide a brief overview of the FOBS protocol. A more detailed description can be found in [4]. FOBS is a user-level, UDP-based data transfer mechanism that leverages knowledge of the characteristics of the data transfer itself to significantly enhance performance. ...
Article
Full-text available
In this paper, we describe FOBS: a simple userlevel communication protocol designed to take advantage of the available bandwidth in a highbandwidth, high-delay network environment. We compare the performance of FOBS with that of TCP both with and without the so-called Large Window extensions designed to improve the performance of TCP in this type of network environment. We show that FOBS can obtain on the order of 90% of the available bandwidth across both short- and long-haul highperformance network connections. For the longhaul connection, the bandwidth obtained was 1.8 times higher than that of the optimized TCP algorithm. Also, we demonstrate that the additional traffic placed on the network because of the greedy nature of the algorithm is quite reasonable, representing approximately 3% of the total data transferred.
Conference Paper
Energy-efficient system design and optimization is rapidly becoming an important consideration across a spectrum of computing systems - from embedded and mobile devices to high end platforms. This paper presents recent research results and experiences when attempting energy optimization at the software level, which is anticipated to have much more impact than micro-management of resources at the hardware level.
Conference Paper
Full-text available
Distributed computational grids depend on TCP to ensure reliable end-to-end communication between nodes across the wide-area network (WAN). Unfortunately, TCP performance can be abysmal even when buffers on the end hosts are manually optimized. Recent studies blame the self-similar nature of aggregate network traffic for TCP’s poor performance because such traffic is not readily amenable to statistical multiplexing in the Internet, and hence computational grids. In this paper, we identify a source of self-similarity previously ignored, a source that is readily controllable - TCP. Via an experimental study, we examine the effects of the TCP stack on network traffic using different implementations of TCP. We show that even when aggregate application traffic ought to smooth out as more applications’ traffic are multiplexed, TCP induces burstiness into the aggregate traffic load, thus adversely impacting network performance. Furthermore, our results indicate that TCP performance will worsen as WAN speeds continue to increase.
Article
Full-text available
After studying data-intensive, high-performance computing applications such as highenergy physics and climate modeling, we conclude that these applications require two fundamental data management services: secure, reliable, efficient transfer of data in wide area environments and the ability to register and locate multiple copies of data sets. In this paper, we present our design of these services in the Globus grid computing environment. We also describe the performance of our current implementation. 1 Introduction Data-intensive, high-performance computing applications require the efficient management and transfer of terabytes or petabytes of information in wide-area, distributed computing environments. Examples of data-intensive applications include experimental analyses and simulations in several scientific disciplines, such as highenergy physics, climate modeling, earthquake engineering and astronomy. These applications share several requirements. Massive data sets must be share...
Conference Paper
Full-text available
Transmission Control Protocol (TCP) is used by various applications to achieve reliable data transfer. TCP was originally designed for unreliable networks. With the emergence of high-speed wide area networks various improvements have been applied to TCP to reduce latency and achieve improved bandwidth. The improvement is achieved by having system administrators tune the network and can take a considerable amount of time. This paper introduces PSockets (Parallel Sockets), a library that achieves an equivalent performance without manual tuning. The basic idea behind PSockets is to exploit network striping. By network striping we mean striping partitioned data across several open sockets. We describe experimental studies using PSockets over the Abilene network. We show in particular that network striping using PSockets is effective for high performance data intensive computing applications using geographically distributed data.
Article
Full-text available
TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received. A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments. This memo proposes an implementation of SACK and discusses its performance and related issues. Acknowledgements Much of the text in this document is taken directly from RFC1072 "TCP Extensions for Long-Delay Paths" by Bob Braden and Van Jacobson. The authors would like to thank Kevin Fall (LBNL), Christian...
Article
Abstract TCP,may,experience,poor,performance,when,multiple,packets,are,lost from,one window,of data.,With,the limited,information,available from cumulative acknowledgments,a TCP sender can only learn about a single,lost,packet,per round,trip,time. An aggressive,sender,could choose to retransmit packets early, but such retransmitted segments may,have,already,been,successfully,received. A Selective Acknowledgment (SACK) mechanism, combined with a
Article
In several experiments using NASA's Advanced Communications Technology Satellite (ACTS), investigators have reported disappointing throughput using the TCP/IP protocol suite over T1 satellite circuits. A detailed analysis of FTP file transfers reveals that the TCP receive window size, the TCP "Slow Start" algorithm, and the TCP acknowledgment mechanism contribute to the observed limits in throughput. To further explore TCP's limitations over satellite circuits, we developed a modified version of FTP (XFTP) that uses multiple TCP connections. By using multiple TCP connections, we have been able to simulate a large, virtual TCP receive window. Our experiences with XFTP over both actual satellite circuits and a software satellite emulator show that a utilization of better than 90% is possible. Our results also indicate the benefit of introducing congestion avoidance at the application level. 1 Introduction In several experiments using NASA's Advanced Communications Technology This wor...