Conference PaperPDF Available

High performance wide area data transfers over high performance networks

February 2002

February 2002

DOI:10.1109/IPDPS.2002.1016675

Source
IEEE Xplore

Conference: Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM

Authors:

Phillip M Dickens

University of Maine

William Gropp

University of Illinois, Urbana-Champaign

Paul Woodward

University of Minnesota Twin Cities

First Page of the Article

This figure depicts the percentage of the maximum available bandwidth obtained by the user-level UDP protocol as a function of the number of packets received before sending an acknowledgement packet.

…

Figures - uploaded by William Gropp

Content may be subject to copyright.

Content uploaded by William Gropp

Content may be subject to copyright.

High Performance Wide Area Data Transfers Over High Performance

Networks

Paul Woodward

Laboratory for Computational Science and Engineering

University of Minnesota

Abstract

This paper introduces a new user-level

communication protocol designed to provide

high-performance data transfers across high-

bandwidth, high-delay, networks. The protocol

incorporates the most important enhancements

defined by the networking community to improve

the performance to TCP for this environment,

and also defines enhancements unique to this

protocol. In terms of the so-called “Large

Window” extensions to TCP, this protocol

implements a communication window that is

essentially infinite, and provides a selective

acknowledgement window that spans the entire

data transfer. In terms of user-level extensions to

TCP, it implements a user-defined

acknowledgement frequency, a user-defined

“batch sending” window, and a simple

framework within which the user can define the

algorithm that determines the next data packet to

be sent out across all eligible packets. We

present experimental results demonstrating data

throughput on the order of 85% - 92% of the

maximum available bandwidth across both short

haul and long haul high-performance networks.

1 Introduction

The Internet2 initiative [19] promises the

development of high-performance networking

applications in diverse areas such as distributed

collaboration, visualization of scientific data,

high performance grid-based computations, and

Internet telephony. Abilene [20], the high-

performance backbone network associated with

the Internet2 project, provides an OC-48

connection between the regional aggregation

points (Gigapops) that it connects. Thus there is

significant bandwidth available for such

advanced networking applications, and the issue

becomes one of actually realizing the available

bandwidth.

It has been well documented that, in practice,

user-level distributed applications connected by

Abilene achieve only a small percentage of the

available bandwidth [1,3,4,5,6,7,10]. The

primary reason for this poor performance is that

the Transmission Control Protocol (TCP) [11],

the communication mechanism of choice for

most applications, is not well suited for high-

performance, wide area data transfers

[3,4,5,6,13,14,15]. Thus one critical area of

current research is the development of

mechanisms to improve the performance of TCP

in a high-bandwidth, high-delay environment,

and another is to study alternative

communication protocols that can circumvent the

problems associated with TCP.

In this paper, we present the results of our

efforts to achieve high-performance wide area

data transfers between selected sites connected

by the Abilene backbone network. In particular,

we developed and tested a very simple user-level

communication protocol, utilizing a single UDP

stream with a simple acknowledgement and

retransmission mechanism, designed specifically

for this type of environment. We tested our user-

level protocol against (an optimized) TCP stream

on data transfers between Argonne National

Laboratory and the Laboratory of Computational

Science and Engineering (LCSE) at the

Phillip Dickens

Department of Computer Science

Illinois Institute of Technology

William Gropp

Mathematics and Computer Science Division

Argonne National Laboratory

University of Minnesota, and between Argonne

National Laboratory and the Center for

Advanced Computing Research (CACR) at the

California Institute of Technology.

Our results are encouraging. We obtained

over 90% of the maximum available bandwidth

on data transfers between ANL and LCSE using

both approaches: that is, using a single

(optimized) TCP stream and our simple user-

level protocol that utilizes a single UDP stream.

On data transfers between ANL and CACR

however, only the user-level protocol was able to

obtain such a high percentage of this maximum

bandwidth. In particular, the user-level protocol

obtained up to 85% of the maximum available

bandwidth while the optimized TCP stream

obtained only on the order of 10% of this

maximum.

There are three primary contributions of this

paper. First, it outlines a simple user-level

protocol that was shown to provide excellent

performance across both short and long haul

high-performance networks. As noted, these

results were obtained with a single UDP data

stream in conjunction with a simple

acknowledgement and retransmission scheme.

This is in contrast to the alternative approaches

that utilize multiple TCP streams (up to eight

streams per host) to improve performance in this

setting [1,10,13]. A single data stream has the

advantage that it does not require the kernel to

multiplex multiple TCP streams, where such

multiplexing may negatively impact the

performance of applications executing on that

particular host.

Another contribution of this paper is that it

provides a detailed study of the impact on

performance due to the various parameters that

can be controlled at the user level. This provides

insight into the relationship between the

acknowledgement frequency and the amount of

network resources “wasted” due to the greedy

nature of the algorithm, and also helps to explain

the performance of TCP in this environment.

Thirdly, the end-points of the experiments

conducted between ANL and LCSE were two

Pentium3-based Windows 2000 boxes running

the “off-the-shelf” winsock2 API. Windows

2000 supports the “Large Window” extensions to

TCP [6], and it is important to look at the

performance of the Winsock2 API, with all of its

support for high-performance data transfers,

given that the vast majority of published

bandwidth studies employ either Unix or Linux

TCP implementations.

The rest of this paper is organized as follows.

In Section 2 we discussed other approaches to

providing high-performance data transfers in a

high-delay, high-bandwidth environment. In

Section 3, the experimental design is presented.

In Section 4, the user-level communication

protocol is presented. In Section 5 we present the

results of our experimental studies, and we

provide conclusions and future research in

Section 6.

2 Related Work

There is a significant amount of research relating

to obtaining (a high percentage of) the available

bandwidth in high-bandwidth, high-delay

networks. This research is proceeding along two

fronts: One approach is fundamental research

into mechanisms to improve the performance of

TCP itself. The other approach is to develop

techniques at the application level that

circumvent the performance problems associated

with TCP. We discuss each approach in turn.

As discussed in [13,14,15], the size of the

TCP window is the single most important factor

in achieving good performance over high-

bandwidth, high-delay networks. To keep such

“fat” pipes full, the TCP window size should be

as large as the product of the bandwidth and the

round-trip delay. This has led to research in

automatically tuning the size of the TCP socket

buffers at runtime [16]. Also, it has led to the

development of commercial TCP

implementations that allow the system

administrator to significantly increase the size of

the TCP window to achieve better performance

[14]. Another area of active research is the use of

a Selective Acknowledgement mechanism

[8,14,18] rather than the standard cumulative

acknowledgement scheme. In this approach, the

receiving TCP sends to the sending TCP a

Selective Acknowledgement (SACK) packet that

specifies exactly those packets that have been

received, allowing the sender to retransmit only

those segments that are missing. Additionally,

“fast retransmit” and “fast recovery” algorithms

have been developed that allow a TCP sender to

retransmit a packet before the retransmission

timer expires, and allows the TCP sender to

increase the size of its congestion control

window, when three duplicate acknowledgement

packets are received (without intervening

acknowledgements) [8,18]. An excellent source

of information, detailing which commercial and

experimental versions of TCP support which

particular TCP extensions, may be found in [14].

At the user level, the allocation of multiple

TCP streams has been investigated. PSockets

[13] employs multiple TCP sockets to increase

the size of the TCP window. As discussed by the

authors, the limitations on TCP window sizes

are on a per socket basis, and thus striping the

data across multiple sockets provides an

aggregate TCP buffer size that is closer to the

(ideal size) of the bandwidth times round-trip

delay. A similar approach has been investigated

within the domain of satellite-based information

systems [10]. The Globus project [17] developed

a grid-FTP[1] tool that employs multiple TCP

streams per host, with (perhaps) multiple hosts.

This again significantly increases the size of the

TCP window. It is also worth noting that using

multiple TCP sockets increases the probability

that, at any given time, there will be at least one

TCP stream that is ready to fire.

3 Experimental Design

We investigated (reasonably) large-scale data

transfers on two high-performance network

connections: one between Argonne National

Laboratory (ANL) and the Laboratory for

Computational Science and Engineering (LCSE)

at the University of Minnesota, and one between

ANL and the Center for Advanced Computing

Research (CACR) at the California Institute of

Technology. ANL is connected to both of these

sites across Abilene. The endpoints at both ANL

and LCSE were Intel Pentium3-based PCs

running Windows 2000 and using the Winsock2

API. We did not have access to a Windows 2000

machine at CACR at the time of this writing, and

used instead an SGI Origin200 (with two

225Mhz MIPS R10000 processors) running

IRIX 6.5. Similarly, we did not have access to an

IRIX 6.5 machine at ANL, and thus were forced

to run the experiments between a Windows 2000

machine at one end and an IRIX 6.5 machine at

the other.

As noted on the Pittsburgh Super Computing

Website [14], IRIX 6.5 (like Windows 2000)

supports the RFC 1323 [6] “Large Window”

extensions to TCP. Both machines also support

MTU path discovery [9], where the segment size

is determined by the maximum packet size that

can be transmitted across the complete path

without fragmentation, rather than simply using

a pre-determined segment size that may be

smaller (and thus less efficient) than this value.

Also, both machines support TCP Selective

Acknowledgements [8,18]. However, the default

TCP window size on the SGI Origin200 is

(approximately) 64KB, and we did not have

system-level access to the machine that would

have allowed us to increase this window size.

The TCP window under Windows 2000 is (or

can very easily) be extended to one Gigabyte

[7,14]. The slowest link on either pair of

connections was 100 Mb/sec, which was

incurred between the desktop PC at ANL used in

these experiments, and the Math and Computer

Science Division’s external router.

The round-trip delay between ANL and

LCSE was measured (using traceroute) to be on

the order of 26 milliseconds, and we (loosely)

categorized this as a short haul network. The

round-trip delay between ANL and CACR was

on the order of 65 milliseconds, which we (again

loosely) categorized as a long haul network. The

transmitted data size for the experiments

between ANL and LCSE was 40 MB, and was

20 MB between ANL and CACR. Similar to the

results found in [13], the amount of data

transferred did not have a significant impact on

the throughput rate, so we opted for a smaller

data size on the long haul connection to decrease

the cost of experimentation. It is interesting to

note that the bandwidth delay product for the

ANL to LCSE connection was 1.04 Gigabytes,

which was only fractionally larger than the TCP

window size used by Windows 2000. The

bandwidth delay product for the connection

between ANL and CACR was 1.3 Gigabytes,

which is orders of magnitude larger than the

default (approximately) 64 KB buffer allocated

by IRIX6.5. As will be seen, this significant

difference in the size of the TCP window had a

tremendous impact on the performance of the

TCP algorithm.

In our experiments, we replicated thirty trials

of sending either 20 MBs of data (across the haul

network), or 40 MBs of data (across the short

haul network). The metric of interest was the

percentage of the maximum available bandwidth

(which is 100 Mb/sec) that was obtained for each

approach. A byte re-ordering cost was

necessitated by the architectural differences

between the SGI Origin 200 and the Windows

2000 machine, and this cost was included in the

final bandwidth values reported.

4 Communication Protocols

We tested two communication protocols: an

optimized (single) TCP stream and a user-level

UDP-based protocol. We found that tuning the

performance of the Windows-based TCP

implementation was very simple. The primary

optimization was to simply request a socket

buffer size greater than 64 KB, which

automatically enabled the “Large Window” TCP

extensions [7,14]. Additionally, we disabled the

Nagel algorithm to avoid delays in placing

packets on the network, and experimented with

decomposing the data into smaller “chunk sizes”

(thus requiring multiple calls to the TPC send

routine to complete the data transfer). We were

very limited in optimizing TCP as implemented

on the SGI Origin200 since we did not have

system-level access to the TCP stack.

4.1 User-Level UDP Protocol

The user-level protocol we developed

incorporates, at the user level, many of the

important extensions defined for TCP in a high-

bandwidth, high-delay network environment.

One important assumption made by the user-

level algorithm however is that both the sender

and the receiver have pre-allocated buffers large

enough to accommodate the complete data

transfer. This seems to be a very reasonable

assumption, and certainly applies to the

applications in which we are interested. In

particular, it applies to wide-area MPI [21]

(where for every send there is a matching

receive), the File Transfer Protocol (FTP, where

the disk is used as a data buffer), and data

visualization applications (where the generated

data is consumed by the data receiver). Given

this very important characteristic of the

applications of interest, our user-level protocol

pushes to the limit the idea of “Large Window”

extensions developed for TCP: that is, the

window size is essentially infinite since it spans

the entire data buffer (albeit at the user level). It

also pushes to the limit the idea of selective

acknowledgements. Given a pre-allocated

receive buffer and constant packet sizes, each

data packet in the entire buffer can be numbered.

The data receiver can then maintain a very

simple data structure with one byte (or even one

bit) allocated per data packet, where this data

structure tracks the received/not received status

of every packet to be received. This information

can then be sent to the data sending process at a

user-defined acknowledgement frequency. Thus

the selective acknowledgement window is also in

a sense infinite. That is, the data sender is

provided with enough information to determine

exactly those packets, across the entire data

transfer, that have not yet been received (or at

least not received at the time the

acknowledgement packet was created and sent).

There are also features unique to our protocol

that have been determined (experimentally) to

have a very significant impact on performance.

One such feature is the acknowledgement

frequency, which, as discussed above,

determines the frequency at which an

acknowledgement packet (containing the

complete history of the data transfer) is sent to

the data sender. Another issue is the algorithm

that determines the next data packet, across all

eligible packets (to be defined below), to be sent

next. (To understand the importance of this

issue consider that when the data sender receives

an acknowledgment packet, it must determine

whether to perhaps re-send a data packet that

was lost, or to send a “new” data packet that has

not yet been sent for the first time.) The third

user-level parameter has to do with the number

of packets the data sender should transmit before

checking for (and processing) an

acknowledgement packet.

4.2 Algorithm Executed by Data Sender

The total data buffer was divided into pre-

determined, equal, and fixed-sized packets of

1024 bytes. This packet size was determined by

executing the MTU path discovery algorithm

along the paths from Argonne National

Laboratory to both LCSE and CACR. Thus in

the case of the 20 MB data buffer there were of

19,532 packets, and there were 39,063 packets

with 40 MB data buffer. One UDP socket was

used to transmit data from the sender to the

receiver, and another UDP socket was used to

send acknowledgement packets from the receiver

to the sender.

The data-sending algorithm iterates over three

basic phases. In the first phase, the data sender

employs some algorithm to determine the

number of data packets to be placed onto the

network before looking for, and processing if

available, an acknowledgement packet. This is

referred to as a “batch-sending” operation since

all such packets are placed onto the network

without interruption (although the select system

call is used to ensure adequate buffer space for

the packet). It is very important to note that after

a batch-send operation the data sender looks for,

but does not block for, an acknowledgement

packet.

In the second phase of the algorithm, the data

sender looks for, and if available, processes an

acknowledgement packet. Processing of an

acknowledgement packet entails updating the

receive/not received status for each data packet

acknowledged, and determining the number of

packets that were received by the data receiver

between the time it created the previous

acknowledgement packet and the time it created

the current acknowledgement packet. This

information can then be used to determine the

number of packets to send in the next batch-send

operation. If no acknowledgement packet is

available, this information can also be used to

determine the number of packets to send in the

next batch-send operation. Note that a repeated

batch-sending operation with zero packets is

logically equivalent to blocking on an

acknowledgement.

In the third phase of the algorithm, the data

sender executes some user-defined algorithm to

choose the next packet, out of all

unacknowledged packets, to be placed onto the

network.

4.2.1 Parameters for the Send Algorithm.

The first parameter studied was the number of

packets to be sent in the next batch-send

operation. Intuitively, one would expect that the

data sender should check for an

acknowledgement packet on a very frequent

basis, thus limiting the number of packets to be

placed onto the network in a given batch-send

operation. Our experimental results supported

this intuition, finding that two packets per batch-

send operation provided the best performance.

We therefore used this number in all subsequent

experimentation

We also performed extensive experimentation

to determine which particular packet, out of all

unacknowledged packets, should next be placed

onto the network. We tried several algorithms,

and, in the end, it became quite clear that the best

approach (by far) was to treat the data as a

circular buffer. That is, the algorithm never went

back to re-transmit a packet that was not yet

acknowledged, if there were any packets that had

not yet been sent for the first time. Similarly, a

given packet was re-transmitted for the n+1st

time only if all other unacknowledged packets

had been re-transmitted n times.

As can be seen, the algorithm executed by the

sender is very greedy, continuing to transmit (or

re-transmit) packets (without blocking) until it

receives an acknowledgement packet from the

data receiver specifying that all data has been

successfully received. Thus a reasonable

question to ask is how wasteful of network

resources is this approach. One measure of

wasted resources applicable to this approach is

the number of duplicate packets received by the

data receiver over the course of the entire data

transfer. We did track this information, where the

data receiver maintained a simple counter that

was incremented every time it received a packet

that had already been marked as having been

received. In hind-sight, it also would have also

been useful to track the number of messages still

in the pipe when the receiver determined it had

obtained all of the data (and thus had stopped

trying to read packets off of the network). In

future research, we will track this information,

and also attempt to track the number of packets

lost in the network due to contention.

4.3 Algorithm Executed by Data Receiver

The data receiver iterates over a loop with

the select system call at the top of the loop. The

select system call takes as parameters a set of

socket descriptors (in the case of Windows

2000), or a set of file descriptors (in the case of

Unix). The select call also takes as a parameter a

timer. The select call returns when one of the

sockets is available for reading or writing, or

when the timer expires. We set the time out

value for the data receiver at 1.5 seconds, which

was determined experimentally. The basic

algorithm is as follows.

1) Use the select system call to

determine if a data packet is

available.

2) If the select call times out send an

acknowledgement packet (that

contains a complete history of the

data transfer).

3) If a packet is available, read it off

of the network and determine if it is

a duplicate (i.e. has already been

received and acknowledged). If it is

a duplicate packet, discard it and

increment a counter tracking the

number of duplicate packets

received. If it is not a duplicate

packet, place the packet into its

proper position within the data

buffer using the packet number as

an offset.

4) If the data packet is not a duplicate,

increment a counter tracking the

total number of packets received by

the data receiver. If this value

exceeds some user-defined

threshold, send an

acknowledgement packet and reset

the counter to zero.

5) If the packet is a duplicate, and the

number of duplicate packets

received exceeds some threshold

value, then send an

acknowledgement packet and reset

the duplicate counter to zero.

4.3.1 Parameters for the Data Receiver. The

most important parameter with respect to the

data receiver is the number of new packets

received before generating and sending an

acknowledgement packet. The frequency with

which the data receiver sends acknowledgement

packets essentially determines the level of

synchronization between the two processes. A

small value (and thus a high level of

synchronization) implies that the data receiver

must frequently stop pulling packets off of the

network to create and send acknowledgement

packets. Given that the algorithm is UDP-based,

those packets missed while creating and sending

an acknowledgement will, in all likely-hood, be

lost. A very high value, and thus a very low level

of synchronization, results in both the data

sender and data receiver spending virtually all of

their time placing packets on, and reading

packets off, of the network. As will be seen, this

is actually a very good approach when the pipe is

completely clear.

The acknowledgement frequency, as

described above, takes into consideration only

the number of new data packets received. We

found that it is also necessary to place a bound

on the number of duplicate packets received

before sending an acknowledgement packet. The

data sender only sends duplicate packets when it

cannot correctly determine (or anticipate) the

packets that have not yet been received by the

data receiver. Thus when the data sender if

clearly “off the mark” in terms of the packets it

is selecting to send (or re-send), it is helpful for

the data receiver to send an acknowledgement

packet to provide an updated view of the state of

the data transfer. Our experimentation suggested

that sending an acknowledgement packet

whenever the duplicate count reached 50

provided the best performance over both network

connections.

4 Experimental Results

We compared the performance of the user-

level UDP protocol against that of an optimized

TCP implementation (when both end-points

were executing the winsock2 API), and against a

TCP implementation that we were unable to

optimize in any meaningful way (i.e. the IRIX

6.0 TCP implementation). As noted, we were

unable to significantly modify the window size

on the SGI Origin200 since we did not have

system-level access on the machine. First,

consider the performance of the user-level UDP

protocol shown in Figure 1. This figure depicts

the performance of the approach as a function of

the number of packets received before triggering

an acknowledgement. As can be seen, this simple

protocol, involving a single UDP stream (for

data sending), provides excellent performance

across both platforms and across both the short

haul and the long haul connections. In

particular, the protocol achieved a throughput of

over 90% of the maximum available bandwidth

on the connection between ANL and LCSE.

Also, it obtained on the order of 85% of the available on the connection between ANL and

CACR. It is important to remember that these

Performance of User-Level Protocol on Short and Long Haul

Networks

20%

40%

60%

80%

100%

100

200

300

400

500

600

700

800

900

1000

1500

2500

Number of Packets Received Before

Sending an Acknowledgement

Packet

Percentage of Maximum

Bandwidth Obtained

Long Haul Network

Short Haul Network

Figure 1. This figure depicts the percentage of the maximum available bandwidth

obtained by the user-level UDP protocol as a function of the number of packets

received before sending an acknowledgement packet.

results were obtained using a single (optimized)

UDP stream.

It is interesting to note the impact on

performance due to the acknowledgement

frequency. When this frequency was very high

(e.g. 1/50), there was a significant detrimental

impact on performance. This is due to the fact

that when the data sender and data receiver were

tightly synchronized, both processes were

spending a non-trivial amount of time preparing,

sending, receiving, and processing

acknowledgement packets. Even though the cost

of preparing/processing acknowledgements was

not necessarily large, this extra time devoted to

processing such packets (and thus not

sending/receiving packets) turned out to have a

significant impact (at least on the high-

performance networks we tested). As the

acknowledgement frequency decreased (down to

a minimum value of one out of every 3000

packets), the performance began to improve. The

throughput on the short haul network peaked out

at a little over 90% of the available bandwidth.

In the case of the long haul network, the

throughput peaked out at approximately 85% of

the available bandwidth.

It is also interesting to look at the number of

duplicate packets received by the data receiver

across the complete data transfer (as noted

however, this value does not reflect packets in

the pipe when the data receiver stopped looking

for more packets). This is shown in Figure 2. In

the case of the short haul network, it was rare to

observe more than one or two duplicate packets

for all acknowledgement frequencies studied.

Thus the pipe was very clear, and there were

virtually no packets being lost in the network.

This was somewhat surprising given that the

trials were conducted during normal business

hours, although the summer students at Argonne

National Laboratory had departed (vastly

reducing the load on the ANL networks). As can

be seen however, the number of duplicate

packets did significantly increase when the data

was transferred over the long haul network. This

number was not large when the data sender and

data receiver were tightly synchronized, but as

the level of synchronization began to decrease,

the number of duplicate packets began to

increase (and rather dramatically after the

acknowledgement frequency was less than

1/900). The number of duplicate packets reached

a value of around 550 (when the

acknowledgement frequency was reduced to

1/2500). It is interesting to note that even though

the number of duplicate packets significantly

increased with a decreased acknowledgement

frequency, this did not have a significant

negative impact on performance. This can be

best understood by considering that even with

550 duplicate packets, this represented only 0.5

MB of additional data on the network. This, in

turn, represented less than 3% of the total data

transfer.

5.1 TCP Performance

Figure 3 depicts the performance of TCP across

the short and long haul networks. As can be

seen, the results obtained using the Windows

2000 TCP implementation (across the short haul

network) were quite impressive, providing

approximately the same performance as that of

the user-level protocol. These results certainly

help emphasize the fact that large TCP windows

are imperative over high-bandwidth, high-delay

networks. The other factor allowing TCP to

obtain such good performance was most likely

due to the absence of contention in the network.

The fact that virtually no packets were duplicated

in the UDP protocol (at least strongly) suggests

that TCP experienced very little packet loss

across this same network. Thus the TCP

congestion control mechanisms were not

triggered, allowing the TCP window to be

advanced without (any significant) delay. Clearly

the research in optimizing TCP for this type of

network has produced dramatic improvements in

performance.

As can be seen however, the performance of

TCP drops dramatically over the long haul

network. There were two reasons for this: First,

the TCP window size was significantly (by

several orders of magnitude) smaller than the

bandwidth delay product. As noted, we were not

able to modify this window size since we did not

have system-level access. Secondly, judging by

the packet loss incurred by the user-level

protocol, it is likely that TCP also experienced

packet loss during the data transfer. Thus the

TCP congestion control algorithms were most

likely triggered, resulting in a significant drop in

performance. When we secure a high-

performance Windows 2000 machine at the other

end-point of this connection, we should be able

to determine how much of the degradation was

due to the window size and how much was due

packet loss and the subsequent triggering of the

congestion control mechanisms.

We were also interested in whether breaking

the data into smaller data units (“chunks”), and

invoking the TCP send algorithm multiple times,

would have an impact on performance. As can be

seen, this approach did result in some

performance improvement over the short haul

network, but did not appear to have any impact

on performance across the long haul network.

5 Conclusions and Future

Research

In this paper, we have reported on the design and

implementation of a user-level UDP protocol

designed for high-delay, high-bandwidth

networks. The most important features of this

algorithm include a (logically) infinite window

size, a (logically) infinite selective

acknowledgement window, and a user-define

acknowledgement frequency. This algorithm was

shown to achieve a throughput of over 90% of

the maximum bandwidth when executed across a

short haul, high-performance network. Over a

long haul network, it was still able to achieve

throughout on the order of 85% of this

maximum.

This research also clearly demonstrates the

importance of the window size in the

performance of TCP. When the “Large Window”

extensions to TCP were enabled, a single TCP

stream was able to achieve on the order of 90%

of the available bandwidth. This performance

was dramatically reduced however when the

TCP window was significantly smaller than the

bandwidth delay product, and when packet loss

was introduced into the transfer.

Currently, we are investigating the significant

performance decline suffered by TCP across the

long haul network. We are interested in sorting

out the impact on performance due to the TCP

window size and due to the triggering of the TCP

congestion control algorithms. We are also

interested in providing a better definition and a

more robust measurement of any waste in

network resources due to the greedy nature of the

user-level algorithm. Finally, we wish to study

the impact on the performance of other

applications (executing on the same processor)

based on the communication protocol being

employed (i.e. the user-level protocol defined

herein, versus the use of multiple TCP streams).

100

200

300

400

500

600

200

400

600

800

1000

2500

Number of Packets Received Before

Sending an Acknowledgement Packet

Number of Dulicate Packets

Short Haul

Network

Long Haul

Network

Figure 2. This figure shows the number of duplicate packets received by the data

receiver as a function of the acknowledgement frequency.

References

[1] Allcock, B. Bester, Bresnahan, J.Chervenak, A.,

Foster, I., Kesselman, C. Meder, S.,

Nefedova, V., Quesnet, D., and S. Tuecke.

Secure, Efficient Data Transport and

Replica Management for High-Performance

Data-Intensive Computing. Preprint

ANL/MCS-P871-0201, Feb. 2001.

[2] Allman, M., Paxson, V., and W.Stevens.

TCP Congestion Control. RFC 2581, April

1999.

[3] Feng, W. and P. Tinnakornsrisuphap. The

Failure of TCP in High-Performance

Computational Grids. In the Proceedings of

the Super Computing 2000 (SC2000).

[4] Hobby, R. Internet2 End-to-End Performance

Initiative (or Pat Pipes Are Not Enough).

URL: http//www.internet2.org.

[5] Irwin, B. and M. Mathis. Web100:

Facilitating High-Performance Network Use.

White Paper for the Internet2 End-to-End

Performance Initiative.

URL:http://www.internet2.edu/e2epi/web02/p

_web100.shtml

[6] Jacobson, V., Braden, R., and D. Borman.

TCP Extensions for high performance. RFC

1323, May 1992.

[7] MacDonald and W. Barkley. Microsoft

Windows 2000 TCP/IP Implementation

Details. White Paper, May 2000.

[8] Mathis, M., Mahdavi, J., Floyd, S. and A.

Romanow. TCP Selective Acknowledgement

Options. RFC 2018

[9] Mogul, J. and S. Deering, "Path MTU

Discovery", RFC 1191,

November 1990.

[10] Ostermann, S., Allman, M., and H. Kruse. An

Application-Level solution to TCP’s Satellite

Inefficiencies. Workshop on Satellite-based

Information Services (WOSBIS), November,

1996.

[11] J. Postel, Transmission Control Protocol,

RFC793, September 1981.

[12] Semke, J., Jamshid Mahdavi, J., and M.

Mathis, Automatic TCP Buffer Tuning,

Computer Communications Review, a

publication of ACM SIGCOMM, volume 28,

number 4, October 1998].

[13] Sivakumar, H., Bailey, S., and R. Grossman.

PSockets: The Case for Application-level

Network Striping for Data Intensive

Applications using High Speed Wide Area

Networks. In Proceedings of Super

Computing 2000 (SC2000).

[14] URL:

http://www.psc.edu/networking/perf_tune.ht

ml#intro

Enabling High Performance Data Transfers

on Hosts: (Notes for Users and System

Administrators)

[15] URL:

http://dast.nlanr.net/Articles/GettingStarted/T

CP_window_size.html

[16] URL:

http://dast.nlanr.net/Projects/Autobuf_v1.0/au

totcp.html. Automatic TCP Window Tuning

and Applications

[17] URL: http://www.globus.org

[18] URL:

http://www.psc.edu/networking/all_sack.html

. List of sack implementations

[19] URL: http://www.internet2.org

[20] URL: http://www.internet2.edu/abilene

[21] URL: http://www-

unix.mcs.anl.gov/mpi/mpich/

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5K 15K 25K 40K 100K 4Meg

Long Haul Network

Short Haul Network

Figure 3. This figure shows the percentage o

maximum bandwidth attained by TCP across a

short and long haul high-performance networ

connection. The “chunk” size reflects a

decomposition of the total buffer size into

smaller data units.

A lightweight, high performance communication protocol for grid computing

Article

Full-text available

Mar 2010

Phillip M Dickens

This paper describes a lightweight, high-performance communication protocol for the high-bandwidth, high-delay networks typical of computational Grids. One unique feature of this protocol is that it incorporates an extremely accurate classification mechanism that is efficient enough to diagnose the cause of data loss in real time, providing to the controller the opportunity to respond to different causes of data loss in different ways. The simplest adaptive response, and the one discussed in this paper, is to trigger aggressive congestion control measures only when the data loss is diagnosed as network related. However, even this very simple adaptation can have a tremendous impact on performance in a Grid setting where the resources allocated to a long-running, data-intensive application can fluctuate significantly during the course of its execution. In fact, we provide results showing that the utilization of the information provided by the classifier increased performance by over two orders of magnitude depending on the dominant cause of data loss. In this paper, we discuss the Bayesian statistical framework upon which the classifier is based and the classification metrics that make this approach highly successful. We discuss the integration of the classifier into the congestion control structures of an existing high-performance communication protocol, and provide empirical results showing that it correctly diagnosed the cause of data loss in over 98% of the experimental trials. Grid computing-High-performance communication protocols-High-performance networking-Bayesian analysis-Classification mechanisms

FOBS: A Lightweight Communication Protocol for Grid Computing

Conference Paper

Full-text available

Jan 2004

Phillip M Dickens

In this paper, we discuss our work on developing an efficient, lightweight application-level communication protocol for the high-bandwidth, high-delay network environments typical of computational grids. The goal of this research is to provide congestion-control algorithms that allow the protocol to obtain a large percentage of the underlying bandwidth when it is available, and to be responsive (and eventually proactive) to developing contention for system resources. Towards this end, we develop and evaluate two application-level congestion-control algorithms, one of which incorporates historical knowledge and one that only uses current information. We compare the performance of these two algorithms with respect to each other and with respect to TCP.

Application-Level Congestion Control Mechanisms for Large Scale Data Transfers Across Computational Grids.

Conference Paper

Full-text available

Jan 2003

In this paper, we discuss the development of a highly efficient, application-level data transfer system for computational grids. We describe and evaluate two application-level congestion control mechanisms, and compare their performance with respect to each other and with that obtained by TCP. We show that both application-level protocols are able to obtain performance that is very close to the maximum available bandwidth while keeping data loss rates low (from 0.06% to 1.6 %). Also, we show that both approaches obtain throughput that is close to an order of magnitude greater than that achieved by TCP. Finally, we begin to address the important issue of whether all data loss should be assumed to represent a network congestion event, with the goal of crafting the response to observed data loss as a function of the root causes of such loss.

Demonstration of a Prototype System-Aware Data Transfer Mechanism for Computational Grids

Article

Full-text available

Improvements of the SABUL Congestion Control Algorithm

Article

Jan 2008

Several new protocols such as RBUDP, User- Level UDP, Tsunami, and SABUL, have been proposed as alternatives to TCP for high speed data transfer. The purpose of this paper is to analyze the effects of SABUL congestion control algorithm and its parame- ters on SABUL bandwidth utilization by using extensive simulations. The simulation results show that SABUL congestion control parameters have a direct impact on SABUL bandwidth utilization and SABUL cannot compete with uncontrolled flows such as UDP. We propose a modification of SABUL congestion control algorithm to improve the overall performance of SABUL and use ns- 2 simulations to verify our algorithms. Our results show that with two modifications, SABUL can achieve a better bandwidth utilization even when protocol parameters have been chosen improperly; moreover, the modified SABUL algorithm can achieve a better bandwidth utilization when competing with uncontrolled traffic.

Towards a Bayesian Statistical Model for the Classification of the Causes of Data Loss

Conference Paper

Full-text available

Sep 2005
Lect Notes Comput Sci

Given the critical nature of communications in computational Grids it is important to develop efficient, intelligent, and adaptive communication mechanisms. An important milestone on that path is the development of classification mechanisms that can distinguish between causes of data loss in cluster and Grid environments. In this paper, we investigate one promising approach to developing such classification mechanisms based on the analysis of what may be termed packet-loss signatures that describe the patterns of packet loss in the current transmission window. We analyze these signatures using complexity theory to learn about their underlying structure (or lack thereof), and are mapping the relationship between the complexity metrics and the system conditions responsible for their generation. We show that complexity measures are an excellent metric upon which a classification mechanism can be built. Further, we show how such a classification mechanism can be built based on the application of Bayesian statistics.

AUTONOMIC MANAGEMENT OF DATA STREAMING AND IN-TRANSIT PROCESSING FOR DATA INTENSIVE SCIENTIFIC WORKFLOWS

Article

Viraj Bhat

A Generic Infrastructure for Web Computing

Conference Paper

Full-text available

Aug 2007

This paper proposes a generic infrastructure for the de- velopment of parallel and distributed applications on the Web. The infrastructure is oriented to allow that every single host in the Internet can participate in the execution of distributed applications using a very simple configuration with rigid guarantees of security. Our proposal is based on the use of World Wide Web protocols and Java applets, ex- clusively. Thus, users willing to participate in the execution of applications only require a conventional Web browser with the Java Runtime Environment enabled. This work dif- fers from previous proposals in its simplicity and improved performance. In addition, our study includes a detailed evaluation of real parallel applications.

An Evaluation of a User-Level Data Transfer Mechanism for High Performance Networks

Article

Full-text available

Jan 2002

In this paper, we describe FOBS: a simple userlevel communication protocol designed to take advantage of the available bandwidth in a highbandwidth, high-delay network environment. We compare the performance of FOBS with that of TCP both with and without the so-called Large Window extensions designed to improve the performance of TCP in this type of network environment. We show that FOBS can obtain on the order of 90% of the available bandwidth across both short- and long-haul highperformance network connections. For the longhaul connection, the bandwidth obtained was 1.8 times higher than that of the optimized TCP algorithm. Also, we demonstrate that the additional traffic placed on the network because of the greedy nature of the algorithm is quite reasonable, representing approximately 3% of the total data transferred.

Designing Energy-Efficient Software.

Conference Paper

Jan 2002

Energy-efficient system design and optimization is rapidly becoming an important consideration across a spectrum of computing systems - from embedded and mobile devices to high end platforms. This paper presents recent research results and experiences when attempting energy optimization at the software level, which is anticipated to have much more impact than micro-management of resources at the hardware level.

The Failure of TCP in High-Performance Computational Grids

Conference Paper

Full-text available

Jan 2000

Distributed computational grids depend on TCP to ensure reliable end-to-end communication between nodes across the wide-area network (WAN). Unfortunately, TCP performance can be abysmal even when buffers on the end hosts are manually optimized. Recent studies blame the self-similar nature of aggregate network traffic for TCPs poor performance because such traffic is not readily amenable to statistical multiplexing in the Internet, and hence computational grids. In this paper, we identify a source of self-similarity previously ignored, a source that is readily controllable - TCP. Via an experimental study, we examine the effects of the TCP stack on network traffic using different implementations of TCP. We show that even when aggregate application traffic ought to smooth out as more applications traffic are multiplexed, TCP induces burstiness into the aggregate traffic load, thus adversely impacting network performance. Furthermore, our results indicate that TCP performance will worsen as WAN speeds continue to increase.

Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing

Article

Full-text available

Apr 2001

After studying data-intensive, high-performance computing applications such as highenergy physics and climate modeling, we conclude that these applications require two fundamental data management services: secure, reliable, efficient transfer of data in wide area environments and the ability to register and locate multiple copies of data sets. In this paper, we present our design of these services in the Globus grid computing environment. We also describe the performance of our current implementation. 1 Introduction Data-intensive, high-performance computing applications require the efficient management and transfer of terabytes or petabytes of information in wide-area, distributed computing environments. Examples of data-intensive applications include experimental analyses and simulations in several scientific disciplines, such as highenergy physics, climate modeling, earthquake engineering and astronomy. These applications share several requirements. Massive data sets must be share...

PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks

Conference Paper

Full-text available

Dec 2000

Transmission Control Protocol (TCP) is used by various applications to achieve reliable data transfer. TCP was originally designed for unreliable networks. With the emergence of high-speed wide area networks various improvements have been applied to TCP to reduce latency and achieve improved bandwidth. The improvement is achieved by having system administrators tune the network and can take a considerable amount of time. This paper introduces PSockets (Parallel Sockets), a library that achieves an equivalent performance without manual tuning. The basic idea behind PSockets is to exploit network striping. By network striping we mean striping partitioned data across several open sockets. We describe experimental studies using PSockets over the Abilene network. We show in particular that network striping using PSockets is effective for high performance data intensive computing applications using geographically distributed data.

TCP Selective Acknowledgment Options

Article

Full-text available

Oct 1996

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received. A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments. This memo proposes an implementation of SACK and discusses its performance and related issues. Acknowledgements Much of the text in this document is taken directly from RFC1072 "TCP Extensions for Long-Delay Paths" by Bob Braden and Van Jacobson. The authors would like to thank Kevin Fall (LBNL), Christian...

Web100: Facilitating high-performance network use

Article

Path MTU Discovery for IPv6

Article

TCP selective acknowledgement options, RFC 2018

Article

Jan 2003

Abstract TCP,may,experience,poor,performance,when,multiple,packets,are,lost from,one window,of data.,With,the limited,information,available from cumulative acknowledgments,a TCP sender can only learn about a single,lost,packet,per round,trip,time. An aggressive,sender,could choose to retransmit packets early, but such retransmitted segments may,have,already,been,successfully,received. A Selective Acknowledgment (SACK) mechanism, combined with a

Transmission control protocol; rfc793

Article

Jan 1981

Jon Postel

RFC 1323: TCP extensions for high performance

Article

May 1992

An Application-Level Solution to TCP's Satellite Inefficiencies

Article

May 1997

In several experiments using NASA's Advanced Communications Technology Satellite (ACTS), investigators have reported disappointing throughput using the TCP/IP protocol suite over T1 satellite circuits. A detailed analysis of FTP file transfers reveals that the TCP receive window size, the TCP "Slow Start" algorithm, and the TCP acknowledgment mechanism contribute to the observed limits in throughput. To further explore TCP's limitations over satellite circuits, we developed a modified version of FTP (XFTP) that uses multiple TCP connections. By using multiple TCP connections, we have been able to simulate a large, virtual TCP receive window. Our experiences with XFTP over both actual satellite circuits and a software satellite emulator show that a utilization of better than 90% is possible. Our results also indicate the benefit of introducing congestion avoidance at the application level. 1 Introduction In several experiments using NASA's Advanced Communications Technology This wor...

High performance wide area data transfers over high performance networks

Abstract and Figures

Recommended publications

Transport-Layer Design Perspectives for Heterogeneous Networks

An evaluation of object-based data transfers on high performance networks

An Evaluation of a User-Level Data Transfer Mechanism for High Performance Networks

A Performance Study of Application Level Acknowledgment and Retransmission Mechanism