PreprintPDF Available

Achieveing reliable UDP transmission at 10 Gb/s using BSD socket for data acquisition systems

June 2017

June 2017

Authors:

European Spallation Source (ESS)

User Datagram Protocol (UDP) is a commonly used protocol for data transmission in small embedded systems. UDP as such is unreliable and packet losses can occur. The achievable data rates can suffer if optimal packet sizes are not used. The alternative, Transmission Control Protocol (TCP) guarantees the ordered delivery of data and automatically adjusts transmission to match the capability of the transmission link. Nevertheless UDP is often favored over TCP due to its simplicity, small memory and instruction footprints. Both UDP and TCP are implemented in all larger operating systems and commercial embedded frameworks. In addition UDP also supported on a variety of small hardware platforms such as Digital Signal Processors (DSP) Field Programmable Gate Arrays (FPGA). This is not so common for TCP. This paper describes how high speed UDP based data transmission with very low packet error ratios was achieved. The near-reliable communications link is used in a data acquisition (DAQ) system for the next generation of extremely intense neutron source, European Spallation Source. This paper presents measurements of UDP performance and reliability as achieved by employing several optimizations. The measurements were performed on Xeon E5 based CentOS (Linux) servers. The measured data rates are very close to the 10 Gb/s line rate, and zero packet loss was achieved. The performance was obtained utilizing a single processor core as transmitter and a single core as receiver. The results show that support for transmitting large data packets is a key parameter for good performance. Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread affinity, core locality and efficient timers.

(top) Ethernet frames are separated by a 20 byte inter frame gap. (bottom) The Ethernet, IP and UDP headers take up 46 bytes. The largest UDP user data size is 1472 bytes on most Ethernet interfaces due to a default MTU of 1500. This can be extended on some equipment to 8972 bytes by the use of jumbo frames.

…

. Packet rates as function of packet sizes for 10 Gb/s Ethernet

…

Experimental setup.

…

Performance measurements. a) User data speed. b) Packet Error Ratio. c) CPU Load. Note that for the optimized values PER is zero for user data larger than or equal to 2200 bytes (solid line).

…

. Packet error ratios as function of user data size

…

Figures - uploaded by Morten Jagd Christensen

Content may be subject to copyright.

Content uploaded by Morten Jagd Christensen

Content may be subject to copyright.

Prepared for submission to JINST

Achieveing reliable UDP transmission at 10 Gb/s using

BSD socket for data acquisition systems

M.J. Christensen,aT. Richtera

aEuropean Spallation Source, Data Management and Software Centre

Ole Maaløes vej 3

2200 Copenhagen N

Denmark

E-mail: mortenchristensen@esss.se

Abstract: User Datagram Protocol (UDP) is a commonly used protocol for data transmission in

small embedded systems. UDP as such is unreliable and packet losses can occur. The achievable

data rates can suﬀer if optimal packet sizes are not used. The alternative, Transmission Control

Protocol (TCP) guarantees the ordered delivery of data and automatically adjusts transmission to

match the capability of the transmission link. Nevertheless UDP is often favored over TCP due

to its simplicity, small memory and instruction footprints. Both UDP and TCP are implemented

in all larger operating systems and commercial embedded frameworks. In addition UDP also

supported on a variety of small hardware platforms such as Digital Signal Processors (DSP) Field

Programmable Gate Arrays (FPGA). This is not so common for TCP. This paper describes how high

speed UDP based data transmission with very low packet error ratios was achieved. The near-reliable

communications link is used in a data acquisition (DAQ) system for the next generation of extremely

intense neutron source, European Spallation Source. This paper presents measurements of UDP

performance and reliability as achieved by employing several optimizations. The measurements

were performed on Xeon E5 based CentOS (Linux) servers. The measured data rates are very close

to the 10 Gb/s line rate, and zero packet loss was achieved. The performance was obtained utilizing

a single processor core as transmitter and a single core as receiver. The results show that support

for transmitting large data packets is a key parameter for good performance.

Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread

aﬃnity, core locality and eﬃcient timers.

Keywords: Computing (architecture, farms, GRID for recording, storage, archiving, and distribu-

tion of data), Data acquisition concepts, Software architectures (event data models, frameworks and

databases)

1Corresponding author.

arXiv:1706.00333v1 [physics.ins-det] 1 Jun 2017

Contents

1 Introduction 1

2 TCP and UDP pros and cons 2

2.1 Congestion 2

2.2 Connections 3

2.3 Addressing 3

3 Performance optimizations 3

3.1 Transmission of data 3

3.2 Network buﬀers and packet loss 5

3.3 Core locality 6

3.4 Timers 6

4 Testbed for the experiments 6

4.1 Experimental limitations 7

5 Performance 8

5.1 Data Speed 8

5.2 Packet error ratios 8

5.3 CPU load 9

6 Conclusion 9

A Source code 11

B System conﬁguration 11

1 Introduction

European Spallation Source [1] is a next generation neutron source currently being developed

in Lund, Sweden. The facility will initially support about 16 diﬀerent instruments for neutron

scattering. In addition to the instrument infrastructure, the ESS Data Management and Software

Centre (DMSC), located in Copenhagen, provides infrastructure and computational support for the

acquisition, event formation and long term storage of the experimental data. At the heart of each

instrument is a neutron detector and its associated readout system. Both detectors and readout

systems are currently in the design phase and various prototypes have already been produced [2–5].

During experiments data is being produced at high rates: Detector data is read out by custom

electronics and the readings are converted into UDP packets by the readout system and sent to event

formation servers over 10 Gb/s optical Ethernet links. The event formation servers are based on

–1–

general purpose CPUs and it is anticipated that most if not all data reduction at ESS is done in

software. This includes reception of raw readout data, threshold rejection, clustering and event

formation. UDP is a simple protocol for connectionless data transmission [6] and packet loss can

occur during transmission. Nevertheless UDP is widely used, for example in the RD51 Scalable

Readout System [7], or the CMS trigger readout [8], both using 1 Gb/s Ethernet. The two central

components are the readout system and the event formation system. The readout system is a hybrid

of analog and digital electronics. The electronics convert deposited charges into electric signals

which are digitized and timestamped. In the digital domain simple data reduction such as zero

suppression and threshold based rejection can be performed. The event formation system receives

these timestamped digital readouts and performs the necessary steps to determine the position of

the neutron. These processing steps are diﬀerent for each detector type. The performance of UDP

over 10G Ethernet has been the subject of previous studies [9] [10], which measured TCP and UDP

performance and CPU usages on Linux using commodity hardware. Both studies use a certain set

of optimizations but otherwise using standard Linux. In [9] the transmitting process is found to be

a bottleneck in terms of CPU usage, whereas a comparison between Ethernet and InﬁniBand [10]

reinforces the earlier results and concludes that Ethernet is a serious contender for use in a readout

system. This study is aimed at characterizing the performance of a prototype data acquisition

system based on UDP. The study is not so much concerned with transmitter performance as we

expect to receive data from a FPGA based platform capable of transmitting at wire speed at all

packet sizes. In stead comparisons between the measured and theoretically possible throughput

and measurements of packet error ratios are presented. Finally, this paper presents strategies for

optimizing the performance of data transmission between the readout system and the event formation

system.

2 TCP and UDP pros and cons

Since TCP is reliable and has good performance whereas UDP is unreliable why not always just

use TCP? The pros and cons for this will be discussed in the following. Both TCP and UDP

are designed to provide end-to-end communications between hosts connected over a network of

packet forwarders. Originally these forwarders were routers but today the group of forwarders

include ﬁrewalls, load balancers, switches, Network Address Translator (NAT) devices etc. TCP

is connection oriented, whereas UDP is connectionless. This means that TCP requires that a

connection is setup before data can be transmitted. It also implies that TCP data can only be sent

from a single transmitter to a single receiver. In contrast UDP does not have a connection concept

and UDP data can be transmitted as either Internet Protocol (IP) broadcast or IP multicast. As

mentioned earlier the main argument for UDP is that it is often supported on smaller systems where

TCP is not. A notable example are FPGA based systems (see [11] for one such example). But some

of the TCP features are not actually improving the performance and reliability in the case of special

network topologies as explained below.

2.1 Congestion

Any forwarder is potentially subject to congestion and can drop packets when unable to cope with

the traﬃc load. TCP was designed to react to this congestion. Firstly TCP has a slow start algorithm

–2–

whereby the data rate is ramped up gradually in order not to contribute to the network congestion

itself. Secondly TCP will back oﬀ and reduce its transmission rate when congestion is detected.

In a readout system such as ours the network only consists of a data sender and a data receiver

with an optional switch connecting them. Thus the only places where congestion occurs are at the

sender or receiver. The readout system will typically produce data at near constant rates during

measurements so congestion at the receiver will result in reduced data rates by the transmitter when

using TCP. This ﬁrst causes buﬀering at the transmitting application until the buﬀer is full and

eventually pakets are lost.

For some detector readout it is not even evident that guaranteed delivery is necessary. In one

detector prototype we discarded around 24% of the data due to threshold suppression, so spending

extra time making an occasional retransmission may not be worth the added complexity.

2.2 Connections

Since TCP requires the establishement of a connection, both the receiving and transmitting appli-

cations must implement additional state to detect the possible loss of a connection. For example

upon reset of the readout system after a software upgrade or a parameter change. With UDP the

receiver will just ’listen’ on a speciﬁed UDP port whenever it is ready and receive data when it

arrives. Correspondingly the transmitter can send data whenever it is ready. UDP reception sup-

ports many-to-one communication, supporting for example two or more readout systems in a single

receiver. For TCP to support this would require handling multiple TCP connections.

2.3 Addressing

UDP can be transmitted over IP as multicast. This means that a single sender can reach multiple

receivers without any additional programming eﬀort. This can be used for seamless switchovers,

redundancy, load distribution, monitoring, etc.. Implementing this in TCP would add complexity

to the transmitter.

In summary: For our purposes UDP appears to have more relevant features than TCP. Thus it is

preferred provided we can achieve the desired performance and reliability.

3 Performance optimizations

This section explains the factors that contribute to limiting performance, reproducibility or accuracy

of the measurements. Here we also discuss the optimization strategies used to achieve the results.

3.1 Transmission of data

An Ethernet frame consists of a ﬁxed 14 bytes header the Ethernet payload, padding and a 4 byte

checksum ﬁeld. Padding is applied to ensure a minimum Ethernet packet size of 64 bytes. There

is a minimum gap between Ethernet frames of 20 bytes. This is called the Inter Frame Gap (IFG).

Standard Ethernet supports ethernet payloads from 1 to 1500 bytes. Ethernet frames with payload

sizes above 1500 bytes are called jumbo frames. Some Ethernet hardware support payload sizes of

9000 bytes corresponding to Ethernet frame sizes of 9018 when including the header and checksum

–3–

ﬁelds. This is shown in Figure 1(top). The Ethernet payload consists of IP and UDP headers

as well as user data. This is illustrated in Figure 1(bottom). For any data to be transmitted over

Ethernet, the factors inﬂuencing the packet and data rates are: The link speed, IFG and the payload

size. The largest supported Ethernet payload is called the Maximum Tranmission Unit (MTU). For

futher information see [12] and [13].

eth.%frame%n% IFG% eth.%frame%n+1%

64%–%9018% 20% 64%–%9018%

...% ...%

eth% ip%udp%eth%

14% 20% 8% 4%1%–%8972%

ifg%

20% 20%

Eth.%payload%

User%data%

Eth.%payload%

Figure 1. (top) Ethernet frames are separated by a 20 byte inter frame gap. (bottom) The Ethernet, IP and

UDP headers take up 46 bytes. The largest UDP user data size is 1472 bytes on most Ethernet interfaces

due to a default MTU of 1500. This can be extended on some equipment to 8972 bytes by the use of jumbo

frames.

Sending data larger than the MTU will result in the data being split in chunks of size MTU

before transmission. Given a speciﬁc link speed and packet size, the packet rate is given by

rate[packets per second]=ls

8· (ps +ifg)

where ls is the link speed in b/s, ps the packet size and ifg the inter frame gap. Thus for a 10 Gb/s

Ethernet link, the packet rate for 64 byte packets is 14.88 M packets per second (pps) as is shown

in Table 1.

Table 1. Packet rates as function of packet sizes for 10 Gb/s Ethernet

User data size [B] 1 18 82 210 466 978 1472 8972

Packet size [B] 64 64 128 256 512 1024 1518 9018

Overhead [%] 98.8 78.6 44.6 23.9 12.4 5.5 4.3 0.7

Frame rate [Mpps] 14.88 14.88 8.45 4.53 2.35 1.20 0.81 0.14

Packets arriving at a data acquisition system are subject to a nearly constant per-packet pro-

cessing overhead. This is due to interrupt handling, context switching, checksum validations and

header processing. At almost 15 M packets per second this processing alone can consume most of

the available CPU resources. In order to achieve maximum performance, data from the electronics

readout should be bundled into jumbo frames if at all possible. Using the maximum Ethernet packet

size of 9018 bytes reduces the per-packet overhead by a factor of 100. This does, however, come at

the cost of larger latency. For example the transmission time of 64 bytes + IFG is 67 ns, whereas

–4–

for 9018 + IFG it is 902 ns. For applications sentitive to latency a tradeoﬀ must be made between

low packet rates and low latency.

Not all transmitted data are of interest for the reciever and can be considered as overhead.

Packet headers is such an example. The Ethernet, IP and UDP headers are always present and takes

up a total of 46 bytes as shown in Figure 1(bottom). The utilization of an Ethernet link can be

calculated as

d+46 +ifg +pad

where Uis the link utilization, dthe user data size, ifg the inter frame gap and pad is the padding

meantioned earlier. For user data larger than 18 bytes no padding is applied. This means that

for small user payloads the overhead can be signiﬁcant, making it impossible to achieve high

throughput. For example transmitting a 32 bit counter over UDP will take up 84 bytes on the wire

(20 bytes IFG + 64 byte for a minimum Ethernet frame) and the overhead will account for approx.

95% of the available bandwidth. In contrast when sending 8972 byte user data the overhead is as

low as 0.73%.

3.2 Network buﬀers and packet loss

A UDP packet can be dropped in any part of the communications chain: The sender, the receiver,

intermediate systems such as routers, ﬁrewalls, switches, load balancers, etc. This makes it diﬃcult

in general to rely on UDP for high speed communications. However for simple network topologies

such as the ones found in detector readout systems it is possible to achieve very reliable UDP

communications. When for example the system comprise two hosts (sender and receiver) connected

via a switch of high quality, the packet loss is mainly caused by the Ethernet NIC transmit queue

and the socket receive buﬀer size. Fortunately these can be optimized. The main parameters for

controlling socket buﬀers are rmem_max and wmem_max. The former is the size of the UDP socket

receive buﬀer, whereas the latter is the size of the UDP socket transmit buﬀer. To change these

values from an application use setsockopt(), for example

int buffer = 4000000;

setsockopt(s, SOL_SOCKET, SO_SNDBUF, buffer, sizeof(buffer));

setsockopt(s, SOL_SOCKET, SO_RCVBUF, buffer, sizeof(buffer));

In addition there is an internal queue for packet reception whose size (in packets) is named

netdev_max_backlog, and a network interface parameter, txqueuelen which were also adjusted.

The default value of these parameters on Linux are not optimized for high speed data links

such as 10 Gb/s Ethernet, so for this investigation the following parameters were used.

net.core.rmem_max=12582912

net.core.wmem_max=12582912

net.core.netdev_max_backlog=5000

txqueuelen 10000

These values have largely been determined by experimentation. We also conﬁgured the systems

with an MTU of 9000 allowing user payloads up to 8972 bytes when taking into account that IP

and UDP headers are also transmitted.

–5–

3.3 Core locality

Modern CPUs rely heavily on cache memories to achieve performance. This holds for both

instructions and data access. For Xeon E5 processors there are three levels of cache. Some is

shared between instructions and data, some is dedicated. The L3 cache is shared across all cores

and hyperthreads, whereas the L1 cache is only shared between two hyperthreads. The way to ensure

that the transmit and receive applications always uses the same cache is to ’lock’ the applications

to speciﬁc cores. For this we use the Linux command taskset and the pthread API function

pthread_setaffinity_np(). This prevents the application processes to be moved to other cores

and thereby causing interrupts in the data processing, but it does not prevent other processes to be

swapped onto the same core.

3.4 Timers

The transmitter and receiver applications for this investigation periodically prints out the measured

data speed, PER and other parameters. Initially the standard C++ chrono class timer was used

(version: libstdc++.so.6). But proﬁling showed that signiﬁcant time was spent here, enough to

aﬀect the measurements at high loads. Instead we decided to use the CPU’s hardware based Time

Stamp Counter (TSC). TSC is a 64 bit counter running at CPU clock frequency. Since processor

speeds are subject to throttling, the TSC cannot be directly relied upon to measure time. In this

investigation time checking is a two-step process: First we estimate when it is time to do the periodic

update based on the inaccurate TSC value. Then we use the more expensive C++ chrono functions

to calculate the elapsed time used in the rate calcuations. An example of this is shown in the source

code which is publicly available. See Section Afor instructions on how to obtain the source code.

4 Testbed for the experiments

The experimental conﬁguration is shown in Figure 2. It consists of two hosts, one acting as a UDP

data generator and the other as a UDP receiver. The hosts are HPE ProLiant DL360 Gen9 servers

connected to a 10 Gb/s Ethernet switch using short (2 m) single mode ﬁber cables. The switch is

a HP E5406 switch equipped with a J9538A 8-port SFP+ module. The server speciﬁcations are

shown in table 2. Except for processor internals the servers are equipped with identical hardware.

HP#ProLiant#

DL360#

HP#E5406#

J9538A#

HP#ProLiant#

DL360#

S0# S0# S1#

Generator# Receiver#Switch#

eno49# eno49#

Figure 2. Experimental setup.

The data generator is a small C++ program using BSD socket, speciﬁcally the sendto() system

call for transmission of UDP data. The data receiver is based on a DAQ and event formation system

developed at ESS as a prototype. The system, named the Event Formation Unit (EFU), supports

loadable processing pipelines. A special UDP ’instrument’ pipeline was created for the purpose

–6–

Table 2. Hardware components for the testbed

Motherboard HPE ProLiant DL360 Gen9

Processor type (receiver) Two 10-core Intel Xeon E5-2650v3 CPU @ 2.30GHz

Processor type (generator) One 6-core Intel Xeon E5-2620v3 CPU @ 2.40GHz

RAM 64 GB (DDR4) - 4 x 16 GB DIMM - 2133MHz

NIC dual port Broadcom NetXtreme II BCM57810 10 Gigabit Ethernet

Hard Disk Internal SSD drive (120GB) for local installation of CentOS 7.1.1503

Linux kernel 3.10.0-229.7.2.el7.x86_64

of these tests. Both the generator and receiver uses setsockopt() to adjust transmit and receive

buﬀer sizes. Sequence numbers are embedded in the user payload by the transmitter allowing the

receiver to detect packet loss and hence to calculate packet error ratios. Both the transmitting and

receiving applications were locked to a speciﬁc processor core using the taskset command and

pthread_setaffinity_np() function. The measured user payload data-rates were calculated

using a combination of fast timestamp counters and microsecond counters from the C++ chrono

class. Care was taken not to run other programs that might adversly aﬀect performance while

performing the experiments. CPU usages were calculated from the /proc/stat pseudoﬁle as also

used in [9].

A measurement series typically consisted of the following steps:

1. Start receiver

2. Start transmitter with speciﬁed packet size

3. Record packet error ratios (PER) and data rates

4. Stop transmitter and receiver after 400 GB

The above steps were then repeated for measurements of CPU usage using /proc/stat

averaged over 10 second intervals.

A series of measurements of speed, packet error ratios and CPU usage where made as a function

of user data size for reasons discussed in Section 3.1.

4.1 Experimental limitations

The current experiments are subject to some limitations. We do not however believe that these pose

any signiﬁcant problems in the evaluation of the results. The main limitations are described below.

Multi user issues: The servers used for the tests are multi user systems in a shared integration

laboratory. Care was taken to ensure that other users were not running applications at the same

time to avoid competition for cpu, memory and network resources. However a number of standard

demon processes were running in the background, some of which triggers the transmission of data

and some of which are triggered by packet reception.

Measuring aﬀects performance: Several conﬁguration, performance and debugging tools need

access to kernel or driver data structures. Examples we encountered are netstat,ethtool and

–7–

dropwatch. However the use of these tools can cause additional packet drops when running at

high system loads. These tools were not run while measuring packet losses.

Packet reordering: The test application is unable to detect misordered packets. Packet reordering

however is highly unlikely in the current setup, but would be falsely reported as packet loss.

Packet checksum errors: The NICs perform checksums of Ethernet and IP in hardware. Thus

packets with wrong checksums will not be delivered to the application and subsequently be falsely

reported as packet loss. For the purpose of this study this is the desired behavior.

5 Performance

The performance results covers user data speed, packet error ratios and cpu load. These topics will

be covered in the following sections.

5.1 Data Speed

The result of the measurements of achievable user data speeds is shown in Figure 3(a). The ﬁgure

shows both the measured and the theoretical maximum speed. For packets with user data sizes

larger than 2000 bytes the achieved rates match the theoretical maximum. However at smaller data

sizes the performance gap increases rapidly. It is clear that either the transmitter or the receiver

is unable to cope with the increasing load. This is mainly due to the higher packet arrival rates

occurring at smaller packet sizes. The higher rates increases the per-packet overhead and also the

number of interrupts and system calls. At the maximum data size of 8972 bytes the CPU load on

the receiver was 20%.

5.2 Packet error ratios

The achieved packet error ratios in this experiment are shown in Figure 3(b), which also shows the

corresponding values obtained using the default system parameters. The raw measurements for the

achieved values are listed in Table 3. It is observed that the packet error ratios depends on the size

of transmitted data. This dependency is mainly caused by the per-packet overhead introduced by

increasing packet rates with decreasing size. The onset of packet loss coincides with the onset of

deviation of observed speed from the theoretical maximum speed suggesting a common cause. No

packet loss was observed for data larger than 2200 bytes. When packet loss sets in at lower data

sizes, the performance degrades rapidly: In the region from 2000 to 1700 bytes the PER increases

by more than four orders of magnitude from 1.3·10−6to 7.1·10−2.

Table 3. Packet error ratios as function of user data size

size [B] 64 128 256 472 772 1000 1472 1700

PER 4.0·10−14.0·10−14.1·10−13.9·10−13.8·10−13.8·10−12.0·10−17.1·10−2

size [B] 1800 1900 2000 2200 2972 4472 5972 8972

PER 3.2·10−36.1·10−61.3·10−60 0 0 0 0

–8–

5.3 CPU load

The CPU load as a function of user data size is shown in Figure 3(c). The observation for both

transmitter and receiver is that the CPU load increases with decreasing user data size. When the

transmitter reaches 100% the receiver is slightly less busy at 84%. There is a clear cut-oﬀ value

corresponding to packet loss and deviations from theoretical maximum speed around user data sizes

of 2000 bytes. The measured CPU loads indicate that the transmission is the bottle neck at small

data sizes (high packet rates), and that most CPU cycles are spent as System load as also reported

by [9]. But the comparisons diﬀer both qualitatively and quantitatively upon closer scrutiny. For

example in this study we ﬁnd the total CPU load for the receiver (system + user) to be 20% for user

data sizes of 8972 bytes. This is much lower than reported earlier. On the other hand we observe

a sharp increase CPU usage in soft IRQ from 0% to 100% over a narrow region which was not

observed previously. We also observe a local minimum in Tx CPU load around 2000 bytes followed

by a rapid increase at lower data sizes.

6 Conclusion

Measurements of data rates and packet error ratios for UDP based communications at 10 Gb/s

have been presented. The data rates were achieved using standard hardware and software. No

modiﬁcations were made to the kernel network stack but some standard Linux commands were

used to optimize the behavior of the system. The main change was increasing network buﬀers for

UDP communications from a small default value of 212 kB to 12 MB. In addition packet error ratios

were measured. The measurements shows that it is possible to achieve zero packet error ratios at

10 Gb/s, but that this requires the use of large Ethernet packets (jumbo frames), preferably as large

as 9018 bytes. Thus the experiments have shown that it is feasible to create a reliable UDP based

data acquisition system supporting readout data at 10 Gb/s.

This study supplements independent measurements done earlier [9] and reveals diﬀerences

in performance across diﬀerent platforms. The observed diﬀerences are likely to be caused by

diﬀerences in CPU generations, Ethernet NIC capabilities and Linux kernel versions. These

diﬀerences were not the focus of our study and have not been investigated further. But they do

indicate that some performance numbers are diﬃcult to compare directly across setups. They also

provide a strong hint to DAQ developers: When upgrading hardware or kernel versions in s Linux

based DAQ system, performance tests should be done to ensure that speciﬁcations are still met.

There are several ways to improve performance to achieve 10 Gb/s with smaller packet sizes,

but the complexity increases. For example it is possible to send and receive multiple messages

using a single system call such as sendmmsg() and recvmmsg() which will reduce the number

of system calls and should improve performance. It is also possible to use multiple cores for the

receiver instead of only one as we did in this test. This adds some complexity that has to handle

distributing packets across cores in case it cannot be done automatically. One method for automatic

load distribution is to use Receive Side Scaling (RSS). However this requires the transmitter to use

several diﬀerent source ports in the UDP packet when transmitting instead of one currently used.

This may require changes to the readout system. It is also possible to move network processing

away from the kernel and into user space avoiding context switches, and to change from interrupt

–9–

Figure 3. Performance measurements. a) User data speed. b) Packet Error Ratio. c) CPU Load. Note that

for the optimized values PER is zero for user data larger than or equal to 2200 bytes (solid line).

– 10 –

driven reception to polling. These approaches are used in the Intel Data Plane Development Kit

(DPDK) software packet processing framework.

A Source code

The software for this project is released under a BSD license and is freely available at GitHub

https://github.com/ess-dmsc/event-formation-unit.git. To build the programs used

for these experiments complete the following steps. To build and start the producer:

> git clone https://github.com/ess-dmsc/event-formation-unit

> cd event-formation-unit/udp

> make

> taskset -c coreid ./udptx -i ipaddress

to build and start the receiver:

> git clone https://github.com/ess-dmsc/event-formation-unit

> mkdir build

> cd build

> cmake ..

> make

> ./efu2 -d udp -c coreid

The central source ﬁles for this paper are udp/udptx.cpp for the generator and prototype2/udp/udp.cpp

for the receiver. The programs have been demonstrated to build and rund on Mac OS X, Ubuntu

16 and CentOS 7.1. However some additional libraries need to be installed, such as librdkafka and

google ﬂatbuﬀers.

B System conﬁguration

The following commands were used (performed as superuser) to change the system parameters on

CentOS. The examples below modiﬁes network interface eno49. This should be changed to match

the name of the interface on the actual system.

> sysctl -w net.core.rmem_max=12582912

> sysctl -w net.core.wmem_max=12582912

> sysctl -w net.core.netdev_max_backlog=5000

> ifconfig eno49 mtu 9000 txqueuelen 10000 up

Acknowledgments

This work is funded by the EU Horizon 2020 framework, BrightnESS project 676548.

We thank Sarah Ruepp, associate professor at DTU FOTONIK, and Irina Stefanescu, Detector

Scientist at ESS, for comments that greatly improved the manuscript.

– 11 –

References

[1] European Spallation Source ERIC, Retrieved from http://europeanspallationsource.se/.

[2] T. Gahl et al., Hardware Aspects, Modularity and Integration of an Event Mode Data Acquisition and

Instrument Control for the European Spallation Source (ESS), arXiv:1507.01838v1.

[3] A. Khaplanov et al., Multi-Grid detector for neutron spectroscopy: results obtained on time-of-ﬂight

spectrometer CNCS ,JINST 12 (2017) P04030.

[4] I. Stefanescu et al., Neutron detectors for the ESS diﬀractometers,JINST 12 (2017) P01019.

[5] F. Piscitelli et al., The Multi-Blade Boron-10-based Neutron Detector for high intensity Neutron

Reﬂectometry at ESS, arXiv:1701.07623v1.

[6] J. Postel User Datagram Protocol,IETF, Retrieved from

https://tools.ietf.org/html/rfc768.

[7] S. Martoiu, H. Muller and J. Toledo Front-end electronics for the Scalable Readout System of RD51,

IEEE Nuclear Science Symposium Conference Record (2011) 2036.

[8] R. Frazier, G. Illes, D. Newbold and A. Rose Software and ﬁrmware for controlling CMS trigger and

readout hardware via gigabit Ethernet,Physics Procedia,37, (2012) 1892-1899.

[9] M. Bencivenni et al., Performance of 10 Gigabit Ethernet Using Commodity Hardware,IEEE Trans.

Nucl. Sci.,57, (2010) 630-641.

[10] D. Bortolotti et al., Comparison of UDP Transmission Performance Between IP-Over-Inﬁniband and

10-Gigabit Ethernet,IEEE Trans. Nucl. Sci.,58, (2011) 1606-1612.

[11] P. Födisch, B. Lange, J. Sandmann, A. Büchner, W. Enghardt and P. Kaever A synchronous Gigabit

Ethernet protocol stack for high-throughput UDP/IP applications,J.Inst,11 (2016).

[12] IEEE 802 LAN/MAN Standards Committee,IEEE, Retrieved from http://www.ieee802.org/.

[13] Request For Comments,IETF, Retrieved from http://www.ietf.org/.

– 12 –

ResearchGate has not been able to resolve any citations for this publication.

Multi-Grid Detector for Neutron Spectroscopy: Results Obtained on Time-of-Flight Spectrometer CNCS

Article

Full-text available

Mar 2017
J INSTRUM

The Multi-Grid detector technology has evolved from the proof-of-principle and characterisation stages. Here we report on the performance of the Multi-Grid detector, the MG.CNCS prototype, which has been installed and tested at the Cold Neutron Chopper Spectrometer, CNCS at SNS. This has allowed a side-by-side comparison to the performance of He-3 detectors on an operational instrument. The demonstrator has an active area of 0.2m$^2$. It is specifically tailored to the specifications of CNCS. The detector was installed in June 2016 and has operated since then, collecting neutron scattering data in parallel to the He-3 detectors of CNCS. In this paper, we present a comprehensive analysis of this data, in particular on instrument energy resolution, rate capability, background and relative efficiency. Stability, gamma-ray and fast neutron sensitivity have also been investigated. The effect of scattering in the detector components has been measured and provides input to comparison for Monte Carlo simulations. All data is presented in comparison to that measured by the He-3 detectors simultaneously, showing that all features recorded by one detector are also recorded by the other. The energy resolution matches closely. We find that the Multi-Grid is able to match the data collected by He-3, and see an indication of a considerable advantage in the count rate capability. Based on these results, we are confident that the Multi-Grid will be capable of producing high quality scientific data on chopper spectrometers utilising the unprecedented neutron flux of the ESS.

The Multi-Blade Boron-10-based Neutron Detector for high intensity Neutron Reflectometry at ESS

Article

Full-text available

Mar 2017
J INSTRUM

The Multi-Blade is a Boron-10-based gaseous detector introduced to face the challenge arising in neutron reflectometry at pulsed neutron sources. Neutron reflectometers are the most challenging instruments in terms of instantaneous counting rate and spatial resolution. This detector has been designed to cope with the requirements set for the reflectometers at the upcoming European Spallation Source (ESS) in Sweden. Based on previous results obtained at the Institut Laue-Langevin (ILL) in France, an improved demonstrator has been built at ESS and tested at the Budapest Neutron Centre (BNC) in Hungary and at the Source Testing Facility (STF) at the Lund University in Sweden. A detailed description of the detector and the results of the tests are discussed in this manuscript.

Neutron detectors for the ESS diffractometers

Article

Full-text available

Jul 2016
J INSTRUM

The ambitious instrument suite for the future European Spallation Source whose civil construction started recently in Lund, Sweden, demands a set of diverse and challenging requirements for the neutron detectors. For instance, the unprecedented high flux expected on the samples to be investigated in neutron diffraction or reflectometry experiments requires detectors that can handle high counting rates, while the investigation of sub-millimeter protein crystals will only be possible with large-area detectors that can achieve a position resolution as low as 200 {\mu}m. This has motivated an extensive research and development campaign to advance the state-of-the-art detector and to find new technologies that can reach maturity by the time the ESS will operate at full potential. This paper presents the key detector requirements for three of the Time-of-Flight diffraction instrument concepts selected by the Scientific Advisory Committee to advance into the phase of preliminary engineering design. We discuss the available detector technologies suitable for this particular instrument class and their major challenges. The detector technologies selected by the instrument teams to collect the diffraction patterns are briefly discussed. Analytical calculations, Monte-Carlo simulations, and real experimental data are used to develop a generic method to esti- mate the event rate in the diffraction detectors. The proposed approach is based upon conservative assumptions that use information and input parameters that reflect our current level of knowledge and understanding of the ESS project. We apply this method to make predictions for the future diffraction instruments, and thus provide additional information that can help the instrument teams with the optimisation of the detector designs.

Hardware Aspects, Modularity and Integration of an Event Mode Data Acquisition and Instrument Control for the European Spallation Source (ESS)

Article

Full-text available

Jul 2015

The European Spallation Source (ESS) in Lund, Sweden is just entering the construction phase with 3 neutron instruments having started in its design concept phase in 2014. As a collaboration of 17 European countries the majority of hardware devices for neutron instrumentation will be provided in-kind. This presents numerous technical and organisational challenges for the construction and the integration of the instruments into the facility wide infrastructure; notably the EPICS control network with standardised hardware interfaces and the facilities absolute timing system. Additionally the new generation of pulsed source requires a new complexity and flexibility of instrumentation to fully exploit its opportunities. In this contribution we present a strategy for the modularity of the instrument hardware with well-defined standardized functionality and control & data interfaces integrating into EPICS and the facilities timing system. It allows for in-kind contribution of dedicated modules for each instrument (horizontal approach) as well as of whole instruments (vertical approach). Key point of the strategy is the time stamping of all readings from the instruments control electronics extending the event mode data acquisition from neutron events to all metadata. This gives the control software the flexibility necessary to adapt the functionality of the instruments to the demands of each single experiment. We present the advantages of that approach for operation and diagnostics and discuss additional hardware requirements necessary.

Software and firmware for controlling CMS trigger and readout hardware via gigabit Ethernet

Article

Full-text available

Dec 2012

Forthcoming hardware upgrades to the CMS experiment trigger and readout system are based upon the ATCA or mu TCA bus standards, giving them the opportunity to be controlled via commodity gigabit Ethernet. These hardware upgrades supersede existing systems largely based upon the VME-bus standard, and thus a requirement has arisen to provide a new low-level control infrastructure for use by trigger and readout subsystem developers. This paper details the recent research and development into a tightly-integrated suite of software and firmware based upon the IPbus protocol that allows such Ethernet-attached hardware to be controlled in an efficient and highly-scalable manner. (C) 2012 Published by Elsevier B.V. Selection and/or peer review under responsibility of the organizing committee for TIPP 11.

Front-end electronics for the Scalable Readout System of RD51

Conference Paper

Full-text available

Feb 2012

Recent developments in micro-pattern gas detector technologies have considerably broadened the interest in this type of detectors, extending their application field from high-energy physics to nuclear, astrophysical, geophysical, medical or industrial applications, to name just a few. Historically, for the wide range of gas amplification schemes available, there has been an almost equally wide amount of electronic readout solutions, tailored on just one application, making it rather difficult for newcomers to employ the technology. Developed within RD51 Collaboration for the Development of Micro-Pattern Gas Detectors Technologies, the Scalable Readout System (SRS) is intended as a general purpose multi-channel readout solution for a wide range of detector types, and detector complexities, as well as for different experimental environments.

Performance of 10 Gigabit Ethernet Using Commodity Hardware

Article

Full-text available

May 2010

In the prospect of employing 10 Gigabit Ethernet as networking technology for online systems and offline data analysis centers of High Energy Physics experiments, we performed a series of measurements on the performance of 10 Gigabit Ethernet, using the network interface cards mounted on the PCI-Express bus of commodity PCs both as transmitters and receivers. In real operating conditions, the achievable maximum transfer rate through a network link is not only limited by the capacity of the link itself, but also by that of the memory and peripheral buses and by the ability of the CPUs and of the Operating System to handle packet processing and interrupts raised by the network interface cards in due time. Besides the TCP and UDP maximum data transfer throughputs, we also measured the CPU loads of the sender/receiver processes and of the interrupt and soft-interrupt handlers as a function of the packet size, either using standard or Â¿jumboÂ¿ Ethernet frames. In addition, we also performed the same measurements by simultaneously reading data from Fibre Channel links and forwarding them through a 10 Gigabit Ethernet link, hence emulating the behavior of a disk server in a Storage Area Network exporting data to client machines via 10 Gigabit Ethernet.

A synchronous Gigabit Ethernet protocol stack for high-throughput UDP/IP applications

Article

Oct 2015
J INSTRUM

State of the art detector readout electronics require high-throughput data acquisition (DAQ) systems. In many applications, e. g. for medical imaging, the front-end electronics are set up as separate modules in a distributed DAQ. A standardized interface between the modules and a central data unit is essential. The requirements on such an interface are varied, but demand almost always a high throughput of data. Beyond this challenge, a Gigabit Ethernet interface is predestined for the broad requirements of Systems-on-a-Chip (SoC) up to large-scale DAQ systems. We have implemented an embedded protocol stack for a Field Programmable Gate Array (FPGA) capable of high-throughput data transmission and clock synchronization. A versatile stack architecture for the User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP) over Internet Protocol (IP) such as Address Resolution Protocol (ARP) as well as Precision Time Protocol (PTP) is presented. With a point-to-point connection to a host in a MicroTCA system we achieved the theoretical maximum data throughput limited by UDP both for 1000BASE-T and 1000BASE-KX links. Furthermore, we show that the total jitter of a synchronous clock over a 1000BASE-T link for a PTP application is below 60 ps.

Comparison of UDP Transmission Performance Between IP-Over-InfiniBand and 10-Gigabit Ethernet

Article

Aug 2011

Amongst link technologies, InfiniBand has gained wide acceptance in the framework of High Performance Computing (HPC), due to its high bandwidth and in particular to its low latency. Since InfiniBand is very flexible, supporting several kinds of messages, it is suitable, in principle, not only for HPC, but also for the data acquisition systems of High Energy Physics (HEP) Experiments. In order to check the InfiniBand capabilities in the framework of on-line systems of HEP Experiments, we performed measurements with point-to-point UDP data transfers over a 4-lane Double Data Rate InfiniBand connection, by means of the IPoIB (IP over InfiniBand) protocol stack, using the Host Channel Adapter cards mounted on a 8-lane PCI-Express bus of commodity PCs both as transmitters and receivers, thus measuring not only the capacity of the link itself, but also the effort required by the host CPUs, buses and Operating Systems. Using either the "Unreliable Datagram" or the "Reliable Connected" InfiniBand transfer modes, we measured the maximum achievable UDP data transfer throughput, the frame rate and the CPU loads of the sender/receiver processes and of the interrupt handlers as a function of the datagram size. Performance of InfiniBand in UDP point-to-point data transfer are then compared with that obtained with analogous tests per formed between the same PCs, using a 10-Gigabit Ethernet link.

Achieveing reliable UDP transmission at 10 Gb/s using BSD socket for data acquisition systems

Abstract and Figures

Recommended publications

Neutrons for Life Science & Soft Condensed Matter

Journal of Instrumentation Achieving reliable UDP transmission at 10 Gb/s using BSD socket for data...

Software-based data acquisition and processing for neutron detectors at European Spallation Source -...

Software-based data acquisition and processing for neutron detectors at European Spallation Source -...

Design of a Data Transmission System Based on Gigabit Ethernet

Comparison of UDP Transmission Performance Between IP-Over-InfiniBand and 10-Gigabit Ethernet