PreprintPDF Available

Achieveing reliable UDP transmission at 10 Gb/s using BSD socket for data acquisition systems

Authors:

Abstract and Figures

User Datagram Protocol (UDP) is a commonly used protocol for data transmission in small embedded systems. UDP as such is unreliable and packet losses can occur. The achievable data rates can suffer if optimal packet sizes are not used. The alternative, Transmission Control Protocol (TCP) guarantees the ordered delivery of data and automatically adjusts transmission to match the capability of the transmission link. Nevertheless UDP is often favored over TCP due to its simplicity, small memory and instruction footprints. Both UDP and TCP are implemented in all larger operating systems and commercial embedded frameworks. In addition UDP also supported on a variety of small hardware platforms such as Digital Signal Processors (DSP) Field Programmable Gate Arrays (FPGA). This is not so common for TCP. This paper describes how high speed UDP based data transmission with very low packet error ratios was achieved. The near-reliable communications link is used in a data acquisition (DAQ) system for the next generation of extremely intense neutron source, European Spallation Source. This paper presents measurements of UDP performance and reliability as achieved by employing several optimizations. The measurements were performed on Xeon E5 based CentOS (Linux) servers. The measured data rates are very close to the 10 Gb/s line rate, and zero packet loss was achieved. The performance was obtained utilizing a single processor core as transmitter and a single core as receiver. The results show that support for transmitting large data packets is a key parameter for good performance. Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread affinity, core locality and efficient timers.
Content may be subject to copyright.
Prepared for submission to JINST
Achieveing reliable UDP transmission at 10 Gb/s using
BSD socket for data acquisition systems
M.J. Christensen,aT. Richtera
aEuropean Spallation Source, Data Management and Software Centre
Ole Maaløes vej 3
2200 Copenhagen N
Denmark
E-mail: mortenchristensen@esss.se
Abstract: User Datagram Protocol (UDP) is a commonly used protocol for data transmission in
small embedded systems. UDP as such is unreliable and packet losses can occur. The achievable
data rates can suffer if optimal packet sizes are not used. The alternative, Transmission Control
Protocol (TCP) guarantees the ordered delivery of data and automatically adjusts transmission to
match the capability of the transmission link. Nevertheless UDP is often favored over TCP due
to its simplicity, small memory and instruction footprints. Both UDP and TCP are implemented
in all larger operating systems and commercial embedded frameworks. In addition UDP also
supported on a variety of small hardware platforms such as Digital Signal Processors (DSP) Field
Programmable Gate Arrays (FPGA). This is not so common for TCP. This paper describes how high
speed UDP based data transmission with very low packet error ratios was achieved. The near-reliable
communications link is used in a data acquisition (DAQ) system for the next generation of extremely
intense neutron source, European Spallation Source. This paper presents measurements of UDP
performance and reliability as achieved by employing several optimizations. The measurements
were performed on Xeon E5 based CentOS (Linux) servers. The measured data rates are very close
to the 10 Gb/s line rate, and zero packet loss was achieved. The performance was obtained utilizing
a single processor core as transmitter and a single core as receiver. The results show that support
for transmitting large data packets is a key parameter for good performance.
Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread
affinity, core locality and efficient timers.
Keywords: Computing (architecture, farms, GRID for recording, storage, archiving, and distribu-
tion of data), Data acquisition concepts, Software architectures (event data models, frameworks and
databases)
1Corresponding author.
arXiv:1706.00333v1 [physics.ins-det] 1 Jun 2017
Contents
1 Introduction 1
2 TCP and UDP pros and cons 2
2.1 Congestion 2
2.2 Connections 3
2.3 Addressing 3
3 Performance optimizations 3
3.1 Transmission of data 3
3.2 Network buffers and packet loss 5
3.3 Core locality 6
3.4 Timers 6
4 Testbed for the experiments 6
4.1 Experimental limitations 7
5 Performance 8
5.1 Data Speed 8
5.2 Packet error ratios 8
5.3 CPU load 9
6 Conclusion 9
A Source code 11
B System configuration 11
1 Introduction
European Spallation Source [1] is a next generation neutron source currently being developed
in Lund, Sweden. The facility will initially support about 16 different instruments for neutron
scattering. In addition to the instrument infrastructure, the ESS Data Management and Software
Centre (DMSC), located in Copenhagen, provides infrastructure and computational support for the
acquisition, event formation and long term storage of the experimental data. At the heart of each
instrument is a neutron detector and its associated readout system. Both detectors and readout
systems are currently in the design phase and various prototypes have already been produced [25].
During experiments data is being produced at high rates: Detector data is read out by custom
electronics and the readings are converted into UDP packets by the readout system and sent to event
formation servers over 10 Gb/s optical Ethernet links. The event formation servers are based on
–1–
general purpose CPUs and it is anticipated that most if not all data reduction at ESS is done in
software. This includes reception of raw readout data, threshold rejection, clustering and event
formation. UDP is a simple protocol for connectionless data transmission [6] and packet loss can
occur during transmission. Nevertheless UDP is widely used, for example in the RD51 Scalable
Readout System [7], or the CMS trigger readout [8], both using 1 Gb/s Ethernet. The two central
components are the readout system and the event formation system. The readout system is a hybrid
of analog and digital electronics. The electronics convert deposited charges into electric signals
which are digitized and timestamped. In the digital domain simple data reduction such as zero
suppression and threshold based rejection can be performed. The event formation system receives
these timestamped digital readouts and performs the necessary steps to determine the position of
the neutron. These processing steps are different for each detector type. The performance of UDP
over 10G Ethernet has been the subject of previous studies [9] [10], which measured TCP and UDP
performance and CPU usages on Linux using commodity hardware. Both studies use a certain set
of optimizations but otherwise using standard Linux. In [9] the transmitting process is found to be
a bottleneck in terms of CPU usage, whereas a comparison between Ethernet and InfiniBand [10]
reinforces the earlier results and concludes that Ethernet is a serious contender for use in a readout
system. This study is aimed at characterizing the performance of a prototype data acquisition
system based on UDP. The study is not so much concerned with transmitter performance as we
expect to receive data from a FPGA based platform capable of transmitting at wire speed at all
packet sizes. In stead comparisons between the measured and theoretically possible throughput
and measurements of packet error ratios are presented. Finally, this paper presents strategies for
optimizing the performance of data transmission between the readout system and the event formation
system.
2 TCP and UDP pros and cons
Since TCP is reliable and has good performance whereas UDP is unreliable why not always just
use TCP? The pros and cons for this will be discussed in the following. Both TCP and UDP
are designed to provide end-to-end communications between hosts connected over a network of
packet forwarders. Originally these forwarders were routers but today the group of forwarders
include firewalls, load balancers, switches, Network Address Translator (NAT) devices etc. TCP
is connection oriented, whereas UDP is connectionless. This means that TCP requires that a
connection is setup before data can be transmitted. It also implies that TCP data can only be sent
from a single transmitter to a single receiver. In contrast UDP does not have a connection concept
and UDP data can be transmitted as either Internet Protocol (IP) broadcast or IP multicast. As
mentioned earlier the main argument for UDP is that it is often supported on smaller systems where
TCP is not. A notable example are FPGA based systems (see [11] for one such example). But some
of the TCP features are not actually improving the performance and reliability in the case of special
network topologies as explained below.
2.1 Congestion
Any forwarder is potentially subject to congestion and can drop packets when unable to cope with
the traffic load. TCP was designed to react to this congestion. Firstly TCP has a slow start algorithm
–2–
whereby the data rate is ramped up gradually in order not to contribute to the network congestion
itself. Secondly TCP will back off and reduce its transmission rate when congestion is detected.
In a readout system such as ours the network only consists of a data sender and a data receiver
with an optional switch connecting them. Thus the only places where congestion occurs are at the
sender or receiver. The readout system will typically produce data at near constant rates during
measurements so congestion at the receiver will result in reduced data rates by the transmitter when
using TCP. This first causes buffering at the transmitting application until the buffer is full and
eventually pakets are lost.
For some detector readout it is not even evident that guaranteed delivery is necessary. In one
detector prototype we discarded around 24% of the data due to threshold suppression, so spending
extra time making an occasional retransmission may not be worth the added complexity.
2.2 Connections
Since TCP requires the establishement of a connection, both the receiving and transmitting appli-
cations must implement additional state to detect the possible loss of a connection. For example
upon reset of the readout system after a software upgrade or a parameter change. With UDP the
receiver will just ’listen’ on a specified UDP port whenever it is ready and receive data when it
arrives. Correspondingly the transmitter can send data whenever it is ready. UDP reception sup-
ports many-to-one communication, supporting for example two or more readout systems in a single
receiver. For TCP to support this would require handling multiple TCP connections.
2.3 Addressing
UDP can be transmitted over IP as multicast. This means that a single sender can reach multiple
receivers without any additional programming effort. This can be used for seamless switchovers,
redundancy, load distribution, monitoring, etc.. Implementing this in TCP would add complexity
to the transmitter.
In summary: For our purposes UDP appears to have more relevant features than TCP. Thus it is
preferred provided we can achieve the desired performance and reliability.
3 Performance optimizations
This section explains the factors that contribute to limiting performance, reproducibility or accuracy
of the measurements. Here we also discuss the optimization strategies used to achieve the results.
3.1 Transmission of data
An Ethernet frame consists of a fixed 14 bytes header the Ethernet payload, padding and a 4 byte
checksum field. Padding is applied to ensure a minimum Ethernet packet size of 64 bytes. There
is a minimum gap between Ethernet frames of 20 bytes. This is called the Inter Frame Gap (IFG).
Standard Ethernet supports ethernet payloads from 1 to 1500 bytes. Ethernet frames with payload
sizes above 1500 bytes are called jumbo frames. Some Ethernet hardware support payload sizes of
9000 bytes corresponding to Ethernet frame sizes of 9018 when including the header and checksum
–3–
fields. This is shown in Figure 1(top). The Ethernet payload consists of IP and UDP headers
as well as user data. This is illustrated in Figure 1(bottom). For any data to be transmitted over
Ethernet, the factors influencing the packet and data rates are: The link speed, IFG and the payload
size. The largest supported Ethernet payload is called the Maximum Tranmission Unit (MTU). For
futher information see [12] and [13].
eth.%frame%n% IFG% eth.%frame%n+1%
64%–%9018% 20% 64%–%9018%
...% ...%
eth% ip%udp%eth%
14% 20% 8% 4%1%–%8972%
ifg%
ifg%
20% 20%
Eth.%payload%
User%data%
Eth.%payload%
Figure 1. (top) Ethernet frames are separated by a 20 byte inter frame gap. (bottom) The Ethernet, IP and
UDP headers take up 46 bytes. The largest UDP user data size is 1472 bytes on most Ethernet interfaces
due to a default MTU of 1500. This can be extended on some equipment to 8972 bytes by the use of jumbo
frames.
Sending data larger than the MTU will result in the data being split in chunks of size MTU
before transmission. Given a specific link speed and packet size, the packet rate is given by
rate[packets per second]=ls
8· (ps +ifg)
where ls is the link speed in b/s, ps the packet size and ifg the inter frame gap. Thus for a 10 Gb/s
Ethernet link, the packet rate for 64 byte packets is 14.88 M packets per second (pps) as is shown
in Table 1.
Table 1. Packet rates as function of packet sizes for 10 Gb/s Ethernet
User data size [B] 1 18 82 210 466 978 1472 8972
Packet size [B] 64 64 128 256 512 1024 1518 9018
Overhead [%] 98.8 78.6 44.6 23.9 12.4 5.5 4.3 0.7
Frame rate [Mpps] 14.88 14.88 8.45 4.53 2.35 1.20 0.81 0.14
Packets arriving at a data acquisition system are subject to a nearly constant per-packet pro-
cessing overhead. This is due to interrupt handling, context switching, checksum validations and
header processing. At almost 15 M packets per second this processing alone can consume most of
the available CPU resources. In order to achieve maximum performance, data from the electronics
readout should be bundled into jumbo frames if at all possible. Using the maximum Ethernet packet
size of 9018 bytes reduces the per-packet overhead by a factor of 100. This does, however, come at
the cost of larger latency. For example the transmission time of 64 bytes + IFG is 67 ns, whereas
–4–
for 9018 + IFG it is 902 ns. For applications sentitive to latency a tradeoff must be made between
low packet rates and low latency.
Not all transmitted data are of interest for the reciever and can be considered as overhead.
Packet headers is such an example. The Ethernet, IP and UDP headers are always present and takes
up a total of 46 bytes as shown in Figure 1(bottom). The utilization of an Ethernet link can be
calculated as
U=
d
d+46 +ifg +pad
where Uis the link utilization, dthe user data size, ifg the inter frame gap and pad is the padding
meantioned earlier. For user data larger than 18 bytes no padding is applied. This means that
for small user payloads the overhead can be significant, making it impossible to achieve high
throughput. For example transmitting a 32 bit counter over UDP will take up 84 bytes on the wire
(20 bytes IFG + 64 byte for a minimum Ethernet frame) and the overhead will account for approx.
95% of the available bandwidth. In contrast when sending 8972 byte user data the overhead is as
low as 0.73%.
3.2 Network buffers and packet loss
A UDP packet can be dropped in any part of the communications chain: The sender, the receiver,
intermediate systems such as routers, firewalls, switches, load balancers, etc. This makes it difficult
in general to rely on UDP for high speed communications. However for simple network topologies
such as the ones found in detector readout systems it is possible to achieve very reliable UDP
communications. When for example the system comprise two hosts (sender and receiver) connected
via a switch of high quality, the packet loss is mainly caused by the Ethernet NIC transmit queue
and the socket receive buffer size. Fortunately these can be optimized. The main parameters for
controlling socket buffers are rmem_max and wmem_max. The former is the size of the UDP socket
receive buffer, whereas the latter is the size of the UDP socket transmit buffer. To change these
values from an application use setsockopt(), for example
int buffer = 4000000;
setsockopt(s, SOL_SOCKET, SO_SNDBUF, buffer, sizeof(buffer));
setsockopt(s, SOL_SOCKET, SO_RCVBUF, buffer, sizeof(buffer));
In addition there is an internal queue for packet reception whose size (in packets) is named
netdev_max_backlog, and a network interface parameter, txqueuelen which were also adjusted.
The default value of these parameters on Linux are not optimized for high speed data links
such as 10 Gb/s Ethernet, so for this investigation the following parameters were used.
net.core.rmem_max=12582912
net.core.wmem_max=12582912
net.core.netdev_max_backlog=5000
txqueuelen 10000
These values have largely been determined by experimentation. We also configured the systems
with an MTU of 9000 allowing user payloads up to 8972 bytes when taking into account that IP
and UDP headers are also transmitted.
–5–
3.3 Core locality
Modern CPUs rely heavily on cache memories to achieve performance. This holds for both
instructions and data access. For Xeon E5 processors there are three levels of cache. Some is
shared between instructions and data, some is dedicated. The L3 cache is shared across all cores
and hyperthreads, whereas the L1 cache is only shared between two hyperthreads. The way to ensure
that the transmit and receive applications always uses the same cache is to ’lock’ the applications
to specific cores. For this we use the Linux command taskset and the pthread API function
pthread_setaffinity_np(). This prevents the application processes to be moved to other cores
and thereby causing interrupts in the data processing, but it does not prevent other processes to be
swapped onto the same core.
3.4 Timers
The transmitter and receiver applications for this investigation periodically prints out the measured
data speed, PER and other parameters. Initially the standard C++ chrono class timer was used
(version: libstdc++.so.6). But profiling showed that significant time was spent here, enough to
affect the measurements at high loads. Instead we decided to use the CPU’s hardware based Time
Stamp Counter (TSC). TSC is a 64 bit counter running at CPU clock frequency. Since processor
speeds are subject to throttling, the TSC cannot be directly relied upon to measure time. In this
investigation time checking is a two-step process: First we estimate when it is time to do the periodic
update based on the inaccurate TSC value. Then we use the more expensive C++ chrono functions
to calculate the elapsed time used in the rate calcuations. An example of this is shown in the source
code which is publicly available. See Section Afor instructions on how to obtain the source code.
4 Testbed for the experiments
The experimental configuration is shown in Figure 2. It consists of two hosts, one acting as a UDP
data generator and the other as a UDP receiver. The hosts are HPE ProLiant DL360 Gen9 servers
connected to a 10 Gb/s Ethernet switch using short (2 m) single mode fiber cables. The switch is
a HP E5406 switch equipped with a J9538A 8-port SFP+ module. The server specifications are
shown in table 2. Except for processor internals the servers are equipped with identical hardware.
HP#ProLiant#
DL360#
#
#
HP#E5406#
J9538A#
HP#ProLiant#
DL360#
#
#
S0# S0# S1#
Generator# Receiver#Switch#
eno49# eno49#
Figure 2. Experimental setup.
The data generator is a small C++ program using BSD socket, specifically the sendto() system
call for transmission of UDP data. The data receiver is based on a DAQ and event formation system
developed at ESS as a prototype. The system, named the Event Formation Unit (EFU), supports
loadable processing pipelines. A special UDP ’instrument’ pipeline was created for the purpose
–6–
Table 2. Hardware components for the testbed
Motherboard HPE ProLiant DL360 Gen9
Processor type (receiver) Two 10-core Intel Xeon E5-2650v3 CPU @ 2.30GHz
Processor type (generator) One 6-core Intel Xeon E5-2620v3 CPU @ 2.40GHz
RAM 64 GB (DDR4) - 4 x 16 GB DIMM - 2133MHz
NIC dual port Broadcom NetXtreme II BCM57810 10 Gigabit Ethernet
Hard Disk Internal SSD drive (120GB) for local installation of CentOS 7.1.1503
Linux kernel 3.10.0-229.7.2.el7.x86_64
of these tests. Both the generator and receiver uses setsockopt() to adjust transmit and receive
buffer sizes. Sequence numbers are embedded in the user payload by the transmitter allowing the
receiver to detect packet loss and hence to calculate packet error ratios. Both the transmitting and
receiving applications were locked to a specific processor core using the taskset command and
pthread_setaffinity_np() function. The measured user payload data-rates were calculated
using a combination of fast timestamp counters and microsecond counters from the C++ chrono
class. Care was taken not to run other programs that might adversly affect performance while
performing the experiments. CPU usages were calculated from the /proc/stat pseudofile as also
used in [9].
A measurement series typically consisted of the following steps:
1. Start receiver
2. Start transmitter with specified packet size
3. Record packet error ratios (PER) and data rates
4. Stop transmitter and receiver after 400 GB
The above steps were then repeated for measurements of CPU usage using /proc/stat
averaged over 10 second intervals.
A series of measurements of speed, packet error ratios and CPU usage where made as a function
of user data size for reasons discussed in Section 3.1.
4.1 Experimental limitations
The current experiments are subject to some limitations. We do not however believe that these pose
any significant problems in the evaluation of the results. The main limitations are described below.
Multi user issues: The servers used for the tests are multi user systems in a shared integration
laboratory. Care was taken to ensure that other users were not running applications at the same
time to avoid competition for cpu, memory and network resources. However a number of standard
demon processes were running in the background, some of which triggers the transmission of data
and some of which are triggered by packet reception.
Measuring affects performance: Several configuration, performance and debugging tools need
access to kernel or driver data structures. Examples we encountered are netstat,ethtool and
–7–
dropwatch. However the use of these tools can cause additional packet drops when running at
high system loads. These tools were not run while measuring packet losses.
Packet reordering: The test application is unable to detect misordered packets. Packet reordering
however is highly unlikely in the current setup, but would be falsely reported as packet loss.
Packet checksum errors: The NICs perform checksums of Ethernet and IP in hardware. Thus
packets with wrong checksums will not be delivered to the application and subsequently be falsely
reported as packet loss. For the purpose of this study this is the desired behavior.
5 Performance
The performance results covers user data speed, packet error ratios and cpu load. These topics will
be covered in the following sections.
5.1 Data Speed
The result of the measurements of achievable user data speeds is shown in Figure 3(a). The figure
shows both the measured and the theoretical maximum speed. For packets with user data sizes
larger than 2000 bytes the achieved rates match the theoretical maximum. However at smaller data
sizes the performance gap increases rapidly. It is clear that either the transmitter or the receiver
is unable to cope with the increasing load. This is mainly due to the higher packet arrival rates
occurring at smaller packet sizes. The higher rates increases the per-packet overhead and also the
number of interrupts and system calls. At the maximum data size of 8972 bytes the CPU load on
the receiver was 20%.
5.2 Packet error ratios
The achieved packet error ratios in this experiment are shown in Figure 3(b), which also shows the
corresponding values obtained using the default system parameters. The raw measurements for the
achieved values are listed in Table 3. It is observed that the packet error ratios depends on the size
of transmitted data. This dependency is mainly caused by the per-packet overhead introduced by
increasing packet rates with decreasing size. The onset of packet loss coincides with the onset of
deviation of observed speed from the theoretical maximum speed suggesting a common cause. No
packet loss was observed for data larger than 2200 bytes. When packet loss sets in at lower data
sizes, the performance degrades rapidly: In the region from 2000 to 1700 bytes the PER increases
by more than four orders of magnitude from 1.3·106to 7.1·102.
Table 3. Packet error ratios as function of user data size
size [B] 64 128 256 472 772 1000 1472 1700
PER 4.0·1014.0·1014.1·1013.9·1013.8·1013.8·1012.0·1017.1·102
size [B] 1800 1900 2000 2200 2972 4472 5972 8972
PER 3.2·1036.1·1061.3·1060 0 0 0 0
–8–
5.3 CPU load
The CPU load as a function of user data size is shown in Figure 3(c). The observation for both
transmitter and receiver is that the CPU load increases with decreasing user data size. When the
transmitter reaches 100% the receiver is slightly less busy at 84%. There is a clear cut-off value
corresponding to packet loss and deviations from theoretical maximum speed around user data sizes
of 2000 bytes. The measured CPU loads indicate that the transmission is the bottle neck at small
data sizes (high packet rates), and that most CPU cycles are spent as System load as also reported
by [9]. But the comparisons differ both qualitatively and quantitatively upon closer scrutiny. For
example in this study we find the total CPU load for the receiver (system + user) to be 20% for user
data sizes of 8972 bytes. This is much lower than reported earlier. On the other hand we observe
a sharp increase CPU usage in soft IRQ from 0% to 100% over a narrow region which was not
observed previously. We also observe a local minimum in Tx CPU load around 2000 bytes followed
by a rapid increase at lower data sizes.
6 Conclusion
Measurements of data rates and packet error ratios for UDP based communications at 10 Gb/s
have been presented. The data rates were achieved using standard hardware and software. No
modifications were made to the kernel network stack but some standard Linux commands were
used to optimize the behavior of the system. The main change was increasing network buffers for
UDP communications from a small default value of 212 kB to 12 MB. In addition packet error ratios
were measured. The measurements shows that it is possible to achieve zero packet error ratios at
10 Gb/s, but that this requires the use of large Ethernet packets (jumbo frames), preferably as large
as 9018 bytes. Thus the experiments have shown that it is feasible to create a reliable UDP based
data acquisition system supporting readout data at 10 Gb/s.
This study supplements independent measurements done earlier [9] and reveals differences
in performance across different platforms. The observed differences are likely to be caused by
differences in CPU generations, Ethernet NIC capabilities and Linux kernel versions. These
differences were not the focus of our study and have not been investigated further. But they do
indicate that some performance numbers are difficult to compare directly across setups. They also
provide a strong hint to DAQ developers: When upgrading hardware or kernel versions in s Linux
based DAQ system, performance tests should be done to ensure that specifications are still met.
There are several ways to improve performance to achieve 10 Gb/s with smaller packet sizes,
but the complexity increases. For example it is possible to send and receive multiple messages
using a single system call such as sendmmsg() and recvmmsg() which will reduce the number
of system calls and should improve performance. It is also possible to use multiple cores for the
receiver instead of only one as we did in this test. This adds some complexity that has to handle
distributing packets across cores in case it cannot be done automatically. One method for automatic
load distribution is to use Receive Side Scaling (RSS). However this requires the transmitter to use
several different source ports in the UDP packet when transmitting instead of one currently used.
This may require changes to the readout system. It is also possible to move network processing
away from the kernel and into user space avoiding context switches, and to change from interrupt
–9–
Figure 3. Performance measurements. a) User data speed. b) Packet Error Ratio. c) CPU Load. Note that
for the optimized values PER is zero for user data larger than or equal to 2200 bytes (solid line).
– 10 –
driven reception to polling. These approaches are used in the Intel Data Plane Development Kit
(DPDK) software packet processing framework.
A Source code
The software for this project is released under a BSD license and is freely available at GitHub
https://github.com/ess-dmsc/event-formation-unit.git. To build the programs used
for these experiments complete the following steps. To build and start the producer:
> git clone https://github.com/ess-dmsc/event-formation-unit
> cd event-formation-unit/udp
> make
> taskset -c coreid ./udptx -i ipaddress
to build and start the receiver:
> git clone https://github.com/ess-dmsc/event-formation-unit
> mkdir build
> cd build
> cmake ..
> make
> ./efu2 -d udp -c coreid
The central source files for this paper are udp/udptx.cpp for the generator and prototype2/udp/udp.cpp
for the receiver. The programs have been demonstrated to build and rund on Mac OS X, Ubuntu
16 and CentOS 7.1. However some additional libraries need to be installed, such as librdkafka and
google flatbuffers.
B System configuration
The following commands were used (performed as superuser) to change the system parameters on
CentOS. The examples below modifies network interface eno49. This should be changed to match
the name of the interface on the actual system.
> sysctl -w net.core.rmem_max=12582912
> sysctl -w net.core.wmem_max=12582912
> sysctl -w net.core.netdev_max_backlog=5000
> ifconfig eno49 mtu 9000 txqueuelen 10000 up
Acknowledgments
This work is funded by the EU Horizon 2020 framework, BrightnESS project 676548.
We thank Sarah Ruepp, associate professor at DTU FOTONIK, and Irina Stefanescu, Detector
Scientist at ESS, for comments that greatly improved the manuscript.
– 11 –
References
[1] European Spallation Source ERIC, Retrieved from http://europeanspallationsource.se/.
[2] T. Gahl et al., Hardware Aspects, Modularity and Integration of an Event Mode Data Acquisition and
Instrument Control for the European Spallation Source (ESS), arXiv:1507.01838v1.
[3] A. Khaplanov et al., Multi-Grid detector for neutron spectroscopy: results obtained on time-of-flight
spectrometer CNCS ,JINST 12 (2017) P04030.
[4] I. Stefanescu et al., Neutron detectors for the ESS diffractometers,JINST 12 (2017) P01019.
[5] F. Piscitelli et al., The Multi-Blade Boron-10-based Neutron Detector for high intensity Neutron
Reflectometry at ESS, arXiv:1701.07623v1.
[6] J. Postel User Datagram Protocol,IETF, Retrieved from
https://tools.ietf.org/html/rfc768.
[7] S. Martoiu, H. Muller and J. Toledo Front-end electronics for the Scalable Readout System of RD51,
IEEE Nuclear Science Symposium Conference Record (2011) 2036.
[8] R. Frazier, G. Illes, D. Newbold and A. Rose Software and firmware for controlling CMS trigger and
readout hardware via gigabit Ethernet,Physics Procedia,37, (2012) 1892-1899.
[9] M. Bencivenni et al., Performance of 10 Gigabit Ethernet Using Commodity Hardware,IEEE Trans.
Nucl. Sci.,57, (2010) 630-641.
[10] D. Bortolotti et al., Comparison of UDP Transmission Performance Between IP-Over-Infiniband and
10-Gigabit Ethernet,IEEE Trans. Nucl. Sci.,58, (2011) 1606-1612.
[11] P. Födisch, B. Lange, J. Sandmann, A. Büchner, W. Enghardt and P. Kaever A synchronous Gigabit
Ethernet protocol stack for high-throughput UDP/IP applications,J.Inst,11 (2016).
[12] IEEE 802 LAN/MAN Standards Committee,IEEE, Retrieved from http://www.ieee802.org/.
[13] Request For Comments,IETF, Retrieved from http://www.ietf.org/.
– 12 –
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Multi-Grid detector technology has evolved from the proof-of-principle and characterisation stages. Here we report on the performance of the Multi-Grid detector, the MG.CNCS prototype, which has been installed and tested at the Cold Neutron Chopper Spectrometer, CNCS at SNS. This has allowed a side-by-side comparison to the performance of He-3 detectors on an operational instrument. The demonstrator has an active area of 0.2m$^2$. It is specifically tailored to the specifications of CNCS. The detector was installed in June 2016 and has operated since then, collecting neutron scattering data in parallel to the He-3 detectors of CNCS. In this paper, we present a comprehensive analysis of this data, in particular on instrument energy resolution, rate capability, background and relative efficiency. Stability, gamma-ray and fast neutron sensitivity have also been investigated. The effect of scattering in the detector components has been measured and provides input to comparison for Monte Carlo simulations. All data is presented in comparison to that measured by the He-3 detectors simultaneously, showing that all features recorded by one detector are also recorded by the other. The energy resolution matches closely. We find that the Multi-Grid is able to match the data collected by He-3, and see an indication of a considerable advantage in the count rate capability. Based on these results, we are confident that the Multi-Grid will be capable of producing high quality scientific data on chopper spectrometers utilising the unprecedented neutron flux of the ESS.
Article
Full-text available
The Multi-Blade is a Boron-10-based gaseous detector introduced to face the challenge arising in neutron reflectometry at pulsed neutron sources. Neutron reflectometers are the most challenging instruments in terms of instantaneous counting rate and spatial resolution. This detector has been designed to cope with the requirements set for the reflectometers at the upcoming European Spallation Source (ESS) in Sweden. Based on previous results obtained at the Institut Laue-Langevin (ILL) in France, an improved demonstrator has been built at ESS and tested at the Budapest Neutron Centre (BNC) in Hungary and at the Source Testing Facility (STF) at the Lund University in Sweden. A detailed description of the detector and the results of the tests are discussed in this manuscript.
Article
Full-text available
The ambitious instrument suite for the future European Spallation Source whose civil construction started recently in Lund, Sweden, demands a set of diverse and challenging requirements for the neutron detectors. For instance, the unprecedented high flux expected on the samples to be investigated in neutron diffraction or reflectometry experiments requires detectors that can handle high counting rates, while the investigation of sub-millimeter protein crystals will only be possible with large-area detectors that can achieve a position resolution as low as 200 {\mu}m. This has motivated an extensive research and development campaign to advance the state-of-the-art detector and to find new technologies that can reach maturity by the time the ESS will operate at full potential. This paper presents the key detector requirements for three of the Time-of-Flight diffraction instrument concepts selected by the Scientific Advisory Committee to advance into the phase of preliminary engineering design. We discuss the available detector technologies suitable for this particular instrument class and their major challenges. The detector technologies selected by the instrument teams to collect the diffraction patterns are briefly discussed. Analytical calculations, Monte-Carlo simulations, and real experimental data are used to develop a generic method to esti- mate the event rate in the diffraction detectors. The proposed approach is based upon conservative assumptions that use information and input parameters that reflect our current level of knowledge and understanding of the ESS project. We apply this method to make predictions for the future diffraction instruments, and thus provide additional information that can help the instrument teams with the optimisation of the detector designs.
Article
Full-text available
The European Spallation Source (ESS) in Lund, Sweden is just entering the construction phase with 3 neutron instruments having started in its design concept phase in 2014. As a collaboration of 17 European countries the majority of hardware devices for neutron instrumentation will be provided in-kind. This presents numerous technical and organisational challenges for the construction and the integration of the instruments into the facility wide infrastructure; notably the EPICS control network with standardised hardware interfaces and the facilities absolute timing system. Additionally the new generation of pulsed source requires a new complexity and flexibility of instrumentation to fully exploit its opportunities. In this contribution we present a strategy for the modularity of the instrument hardware with well-defined standardized functionality and control & data interfaces integrating into EPICS and the facilities timing system. It allows for in-kind contribution of dedicated modules for each instrument (horizontal approach) as well as of whole instruments (vertical approach). Key point of the strategy is the time stamping of all readings from the instruments control electronics extending the event mode data acquisition from neutron events to all metadata. This gives the control software the flexibility necessary to adapt the functionality of the instruments to the demands of each single experiment. We present the advantages of that approach for operation and diagnostics and discuss additional hardware requirements necessary.
Article
Full-text available
Forthcoming hardware upgrades to the CMS experiment trigger and readout system are based upon the ATCA or mu TCA bus standards, giving them the opportunity to be controlled via commodity gigabit Ethernet. These hardware upgrades supersede existing systems largely based upon the VME-bus standard, and thus a requirement has arisen to provide a new low-level control infrastructure for use by trigger and readout subsystem developers. This paper details the recent research and development into a tightly-integrated suite of software and firmware based upon the IPbus protocol that allows such Ethernet-attached hardware to be controlled in an efficient and highly-scalable manner. (C) 2012 Published by Elsevier B.V. Selection and/or peer review under responsibility of the organizing committee for TIPP 11.
Conference Paper
Full-text available
Recent developments in micro-pattern gas detector technologies have considerably broadened the interest in this type of detectors, extending their application field from high-energy physics to nuclear, astrophysical, geophysical, medical or industrial applications, to name just a few. Historically, for the wide range of gas amplification schemes available, there has been an almost equally wide amount of electronic readout solutions, tailored on just one application, making it rather difficult for newcomers to employ the technology. Developed within RD51 Collaboration for the Development of Micro-Pattern Gas Detectors Technologies, the Scalable Readout System (SRS) is intended as a general purpose multi-channel readout solution for a wide range of detector types, and detector complexities, as well as for different experimental environments.
Article
Full-text available
In the prospect of employing 10 Gigabit Ethernet as networking technology for online systems and offline data analysis centers of High Energy Physics experiments, we performed a series of measurements on the performance of 10 Gigabit Ethernet, using the network interface cards mounted on the PCI-Express bus of commodity PCs both as transmitters and receivers. In real operating conditions, the achievable maximum transfer rate through a network link is not only limited by the capacity of the link itself, but also by that of the memory and peripheral buses and by the ability of the CPUs and of the Operating System to handle packet processing and interrupts raised by the network interface cards in due time. Besides the TCP and UDP maximum data transfer throughputs, we also measured the CPU loads of the sender/receiver processes and of the interrupt and soft-interrupt handlers as a function of the packet size, either using standard or ¿jumbo¿ Ethernet frames. In addition, we also performed the same measurements by simultaneously reading data from Fibre Channel links and forwarding them through a 10 Gigabit Ethernet link, hence emulating the behavior of a disk server in a Storage Area Network exporting data to client machines via 10 Gigabit Ethernet.
Article
State of the art detector readout electronics require high-throughput data acquisition (DAQ) systems. In many applications, e. g. for medical imaging, the front-end electronics are set up as separate modules in a distributed DAQ. A standardized interface between the modules and a central data unit is essential. The requirements on such an interface are varied, but demand almost always a high throughput of data. Beyond this challenge, a Gigabit Ethernet interface is predestined for the broad requirements of Systems-on-a-Chip (SoC) up to large-scale DAQ systems. We have implemented an embedded protocol stack for a Field Programmable Gate Array (FPGA) capable of high-throughput data transmission and clock synchronization. A versatile stack architecture for the User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP) over Internet Protocol (IP) such as Address Resolution Protocol (ARP) as well as Precision Time Protocol (PTP) is presented. With a point-to-point connection to a host in a MicroTCA system we achieved the theoretical maximum data throughput limited by UDP both for 1000BASE-T and 1000BASE-KX links. Furthermore, we show that the total jitter of a synchronous clock over a 1000BASE-T link for a PTP application is below 60 ps.
Article
Amongst link technologies, InfiniBand has gained wide acceptance in the framework of High Performance Computing (HPC), due to its high bandwidth and in particular to its low latency. Since InfiniBand is very flexible, supporting several kinds of messages, it is suitable, in principle, not only for HPC, but also for the data acquisition systems of High Energy Physics (HEP) Experiments. In order to check the InfiniBand capabilities in the framework of on-line systems of HEP Experiments, we performed measurements with point-to-point UDP data transfers over a 4-lane Double Data Rate InfiniBand connection, by means of the IPoIB (IP over InfiniBand) protocol stack, using the Host Channel Adapter cards mounted on a 8-lane PCI-Express bus of commodity PCs both as transmitters and receivers, thus measuring not only the capacity of the link itself, but also the effort required by the host CPUs, buses and Operating Systems. Using either the "Unreliable Datagram" or the "Reliable Connected" InfiniBand transfer modes, we measured the maximum achievable UDP data transfer throughput, the frame rate and the CPU loads of the sender/receiver processes and of the interrupt handlers as a function of the datagram size. Performance of InfiniBand in UDP point-to-point data transfer are then compared with that obtained with analogous tests per formed between the same PCs, using a 10-Gigabit Ethernet link.