Figure 4 - uploaded by Rolf Riesen
Content may be subject to copyright.
Processing Incoming Data Packets on the NIC  

Processing Incoming Data Packets on the NIC  

Source publication
Conference Paper
Full-text available
Offloading protocol processing will become an important tool in supporting our efforts to deliver increasing bandwidth to applications. In this paper we describe our experience in offloading protocol processing to a programmable gigabit Ethernet network interface card. For our experiments, we selected a simple RTS/CTS (request to send/clear to send...

Citations

... Even though VS shares the benefits from recent advances in ISP [6,9,14,23,31,33,35,40,67,75,76,82,85,89,96] and neardata processing [2,17,20,39,48,50,63,80,84], these frameworks need the mechanisms that VS offers in order to execute approximate computing applications efficiently. And while using approximate computing in channel encoding [38,62] and memory controller [30] can achieve an effect similar to that of VS in terms of reducing data-movement overhead, VS is independent of these projects and requires no changes in hardware. ...
Conference Paper
Approximate computing that works on less precise data leads to significant performance gains and energy-cost reductions for compute kernels. However, without leveraging the full-stack design of computer systems, modern computer architectures undermine the potential of approximate computing. In this paper, we present Varifocal Storage, a dynamic multi-resolution storage system that tackles challenges in performance, quality, flexibility and cost for computer systems supporting diverse application demands. Varifocal Storage dynamically adjusts the dataset resolution within a storage device, thereby mitigating the performance bottleneck of exchanging/preparing data for approximate compute kernels. Varifocal Storage introduces Autofocus and iFilter mechanisms to provide quality control inside the storage device and make programs more adaptive to diverse datasets. Varifocal Storage also offers flexible, efficient support for approximate and exact computing without exceeding the costs of conventional storage systems by (1) saving the raw dataset in the storage device, and (2) targeting operators that complement the power of existing SSD controllers to dynamically generate lower-resolution datasets. We evaluate the performance of Varifocal Storage by running applications on a heterogeneous computer with our prototype SSD. The results show that Varifocal Storage can speed up data resolution adjustments by 2.02× or 1.74× without programmer input. Compared to conventional approximate-computing architectures, Varifocal Storage speeds up the overall execution time by 1.52×.
... Experiences with offloading protocol to a programmable Gigabit Ethernet are presented in [3]. Here it is shown that network throughput almost doubles when jumbo frames are implemented. ...
... 3. Jitter values and UDP packet dropped values are comparable between jumbo and normal frames. ...
Article
Enhancing network performance has been studied by a number of researchers. The need to provide greater throughput on network infrastructure has been the key driver for these studies. The use of jumbo frames is considered one of the methodologies that can be employed to increase data throughput on networks. In this research undertaking, the authors implement jumbo frames on a test-bed implemented with Windows Server 2003/2008 networks and performance related metrics are measured for both IPv4 and IPv6 implementations. The results obtained in this empirical study shows that performance metrics values are different in various scenarios.
... A similar conclusion in drawn in [2], where details of performance tests are presented showing that jumbo frames and fast processors are necessary to get best performance. Experiences with offloading protocol to a programmable Gigabit Ethernet are presented in [3]. ...
Article
Full-text available
With the current limitations of network technology, jumbo frames provide a mechanism that can allow greater amount of data transfer efficiently on cables. Implementation of jumbo frames is a viable option on Gigabit Ethernet. In this research undertaking, the authors implement jumbo frames on a test-bed implemented with Windows routers and performance related metrics are measured for both IPv4 and IPv6 implementations. The results obtained in this empirical study shows that performance metrics values are different in various scenarios.
... A similar approach consists in offloading a part of the protocol processing onto specialized NICs [11]. Such a technique could be used to handle the rendezvous handshake at the hardware level, allowing to overlap communication and computation. ...
Conference Paper
Full-text available
Since the advent of multi-core processors, the physionomy of typical clusters has dramatically evolved. This new massively multi-core era is a major change in architecture, causing the evolution of programming models towards hybrid MPI+threads, therefore requiring new features at low-level. Modern communication subsystems now have to deal with multi-threading: the impact of thread-safety, the contention on network interfaces or the consequence of data locality on performance have to be studied carefully. In this paper, we present PIOMan, a scalable and generic lightweight task scheduling system for communication libraries. It is designed to ensure concurrent progression of multiple tasks of a communication library (polling, offload, multi-rail) through the use of multiple cores, while preserving locality to avoid contention and allow a scalability to a large number of cores and threads. We have implemented the model, evaluated its performance, and compared it to state of the art solutions regarding overhead, scalability, and communication and computation overlap.
... High-performance Ethernet has motivated many research groups to bypass or offload operating system bottleneck and design hardware accelerated solutions, such as a hardware multicast routing device for dedicated wired networks that was developed for two-phase multicast (TPM) communications [37] (other examples are in383940414243444546). Currently, these sophisticated network processors are often designed to cover a wide range of market applications. ...
Article
In the past three decades, tremendous Ethernet-related research has been done, which has led to today's ubiquitous Ethernet technology. On the other hand, with the emergence of new network needs, a new protocol, the IEEE 1394 standard serial bus (or Firewire) was introduced. Firewire is suitable for high-quality audio/video applications which do not perform well in the best-effort-based Ethernet technology. However, since Firewire is a serial bus, it has harsh cable length limitations as compared to Ethernet capabilities.
... iWARP network cards conceptually include TOEs and other functionality needed to implement the higher-layer protocols. Previous research has also considered using programmable components to accelerate network processing in specific situations [9, 15]. Our goal in this work is to enable more general access to programmable components for arbitrary networking, computing or I/O tasks. ...
Conference Paper
Full-text available
During the last two decades, a considerable amount of academic research has been conducted in the field of dis- tributed computing. Typically, distributed applications re- quire frequent network communication, which becomes a dominate factor in the overall runtime overhead. The re- cent proliferation of programmable peripheral devices for computer systems may be utilized in order to improve the performance of such applications. Offloading application- specific network functions to peripheral devices can im- prove performance and reduce host CPU utilization. Due to the peculiarities of each particular device and the dif- ficulty of programming an outboard CPU, the need for an abstracted offloading framework is apparent. This paper proposes a novel offloading framework, called H YDRA that enables utilization of such devices. The framework enables an application developer to design the offloading aspects of the application by specifying an "offloading layout", which is enforced by the runtime during application deployment. The performance of a variety of distributed algorithms can be significantly improved by utilizing such a framework. We demonstrate this claim by evaluating several offloaded ap- plications: a distributed total message ordering algorithm and a packet generator.
... The work we have done on offloading parts of the IP and TCP protocol onto a NIC [8,4,5,6] and the experiences of our colleagues [9] has taught us that the keys to increasing performance is to reduce the communication costs between the network and the application and to reduce the number of cycles that the data must travel before being delivered to the application. Because we are only concerned about the number of cycles the data must travel, we process the protocol headers normally. ...
Conference Paper
In this paper, we present a new, conceptual model that captures the benefits of protocol offload in the context of high performance computing systems. In contrast to the LAWS model, the extensible message-oriented offload model (EMO) emphasizes communication in terms of mes- sages rather than flows. In contrast to the LogP model, EMO emphasizes the performance of the network protocol rather than the parallel algorithm. The extensible message- oriented offload model allows protocol developers to con- sider the tradeoffs and specifics associated with offloading protocol processing including the reduction in message la- tency along with benefits associated with reduction in over- head and improvements to throughput. We give an overview of the EMO model and show how our model can be mapped to the LAWS and LogP models. We also show how it can be used to analyze individual mes- sages within TCP flows by contrasting full offload (TCP of- fload engines) with other approaches, e.g., interrupt coa- lescing and splintered TCP.
... In contrast, a significant amount of previous work is aimed at utilizing programmable network interface hardware, such as Quadrics [20] and Myrinet [2], for optimizing MPI performance. These optimizations include offloading MPI protocol processing [24,14], using hardware capabilities for efficient data movement (especially collective operations [7,17]), and using scatter/gather functionality for handling non-contiguous data transfers [27]. ...
Conference Paper
Full-text available
Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely ne-gr ained multi-threading capabilities. This paper explores a mechanism for leverag- ing these features of PIM technology to enhance commod- ity architectures in a seemingly mundane way: accelerat- ing MPI. Modern network interfaces leverage simple pro- cessors to ofoad portions of the MPI semantics. partic- ularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a signicant decrease in la- tency and increase in small message bandwidth, particu- larly when long queues are present.
... Some efforts have studied protocols that offload a portion of the MPI matching semantics. For example, in [16] portions of a Portals [5] stack were offloaded. Similarly, the Quadrics network [19] offloads the MPI matching stack onto the network interface, and others have explored this approach for Myrinet [28]. ...
Conference Paper
Full-text available
Summary form only given. Modern cluster interconnection networks rely on processing on the network interface to deliver higher bandwidth and lower latency than what could be achieved otherwise. These processors are relatively slow, but they provide adequate capabilities to accelerate some portion of the protocol stack in a cluster computing environment. This offload capability is conceptually appealing, but the standard evaluation of NIC-based protocol implementations relies on simplistic microbenchmarks that create idealized usage scenarios. We evaluate characteristics of MPI usage scenarios using application benchmarks to help define the parameter space that protocol offload implementations should target. Specifically, we analyze characteristics that we expect to have an impact on NIC resource allocation and management strategies, including the length of the MPI posted receive and unexpected message queues, the number of entries in these queues that are examined for a typical operation, and the number of unexpected and expected messages.
... In an earlier paper, we reported our success in using splintering to reduce communication overhead. Using splintering, we were able to reduce the host processor utilization for large messages by 80% while maintaining high bandwidth [7]. ...
... Using splintering, we were able to greatly reduce the host processor utilization [7]. We use host processor availability as our measurement for the host processor utilization. ...
Article
Full-text available
Communication overhead and latency are critical fac- tors for application performance in cluster computing based on commodity hardware. We propose a general strat- egy, splintering, to improve communication performance. In the splintering strategy, previously centralized function- ality is broken into pieces, and the pieces are distributed among the processors in a system, in such a way that en- sures system integrity and improves performance. In a previous paper we demonstrated the benefits of us- ing splintering to reduce communication overhead. In this paper, we describe our efforts to use splintering to reduce communication latency. To date, our efforts have not re- sulted in the improvement that we originally anticipated. In order to identify the sources of latency, we have done a thor- ough instrumentation of our implementation. Based on our analysis of our measurements, we propose several modifi- cations to the MPI library and the NIC firmware.