Fig 2 - uploaded by Suzanne McIntosh
Content may be subject to copyright.
Performance Comparison of TCP vs. Unix Domain Sockets as a Function of Message Size.

Performance Comparison of TCP vs. Unix Domain Sockets as a Function of Message Size.

Source publication
Conference Paper
Full-text available
This paper presents the design and implementation of XenSocket, a UNIX-domain-socket-like construct for high-throughput in- terdomain (VM-to-VM) communication on the same system. The design of XenSocket replaces the Xen page-flipping mechanism with a static cir- cular memory buffer shared between two domains, wherein information is written by one d...

Contexts in source publication

Context 1
... are assuming that, because the Virtual Machine Monitor (VMM) is much smaller compared to a modern monolithic kernel, it is therefore much harder to break. example, Figure 2 shows the transport throughput of two guest domains on the same machine communicating through a TCP connection. For comparison, the figure also shows the throughput of two Unix processes communicating through a UNIX domain socket stream on a native Linux system. ...
Context 2
... these optimizations the authors achieve a maximum receive throughput of 970 Mb/s and transmit throughput of 3310 Mb/s. While these improvements are noteworthy, the performance of the resulting system still falls short compared to that of Unix Domain Sockets (over 10,000 Mb/s, see Figure 2). ...

Citations

... We note that these overheads are incurred regardless of the type of network interface being used, loopback or remote. Zhang et al. [27] and Redis [26] report a significant throughput improvement when switching from TCP/IP loopback to UDS. ...
Preprint
Despite the stringent requirements of a real-time system, the reliance of the Robot Operating System (ROS) on the loopback network interface imposes a considerable overhead on the transport of high bandwidth data, while the nodelet package, which is an efficient mechanism for intra-process communication, does not address the problem of efficient local inter-process communication (IPC). To remedy this, we propose a novel integration into ROS of smart pointers and synchronisation primitives stored in shared memory. These obey the same semantics and, more importantly, exhibit the same performance as their C++ standard library counterparts, making them preferable to other local IPC mechanisms. We present a series of benchmarks for our mechanism - which we call LOT (Low Overhead Transport) - and use them to assess its performance on realistic data loads based on Five's Autonomous Vehicle (AV) system, and extend our analysis to the case where multiple ROS nodes are running in Docker containers. We find that our mechanism performs up to two orders of magnitude better than the standard IPC via local loopback. Finally, we apply industry-standard profiling techniques to explore the hotspots of code running in both user and kernel space, comparing our implementation against alternatives.
... Homogenous partitions are created on top of an inhomogeneous FPGA fabric in order to abstract from location, size and access to physical hardware. The RC2F components communicate with a host hypervisor (XEN) via vChan [46]. A virtIO-based framework for communication between VMs and FPGAs presented in [47] is FPGAVirt, which uses the in-kernel network stack for transferring packets to FPGAs and so reduces the overhead of context switches between VMs and host address space. ...
Conference Paper
The increase of size, capabilities, and speed of FPGAs enables the shared usage of reconfigurable resources by multiple appli-cations and even operating systems. While research on FPGA virtualization in HPC-datacenters and cloud is already well advanced, it is a rather new concept for embedded systems. The necessity for FPGA virtualization of embedded systems results from the trend to integrate multiple environments into the same hardware platform. As multiple guest operating sys-tems with different requirements, e.g., regarding real-time, security, safety, or reliability share the same resources, the focus of research lies on isolation under the constraint of hav-ing minimal impact on the overall system. Drivers for this de-velopment are, e.g., computation intensive AI-based applica-tions in the automotive or medical field, embedded 5G edge computing systems, or the consolidation of electronic control units (ECUs) on a centralized MPSoC with the goal to increase reliability by reducing complexity. This survey outlines key concepts of hypervisor-based virtualization of embedded recon-figurable systems. Hypervisor approaches are compared and classified into FPGA-based hypervisors, MPSoC-based hyper-visors and hypervisors for distributed embedded reconfigura-ble systems. Strong points and limitations are pointed out and future trends for virtualization of embedded reconfigurable systems are identified.
... Several approaches to improving performance of communication between co-located virtual machines have been described [57][58][59], all focusing on Xen. These solve similar communication inefficiencies as Slipstream, but either require application modification [58], guest kernel modification [57][58][59], are not fully automatic [57,58], or operate at the IP layer so TCP overheads are not eliminated [59]. ...
... Several approaches to improving performance of communication between co-located virtual machines have been described [57][58][59], all focusing on Xen. These solve similar communication inefficiencies as Slipstream, but either require application modification [58], guest kernel modification [57][58][59], are not fully automatic [57,58], or operate at the IP layer so TCP overheads are not eliminated [59]. ...
... Several approaches to improving performance of communication between co-located virtual machines have been described [57][58][59], all focusing on Xen. These solve similar communication inefficiencies as Slipstream, but either require application modification [58], guest kernel modification [57][58][59], are not fully automatic [57,58], or operate at the IP layer so TCP overheads are not eliminated [59]. ...
Thesis
Managing the overwhelming complexity of software is a fundamental challenge because complexity is the root cause of problems regarding software performance, size, and security. Complexity is what makes software hard to understand, and our ability to understand software in whole or in part is essential to being able to address these problems effectively. Attacking this overwhelming complexity is the fundamental challenge I seek to address by simplifying how we write, organize, and think about programs. Within this dissertation I present a system of tools and a set of solutions for improving the nature of software by focusing on programmer’s desired outcome, i.e. their intent. At the program level, the conventional focus, it is impossible to identify complexity that, at the system level, is unnecessary. This “accidental complexity” includes everything from unused features to independent implementations of common algorithmic tasks. Software techniques driving innovation simultaneously increase the distance between what is intended by humans – developers, designers, and especially the users – and what the executing code does in practice. By preserving the declarative intent of the programmer, which is lost in the traditional process of compiling and linking and building software, it is easier to abstract away unnecessary details. The Slipstream, ALLVM, and software multiplexing methods presented here automatically reduce complexity of programs while retaining intended function of the program. This results in improved performance at both startup and run-time, as well as reduced disk and memory usage.
... The research on IO resource scheduling has a certain history [10], [23]- [27]. The network IO optimization between VMs mainly includes XenLoop [23], XenSocket [24], and XWay [25], which are different network communication optimization schemes that are based on a shared memory. Mandal et al. [26] proposed a data performance evaluation system for network flows to estimate the performance impact of bandwidth on IO. ...
... The RNN was trained for 15 min (steps 5-13). The time complexity of the remaining part (steps [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29] is O(N + M ). ...
Article
The use of virtualization technology in industrial control is increasing. However, virtual instances or virtual machines (VMs) are confronted with the unreasonable resource allocation in the industrial control cyber range, thereby highlighting the increasing importance of resource scheduling. In the Xen open source system, the IO-intensive task response is extended because the system does not distinguish between CPU- and IO-intensive tasks. Therefore, this study presented improved task performance through the hybrid scheduling of CPU and network IO resources. This method uses Cap- and Timeslice-scheduling algorithms for CPU resource scheduling. First, the Cap-scheduling algorithm uses historical data to train a recurrent neural network model for classification and then utilizes the heuristic method to set the upper limit of cap value for VMs. Second, the Timeslice-scheduling algorithm adjusts the timeslice based on Q-learning to shorten the execution time of the overall tasks. This study proposes an IOb-scheduling algorithm for network bandwidth scheduling. The part that does not exceed the average bandwidth is eliminated and distributed to other VMs by monitoring the bandwidth usage of each VM, thereby improving the utilization of bandwidth. Experimental results showed that the proposed CPU/IO scheduling algorithms improved the overall benchmark performance substantially.
... The frontend FIFOs and the FPGA memories are mapped to device files inside the host hypervisor. There, the system forwards the user devices to the assigned VM using inter-domain communication based on vChan from Zhang et al. [48] in our Xen virtualized environment, similar to the FPGA device virtualization pvFPGA [25]. Figure 14. ...
Article
Full-text available
Field Programmable Gate Arrays (FPGAs) provide a promising opportunity to improve performance, security and energy efficiency of computing architectures, which are essential in modern data centers. Especially the background acceleration of complex and computationally intensive tasks is an important field of application. The flexible use of reconfigurable devices within a cloud context requires abstraction from the actual hardware through virtualization to offer these resources to service providers. In this paper, we present our Reconfigurable Common Computing Frame (RC2F) approach – inspired by system virtual machines – for the profound virtualization of reconfigurable hardware in cloud services. Using partial reconfiguration, our framework abstracts a single physical FPGA into multiple inde- pendent virtual FPGAs (vFPGAs). A user can request vFPGAs of different size for optimal resource utilization and energy efficiency of the whole cloud system. To enable such flexibility, we create homogeneous partitions on top of an inhomogeneous FPGA fabric abstracting from physical locations and static areas. The RC2FSEC extension combines this virtualization with a security system to allow for processing of sensitive data. On the host side our Reconfigurable Common Cloud Computing Environment (RC3E) offers different service models and manages the allocation of the dynamic vFPGAs. We demonstrate the possibilities and the resource trade-off of our approach in a basic scenario. Moreover, we present future perspectives for the use of FPGAs in cloud- based environments.
... Circuits are established directly between VMs or, in the case of communicating with an external host, between a VM and a proxy stack in the host OS. The less privileged VM must always provide the bu er memory to prevent denial-of-service attacks [26]. Requests to establish a circuit are forwarded by the switch operator in between VMs and between VMs and the proxy stack. ...
... Direct inter-VM communication has been proposed in di erent variants. XenSockets [26] deviate from the TCP stream semantics and use a di erent socket interface, thereby requiring modi ed applications. XenLoop [25] proposes direct virtual Ethernet between VMs which we included into our performance evaluation. ...
Conference Paper
Although applications are nowadays often executed in virtual machines (VMs) to isolate applications or consolidate physical machines, VM network performance is still challenging. Packetization, encapsulation, congestion control, preparations for loss, and copying of data introduce unnecessary performance degradation within a system where VMs communicate over abundant and reliable shared-memory. Although protocols like TCP are therefore not well suited for kernel network stack in VMs, preexisting applications require the kernel socket interface to keep functioning. In eliminating the unnecessary overhead for inter-VM communication and shifting it to the host operating system for communication over a physical NIC, our approach increases performance for both cases of communicating with another VM on the same host and for communicating with external hosts. Instead of multiplexing multiple connections over a single virtual Ethernet link, we use a separate shared-memory connection for each VM application socket. Our approach improves the stream and datagram performance of existing applications over an unmodified socket interface and brings the benefits of memory-mapped zero-copy IO to modified applications without sacrificing isolation between sockets.
... The frontend FIFOs and the FPGA memories are mapped to device files inside the host hypervisor. There, our system forwards the user devices to the assigned VM using inter-domain communication based on vChan from Zhang et al. [25] in our Xen virtualized environment, similar to pvFPGA [16]. ...
Article
Full-text available
Field Programmable Gate Arrays (FPGAs) provide a promising opportunity to improve performance, security and energy efficiency of computing architectures, which are essential in modern data centers. Especially the background acceleration of complex and computationally intensive tasks is an important field of application. The flexible use of reconfigurable devices within a cloud context requires abstraction from the actual hardware through virtualization to offer these resources to service providers. In this paper, we enhance our related Reconfigurable Common Computing Frame (RC2F) approach, which is inspired by system virtual machines, for the profound virtualization of reconfigurable hardware in cloud services. Using partial recon-figuration, our hardware and software framework virtualizes physical FPGAs to provide multiple independent user designs on a single device. Essential components are the management of the virtual user-defined accelerators (vFPGAs), as well as their migration between physical FPGAs to achieve higher system-wide utilization levels. We create homogenous partitions on top of an inhomogeneous FPGA fabric to offer an abstraction from physical location, size and access to the real hardware. We demonstrate the possibilities and the resource trade-off of our approach in a basic scenario. Moreover, we present future perspectives for the use of FPGAs in cloud-based environments.
... Table 1 compares the related works and provides a brief description. The studies [10,13,16,21,30,35] provide co-resident VMs detection on VM environment. Study [33] supports locality detection on container level, and the work [23] is publicly available. ...
... To eliminate performance overhead, we propose a high performance two-layer locality-aware and NUMA-aware MPI library, which is able to dynamically detect co-resident containers inside one VM as well as detect co-resident VMs inside one host at MPI runtime, as described in Table 1 [10,13,16,21,30,35] 1Layer (VM) × Support co-resident VMs detection [33] 1Layer ( The rest of the paper is organized as follows. Section 2 mainly introduces two types of virtualization solutions, nested virtualization solution, InfiniBand technology with SR-IOV support and Intra-node communication mechanisms. ...
Conference Paper
Hypervisor-based virtualization solutions reveal good security and isolation, while container-based solutions make applications and workloads more portable and distributed in an effective, standardized and repeatable way. Therefore, nested virtualization based computing environments (e.g., container over virtual machine), which inherit the capabilities from both solutions, are becoming more and more attractive in clouds (e.g., running Docker over Amazon EC2 VMs). Recent studies have shown that running applications in either VMs or containers still has significant overhead, especially for I/O intensive workloads. This motivates us to investigate whether the nested virtualization based solution can be adopted to build high-performance computing (HPC) clouds for running MPI applications efficiently and where the bottlenecks lie. To eliminate performance bottlenecks, we propose a high-performance two-layer locality and NUMA aware MPI library, which is able to dynamically detect co-resident containers inside one VM as well as detect co-resident VM inside one host at MPI runtime. Thus the MPI processes across different containers and VMs can communicate to each other by shared memory or Cross Memory Attach (CMA) channels instead of network channel if they are co-resident. We further propose an enhanced NUMA aware hybrid design to utilize InfiniBand loopback based channel to optimize large message transfer across containers when they are running on different sockets. Performance evaluations show that compared with the performance of the state-of-art (1Layer) design, our proposed enhance-hybrid design can bring up to 184%, 81% and 12% benefit on point-to-point, collective operations, and end applications. Compared with the default performance, our enhanced-hybrid design delivers up to 184%, 85% and 16% performance improvement.
... Zhang et al. [30] present an approach called XenSocket. XenSocket uses a shared-memory buffer between communicating VMs to bypass the network protocol stack completely, resulting in high throughput. ...
Article
Full-text available
Virtual clusters (VCs) have exhibited various advantages over traditional cluster computing platforms by virtue of their extensibility, reconfigurability, and maintainability. As such, they have become a major execution environment for cloud-based cluster applications. However, compared with traditional clusters, their distributed-memory programming paradigm still remains largely unchanged, which implies that cluster applications cannot be efficiently deployed in VCs, especially when virtual machines (VMs) are running in different physical hosts. Recently, some efforts have been made to improve inter-VM communication, resulting in many studies on how cluster applications could take advantages of VCs. However, most of them mainly focus on the situation that the VMs are all coresident on the same physical machine where the message passing mechanism is usually optimized away by exploiting the host's shared memory. In this paper, we present a design and implementation of Naplus, a kernel-based virtual machine approach to the inter-VM communications that are across different physical hosts. Naplus is based on Nahanni, a mechanism for shared-memory communication in virtual environments. As such, it not only inherits the major merits of Nahanni with respect to flexible data structures and efficient synchronization but also achieves a shared-memory paradigm among VMs. With Naplus, we enable the size of shared space to be maximized as large as the sum of each machine's local memory to accommodate cluster applications with large memory footprints. We prototype Naplus in a dual-host system where an empirical study is conducted to show the effectiveness of the Naplus approach in achieving distributed shared memory for VCs in data centers. Copyright
... Instead of going through traditional network stack, communicating through shared memory shortens the communication path, avoids the barrier cost of hypervisor, and improves the data transmission efficiency. Although existing proposals [1], [4], [5], [6], [7], [8], [9] differ from one another in terms of concrete design and implementation decisions, most of these efforts suffer from some of the following problems: poor scalability in terms of shared memory management for different types of workloads and dynamic VM deployment [10], [11], and multiple copies of network packet between VM kernel buffer and the shared memory. ...
... Shared memory mechanism is first introduced to pass data between programs that are running on the same operating system to avoid redundant copy. Inspired by this idea, shared memory approach has been developed to accelerate the performance of co-located inter VM communication [1], [4], [5], [6], [7], [8], [9]. The general idea is to transmit data from a sender VM to a colocated receiver VM by using the shared memory channel and bypass the hypervisor. ...
... Implementing shared memory mechanisms below IP layer offers full transparency. Examples include XenSocket [8], XenLoop [1] and MMNet [7]. ...