Figure 4: Processing Incoming Data Packets on the NIC

Conference Paper

Oct 2019

Approximate computing that works on less precise data leads to significant performance gains and energy-cost reductions for compute kernels. However, without leveraging the full-stack design of computer systems, modern computer architectures undermine the potential of approximate computing. In this paper, we present Varifocal Storage, a dynamic multi-resolution storage system that tackles challenges in performance, quality, flexibility and cost for computer systems supporting diverse application demands. Varifocal Storage dynamically adjusts the dataset resolution within a storage device, thereby mitigating the performance bottleneck of exchanging/preparing data for approximate compute kernels. Varifocal Storage introduces Autofocus and iFilter mechanisms to provide quality control inside the storage device and make programs more adaptive to diverse datasets. Varifocal Storage also offers flexible, efficient support for approximate and exact computing without exceeding the costs of conventional storage systems by (1) saving the raw dataset in the storage device, and (2) targeting operators that complement the power of existing SSD controllers to dynamically generate lower-resolution datasets. We evaluate the performance of Varifocal Storage by running applications on a heterogeneous computer with our prototype SSD. The results show that Varifocal Storage can speed up data resolution adjustments by 2.02× or 1.74× without programmer input. Compared to conventional approximate-computing architectures, Varifocal Storage speeds up the overall execution time by 1.52×.

TCP/IP Jumbo Frames Network Performance Evaluation on A Test-bed Infrastructure

Article

Dec 2012
IJWMT

Shaneel Narayan

Enhancing network performance has been studied by a number of researchers. The need to provide greater throughput on network infrastructure has been the key driver for these studies. The use of jumbo frames is considered one of the methodologies that can be employed to increase data throughput on networks. In this research undertaking, the authors implement jumbo frames on a test-bed implemented with Windows Server 2003/2008 networks and performance related metrics are measured for both IPv4 and IPv6 implementations. The results obtained in this empirical study shows that performance metrics values are different in various scenarios.

Impact on network performance of jumbo-frames on IPv4/IPv6 network infrastructure: An empirical test-bed analysis

Article

Full-text available

Dec 2010

With the current limitations of network technology, jumbo frames provide a mechanism that can allow greater amount of data transfer efficiently on cables. Implementation of jumbo frames is a viable option on Gigabit Ethernet. In this research undertaking, the authors implement jumbo frames on a test-bed implemented with Windows routers and performance related metrics are measured for both IPv4 and IPv6 implementations. The results obtained in this empirical study shows that performance metrics values are different in various scenarios.

A scalable and generic task scheduling system for communication libraries

Conference Paper

Full-text available

Oct 2009

Since the advent of multi-core processors, the physionomy of typical clusters has dramatically evolved. This new massively multi-core era is a major change in architecture, causing the evolution of programming models towards hybrid MPI+threads, therefore requiring new features at low-level. Modern communication subsystems now have to deal with multi-threading: the impact of thread-safety, the contention on network interfaces or the consequence of data locality on performance have to be studied carefully. In this paper, we present PIOMan, a scalable and generic lightweight task scheduling system for communication libraries. It is designed to ensure concurrent progression of multiple tasks of a communication library (polling, offload, multi-rail) through the use of multiple cores, while preserving locality to avoid contention and allow a scalability to a large number of cores and threads. We have implemented the model, evaluated its performance, and compared it to state of the art solutions regarding overhead, scalability, and communication and computation overlap.

On chip novel video streaming system for bi-network multicasting protocols

Article

Jun 2009
INTEGRATION

Omar S. Elkeelany

In the past three decades, tremendous Ethernet-related research has been done, which has led to today's ubiquitous Ethernet technology. On the other hand, with the emergence of new network needs, a new protocol, the IEEE 1394 standard serial bus (or Firewire) was introduced. Firewire is suitable for high-quality audio/video applications which do not perform well in the best-effort-based Ethernet technology. However, since Firewire is a serial bus, it has harsh cable length limitations as compared to Ethernet capabilities.

Accelerating Distributed Computing Applications Using a Network Offloading Framework

Conference Paper

Full-text available

Jan 2007

During the last two decades, a considerable amount of academic research has been conducted in the field of dis- tributed computing. Typically, distributed applications re- quire frequent network communication, which becomes a dominate factor in the overall runtime overhead. The re- cent proliferation of programmable peripheral devices for computer systems may be utilized in order to improve the performance of such applications. Offloading application- specific network functions to peripheral devices can im- prove performance and reduce host CPU utilization. Due to the peculiarities of each particular device and the dif- ficulty of programming an outboard CPU, the need for an abstracted offloading framework is apparent. This paper proposes a novel offloading framework, called H YDRA that enables utilization of such devices. The framework enables an application developer to design the offloading aspects of the application by specifying an "offloading layout", which is enforced by the runtime during application deployment. The performance of a variety of distributed algorithms can be significantly improved by utilizing such a framework. We demonstrate this claim by evaluating several offloaded ap- plications: a distributed total message ordering algorithm and a packet generator.

Modeling Protocol Offload for Message-oriented Communication

Conference Paper

Sep 2005

In this paper, we present a new, conceptual model that captures the benefits of protocol offload in the context of high performance computing systems. In contrast to the LAWS model, the extensible message-oriented offload model (EMO) emphasizes communication in terms of mes- sages rather than flows. In contrast to the LogP model, EMO emphasizes the performance of the network protocol rather than the parallel algorithm. The extensible message- oriented offload model allows protocol developers to con- sider the tradeoffs and specifics associated with offloading protocol processing including the reduction in message la- tency along with benefits associated with reduction in over- head and improvements to throughput. We give an overview of the EMO model and show how our model can be mapped to the LAWS and LogP models. We also show how it can be used to analyze individual mes- sages within TCP flows by contrasting full offload (TCP of- fload engines) with other approaches, e.g., interrupt coa- lescing and splintered TCP.

Enhancing NIC Performance for MPI using Processing-in-Memory

Conference Paper

Full-text available

Jan 2005

Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely ne-gr ained multi-threading capabilities. This paper explores a mechanism for leverag- ing these features of PIM technology to enhance commod- ity architectures in a seemingly mundane way: accelerat- ing MPI. Modern network interfaces leverage simple pro- cessors to ofoad portions of the MPI semantics. partic- ularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a signicant decrease in la- tency and increase in small message bandwidth, particu- larly when long queues are present.

An analysis of NIC resource usage for offloading MPI

Conference Paper

Full-text available

May 2004

Summary form only given. Modern cluster interconnection networks rely on processing on the network interface to deliver higher bandwidth and lower latency than what could be achieved otherwise. These processors are relatively slow, but they provide adequate capabilities to accelerate some portion of the protocol stack in a cluster computing environment. This offload capability is conceptually appealing, but the standard evaluation of NIC-based protocol implementations relies on simplistic microbenchmarks that create idealized usage scenarios. We evaluate characteristics of MPI usage scenarios using application benchmarks to help define the parameter space that protocol offload implementations should target. Specifically, we analyze characteristics that we expect to have an impact on NIC resource allocation and management strategies, including the length of the MPI posted receive and unexpected message queues, the number of entries in these queues that are examined for a typical operation, and the number of unexpected and expected messages.

Identifying the Sources of Latency in a Splintered Protocol

Article

Full-text available

Jan 2003

Communication overhead and latency are critical fac- tors for application performance in cluster computing based on commodity hardware. We propose a general strat- egy, splintering, to improve communication performance. In the splintering strategy, previously centralized function- ality is broken into pieces, and the pieces are distributed among the processors in a system, in such a way that en- sures system integrity and improves performance. In a previous paper we demonstrated the benefits of us- ing splintering to reduce communication overhead. In this paper, we describe our efforts to use splintering to reduce communication latency. To date, our efforts have not re- sulted in the improvement that we originally anticipated. In order to identify the sources of latency, we have done a thor- ough instrumentation of our implementation. Based on our analysis of our measurements, we propose several modifi- cations to the MPI library and the NIC firmware.

Processing Incoming Data Packets on the NIC

Citations