Router System with Network Processor. Packets are processed by one of multiple processing cores in the network processor.

Source publication

Analysis of Network Processing Workloads

Conference Paper

Full-text available

Apr 2005

Network processing is becoming an increasingly important paradigm as the Internet moves towards an architecture with more complex functionality inside the network. Modern routers not only forward packets, but also process headers and payloads to implement a variety of functions related to security, performance, and customization. It is important to...

Context 1

... processing tasks are performed on the network processor before the packets are passed on through the router switching fabric and through the next network link. This is illustrated in Figure 1. Design space exploration of NP architectures, development of novel protocols and network processing applications, and the creation of suitable programming abstractions for such parallel embedded systems are current areas of research. ...

View in full-text

Performance analysis of thread mappings with a holistic view of the hardware resources

Article

Apr 2012

With the shift to chip multiprocessors, managing shared resources has become a critical issue in realizing their full potential. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads have imped...

Criticality-based optimizations for efficient load processing

Conference Paper

Full-text available

Mar 2009

Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this information can be leveraged to optimize the perf...

Towards Low Latency Software Routers

Article

Apr 2015

Network devices based on commodity hardware are capable of high-speed packet processing while maintaining the programmability and extensibility of software. Thus, software-based network devices, like software routers, software-based firewalls, or monitoring systems, constitute a cost-efficient and flexible alternative to expensive, special purpose hardware. The overall packet processing performance in resource-constrained nodes can be strongly increased through parallel processing based on off-the-shelf multi-core processors. However, synchronization and coordination of parallel processing may counteract the corresponding network node performance. We describe how multi-core software routers can be optimized for real-time traffic by utilizing the technologies available in commodity hardware. Furthermore, we propose a low latency extension for the Linux NAPI. For the analysis, we use our approach for modeling resource contention in resource-constrained nodes which is also implemented as a resource-management extension module for ns-3. Based on that, we derive a QoS-aware software router model which we use to evaluate our performance optimizations. Our case study shows that the different scheduling strategies of a software router have significant influence on the performance of handling real-time traffic.

A Modeling Approach for Resource Management in Resource-Constrained Nodes

Article

Feb 2015

The rapid growth of link bandwidths on one hand, and the emergence of resource-constrained nodes (e.g. software routers) on the other hand, will cause network nodes to be the bottleneck in the future. Parallel processing using multi-core processors can increase the packet processing of resource-constrained nodes and alleviate the problem. However, intra-node resource contention can have a strong negative impact on the corresponding network node and, therefore, also on the overall performance of the network. Commonly used network simulators (e.g. ns-3) only offer a rather simplistic node model and do not take into account intra-node resource contention. We propose a unified and extensible approach to model intra-node resource management in resource-constrained nodes. Our model gives ability to identify and predict performance bottlenecks in networks. We have implemented our model as an extension to the network simulator ns-3. The simulation results using different case studies, show that our approach significantly outperforms the original ns-3 in terms of realistic modeling.

Low Latency Packet Processing in Software Routers

Conference Paper

Jul 2014

Commodity hardware can be used to build a software router that is capable of high-speed packet processing while being programmable and extensible. Therefore, software routers provide a cost-efficient alternative to expensive, special hardware routers. The efficiency of packet processing in resource-constrained nodes (e.g. software routers) can be strongly in-creased through parallel processing with commodity hardware based on multi-core processors. However, intra-node resource contention can have a strong negative impact on the corre-sponding network node. We describe how multi-core software routers can be optimized for low latency support by utilizing the technologies available in commodity PC hardware. For the analysis we used our approach for modeling of resource con-tention in resource-constrained nodes which is also implemented as the resource-management extension module for ns-3. Based on that, we derived a specific software router model which we used to optimize the performance. Our measurements show that the configuration of a software router has significant influence on the performance. The results can be used for parameter tuning in such systems.

A Theoretical Framework for Design Space Exploration of Manycore Processors

Conference Paper

Jul 2011

With ever expanding design space and workload space in multicore era, it is a challenge to identify optimal design points quickly, desirable during the early stage of multicore processor design or programming phase. To meet this challenge, this paper proposes a theoretical framework that can capture the general performance properties for a class of multicore processors of interest over a large design space and workload space, free of scalability issues. The idea is to model multicore processors at the thread-level, overlooking instruction-level and microarchitectural details. In particular, queuing network models that model multicore processors at the thread level are developed and solved based on an iterative procedure over a large design space and workload space. This framework scales to virtually unlimited numbers of cores and threads. The testing of the procedure demonstrates that the throughput performance for many-core processors with 1000 cores can be evaluated within a few seconds on an Intel Pentium 4 computer and the results are within 5% of the simulation data obtained based on a thread-level simulator.

Design of a Network Service Processing Platform for Data Path Customization

Conference Paper

Full-text available

Aug 2009

Custom packet processing functionality in routers is one of the key characteristics of next-generation Internet architec- tures. Network services have been proposed as an abstrac- tion to describe, compose, and deploy end-to-end connec- tions with custom communication features. We present a novel hardware architecture for high-performance process- ing of such network services in the data path. The design provides simple processing units to implement services and a custom hardware infrastructure to manage packets and processing context. The design allows for simple software development, flexible network service allocation, and high scalability to handle traffic at Gigabit line rates.

Branch prediction for network processors

Conference Paper

Full-text available

Jan 2009

Meeting the future requirements of higher bandwidth while providing ever more complex functions, future network processors will require a number of methods of improving processing performance. One such method will involve deeper processor pipelines to obtain higher operating frequencies. Mitigation of the penalty costs associated with deeper pipelines have achieved by implementing prediction schemes, with previous execution history used to determine future decisions. In this paper we present an analysis of common branch prediction schemes when applied to network applications. Using widespread network applications, we find that unlike general purpose processing, hit rates in excess of 95% can be obtained in a network processor using a small 256-entry single level predictor. While our research demonstrates the low silicon cost of implementing a branch predictor, the long run times of network applications can leave the majority of the predictor logic idle, increasing static power and reducing device utilization.

Dual-resource TCP/AQM for processing-constrained networks

Article

Full-text available

May 2008
IEEE ACM T NETWORK

This paper examines congestion control issues for TCP flows that require in-network processing on the fly in network elements such as gateways, proxies, firewalls and even routers. Applications of these flows are increasingly abundant in the future as the Internet evolves. Since these flows require use of CPUs in network elements, both bandwidth and CPU resources can be a bottleneck and thus congestion control must deal with ldquocongestionrdquo on both of these resources. In this paper, we show that conventional TCP/AQM schemes can significantly lose throughput and suffer harmful unfairness in this environment, particularly when CPU cycles become more scarce (which is likely the trend given the recent explosive growth rate of bandwidth). As a solution to this problem, we establish a notion of dual-resource proportional fairness and propose an AQM scheme, called Dual-Resource Queue (DRQ), that can closely approximate proportional fairness for TCP Reno sources with in-network processing requirements. DRQ is scalable because it does not maintain per-flow states while minimizing communication among different resource queues, and is also incrementally deployable because of no required change in TCP stacks. The simulation study shows that DRQ approximates proportional fairness without much implementation cost and even an incremental deployment of DRQ at the edge of the Internet improves the fairness and throughput of these TCP flows. Our work is at its early stage and might lead to an interesting development in congestion control research.

A Credential-Based Data Path Architecture for Assurable Global Networking

Conference Paper

Dec 2007

Tilman Wolf

The main limitation for achieving information assurance in current data networks lies in absence of security considerations in the original Internet architecture. This shortcoming leads to the need for a new approach to achieving information assurance in networks. We propose a network architecture that uses credentials in the data path to identify, validate, monitor, and control data flows within the network. The important aspect of this approach is that credentials are tracked on the data path of the network, not just the end-systems, which implies that each and every packet can be audited. We present a credentials design that is based on Bloom filters and can achieve the desired properties to provide data path assurance.

A Characterization of High-Performance Network Monitoring Systems and Workloads

Conference Paper

May 2007

Measurement and monitoring functionality is widely deployed in the present Internet infrastructure to gather insight into the operation of the network. It is important to obtain a detailed understanding of the system architectures and workloads associated with packet measurement. We present the results of a quantitative performance analysis of a variety of existing measurement systems under different workloads. These results give us an understanding of how much system resources are necessary to support measurement in next-generation high-performance networks.

Overcoming the memory wall in packet processing: hammers or ladders?

Conference Paper

Full-text available

Oct 2005

Overhead of memory accesses limits the performance of packet processing applications. To overcome this bottleneck, today's network processors can utilize a wide-range of mechanisms-such as multi-level memory hierarchy, wide-word accesses, special-purpose result-caches, asynchronous memory, and hardware multi-threading. However, supporting all of these mechanisms complicates programmability and hardware design, and wastes systemresources. In this paper, we address the following fundamental question: what minimal set of hardware mechanisms must a network processor support to achieve the twin goals of simplified programmability and high packet throughput? We show that no single mechanism sufficies; the minimal set must include data-caches and multi-threading. Data-caches and multi-threading are complementary; whereas data-caches exploit locality to reduce the number of context-switches and the off-chip memory bandwidth requirement, multi-threading exploits parallelism to hide long cache-miss latencies.

Router System with Network Processor. Packets are processed by one of multiple processing cores in the network processor.

Context in source publication

Similar publications

Citations