Fig 1 - uploaded by Ning Weng
Content may be subject to copyright.
Router System with Network Processor. Packets are processed by one of multiple processing cores in the network processor.

Router System with Network Processor. Packets are processed by one of multiple processing cores in the network processor.

Source publication
Conference Paper
Full-text available
Network processing is becoming an increasingly important paradigm as the Internet moves towards an architecture with more complex functionality inside the network. Modern routers not only forward packets, but also process headers and payloads to implement a variety of functions related to security, performance, and customization. It is important to...

Context in source publication

Context 1
... processing tasks are performed on the network processor before the packets are passed on through the router switching fabric and through the next network link. This is illustrated in Figure 1. Design space exploration of NP architectures, development of novel protocols and network processing applications, and the creation of suitable programming abstractions for such parallel embedded systems are current areas of research. ...

Similar publications

Article
With the shift to chip multiprocessors, managing shared resources has become a critical issue in realizing their full potential. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads have imped...
Conference Paper
Full-text available
Some instructions have more impact on processor performance than others. Identification of these critical instructions can be used to modify and improve instruction processing. Previous work has shown that the criticality of instructions can be dynamically predicted with high accuracy, and that this information can be leveraged to optimize the perf...

Citations

... On the one hand, packet switching is usually packet size dependent due to the fact that most software routers operate on the store-and-forward paradigm while handling packets. According to [61], we apply a simple linear model which consists of a variable part a per Byte (which is dependent on the packet size S) and a constant part b, corresponding to x = a · S + b. Based on real testbed measurements (cf.Table 3 of [2]), we derive the calibration values a ≈ 2.34 ns B and b ≈ 272.47 ns for packet switching (aka. ...
... On the other hand, in case of IP forwarding, the effort for packet processing (e.g. updating the IP header) is independent of the packet sizes [61]. Thus, we model the effort for IP packet processing as an additional constant per-packet overhead c. ...
Article
Network devices based on commodity hardware are capable of high-speed packet processing while maintaining the programmability and extensibility of software. Thus, software-based network devices, like software routers, software-based firewalls, or monitoring systems, constitute a cost-efficient and flexible alternative to expensive, special purpose hardware. The overall packet processing performance in resource-constrained nodes can be strongly increased through parallel processing based on off-the-shelf multi-core processors. However, synchronization and coordination of parallel processing may counteract the corresponding network node performance. We describe how multi-core software routers can be optimized for real-time traffic by utilizing the technologies available in commodity hardware. Furthermore, we propose a low latency extension for the Linux NAPI. For the analysis, we use our approach for modeling resource contention in resource-constrained nodes which is also implemented as a resource-management extension module for ns-3. Based on that, we derive a QoS-aware software router model which we use to evaluate our performance optimizations. Our case study shows that the different scheduling strategies of a software router have significant influence on the performance of handling real-time traffic.
... They proposed a queuing network model for performance prediction of shared-memory multiprocessor systems in parallel protocol execution. Ramaswany et al. [11] presented a framework for considering the packet processing cost in network simulations. In our previous work [12], we proposed a general approach for realistic modeling of resource management in resource-constrained packet processing nodes. ...
... 2) Routing: Every packet is subjected to forwarding as well as full IP routing including routing table lookup, checksum calculation, etc. The effort for updating the IP header is equal for small and large packet sizes [11]. Thus, this workload is modeled as a R = a F and b R = b F + c R where c R ≈ 225 ns represents the IP routing effort. ...
... 3) IPsec: In addition to forwarding and full IP routing, each packet is encrypted using AES-128 encryption, as it is common in VPNs. This workload is CPU-intensive and strongly dependent on the IP packet size [11]. Thus, we model this with a I ≈ 36 ns B and b I = b R . ...
Article
The rapid growth of link bandwidths on one hand, and the emergence of resource-constrained nodes (e.g. software routers) on the other hand, will cause network nodes to be the bottleneck in the future. Parallel processing using multi-core processors can increase the packet processing of resource-constrained nodes and alleviate the problem. However, intra-node resource contention can have a strong negative impact on the corresponding network node and, therefore, also on the overall performance of the network. Commonly used network simulators (e.g. ns-3) only offer a rather simplistic node model and do not take into account intra-node resource contention. We propose a unified and extensible approach to model intra-node resource management in resource-constrained nodes. Our model gives ability to identify and predict performance bottlenecks in networks. We have implemented our model as an extension to the network simulator ns-3. The simulation results using different case studies, show that our approach significantly outperforms the original ns-3 in terms of realistic modeling.
... Furthermore, every packet is subjected to IP routing including routing table lookup, checksum calculation, etc. The effort for updating the IP header is equal for small and large packet sizes [28]. Thus, the effort for IP routing is represented with a constant overhead of c ≈ 225 ns. ...
Conference Paper
Commodity hardware can be used to build a software router that is capable of high-speed packet processing while being programmable and extensible. Therefore, software routers provide a cost-efficient alternative to expensive, special hardware routers. The efficiency of packet processing in resource-constrained nodes (e.g. software routers) can be strongly in-creased through parallel processing with commodity hardware based on multi-core processors. However, intra-node resource contention can have a strong negative impact on the corre-sponding network node. We describe how multi-core software routers can be optimized for low latency support by utilizing the technologies available in commodity PC hardware. For the analysis we used our approach for modeling of resource con-tention in resource-constrained nodes which is also implemented as the resource-management extension module for ns-3. Based on that, we derived a specific software router model which we used to optimize the performance. Our measurements show that the configuration of a software router has significant influence on the performance. The results can be used for parameter tuning in such systems.
... Interested readers may refer to [10] for detailed explanation on why this critical section causes the decrease of line rate when adding more than two threads. [ [16][17][18][19][20][21][22][23][24]). ...
Conference Paper
With ever expanding design space and workload space in multicore era, it is a challenge to identify optimal design points quickly, desirable during the early stage of multicore processor design or programming phase. To meet this challenge, this paper proposes a theoretical framework that can capture the general performance properties for a class of multicore processors of interest over a large design space and workload space, free of scalability issues. The idea is to model multicore processors at the thread-level, overlooking instruction-level and microarchitectural details. In particular, queuing network models that model multicore processors at the thread level are developed and solved based on an iterative procedure over a large design space and workload space. This framework scales to virtually unlimited numbers of cores and threads. The testing of the procedure demonstrates that the throughput performance for many-core processors with 1000 cores can be evaluated within a few seconds on an Intel Pentium 4 computer and the results are within 5% of the simulation data obtained based on a thread-level simulator.
... The effectiveness of the runtime management system to utilize all system resources decreases as more processors are available. The number of instructions executed by a service ranges from 500 to 2,000 per packet [17]. We assume an average packet size 300 bytes. ...
Conference Paper
Full-text available
Custom packet processing functionality in routers is one of the key characteristics of next-generation Internet architec- tures. Network services have been proposed as an abstrac- tion to describe, compose, and deploy end-to-end connec- tions with custom communication features. We present a novel hardware architecture for high-performance process- ing of such network services in the data path. The design provides simple processing units to implement services and a custom hardware infrastructure to manage packets and processing context. The design allows for simple software development, flexible network service allocation, and high scalability to handle traffic at Gigabit line rates.
... Ramaswamy presented an analysis of aspects such as memory access, unique instruction counts, data memory requests and per-packet instruction complexity across four header applications [159]. In [160] cache behaviour, instruction level parallelism and instruction sequences of the TCP/IP protocol stack were examined and compared to the SPEC benchmark, with a number of possible ISA extensions proposed, while a workload analysis of NP-based cryptographic algorithms was presented in [161], [162] and [163]. ...
Conference Paper
Full-text available
Meeting the future requirements of higher bandwidth while providing ever more complex functions, future network processors will require a number of methods of improving processing performance. One such method will involve deeper processor pipelines to obtain higher operating frequencies. Mitigation of the penalty costs associated with deeper pipelines have achieved by implementing prediction schemes, with previous execution history used to determine future decisions. In this paper we present an analysis of common branch prediction schemes when applied to network applications. Using widespread network applications, we find that unlike general purpose processing, hit rates in excess of 95% can be obtained in a network processor using a small 256-entry single level predictor. While our research demonstrates the low silicon cost of implementing a branch predictor, the long run times of network applications can leave the majority of the predictor logic idle, increasing static power and reducing device utilization.
... For instance, a voice-over-IP call made from a cell phone to a PSTN phone must go through a media gateway that performs audio transcoding "on the fly" as the two end points often use different audio compression standards. Examples of in-network processing services are increasingly abundant from security, performance-enhancing proxies (PEP), to media translation [1] [2]. These services add additional loads to An processing capacity in the network components. ...
Article
Full-text available
This paper examines congestion control issues for TCP flows that require in-network processing on the fly in network elements such as gateways, proxies, firewalls and even routers. Applications of these flows are increasingly abundant in the future as the Internet evolves. Since these flows require use of CPUs in network elements, both bandwidth and CPU resources can be a bottleneck and thus congestion control must deal with ldquocongestionrdquo on both of these resources. In this paper, we show that conventional TCP/AQM schemes can significantly lose throughput and suffer harmful unfairness in this environment, particularly when CPU cycles become more scarce (which is likely the trend given the recent explosive growth rate of bandwidth). As a solution to this problem, we establish a notion of dual-resource proportional fairness and propose an AQM scheme, called Dual-Resource Queue (DRQ), that can closely approximate proportional fairness for TCP Reno sources with in-network processing requirements. DRQ is scalable because it does not maintain per-flow states while minimizing communication among different resource queues, and is also incrementally deployable because of no required change in TCP stacks. The simulation study shows that DRQ approximates proportional fairness without much implementation cost and even an incremental deployment of DRQ at the edge of the Internet improves the fairness and throughput of these TCP flows. Our work is at its early stage and might lead to an interesting development in congestion control research.
... As networks connect an increasing number of embedded devices (both as end-systems and as intermediate hops), power constraints are becoming increasingly important. Cryptographic operations require several orders of magnitude more operations than conventional packet processing [8] and thus need to be limited to the initial connection setup. An implication from the third requirement is that it is not practical to set up different credentials for each hop along the path of a packet. ...
Conference Paper
The main limitation for achieving information assurance in current data networks lies in absence of security considerations in the original Internet architecture. This shortcoming leads to the need for a new approach to achieving information assurance in networks. We propose a network architecture that uses credentials in the data path to identify, validate, monitor, and control data flows within the network. The important aspect of this approach is that credentials are tracked on the data path of the network, not just the end-systems, which implies that each and every packet can be audited. We present a credentials design that is based on Bloom filters and can achieve the desired properties to provide data path assurance.
... Using PacketBench [29], we are able to obtain simulation results that yield the number of RISC instructions and memory accesses executed per packet. When analyzing applications in the networking domain, it is important to distinguish between two types of memory accesses: accesses to packet memory and accesses to nonpacket memory [30] . The inherent orientation towards intensive I/O in networking applications requires this separation. ...
Conference Paper
Measurement and monitoring functionality is widely deployed in the present Internet infrastructure to gather insight into the operation of the network. It is important to obtain a detailed understanding of the system architectures and workloads associated with packet measurement. We present the results of a quantitative performance analysis of a variety of existing measurement systems under different workloads. These results give us an understanding of how much system resources are necessary to support measurement in next-generation high-performance networks.
... Techniques for improving performance of such route-caches are explored in [16]. Memory access behavior of some packet processing applications is analyzed in [13, 23, 24, 29, 36, 42]. However, none of these studies compare the relative benefits of the various techniques for addressing memory bottleneck. ...
Conference Paper
Full-text available
Overhead of memory accesses limits the performance of packet processing applications. To overcome this bottleneck, today's network processors can utilize a wide-range of mechanisms-such as multi-level memory hierarchy, wide-word accesses, special-purpose result-caches, asynchronous memory, and hardware multi-threading. However, supporting all of these mechanisms complicates programmability and hardware design, and wastes systemresources. In this paper, we address the following fundamental question: what minimal set of hardware mechanisms must a network processor support to achieve the twin goals of simplified programmability and high packet throughput? We show that no single mechanism sufficies; the minimal set must include data-caches and multi-threading. Data-caches and multi-threading are complementary; whereas data-caches exploit locality to reduce the number of context-switches and the off-chip memory bandwidth requirement, multi-threading exploits parallelism to hide long cache-miss latencies.