Fig 2 - uploaded by Jonathan S Turner
Content may be subject to copyright.
IP lookup table represented as a binary trie. Stored prefixes are denoted by shaded nodes. Next hops are found by traversing the trie.  

IP lookup table represented as a binary trie. Stored prefixes are denoted by shaded nodes. Next hops are found by traversing the trie.  

Source publication
Article
Full-text available
Continuing growth in optical link speeds places increasing demands on the performance of Internet routers, while deployment of embedded and distributed network services imposes new demands for flexibility and programmability. IP address lookup has become a significant performance bottleneck for the highest performance routers. Amid the vast array o...

Similar publications

Conference Paper
Full-text available
In this paper we present an asynchronous VLSI neuromorphic architecture comprising an array of integrate and fire neurons and dynamic synapse circuits with programmable weights. To store synaptic weight values, we designed a novel asynchronous SRAM block, integrated it on chip and connected it to the dynamic synapse circuits, via a fast current-mod...
Article
Full-text available
Polysilicon (poly-Si) grain size control is a critical issue with scaling of MOS transistors in integrated circuit design, more so in embedded non-volatile memory (NVM) technology. This paper investigates an approach to suppress poly-Si grain growth under necessary additional thermal budget for 40 nm embedded NVM technology. Our studies reveal that...
Article
Full-text available
Encryption is an important step for secure data transmission, and a true random number generator (TRNG) is a key building block in many encryption algorithms. Static random-access memory (SRAM) chips can be easily available sources of true random numbers, benefiting from noisy SRAM cells whose start-up values flip between different power-on cycles....
Article
Full-text available
Scratch-pad memory (SPM), a small, fast, software-managed on-chip SRAM (Static Random Access Memory) is widely used in embedded systems. With the ever-widening performance gap between processors and main memory, it is very important to reduce the serious off-chip memory access overheads caused by transferring data between SPM and off-chip memory. I...
Article
Full-text available
Low-power consumption and stability in static random access memories (SRAMs) is essential for embedded applications. This study presents a novel design flow for power minimisation of nano-complementary metal-oxide semiconductor SRAMs, while maintaining stability. A 32 nm high-k/metal-gate SRAM has been used as an example circuit. The baseline circu...

Citations

... Furthermore, routing tables undergo frequent updates. It has been reported that backbone routers may experience up to hundred thousand updates per second [2,11,12,13,14,10]. ...
... The combination of path compression with bit-map representation, where a sub-tree is considered as a compound node, may reduce a tree's height [35,13]. Bit maps are used to indicate the value of the prefix by the bit position of a prefix (i.e., prefixes are represented as numbers in an interval) on a tree level [36,37]. ...
Article
Full-text available
In this paper, we propose an IP lookup scheme, called Helix, that performs parallel prefix matching at the different prefix lengths and uses the helicoidal properties of binary trees to reduce tree height. The reduction of the tree height is achieved without performing any prefix modification. Helix minimizes the amount of memory used to store long and numerous prefixes and achieves IP lookup and route updates in a single memory access. We evaluated the performance of Helix in terms of the number of memory accesses and amount of memory required for storing large IPv4 and IPv6 routing tables with up to 512,104 IPv4 and 389,956 IPv6 prefixes, respectively. In all the tested routing tables, Helix performs lookup in a single memory access while using very small memory amounts. We also show that Helix can be implemented on a single field-programmable gate array (FPGA) chip with on-chip memory for the IPv4 and IPv6 tables considered herein, without requiring external memory. Specifically, Helix uses up to 72% of the resources of an FPGA to accommodate the most demanding routing table, without performance penalties. The implementation shows that Helix may achieve lookup speeds beyond 1.2 billion packets per second (Gpps).
... However, search through ordinary binary trees is not fast enough, so there are many techniques to improve the lookup speed. Some of these techniques are the path compression, multibit trees [3], the level compression [8], bitmap techniques [1] [3] [4] [5], leaf pushing [5] [9], the priority tree technique [6], hashing [7] etc. The path compression is a technique where the paths in a tree are compressed if they have no branching. ...
... However, search through ordinary binary trees is not fast enough, so there are many techniques to improve the lookup speed. Some of these techniques are the path compression, multibit trees [3], the level compression [8], bitmap techniques [1] [3] [4] [5], leaf pushing [5] [9], the priority tree technique [6], hashing [7] etc. The path compression is a technique where the paths in a tree are compressed if they have no branching. ...
Conference Paper
The pool of available IPv4 addresses is being depleted, comprising less than 10% of all IPv4 addresses. At the same time, the bit-rates at which packets are transmitted are increasing, and the IP lookup speed must be increased as well. Consequently, the IP lookup algorithms are in the research focus again because the existing solutions were designed for IPv4 addresses, and are not sufficiently scalable. In this paper, we compare FPGA implementations of the balanced parallelized frugal lookup (BPFL) algorithm, and the parallel optimized linear pipeline (POLP) lookup algorithm that efficiently use the memory, and achieve the highest speeds. I. INTRODUCTION Internet is still fast growing network. The number of hosts is still increasing and the IPv4 address space is almost exhausted. Actually, the Internet of "things" is being developed to include a tremendous number of sensors which might be attached to various machines and appliances. As a result of this development, transition to longer IPv6 addresses is inevitable. Packets generated by the increasing number of things on the Internet will be directed through the routers based on their IPv6 addresses. The output port (i.e. next-hop) of each packet is determined based on its IP address using the information from the lookup table according to the specified IP lookup algorithm. The lookup table contains forwarding information for the network addresses that a router learned from other routers in network. As the Internet is growing the lookup tables are getting larger. Classless network addresses are aggregated in these tables in order to consume a minimal amount of memory, and the longest prefix match of the given IP address should be found. The lookup table is typically split between the internal and the external memories. The internal (on-chip) memory is on the same chip as the lookup logic. The on-chip memory has a large throughput, which allows parallelization and pipelining that provide the high lookup speeds. However, the on-chip memory is very limited, and should be, therefore, carefully used. As the IP addresses get longer, the internal memory requirements of the lookup table can become a bottleneck. So, the existing internal memory should be used in a way that would maximize the lookup speed for the largest IP lookup tables. one node has 2 m children as m bits are used for determining the child in the next level, and this technique reduces the tree depth. The level compression technique replaces the parts of the binary tree that are populated above some threshold with the multibit subtrees to efficiently reduce the depth of the tree. The bitmap technique uses the compact binary presentation of some parts of the tree (a subtree structure is presented with the bitmap vector whose positions correspond to the nodes in the subtree). It is usually combined with the technique that reduces the number of pointers in a multibit tree. Namely, only one pointer is kept in a node and it points to the first element of the vector of pointers to the node's children. Leaf pushing is used to push the next-hop information from the internal nodes of the tree to the leafs of the tree. The priority-tree technique fills empty nodes in the early levels of a binary tree with the longer prefixes. In this way the total number of nodes is reduced. Hashing is a popular technique used to reduce the number of memory accesses and increase the lookup speed. As one can see, there are many techniques that can be used to achieve higher lookup speeds, but usually they come with a price. For example, some internal nodes with the next-hop information can be masked with the leaf pushing technique or the multibit trees, so updates get more complicated than in ordinary binary tree, etc.
... Several software and hardware approaches for IP address lookup problem have recently been proposed [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22], and [28][29][30][31][32]. Binary trie which is the base of many existing schemes [4] is a binary tree which is searched starting from the root and recursively going down to the left or right child depending on the current bit of the searched address. ...
... Our architecture 226,094 193 Reconfigurable BDD [21] 33,796 157.7 Partitioned table [15] 18,684 68 Reconfigurable FSM [20] 38,367 22.22 LogW-Elevators (software) [31] 33,796 20.5 FIPL (parallel trie bitmap) [30] 16,564 9.09 1 A frame is a column of logic resources inside the FPGA. Table 6 The results of the amount of memory usage and the number of memory accesses per lookup in our proposed architecture and other approaches [11] 2000 29,584 67.6 1 DIR-24-8 [22] 264,000 --1-2 Trie bitmap [28] 20,000 41,811 478.34 1-5 Indirect lookup [29] 3600 40,000 90 1-3 Parallel hashing [17] 1512 37,000 40.86 1-5 Multi-way search [7] 5600 30,000 186.67 1-9 DTBM [32] 36,000 85,987 418.67 1-12 EnBiT (software) [12] 2413 41,000 58.86 parameters. ...
Article
With today’s networks complexity, routers in backbone links must be able to handle millions of packets per second on each of their ports. Determining the corresponding output interface for each incoming packet based on its destination address requires a longest matching prefix search on the IP address. Therefore, IP address lookup is one of the most challenging problems for backbone routers. In this paper, an IP routing lookup architecture is proposed which is based on a reconfigurable hardware platform. Experimental results show that the rate of 193 million lookups per second is achieved using our architecture while prefixes can be updated with a rate of 3 million updates per second. Furthermore, it was shown that using our reconfigurable architecture results in rare update failure rate due to resource limitations.
... Although TCAMbased engines can retrieve IP lookup results in just one clock cycle, their throughput is limited by the relatively low speed of TCAMs. They are expensive and offer little flexibility for adapting to new addressing and routing protocols [4]. As shown inTable I, SRAM outperforms TCAM with respect to speed, density and power consumption. ...
Conference Paper
Full-text available
Continuous growth in network link rates poses a strong demand on high speed IP lookup engines. While Ternary Content Addressable Memory (TCAM) based solutions serve most of today's high-end routers, they do not scale well for the next-generation. On the other hand, pipelined SRAM- based algorithmic solutions become attractive. Intuitively multiple pipelines can be utilized in parallel to have a multiplicative effect on the throughput. However, several challenges must be addressed for such solutions to realize high throughput. First, the memory distribution across different stages of each pipeline as well as across different pipelines must be balanced. Second, the traffic on various pipelines should be balanced. In this paper, we propose a parallel SRAM-based multi- pipeline architecture for terabit IP lookup. To balance the memory requirement over the stages, a two-level mapping scheme is presented. By trie partitioning and subtrie-to-pipeline mapping, we ensure that each pipeline contains approximately equal number of trie nodes. Then, within each pipeline, a fine-grained node-to-stage mapping is used to achieve evenly distributed memory across the stages. To balance the traffic on different pipelines, both pipelined prefix caching and dynamic subtrie-to-pipeline remapping are employed. Simulation using real-life data shows that the proposed architecture with 8 pipelines can store a core routing table with over 200 K unique routing prefixes using 3.5 MB of memory. It achieves a throughput of up to 3.2 billion packets per second, i.e. 1 Tbps for minimum size (40 bytes) packets.
... 12 is not sufficient. Two hardware-based solutions are available: those that involve the use of a CAM or that use synchronous LUT pipeline (on ASIC or FPGA) [21], [22]. The CAM can be implemented by a LUT cascade anyway [23]. ...
Article
Full-text available
The paper addresses software and firmware implementation of multiple-output Boolean functions based on cascades of Look-Up Tables (LUTs). A LUT cascade is described as a means of compact representation of a large class of sparse Boolean functions, evaluation of which then reduces to multiple indirect memory accesses. The method is compared to a technique of direct PLA emulation and is illustrated on examples. A specialized micro-engine is proposed for even faster evaluation than is possible with universal microprocessors. The presented method is flexible in making trade-offs between performance and memory footprint and may be useful for embedded applications where the processing speed is not critical. Evaluation may run on various CPUs and DSP cores or slightly faster on FPGA-based micro-programmed controllers.
... Such in-transit data manipulation services have been addressed by a large body or work, from active networking research [58]. Research with active networks has explored the possibilities of extending and customizing the behaviour of the network infrastructure to meet application needs [4,6,52]. Furthermore, The Active System Area Networks (ASAN) [24,38] focuses on the use of NPs close to or even within the leaf nodes of a network. ...
... There are numerous interface designs for closely coupled network devices such as programmable network processors or line cards to host nodes using OS controlled mappings [52,53,8]. The motivation is to take advantage of the network near nature of these devices vs. hosts and implement transaction services, synchronization functions and service or system specific optimization of protocol stacks and of the data movement and buffering associated with message communication. ...
Conference Paper
Full-text available
Our research addresses "information appliances' used in modern large-scale distributed systems to: (1) virtualize their data flows by applying actions such as filtering, format translation, etc., and (2) separate such actions from enterprise applications' business logic, to make it easier for future service-oriented codes to inter-operate in diverse and dynamic environments. Our specific contribution is the enrichment of runtimes of these appliances with methods for QoS-awareness, thereby giving them the ability to deliver desired levels of QoS even under sudden requirement changes - IQ-appliances. For experimental evaluation, we prototype an IQ-appliance. Measurements demonstrate the feasibility and utility of the approach.
... Accesses to SRAM and SDRAM can take a significant amount of time which can increase the overall processing cost of TSA. We have used a multibit trie implementation [15] to reduce the number of memory accesses necessary to traverse the anonymization tree. Our current prototype runs on an Intel IXP 2400 network processor [5] and requires 4 accesses to the anonymization data structures instead of 26. ...
Article
Passive network measurement and packet header trace collection are vital tools for network operation and research. To protect a user's privacy, it is necessary to anonymize header fields, particularly IP addresses. To preserve the correlation between IP addresses, prefix-preserving anonymization has been proposed. The limitations of this approach for a high-performance measurement system are the need for complex cryptographic computations and potentially large amounts of memory. We propose a new prefix-preserving anonymization algorithm, top-hash subtree-replicated anonymization (TSA), that features three novel improvements: precomputation, replicated subtrees, and top hashing. TSA makes anonymization practical to be implemented on network processors or dedicated logic at Gigabit rates. The performance of TSA is compared with a conventional cryptography based prefix-preserving anonymization scheme which utilizes caching. TSA performs better as it requires no online cryptographic computation and a small number of memory lookups per packet. Our analytic comparison of the susceptibility to attacks between conventional anonymization and our approach shows that TSA performs better for small scale attacks and comparably for medium scale attacks. The processing cost for TSA is reduced by two orders of magnitude and the memory requirements are a few Megabytes. The ability to tune the memory requirements and security level makes TSA ideal for a broad range of network systems with different capabilities
... Our proposed approach is twofold. Firstly, our approach is grounded on the programmable router architecture [2, 3, 4, 5] wherein participating routers can collaboratively collect traffic statistics to a victim site. The statistical information will be forwarded to a victim site (or any processing node in a distributed system). ...
... To illustrate the notion of correctness,Figure 3 illustrates a timing diagram with two routers R i and R j , where R j is an immediate upstream router of R i . The black rectangle in the figure represents the time instant at which the outgoing traffic counter of a router R i is read, and we let this time instant be t i,k , where k 2 [1, 2] represents whether the reading is taken for the first or the second time. We assume thatFigure 3 is illustrating the reading of the outgoing traffic counter for the k-th time. ...
... In this subsection, we illustrate the DDoS traceback algorithm through an example by using the network topology shown inFigure 7 illustrates how the DDoS traceback algorithm works. A black rectangle represents the time instant that a router R i records the value of its outgoing traffic counter, and we denote this time instant as t i,k where k represents the k-th instance of the snapshot algorithm, k 2 [1, 2]. In addition, the corresponding value of the outgoing traffic counter is shown beside the black rectangle. ...
Article
Distributed denial-of-service attack is one of the most pressing security problems that the Internet community needs to address. Two major requirements for effective traceback are (i) to quickly and accurately locate potential attackers and (ii) to filter attack packets so that a host can resume the normal service to legitimate clients. Most of the existing IP traceback techniques focus on tracking the location of attackers after-the-fact. In this work, we provide an efficient methodology for locating potential attackers who employ the flood-based attack. We propose a distributed algorithm so that a set of routers can correctly (in a distributed sense) gather statistics in a coordinated fashion and that a victim site can deduce the local traffic intensities of all these participating routers. We prove the correctness of our distributed algorithm, and given the collected statistics, we provide a method for the victim site to locate attackers who sent out dominating flows of packets. The proposed distributed traceback methodology can also complement and leverage on the existing ICMP traceback so that a more efficient and accurate traceback can be obtained. We carry out simulations to illustrate that the proposed methodology can locate the attackers in a short period of time. Moreover, the applications as well as the limitations of the proposed methodology are covered. We believe this work also provides the theoretical foundation on how to correctly and accurately perform distributed measurement and traffic estimation on the Internet.
... Our algorithm uses a builtin bit-manipulation instruction to calculate the number of bits set, and thus it is much more efficient than Lulea. Ternary Content Address Memory (TCAM) based schemes [19], reconfigurable fast IP lookup engine [29], Binary Decision Diagrams [24], are hardware-based IP lookup schemes. They achieve high lookup speed at the cost of high power consumption, and complicated prefix update. ...
Conference Paper
IP forwarding is one of the main bottlenecks in Internet backbone routers, as it requires performing the longest-prefix match at 10Gbps speed or higher. IPv6 forwarding further exacerbates the situation because its search space is quadrupled. We propose a high-performance IPv6 forwarding algorithm TrieC, and implement it efficiently on the Intel IXP2800 network processor (NPU). Programming the multi-core and multithreaded NPU is a daunting task. We study the interaction between the parallel algorithm design and the architecture mapping to facilitate efficient algorithm implementation. We experiment with an architecture-aware design principle to guarantee the high performance of the resulting algorithm. This paper investigates the main software design issues that have dramatic performance impacts on any NPU based implementation: memory space reduction, instruction selection, data allocation, task partitioning, latency hiding, and thread synchronization. In the paper, we provide insight on how to design an NPU-aware algorithm for high-performance networking applications. Based on the detailed performance analysis of the TrieC algorithm, we provide guidance on developing high-performance networking applications for the multi-core and multithreaded architecture.
... However, in the worst case, it needs approximate 15 memory accesses for IPv6 address lookup because of O(log k 2N+W/M) search time. Additionally, TCAM-based, CPU caching[5], reconfigurable fast IP lookup engine[14], binary decision diagrams[20] etc. are all hardware-based IP lookup schemes. Their advantages are high lookup speed, whereas their disadvantages like specific hardware support, high power consumption, complicated prefix update and high cost limit their application to a certain degree. ...
Conference Paper
Address lookup is one of the main bottlenecks in Internet backbone routers, as it requires the router to perform a longest-prefix-match when searching the routing table for a next hop. Ever-increasing Internet bandwidth, continuously growing prefix table size and inevitable migration to IPv6 address architecture further exacerbate this situation. In recent years, a variety of high- speed address lookup algorithms have been proposed, however most of them are inappropriate to IPv6 lookup. This paper proposes a high-speed IPv6 lookup algorithm TrieC, which achieves the goals of high-speed address lookup, fast incremental prefix updates, high scalability and reasonable memory requirement by taking great advantage of the network processor architecture. Performance of TrieC is carefully evaluated with several IPv6 routing tables of different sizes and different prefix length distributions on Intel IXP2800 network processor(NPU). Simulation shows that TrieC can support IPv6 lookup at OC-192 line rate. Furthermore, if TrieC is pipelined in hardware, it can achieve one IPv6 lookup per memory access.