Conference PaperPDF Available

Caesar: A Content Router for High-Speed Forwarding on Content Names

October 2014

October 2014

DOI:10.1145/2658260.2658267

Conference: ACM/IEEE Symposium on Architectures for Networking and Communications Systems

Authors:

Matteo Varvello

Leonardo Linguaglossa

Institut Mines-Télécom

Rafael Laufer

Two Sigma

Show all 5 authorsHide

Internet users are interested in content regardless of its location; however, the current client/server architecture still requires requests to be directed to a specific server. Information-centric networking (ICN) is a recent vein that relaxes this requirement through the use of name-based forwarding, where forwarding decisions are based on content names instead of IP addresses. Despite previous name-based forwarding strategies have been proposed, almost none have actually built a content router. To fill this gap, in this paper we design and prototype a content router called Caesar for high-speed forwarding on content names. Caesar introduces several innovative features, including (i) a longest-prefix matching algorithm based on a novel data structure called prefix Bloom filter; (ii) an incremental design which allows for easy integration with existing protocols and network equipment; (iii) a forwarding scheme where multiple line cards collaborate in a distributed fashion; and (iv) support for offloading packet processing to graphics processing units (GPUs). We build Caesar as an enterprise router, and show that every line card sustains up to 10 Gbps using a forwarding table with more than 10 million content prefixes. Distributed forwarding allows the forwarding table to grow even further, and to scale linearly with the number of line cards at the cost of only a few microseconds in the packet processing latency. GPU offloading, in turn, trades off a few milliseconds of latency for a large speedup in the forwarding rate.

The hardware architecture of Caesar.

…

Figures - uploaded by Rafael Laufer

Content may be subject to copyright.

Content uploaded by Rafael Laufer

Content may be subject to copyright.

Caesar: A Content Router for

High-Speed Forwarding on Content Names

Diego Perino, Matteo Varvello

Bell Labs, Alcatel-Lucent

ﬁrst.last@alcatel-lucent.com

Leonardo Linguaglossa

INRIA

ﬁrst.last@inria.fr

Rafael Laufer, Roger Boislaigue

Bell Labs, Alcatel-Lucent

ﬁrst.last@alcatel-lucent.com

ABSTRACT

Internet users are interested in content regardless of its location;

however, the current client/server architecture still requires requests

to be directed to a speciﬁc server. Information-centric networking

(ICN) is a recent vein that relaxes this requirement through the use

of name-based forwarding, where forwarding decisions are based

on content names instead of IP addresses. Despite previous name-

based forwarding strategies have been proposed, almost none have

actually built a content router. To ﬁll this gap, in this paper we

design and prototype a content router called Caesar for high-speed

forwarding on content names. Caesar introduces several innovative

features, including (i) a longest-preﬁx matching algorithm based on

a novel data structure called preﬁx Bloom ﬁlter; (ii) an incremental

design which allows for easy integration with existing protocols

and network equipment; (iii) a forwarding scheme where multiple

line cards collaborate in a distributed fashion; and (iv) support for

ofﬂoading packet processing to graphics processing units (GPUs).

We build Caesar as an enterprise router, and show that every line

card sustains up to 10 Gbps using a forwarding table with more

than 10 million content preﬁxes. Distributed forwarding allows the

forwarding table to grow even further, and to scale linearly with the

number of line cards at the cost of only a few microseconds in the

packet processing latency. GPU ofﬂoading, in turn, trades off a few

milliseconds of latency for a large speedup in the forwarding rate.

Categories and Subject Descriptors

C.2.1 [Network Architecture and Designs]: Network commu-

nications, Store and forward networks; C.2.6 [Internetworking]:

Routers

General Terms

Design; Implementation; Experiments.

Keywords

ICN; forwarding; router; architecture.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

ANCS’14, October 20–21, 2014, Los Angeles, CA, USA.

ACM 978-1-4503-2839-5/14/10 ...$15.00.

http://dx.doi.org/10.1145/2658260.2658267.

1. INTRODUCTION

Internet usage has signiﬁcantly evolved over the years, and today

is mostly centered around location-independent services. However,

since the Internet architecture is host-centric, content requests still

have to be directed towards an individual server using IP addresses.

The translation from content name to IP address is realized through

different technologies, e.g., DNS and HTTP redirection, which are

implemented by several systems, such as content delivery networks

(CDN) [1] and cloud services [2].

Information-centric networking (ICN) offers a radical alternative

by advocating name-based forwarding directly at the network layer,

i.e., forwarding decisions are based on the content name carried by

each packet [3]. Names provide routers with information about the

forwarded content, which enables functionalities, such as caching

or multicasting, as network-layer primitives. In particular, the use

of hierarchical names [4] also allows efﬁcient route aggregation

and makes mechanisms to translate content names into IP addresses

unnecessary.

At the core of the ICN architecture is a network device called

content router, responsible for name-based forwarding. Building

a content router is challenging because of two major issues [5, 6].

First, due to the ever-increasing availability of content, the size of

the forwarding tables are expected to be from one to two orders of

magnitude larger than current tables. Second, content names may

be long, having a large number of components as well as many

characters per component, which makes several previous hardware

optimizations proposed for ﬁxed-length IP preﬁxes ineffective [7].

In this paper, we address these ICN challenges and introduce

Caesar, a content router compatible with existing protocols and

network equipment. Caesar’s forwarding engine features three key

optimizations to accelerate name lookups. First, its name-based

longest-preﬁx matching (LPM) algorithm relies on a novel data

structure called preﬁx Bloom ﬁlter (PBF). The PBF is introduced

to achieve high caching efﬁciency by exploiting the hierarchical

nature of content preﬁxes. Second, a fast hashing scheme is pro-

posed to reduce the PBF processing overhead by a multiplicative

factor. Finally, Caesar takes advantage of a cache-aware hash table

designed with an efﬁcient collision resolution scheme. The goal is

to minimize the number of memory accesses required to ﬁnd the

next-hop information for each packet.

Based on the proposed design, we implement the data plane of

Caesar using a µTCA chassis and multiple line cards equipped with

a network processor. In its basic design, Caesar maintains a full

copy of the forwarding information base (FIB) at each line card.

For each received packet, our name-based LPM algorithm runs

independently at the input line card, and an Ethernet switch then

moves the packet to the output line card following the forwarding

decision made. To support large FIBs, we extend Caesar with a dis-

137

tributed forwarding scheme where each line card stores only part of

the FIB and collaborate with each other to perform LPM. A second

extension is also implemented to further increase Caesar’s forward-

ing speed by ofﬂoading name-based LPM to a graphics processing

unit (GPU), if required.

We evaluate Caesar using our full prototype and a commercial

trafﬁc generator that uses synthetic and real traces for the content

preﬁxes and requests. Our main ﬁnding is that every line card of

Caesar is able to sustain up to 10 Gbps with 188-byte packets and a

large FIB with 10 million preﬁxes. Distributed forwarding over line

cards sharing their FIB is shown to allow the the forwarding table to

increase linearly with the number of cards at the cost of 15% rate

reduction and a few microseconds of additional delay. The GPU

extension is also shown to outperform previously proposed designs

for the same hardware.

The remainder of this paper is organized as follows. In Section 2,

we provide some background and an overview of the work closely

related to Caesar. Section 3 presents the design of our name-based

LPM algorithm. Section 4 then introduces the implementation of

Caesar, while its extensions are presented in Section 5. We evaluate

Caesar’s performance in Section 6 and, in Section 7, we discuss the

design and implementation of additional content router features.

Finally, Section 8 concludes the paper.

2. RELATED WORK

This section summarizes the work related to Caesar. Section 2.1

provides a brief background on fundamental NDN concepts, such

as name-based longest preﬁx matching (LPM). Section 2.2 then

overviews the state of the art in name-based LPM, and Section 2.3

presents the related work on content router design.

2.1 Background

Caesar uses the hierarchical naming scheme proposed by NDN

to address content [4, 8, 9, 10]. In this scheme, each content has a

unique identiﬁer composed of a sequence of strings, each separated

by a delimiting character (e.g., /ancs2014/papers/paperA). We

refer to this identiﬁer as the content name, and to each string in

the sequence as a component. For delivery, content usually has

to be split into several different packets, which are identiﬁed by

appending an extra individual component to the original content

name (e.g., /ancs2014/papers/paperA/packet1). For scal-

ability, content routers only maintain forwarding information for

content preﬁxes that aggregate several content names into a single

entry (e.g., /ancs2014/papers/*).

A content router uses name-based longest preﬁx matching (LPM)

to determine the interface where a packet should be forwarded.

Name-based LPM consists in selecting from a local forwarding in-

formation base (FIB) the content preﬁx sharing the longest preﬁx

with a content name. Although the concept is similar to LPM for IP,

name-based LPM faces serious scalability challenges [5, 6]. An

ICN FIB is expected to be at least one order of magnitude larger

than the average FIB of current IP routers. In addition, several

hardware optimizations that take advantage of the ﬁxed length of

IP addresses are not possible in ICN due to the variable length of

content names and preﬁxes.

2.2 Name-Based LPM

Motivated by these challenges, a few techniques have recently

been proposed for name-based LPM. Wang et al. [11] propose to

use name component encoding (NCE), a scheme that encodes the

components of a content name as symbols and organize them as a

trie. Due to its goal of compacting the FIB, NCE requires several

extra data structures that add signiﬁcant complexity to the lookup

process, and result in several memory accesses to ﬁnd the longest

preﬁx match. NameFilter [12] is an alternative name-based LPM

algorithm employing one Bloom ﬁlter per preﬁx length, similarly

to the solutions proposed in [13, 14] for IP addresses. For lookup, a

d-component content name then requires dlookups in the different

Bloom ﬁlters. This approach has two intrinsic limitations. First,

it cannot handle false positives generated by the Bloom ﬁlters, and

thus packets can eventually be forwarded to the wrong interface.

Second, it cannot support a few important functionalities, such as

multipath routing and dynamic forwarding.

In a different approach, So et al. [6, 8] implement LPM using

successive lookups in a hash table. Instead of using the longest-

ﬁrst strategy (i.e., lookups start from the longest preﬁx), the search

starts from the preﬁx length where most FIB preﬁxes are centered,

and restarts at a larger or shorter length, if needed. The approach

bounds the worst case number of lookups, but cannot guarantee

constant performance bounds.

Different from previous work, we reduce the problem of name-

based LPM to two stages (cf. Section 3). The ﬁrst stage ﬁnds the

length of the longest preﬁx that matches a content name. This stage

is accomplished by a Bloom ﬁlter variant engineered for content

preﬁxes, which guarantees a constant number of memory accesses.

The second stage consists of a hash table lookup to ﬁnd the output

interface where a packet should be forwarded to. This last stage

only requires a single lookup with high probability, detects false

positives, and supports enhanced forwarding functionalities.

2.3 Content Router Design

To date, the work in [6, 8] is the only previous attempt to build

a content router. In this work, the content router is implemented

on a Xeon-based Integrated Service Module. Packet I/O is han-

dled by regular line cards, while name-based LPM is performed on

a separate service module connected to the line cards via a switch

fabric. Real experiments show that the module sustains a maximum

forwarding rate of 4.5 Mpps (million packets per second). Simu-

lations without packet I/O show that the the proposed name-based

LPM algorithm handles up to 6.3 Mpps.

Different from [6, 8], Caesar supports name-based LPM directly

on I/O line cards in order to reduce latency, increase the overall

router throughput, and enable ICN functionalities without requir-

ing extra service modules. Real experiments show that, using a

single line card based on a cheaper technology than [6, 8], Caesar

achieves a comparable throughput (Section 6). Finally, Caesar also

allows line cards to share the content of their FIB in order to support

the massive FIB expected in ICN, and supports GPU ofﬂoading to

speed up the forwarding rate (Section 5).

3. NAME-BASED LPM

In this section, we introduce our two-stage name-based longest

preﬁx matching (LPM) algorithm used in the forwarding engine of

Caesar. Section 3.1 and 3.2 describe the preﬁx Bloom ﬁlter (PBF)

and the concept of block expansion, respectively. Both are used in

the ﬁrst stage to ﬁnd the length of the longest preﬁx match. Then,

Section 3.3 explains the fast hashing scheme proposed to reduce

the hashing overhead of the PBF. Finally, Section 3.4 describes the

hash table used in the second stage as well as the optimizations

introduced to speed up the lookups.

3.1 Preﬁx Bloom Filter

For the ﬁrst stage of our name-based LPM algorithm, we intro-

duce a novel data structure called preﬁx Bloom ﬁlter (PBF) and use

it as an oracle to identify the length of the longest preﬁx match.

The PBF takes advantage of the semantics in content preﬁxes to

138

Block Block Block

bits

1 1 1

g(p1)

...

h3(p)hk(p)h2(p)h1(p)

... ...1i b

Figure 1: Insertion of a preﬁx pinto a PBF with bblocks, m

bits per block, and khash functions. The function g(p1)selects

a block using the subpreﬁx p1with the ﬁrst component, and

bits h1(p), h2(p),...,hk(p)are set to 1.

ﬁnd the longest preﬁx match using a single memory access, with

high probability.

The PBF is a space-efﬁcient data structure composed of several

blocks, where each block is a small Bloom ﬁlter with the size of

one (or few) cache line(s). Each content preﬁx in the FIB is inserted

into a block chosen from the hash value of its ﬁrst component. Dur-

ing the lookup of a content name, its ﬁrst component identiﬁes the

unique block, or cache line(s), that must be loaded from memory to

cache before conducting the membership query.

Figure 1 shows the insertion of a content preﬁx pinto a PBF

composed of bblocks, mbits per block, and khash functions.

Let p=/c1/c2/ . . . /cube the u-component preﬁx to be inserted

into the PBF, and let pi=/c1/c2/ . . . /cibe the subpreﬁx with

the ﬁrst icomponents of p, such that p1=/c1,p2=/c1/c2,

and so on. A uniform hash function g(·)with output in the range

{0,1,...,b −1}is used to determine the block where pshould

be inserted. The hash value g(p1)is computed from the subpreﬁx

p1deﬁned by the ﬁrst component. This guarantees that all preﬁxes

starting with the same component are stored in the same block,

which enables fast lookups. Once the block is selected, the hash

values h1(p), h2(p),...,hk(p)are computed using the complete

preﬁx p, resulting in kindexes within the range {0,1,...,m−1}.

Finally, the bits at the positions h1(p), h2(p),...,hk(p)in the se-

lected block are set to 1.

To ﬁnd the length of the longest preﬁx in the PBF that matches

a content name x=/c1/c2/ . . . /cd, the ﬁrst step is to identify

the index of the block where xor its subpreﬁxes may be stored.

Such block is selected using the hash of the ﬁrst component of the

content name, g(x1). Once this block is loaded, a match is ﬁrst

tried using the full name x,i.e., maximum length. The bits of the

positions h1(x), h2(x),...,hk(x)are then checked and, if all bits

are set to 1, a match is found. Otherwise, or if a false positive is

detected (cf. Section 3.4), the preﬁx xd−1is checked using the

same procedure and, if there is no match, xd−2is then checked,

and so on until a match is found or until all subpreﬁxes of xhave

been tested. At each membership query, the bits of the positions

h1(xi), h2(xi),...,hk(xi)are checked, for 1≤i≤d, account-

ing for a maximum of k×dbit checks per name lookup in the worst

case. Bits checks require only a single memory access as the bits

to be checked reside in the same block.

The false positive rate of the PBF is computed as follows. If ni

is the number of preﬁxes inserted into the i-th block, then the false

positive rate of this block is fi= (1 −e−kni/m)k. We consider

two possible cases of false positives. First, assume the worst-case

scenario where the name to be looked up and all of its subpreﬁxes

are not in the FIB. In this case, assuming a content name with d

components, the number Fiof false positives in the i-th block fol-

lows the binomial distribution Fi∼B(d, fi). The average number

of false positives in this block is then d×fi. Since the function

g(·)is uniform, each block is chosen with probability 1/b, where

bis the number of blocks, and thus the average number of false

positives in the PBF for a content name with dcomponents and no

matches is d×f, where f= (1/b)Pb

i=1 fiis the average false

positive rate.

Consider now the case where either the content name or at least

one of its subpreﬁxes are in the table, and let lbe the lenght of its

longest preﬁx match. In this case, a false positive can only occur

for a subpreﬁx whose length is larger than l, i.e., the l-component

subpreﬁx is a real positive and the search stops. The number Fiof

false positives in the i-th block then follows the binomial distribu-

tion Fi∼B(d−l, fi), and the average number of false positives

in this block is (d−l)×fi. In general, for a d-component name

whose longest preﬁx match has length l, the average number of

false positives in the PBF is (d−l)×f.

3.2 Block Expansion

For fast lookups, the PBF is designed such that preﬁxes sharing

their ﬁrst component are stored in the same block. It follows that

if many preﬁxes in the FIB share the same ﬁrst component, then

the corresponding block may yield a high false positive rate. To

address this, we propose a technique called block expansion that

redirects some content preﬁxes to other blocks, allowing the false

positive rate to be reduced in exchange for loading a few additional

blocks from memory.

Block expansion is used when the number niof preﬁxes in the

i-th block exceeds the threshold ti=−(m/k) log(1 −k

√fi)se-

lected to guarantee a maximum false positive rate fi. For now,

assume that preﬁxes are inserted in order from shorter to longer

lengths1. Let nij be the number of j-component preﬁxes stored in

the i-th block. If at a given length lthe number Pj≤lnij exceeds

the threshold ti, then a block expansion occurs. In this case, each

preﬁx pwith length lor higher is redirected to another block chosen

from the hash value g(pl)of its ﬁrst lcomponents. To keep track

of the expansions, each block keeps a bitmap with wbits. The l-th

bit of the bitmap is set to 1 to notify that an expansion at length l

occurred in the block. If the new block indicated by g(pl)already

has an expansion at a length e, with e > l, then any preﬁx pwith

length eor higher is redirected again to another block indicated by

g(pe), and so on.

Figure 2 shows the insertion of a preﬁx p=/c1/c2/ . . . /cuin a

PBF using block expansion. First, block i=g(p1)is identiﬁed as

the target for p. Assuming that the threshold tiis reached at preﬁx

length l, block iis expanded and the l-th bit of its bitmap is set.

Since l≤u, a second block j=g(pl)is then be computed from

the ﬁrst lcomponents of pand, assuming block jis not expanded,

positions h1(p), h2(p),...,hk(p)of this block are set to 1.

The lookup process works as follows. Let xbe the preﬁx to be

looked up, and i=g(x1)be the block where xor its LPM should

be. First, the expansion bitmap of block iis checked. If the ﬁrst bit

set in the bitmap is at position land xhas lor more components,

then block j=g(xl)is also loaded from memory. Assuming that

no bits are set in the bitmap of j, preﬁxes xland higher are checked

in block j. In case there are no matches, then preﬁxes xl−1and

lower are checked in block i.

The false positive rate of the PBF with block expansion is sim-

ilar to the case without expansion, except for two key differences.

First, the ﬁlter size is now m−wbits, since the ﬁrst wbits of the

block are used for the expansion bitmap. The range of the hash

1The dynamic case, where the preﬁxes in the FIB change over time,

is addressed by the control plane, and explained in Section 4.3.

139

Block Block

bits

−th bit

bits

h1(p)hk(p)

g(pl)g(p1)

. . .

i j

m−w

Figure 2: Insertion of a preﬁx p=/p1/p2/. . ./pdinto a PBF

using block expansion. If block i=g(p1)reached its insertion

threshold or if the l-th bit is set in its bitmap and l≤d, then p

is inserted into block j=g(pl).

functions hiis thus {0,1,...,m−w−1}. Second, the number

niof preﬁxes inserted in each block iis now computed from the

original insertions, minus the preﬁxes redirected to other blocks,

plus the preﬁxes coming from the expansion of other blocks.

3.3 Hashing

Hashing is a fundamental operation in our name-based LPM al-

gorithm. For the lookup of a content name p=/c1/c2/ . . . /cd

with dcomponents and khash functions in the PBF, a total of k×d

hash values must be generated for LPM in the worst case, i.e., the

running time is O(k×d). Longer content names thus have a higher

overall impact on the system throughput than shorter names. To re-

duce this overhead, we propose a linear O(k+d)run-time hashing

scheme that only generates k+d−1seed hash values, while the

other (k−1)(d−1) values are computed from XOR operations.

The hash values are computed as follows. Let Hij be the i-th

hash value computed for the preﬁx pj=/c1/c2/ . . . /cjcontain-

ing the ﬁrst jcomponents of p. Then, the k×dvalues are computed

on demand as

Hij =(hi(pj)if i= 1 or j= 1

Hi1⊕H1jotherwise

where hi(pj)is the value computed from the i-th hash function

over the j-component preﬁx pj, and ⊕is the XOR operator. The

use of XOR operations signiﬁcantly speeds up the computation

time without impacting hashing properties [14].

3.4 Hash Table Design

After the PBF identiﬁes the longest preﬁx length, the second

stage of our name-based LPM algorithm consists of a hash table

lookup to either fetch the next hop information or to rule out false

positives.

Figure 3 shows the structure of the hash table used in our system,

which consists of several buckets where the preﬁxes in the FIB are

hashed to. Our ﬁrst design goal is to minimize memory access la-

tency. For this purpose, each bucket is restricted to the ﬁxed size of

one cache line such that, for well-dimensioned tables, only a single

memory access is required to ﬁnd an entry. In case of collisions,

entries are stored next to each other in a contiguous fashion up to

the limit imposed by the cache line size. Bucket overﬂow is man-

aged by chaining with linked lists, but this is expected to be rare if

the number of buckets is large enough.

Our second design goal is to reduce the string matching over-

head required to ﬁnd an entry. As a result, each entry stores the

hash value hof its content preﬁx in order to speed up the match-

ing process. String matching on the content preﬁx only occurs if

there is a match ﬁrst on this 32-bit hash value. Due to large output

cache line

MAC Address

Buckets

/c1/c2/ . . . /cd

00:1F:3B:23:9A:00

/c1/c2/ . . . /cd

h32 a16 ...

p64i16

h32 a16 ...

p64i16

h32 a16 ...

p64i16

Figure 3: The structure of the hash table used to store the FIB.

Each bucket has a ﬁxed size of one cache line, with overﬂows

managed by chaining. Each entry consists of a tuple hh, i, a, pi

that stores the next hop information.

range of the hash function, an error is expected only with a small

probability of 2−32, assuming uniformity.

Finally, our last goal is to maximize the capacity of each bucket.

For this purpose, the content preﬁx is not stored at each entry due

to its large and variable size. Instead, only a 64-bit pointer pto the

preﬁx is stored. To save space, next-hop MAC addresses are also

kept in a separate table and a 16-bit index ais stored in each entry.

A 16-bit index iis also required per entry to specify the output

line card of a given content preﬁx. Each entry in the hash table

then consists of a 16-byte tuple hh, i, a, pi, where his the hash of

the content preﬁx, iis the output line card index, ais index of the

next-hop MAC address, and pis the pointer to the content preﬁx.

4. CAESAR

This section explains the design and implementation of Caesar,

our high-speed content router prototype. Section 4.1 overviews the

hardware setup of Caesar, while Section 4.2 and 4.3 present its data

and control plane, respectively.

4.1 Hardware

Caesar’s hardware is chosen with three key goals in mind:

Enterprise router: Caesar is a router for an enterprise network,

i.e., few 10 Gbps ports. This impacts the choice of router chassis

as well as the selection of the type and number of line cards used.

Easy deployment: Caesar is easily deployable in current networks,

e.g., via a simple ﬁrmware upgrade of existing networking devices.

This constraints the hardware choice to programmable components

already widely adopted by commercial network equipment. We

thus resort to network processors optimized for packet processing.

Backward compatibility: Caesar is designed to be backward com-

patible with existing networking protocols. In particular, its switch

fabric is based on regular Ethernet switching, and thus name-based

forwarding is implemented on top of existing networking protocols

(e.g., Ethernet and IP) in a transparent fashion, without requiring a

clean slate approach.

Figure 4 shows the hardware architecture of Caesar. It consists of

a micro telecommunications computing architecture (µTCA) chas-

sis with slots for advanced mezzanine cards (AMCs). Four slots

of the chassis are occupied by line cards, each equipped with a

network processor unit (NPU), a 4-GB off-chip DRAM, a SFP+

10GbE interface, and a 10 Gbps interface to the backplane. Each

NPU has a 10-core 1.1 GHz 64-bits MIPS processor with 32-KB L1

cache per core, and 2-MB L2 shared cache. Some of the remaining

slots of the chassis are occupied by an Ethernet switch with 10GbE

ports, one connected to each slot via the backplane, and the route

140

Data

Plane

Control

Plane

...

MIPS64 MIPS64

L1 (32KB) L1 (32KB)

MIPS64

L1 (32KB)

Route Controller

NPU

Line cards

Ethernet Switch

DRAM (4 GB)

L2 Cache (2 MB)

10GbE

Backplane

Figure 4: The hardware architecture of Caesar.

controller, composed of an Intel Core Duo 1.5 Ghz processor, 4-GB

off-chip DRAM, and a 300-GB hard disk.

4.2 Data Plane

Caesar’s data plane is responsible for forwarding packets received

by the line cards. Next, we describe the path of packet within Cae-

sar, from the moment it is received until it leaves the system.

Packet input: As a packet is received from the SFP+ 10GbE exter-

nal interface, it is stored in the off-chip DRAM of the line card. A

hardware load balancer then assigns the packet to one of the avail-

able cores for processing.

Header parsing: A standard header format for ICN is currently

under debate in the ICN research group at the IRTF [15]. In ab-

sence of such standard, we use our own header which consists of

four ﬁelds. First, the 16-bit length ﬁeld speciﬁes the size of the fol-

lowing content name ﬁeld. To expedite parsing, we also include an

8-bit components ﬁeld, which speciﬁes the number of components

in the content name, and several offset ﬁelds, each containing an

8-bit offset for each component in the content name. For backward

compatibility, the name-based header is placed after the IP header,

which allows network devices to operate with their standard for-

warding policy, e.g., L2 or L3 forwarding.

Once dispatched to a core, each packet is checked for the pres-

ence of a name-based header by inspecting the protocol ﬁeld of the

IP header. If a name-based header is present, pointers to each ﬁeld

are extracted and stored in the L1 cache. Otherwise, regular packet

processing is performed, i.e., LPM on the destination IP address.

Name-based LPM: If such a header is found, our name-based

LPM algorithm is used (cf. Section 3). The size of each PBF

block is set to one cache line, which is 128 bytes in our architec-

ture (cf. Section 6). To ensure fast hashing calculations, Caesar

takes advantage of the optimized instruction sets of the NPU; the

k+d−1seed hash values are computed using the CRC32 op-

timized instructions, whereas the remaining (k−1)(d−1) hash

values are computed from XOR operations. In case of a match in

the PBF, the content preﬁx is looked up in the hash table stored

in the off-chip DRAM to determine its next hop information, or to

rule out false positives. Each table entry has a ﬁxed size of 16 bytes,

which, for a bucket of 128 bytes, results in a maximum of 7 entries

per bucket in addition to the 64-bit pointer required by the linked

list (cf. Section 3.4). We dimension the hash table to contain 10

million buckets, requiring a total of 1.28 GB to store the buckets.

An additional 640 MB are required to store the content preﬁxes, for

a total of 1.92 GB of storage.

Switching: The LPM algorithm returns the index of the output

line card and the MAC address of the next hop for a packet. The

source MAC address of the packet is then set to the address of the

backplane interface, and its destination MAC address is set to the

address of next hop. Finally, the packet is placed into a per-core

output queue in the backplane interface and waits for transmission.

Each NPU core has its own queue in the backplane interface to

enable lockless queue insertions and avoid contention bottlenecks.

Once transmitted over the backplane, the packet is received by the

Ethernet switch, and regular L2 switching is performed. The packet

is then sent to the output line card over the backplane once again.

Packet output: Once received by the backplane interface of the

output line card, the packet is assigned to a NPU core and the source

MAC address is overwritten with the address of the SPF+ 10GbE

interface. The packet is then sent to the interface for transmission.

4.3 Control Plane

Caesar’s control plane is responsible for periodically computing

and distributing the FIB to line cards. These operations are per-

formed by the route controller, a central authority that is assumed

to participate in a name-based routing protocol [16] to construct

its routing information base (RIB). The RIB is structured as a hash

table that contains the next hop information for each reachable con-

tent preﬁx.

The FIB is derived from the RIB and is composed of the PBF

and the preﬁx hash table (cf. Section 3). To allow preﬁx insertion

and removal, the route controller maintains a mirror counting PBF

(C-PBF). For each bit in the PBF, the C-PBF keeps a counter that is

incremented at insertions and decremented at removals. Only when

a counter reaches zero the corresponding bit in the PBF is set to 0.

The C-PBF enables preﬁx removal while avoiding to keep counters

in the original PBF, which saves precious L2 cache space.

The C-PBF is updated on two different timescales. On a long

timescale (i.e., minutes), the C-PBF is recomputed from the RIB

with the goal to improve preﬁx distribution across blocks. On a

short timescale (i.e., every insertion/removal) the C-PBF is greedily

updated. When inserting a new preﬁx, additional expansions are

performed on blocks that exceed the false-positive threshold. When

removing a preﬁx, block merges are postponed until the next long-

timescale update.

The content preﬁxes stored in the i-th block of the PBF are hi-

erarchically organized into a preﬁx tree to (1) easily identify the

length at which the threshold tiis exceeded, and (2) efﬁciently

move preﬁxes during block expansions with a single pointer up-

date operation. The preﬁx tree of each block is implemented as a

left-child right-sibling binary tree for space efﬁciency.

5. CAESAR EXTENSIONS

In this section, we introduce two Caesar extensions in order to

support (1) large FIBs (i.e., tens of gigabytes), and (2) high-speed

forwarding (i.e., tens of Mpps). Large FIBs are supported by hav-

ing each line card store only part of the entire FIB and collaborate

with each other in a distributed fashion. High-speed forwarding is

supported by ofﬂoading large packet batches to a graphics process-

ing unit (GPU). Although efﬁcient, these solutions may introduce

additional latency during packet processing and thus are presented

here as extensions that can be activated at the operator’s discretion.

Large FIBs: In its original design, Caesar stores a full copy of the

FIB at each line card, as commonly done by commercial routers.

Although this allows each line card to independently process pack-

ets at the nominal rate, it also results in FIB replication and waste of

storage resources. For IP preﬁxes, this is usually not a concern, as

141

a typical FIB contains less than one million entries. In ICN, how-

ever, the FIB can easily grow past hundreds of millions of content

preﬁxes [5, 6] and memory space becomes a real concern.

To address this issue, we propose a Caesar extension in Sec-

tion 5.1 that allows line cards to share their FIB entries. FIBs at

different line cards are populated with a unique set of preﬁxes such

that, overall, Caesar is able to store Ntimes more content preﬁxes,

where Nis the number of line cards. Since the individual FIB size

at each card does not change, line cards are still able to operate at

their nominal rate. The key challenge is then how to use a shared

FIB to perform LPM on each received packet.

High-speed forwarding: The classic strategy to increase forward-

ing speeds in routers is a hardware update. However, there is an

intrinsic scalability limitation to this approach, in addition to high

costs of both hardware and reconﬁguration. For instance, upgrad-

ing Caesar’s line cards from 10 to 40 Gbps requires changing the

hardware architecture, with a 10x impact on cost. While this is

an option for the deployment of edge/core routers with a large set

of networking features, such cost is prohibitive for an enterprise

router.

In Section 5.2, we propose an alternative strategy that does not

incur such a high cost. Wang et al. [9] have recently shown that

high-speed LPM on content names is possible by exploiting the par-

allelism of popular off-the-shelf GPUs. As a second Caesar exten-

sion, we propose to use GPUs to accelerate packet processing. Cur-

rently, each GPU has an average cost of 10% of the aforementioned

architecture upgrade. The challenge is then how to efﬁciently lever-

age a GPU to guarantee fast name-based LPM.

For this extension, we assume that a GPU is associated to each

line card and that it stores the same FIB entries as the line card.

In our platform, a GPU is installed in an external device and con-

nected to a line-card via the switch for power budget reasons. In

other platforms (e.g., Advanced Telecommunications Computing

Architecture ATCA with enhanced NPU), a GPU can be directly

connected to a line card using a regular PCIe bus .

5.1 Distributed Forwarding

To share a large FIB among line cards, we implement a forward-

ing scheme where LPM is performed in a distributed fashion. The

idea is for each packet to be processed at the line card where its

longest preﬁx match resides, i.e., not necessarily the line card that

received the packet. A fast mechanism must then be in place for

each received packet to be directed to the correct line card for LPM.

For this extension, the following modiﬁcations to Caesar’s control

and data planes are required.

Control plane: The route controller now has to compute a different

FIB per line card. Each content preﬁx pin the RIB is assigned

to a line card Li, such that i=g(p1) mod N, where g(p1)is

the hash of the subpreﬁx p1deﬁned by the ﬁrst component of p.

The rationale here is the same used in the PBF for block selection

(cf. Section 3.1); by distributing preﬁxes to line cards based on their

ﬁrst component, it is possible for an incoming packet to be quickly

forwarded to the line card where its longest preﬁx match resides.

In addition to distributing the FIB, the route controller also main-

tains a Line card Table (LT) containing the MAC address of the

backplane interface of each line card. The LT is distributed to each

line card along with their FIB, and serves two key purposes.

First, the LT is used by each line card to delegate LPM to another

card (see data plane). Second, the LT allows the router controller

to quickly recover from failures. With distributed forwarding, the

failure of a line card may jeopardize the reachability to the preﬁxes

it manages. We solve this issue by allowing redirection of trafﬁc

Algorithm 1: Kernel description

Input: Bloom ﬁlters B, hash tables H, content names C

Output: Lengths Lof the LPM for each name c∈C

1preﬁxLength ←blockIdx div blocksPerLength

2blockIdxLength ←blockIdx mod blocksPerLength

3namesPerBlock ← ⌈|C|/blocksPerLength⌉

4namesOffset ←blockIdxLength ×namesPerBlock

5namesLast ←MIN(namesOffset+namesPerBlock,|C|)

6tid ←namesOffset+threadIdx

7 while tid <namesLast do

8c←REA DNAME(C,tid)

9p←MAK EPR EFI X(c,preﬁxLength)

10 m←BFLOOKUP(B[preﬁxLength],p)

11 if m=TRU E then

12 interface ←HTLOOKUP(H[preﬁxLength],p)

13 if interf ace 6=NIL then

14 ATOMICMAX(L[tid],PreﬁxLength)

15 tid ←tid+threads

from a failing line card to a backup line card. Once Caesar detects

a failure at a line card Li, the route controller sends the FIB of Li

to one of the additional pre-installed line cards and updates the LT

to reﬂect the change. The updated table is then distributed to all

line cards to complete the failure recovery.

Data plane: Upon receiving a packet with content name x, an

available NPU core computes the target line card Lito process the

packet, with i=g(x1) mod N. If Licorresponds to the local line

card, then the regular ﬂow of operations occurs, i.e., header extrac-

tion, name-based LPM, switching, and forwarding (cf. Section 4).

Otherwise, the destination MAC address of the packet is overwrit-

ten with the address of the backplane interface of Lifetched from

the LT, and the packet is transmitted over the backplane. LPM then

occurs at Liand the packet is sent once again over the backplane to

the output line card (if different than Li) for external transmission.

Distributed forwarding imposes two constraints as tradeoffs for

supporting a larger FIB. First, it introduces a short delay caused

by packets crossing the backplane twice. Second, extra switch-

ing capacity is required. In the worst case, i.e., when a packet is

never processed by the receiving line card, the switch must oper-

ate twice as fast at a rate 2NR, where Ris the rate of a line card,

instead of NR. Nonetheless, as showed in [17], it is possible to

combine multiple low-capacity switch fabrics to provide a high-

capacity fabric with no performance loss at the cost of small coordi-

nation buffers. This is a common approach in commercial routers,

e.g., the Alcatel 7950 XRS leverages 16 switching elements to sus-

tain an overall throughput of 32 Tbps [18].

5.2 GPU Ofﬂoading

We also propose a Caesar extension to accelerate packet forward-

ing using a GPU. First, a brief background on the architecture and

operation of the NVIDIA GTX 580 [19] used in our implemen-

tation is provided. Then, a discussion on our name-based LPM

solution using this GPU is presented.

GPU background: The NVIDIA GTX 580 GPU is composed

of 16 streaming multiprocessors (SMs), each with 32 stream pro-

cessors (SPs) running at 1,544 MHz. This GPU has two mem-

ory types: a large, but slow, device memory and a small, but fast,

shared memory. The device memory is an off-chip 1.5 GB GDDR5

142

DRAM, which is accelerated by a L2 cache used by all SMs. The

shared memory is an individual on-chip 48 KB SRAM per SM.

Each SM also has several registers and a L1 cache to accelerate

device memory accesses.

All threads in the GPU execute the same function, called kernel.

The level of parallelism of a kernel is speciﬁed by two parameters,

namely, the number of blocks and the number of threads per block.

A block is a set of concurrently executing threads that collaborate

using shared memory and barrier synchronization primitives. At

run-time, each block is assigned to a SM and divided into warps, or

sets of 32 threads, that are independently scheduled for execution.

Each thread in a warp executes the same instruction in lockstep.

Name-based LPM: We introduce few modiﬁcations made to the

LPM algorithm to achieve efﬁcient GPU implementation. Due to

the serial nature of the NPU, the original algorithm uses a PBF to

test for several preﬁx lengths in the same ﬁlter (Section 3). How-

ever, to take advantage of the high level of parallelism in GPUs, a

LPM approach that uses a Bloom ﬁlter and hash table per preﬁx

length is more efﬁcient. Since large FIBs are expected, both the

Bloom ﬁlters and hash tables are stored in device memory.

For high GPU utilization, multiple warps must be assigned to

each SM such that, when a warp stalls on a memory read, other

warps are available waiting to be scheduled. The GTX 580 can

have up to 8 blocks concurrently allocated and executing per SM,

for a total of 128 blocks. Content preﬁxes are assumed to have 128

components or less, and thus we have one block per preﬁx length

in the worst case. Since such a large number of components is

rare, we allow a higher degree of parallelism with multiple blocks

working on the same preﬁx length. In this case, each block operates

on a different subset of content names received from a line card.

Algorithm 1 shows our GPU kernel. As input, it receives arrays

B,H, and Cthat contain the Bloom ﬁlters, hash tables, and con-

tent names to be looked up, respectively. The kernel identiﬁes the

length of the longest preﬁx in the FIB that match each content name

c∈Cand stores it in the array L, which is then returned to the line

card. All these arrays are located in the device memory. We take

advantage in the algorithm of a few CUDA variables available to

each thread at run-time: blockIdx and threadIdx, which are the

block and thread indexes, and blocks and threads, which are the

number of blocks and the number of threads per block, respectively.

At line 1, each block uses its blockIdx index to compute the pre-

ﬁx length that it is responsible for. The parameter BlocksPerLength

is passed to the kernel in order to control how many blocks are used

per preﬁx length. Line 2 computes the relative index (i.e., from 0to

BlocksPerLength−1) among the blocks responsible for this preﬁx

length. Line 3–5 show the partitioning of the content names among

these blocks. Line 3 uses the batch size |C|to compute the number

of content names that the each block must look up. Line 4 com-

putes the offset of the current block in C, and line 5 computes the

index of the ﬁrst name outside the block range. The index of the

ﬁrst content name to be read by the thread is computed in line 6.

Lines 7–15 are the core of the LPM. In each iteration, a thread

loads a different content name c(line 8), transforms it into a preﬁx p

(line 9), and performs a Bloom ﬁlter lookup (line 10). If a match is

found, a hash table lookup is performed (line 12), and line 13 makes

sure that the match is not a false positive. Finally, if an entry was

found, the preﬁx length is written to L[tid]using the ATOMICMAX

call (line 14). The ATOMICMAX(a, v)call is provided by the GPU

to write a value vto a given address aonly if vis higher than the

contents of a. The operation is atomic across all SMs, and thus

line 14 ensures LPM is realized. The thread index is then increased

in line 15 and matching is initiated on the next content name.

6. EVALUATION

This section experimentally evaluates Caesar along with its ex-

tensions and the name-based LPM algorithm. First, Section 6.1

presents the experimental setting and methodology. Section 6.2

then presents results from a series of microbenchmarks to properly

dimension the PBF, key data structure in our name-based LPM al-

gorithm. Finally, Section 6.3 and 6.4 evaluate both Caesar and its

extensions, namely distributed forwarding and GPU ofﬂoading.

6.1 Experimental Setting

Using optical ﬁbers, we connect Caesar to a commercial trafﬁc

generator equipped with 10 Gbps optical interfaces. For easy of

presentation, we assume that the four line cards in Caesar work in

half-duplex mode, two for input trafﬁc and two for output trafﬁc.

Nevertheless, results can be extended to full-duplex conﬁgurations,

as well as to a larger number of line cards. To support content

names, the trafﬁc generator produces regular IP packets with our

name-based header as payload (cf. Section 4). Each experiment

then consists of three parts: (1) trafﬁc with desired characteristics

is originated at the trafﬁc generator and transmitted to Caesar; (2)

packets are received by Caesar’s line cards and content names are

extracted; and (3) forwarding decisions are made and packets are

sent back to the generator. For each experiment, we mainly mea-

sure the forwarding rate and packet latency. The forwarding rate

is measured as the highest input rate that Caesar can handle, dur-

ing 60 seconds with no losses. Packet latency is described by the

minimum, maximum, and average latency of the packets forwarded

within the selected 60 seconds time-frame.

We call workload the combination of a set of content preﬁxes

stored in Caesar’s FIB, and content names requested via the trafﬁc

generator. We derive a reference workload from the trace described

in [9]; this trace contains 10 million URLs collected by crawling

the Web. The assumption here is that the hostname extracted from

an URL is representative of a content preﬁx in ICN. Content names

are then generated by adding random sufﬁxes to content preﬁxes

randomly selected from the trace; this is the same procedure used

in [9], and produces content names that are 42-Bytes long. Overall,

most content preﬁxes in the reference workload are short, with only

2 components on average, whereas content names have between 3

and 12 components, with 4 components on average. The average

distance ∆between content names and their matching preﬁxes is

only equal to 2 components. To avoid the effect of congestion and

trafﬁc management, we assume next hops associated with content

preﬁxes are uniformly distributed over the two output line cards.

Throughout the evaluation, we also use synthetic workloads to

assess the impact of system parameters and trafﬁc characteristics on

Caesar’s performance. Synthetic workloads are generated from the

reference workload by varying the following parameters: (1) the

average distance ∆, which affects the number of potential PBF/hash

table lookups, as well as the complexity of the hashing operation;

(2) the number of content preﬁxes in the FIB, which affects the FIB

size and access speed; and (3) the number of content preﬁxes shar-

ing the ﬁrst component, which affects the distribution of preﬁxes

among PBF blocks, and thus the false positive rate.

6.2 PBF Dimensioning

We start by motivating the choice of the number of hash func-

tions kused in the PBF. The goal is to minimize the cost of comput-

ing seed hash values, as this operation has a high computation time

(cf. Table 2). After extensive investigation, we set k= 2 since the

generation of additional seed hash values signiﬁcantly hurts Cae-

sar’s forwarding rate, with only a marginal false positive reduction.

143

0 100 200 300 400

5.5

6.5

Number of prefixes per block

Forwarding Rate [Mpps]

m=128 B

m=256 B

(a) Single input line card.

0 20 40 60 80 100

10−2

10−1

100

PBF size [MB]

False positive probability [0−1]

0 20 40 60 80 100

5.5

6.5

Forwarding Rate [Mpps]

(b) Single input line card.

0 2 4 6 8 10

Avg component distance

Forwarding Rate [Mpps]

PBF−ideal

PBF

PBF−exp

NoPBF

100102104106

Number of prefixes

Forwarding Rate [Mpps]

(d) Two input line cards.

Figure 5: PBF dimensioning and Caesar’s evaluation. (a) Forwarding rate as a function of the number of preﬁxes per block. (b) False

positive and forwarding rate as a function of the total PBF size. (c) Forwarding rate as a function of average component distance ∆.

(d) Forwarding rate as a function of the number of preﬁxes.

We now focus on PBF dimensioning. Figure 5(a) shows Cae-

sar’s forwarding rate in millions of packets per second (Mpps) as

a function of the number of content preﬁxes per PBF block. The

block size is set to one and two cache lines, corresponding to 128

and 256 bytes, respectively, For simplicity, we assume a single line

card is active, and use synthetic workloads. The ﬁgure shows a key

result: the fastest forwarding rate is measured when a block ﬁts in

a single cache line and there are less than 100 content preﬁxes per

block. Therefore, for the rest of the evaluation we set the block size

mequal to 128 bytes and the expansion threshold tifor a block i

to 75 preﬁxes, i.e., the largest value which does not reduce the for-

warding rate, cf. Figure 5(a).

Figure 5(a) shows another interesting result. When a block con-

tains less than 200 preﬁxes, increasing the block size slightly re-

duces Caesar’s forwarding rate. A larger block size is instead ben-

eﬁcial to the forwarding rate when the block contains more than

200 preﬁxes. It comes with no surprise that, overall, a larger block

size provides a lower false positive rate for the same amount of

content preﬁxes per block. Accordingly, 200 preﬁxes per block is

the threshold for which the additional memory accesses required to

load a larger block are amortized by the lower false positive rate.

Figure 5(b) shows Caesar’s forwarding rate as a function of the

total PBF size s=m×b, where bis the number of PBF blocks, for

the reference workload, i.e., 10 million content preﬁxes. As above,

just one line card is active. When s < 20 MB, the forwarding rate

quickly grows from 6 to 6.7 Mpps. For s > 20 MB, the forwarding

rate is constant at 6.7 Mpps. As above, this effect is due to the fact

that the false positive rate quickly ﬂattens out as sincreases. Ac-

cordingly, we set the PBF size sto 30 MB, which is the minimum

PBF size that maximizes the forwarding rate.

Based on these parameters (k= 2,m= 128 bytes, ti= 75), we

compute the number of expansions required per preﬁx in the refer-

ence workload. We ﬁnd that a single expansion is enough to handle

95% of the content preﬁxes; however, 1% of the preﬁxes incur four

expansions, which is the maximum number of expansions required

to handle content preﬁxes from the reference workload.

6.3 Caesar

This section evaluates Caesar. For completeness, we consider

several variants of the ﬁrst stage of our name-based LPM algorithm

(cf. Section 3): (1) PBF, where the PBF is used without expansion,

(2) PBF-exp, where PBF expansion is enabled, (3) NoPBF, where

no PBF is used. In the NoPBF case, all possible content preﬁxes

originated from a requested content name are looked up directly in

the hash table, from the longest to the shortest preﬁx. As an upper

bound for performance, we introduce the PBF-ideal. This consists

of using a PBF with expansion while assuming an ideal synthetic

workload where all content preﬁxes differ in their ﬁrst component.

In this case, content preﬁxes are uniformly distributed among PBF

blocks.

In the remainder of this section, we ﬁrst evaluate Caesar’s per-

formance assuming the reference workload. Then, we present a

sensitivity analysis that leverages several synthetic workloads to

quantify the impact of workload characteristics on Caesar.

Reference workload: We start by measuring Caesar’s forwarding

rate under each of the four variants: PBF,PBF-exp,NoPBF, and

PBF-ideal. The reference workload is used for all variants, with the

exception of PBF-ideal, that uses the ideal workload. Table 1 sum-

marizes the results from these experiments, differentiating between

the cases of one and two active line cards for the forwarding rate.

We ﬁrst focus on the forwarding rate achieved under the PBF-exp

variant. Assuming a single line card, the table shows that Caesar

supports a maximum of 6.6 Mpps when a matching content preﬁx

is found in the FIB (Match), and up to 7.5 Mpps when no matches

are found (No Match), i.e., the corresponding packet is forwarded

to a default route. At 10 Gbps, 6.6 Mpps translates to a minimum

packet size of 188 Bytes. The table also shows that doubling the

active line cards doubles the overall forwarding rate. Accordingly,

Caesar sustains up to 10 Gbps input trafﬁc per line card assum-

ing a minimum packet size of 188 Bytes, and a FIB with 10 million

content preﬁxes. In the remainder of this paper, we focus on results

and experiments where two line cards are active.

Table 1 also shows that PBF-exp pays only a 2% reduction of

the forwarding rate compared to the ideal case (PBF-ideal). This

reduction of the forwarding rate is due to the additional memory

accesses and complexity required by PBF-exp to deal with the non-

uniform distribution of content preﬁxes among blocks. Compared

to the PBF-exp variant, the absence of the expansion mechanism

(PBF) costs an additional 2% reduction of the forwarding rate; this

is due to the high false positive rate in overloaded PBF blocks. Sur-

prisingly, Table 1 shows that PBF-exp only gains about 5% on a so-

lution without a PBF (NoPBF), i.e., a forwarding rate of 13.1 Mpps

PBF-ideal PBF-exp PBF NoPBF

Fwd Rate Match (Mpps) 6.7/13.3 6.6/13.1 6.3/12.5 6.3/12.5

Fwd Rate No Match (Mpps) 7.5/14.9 7.5/14.9 6.3/12.5 5.2/10.2

Min. latency (µs) 5.4 5.6 5.8 5.8

Avg. latency (µs) 6.4 6.5 6.9 7.0

Max. latency (µs) 8.1 8.1 9.4 9.9

Table 1: Caesar’s forwarding rate (Mpps) and latency (µs)

with the reference workload. For the forwarding rate, we dif-

ferentiate between experiments with one and two line cards.

144

Total I/O processing Hashing HT lookup PBF lookup

PBF-ideal 1412 371 363 384 294

PBF-exp 1553 371 440 385 357

PBF 1763 371 440 656 294

NoPBF 1781 371 462 948 -

Atomic - 371 107 251 129

Table 2: CPU cycles per operation.

versus 12.5 Mpps. Such small gain is due to the simplicity of the

reference workload where ∆, the average distance between a name

and its matching preﬁx, is low (i.e., 2 components on average). In

this case, the NoPBF variant requires, on average, only two extra

hash table lookups to perform LPM compared to both PBF and

PBF-exp. Larger gains from the PBF data structure are showed

later under the presence of adversarial workloads. Nevertheless,

Table 1 shows that PBF-exp gains about 30% over NoPBF when

none of the incoming content names matches a FIB entry. This

result suggests that the PBF-exp is robust to DoS attacks, where

an attacker generates non-existing content names to slow down a

content router.

We now focus on packet latency. Table 1 indicates that PBF-exp

provides a slightly lower latency than both PBF and NoPBF, on

average. In fact due to the simplicity of the reference workload,

the switch is responsible for most of the packet latency, and the

impact of the name-based LPM variant on the average latency is

minimal. The maximum latency, however, shows signiﬁcant differ-

ence. PBF-exp reduces the maximum latency by more than 15%

compared to both PBF and NoPBF. The maximum latency is due

to packets whose content names have many components (e.g., 12),

and a high average distance ∆from preﬁxes in the FIB. In this

case, the algorithmic beneﬁt of PBF-exp plays a role as LPM starts

contributing to the overall latency. The table also shows that the

expansion mechanism only causes a 2-4% latency increase with re-

spect to the ideal case.

We dissect Caesar’s performance bottlenecks by tracking the to-

tal number of CPU cycles per major operation, assuming the refer-

ence workload. Table 2 reports the number of CPU cycles spent, on

average, in the execution of the following operations: I/O process-

ing2, hashing, hash table lookup, and PBF lookup. We differentiate

between PBF,PBF-exp,NoPBF, and PBF-ideal; we also investi-

gate the cost of each operation in isolation (Atomic in the table),

i.e., the CPU cycles for a single execution of an operation.

Overall, the results for the total CPU cycles in Table 2 conﬁrm

the trend showed in Table 1, with PBF-ideal being the least and

NoPBF being the most CPU hungry. In isolation (the Atomic row),

I/O processing and hash table lookup require the most CPU cycles.

However, while I/O processing is performed once per packet, hash

table lookup might be performed multiple times according to the

LPM variant adopted. Accordingly, the hash table lookup opera-

tion accounts for a minimum of 25% (PBF-exp) and a maximum of

50% of the CPU cycles (NoPBF). This result showcases the algo-

rithmic advantage of the PBF-exp in reducing the number of hash

table lookups. Conversely, PBF-exp requires some additional CPU

cycles for the PBF lookup, since occasionally more than one block

might be loaded, e.g., PBF-exp requires on average 357 CPU cycles

whereas both PBF and PBF-ideal require only 294 cycles. Finally

despite hashing per se is not CPU-intensive, on average, it accounts

for 25% of the total number of cycles since several seed hash values

are computed (cf. Section 3.3).

2I/O processing consists of header parsing, MAC address lookup,

and header rewriting for packet switching.

Sensitivity analysis: We now analyze the impact of different work-

load characteristics on Caesar’s performance. We start by varying

the average distance ∆between the content preﬁxes in the FIB and

the requested content names. The parameter ∆is key to properly

characterize a given workload, since it deﬁnes the complexity of

the LPM operation.

Figure 5(c) shows Caesar’s forwarding rate as ∆grows from

0 (equivalent to exact matching) to 10 (highly adversarial work-

load); as usual, we distinguish between PBF-ideal,PBF,PBF-exp

and NoPBF. Overall, the rate decreases as ∆increases, which is

expected since the number of seed hash values increases linearly

with ∆. As ∆increases, the performance gap between PBF-exp

and NoPBF increases too, i.e., when ∆ = 10,PBF-exp guarantees

a forwarding rate twice as fast as the NoPBF variant. Compared to

PBF,PBF-exp adds a penalty when ∆is small, which is absorbed

as ∆increases. This set of results suggests that the PBF-exp is

robust to adversarial workloads and variable trafﬁc patterns.

We now investigate the impact of the number of content pre-

ﬁxes nin the FIB. Figure 5(d) plots the evolution of the forwarding

rate as ngrows from 1 content preﬁx up to 10 million, as in the

reference workload. Overall, the forwarding rate follows a step

function, with a large drop in the forwarding rate for n > 1000,

i.e., from 8.3 to 6.6 Mpps. This phenomenon depends on the hier-

archical memory organization of the NPU. When n= 1, the only

content preﬁx quickly propagates from DRAM to the L1 cache of

every core. As the number of preﬁx grows, the network processor

efﬁciently stores the preﬁxes in the L2 cache; after 1,000 preﬁxes,

the L2 caches is exhausted and most preﬁxes are fetched from the

off-chip DRAM, which causes the rate drop. After the 1,000 pre-

ﬁxes threshold, the forwarding rate is almost constant: this indi-

cates that the amount of preﬁxes that Caesar support is limited by

the amount of off-chip DRAM. Therefore, with additional DRAM,

Caesar could support more content preﬁxes with little impact on

the forwarding rate. Such additional memory is largely available

in both edge and core routers. Implementing Caesar on such plat-

forms would allow storing one to two orders of magnitude more

preﬁxes, while still guaranteeing name-based forwarding at wire

speed. This is part of our future work.

6.4 Distributed Forwarding

This section evaluates the distributed forwarding extension used

by Caesar to allow very large FIBs without requiring additional

DRAM (cf. Section 5.1). We populate each input line card with a

disjoint set of 10 million content preﬁxes, 20 millions in total, orig-

inated by modifying few characters from the 10 million preﬁxes in

the reference workload. Since Caesar has a switch with a capacity

of 10 Gbps per line card, and distributed forwarding requires twice

the overall switching speed in the worst case (cf. Section 5.1), we

limit the trafﬁc at 5 Gbps per line card and halve the minimum

packet size from 188 to 94 bytes.

Figure 6(a) shows Caesar’s forwarding rate as a function of ρ,

the fraction of packets that require going to another line card for

name-based LPM. Overall, the forwarding rate slowly decreases

as ρincreases. In the worst-case scenario, ρ= 100% and these

operations account for a drop of only 15% in the rate i.e., from 13.2

to 11.5 Mpps for PBF-exp. This reduction of the forwarding rate

is due to additional operations required by distributed forwarding,

namely packet dispatching, and MAC address rewriting.

We also estimate the impact of distributed forwarding on packet

latency in the worst case, i.e., ρ= 100%. We ﬁnd that distributed

forwarding causes an increase of the average and minimum latency

in Caesar by 50%. As previously discussed, minimum and average

latency mostly derives from the switching latency which doubles

145

0 20 40 60 80 100

11.5

12.5

13.5

Percentage of delegated packets

Forwarding Rate [Mpps]

PBF ideal

PBF

PBF exp

No PBF

(a) Distributed forwarding.

512k 1M 2M 4M 8M 16M

100

120

140

160

Throughput [Mpps]

n [#]

u=4 u=8 u=16 u=32

(b) GPU ofﬂoading.

ATA MATA MATA−NW GPU−C

100

Throughput [Mpps]

Reference

Adversarial FIB

Figure 6: Evaluation of Caesar’s extensions. (a) Forwarding rate as a function of ρin distributed forwarding. (b) Throughput as a

function of FIB size nand maximum preﬁx length uin GPU ofﬂoading. (c) GPU ofﬂoading comparison with [9].

with distributed forwarding. The maximum latency grows instead

by about 30%, and this happens when LPM latency overcomes the

switching latency, i.e., in presence of large values of ∆. In any

case, the additional latency remains in the order of microseconds

and it is thus tolerable even for delay-sensitive applications.

To summarize, distributed forwarding extends Caesar to support

twice as many content preﬁxes with a reduction of only 15% in the

forwarding rate and an additional delay of a few microseconds.

6.5 GPU Ofﬂoading

This section quantiﬁes the speedup that GPU ofﬂoading provides

to Caesar’s forwarding rate. We assume a line card ofﬂoads a batch

of 8K content names to a GTX 580 GPU [19]. Such batch ensures

high GPU occupancy, and a maximum buffering delay of about

1 ms, assuming 188-byte packets and 10 Gbps. Larger packet sizes

or slower speeds, which both cause higher delay in forming a batch,

can be easily handled by Caesar without the GPU (Section 6.3).

We ﬁrst measure how many packets per second the GPU can

match as a function of the number of content preﬁxes nin the FIB.

Figure 6(b) reports the throughput only from the kernel execution

time, i.e., we omit the transferring time between line card and GPU,

and vice-versa, to be comparable with [9] and not limited by the

PCIe bandwidth problem discussed therein. We generate several

synthetic FIBs where the number of content preﬁxes grow expo-

nentially from 0.5 to 16 million, the maximum number of preﬁxes

that ﬁts in the GPU device memory. We also vary the maximum

length uof the preﬁxes in the FIB between 4 and 32 components.

For each value of u, content preﬁxes in the FIB are equally dis-

tributed among the possible lengths, e.g., when u= 4, a quarter of

the preﬁxes have a single component. Finally, we assume that all

content names have 32 components, i.e., d= 32.

Figure 6(b) shows two main results. First, the throughput is

mostly independent from the number of preﬁxes n; overall, grow-

ing the FIB size from 0.5 to 10 million preﬁxes causes less than a

10% throughput decrease. Second, the throughput largely depends

on u. For example, when n= 16 M, increasing ufrom 4 to 32

components reduces the throughput by 5x, from 150 to 30 Mpps.

We now compare our implementation with the work in [9], which

also explores the usage of GPU for name-based forwarding. Their

GPU code is open-sourced, which allows us to perform a fair com-

parison with our implementation. The key idea of the work in [9]

is to organize the FIB as a trie as done today for IP. They thus

introduce a character trie which allows name-based LPM. Then,

they introduce three optimizations, namely the aligned transition

array (ATA), the multi-ATA (MATA) and MATA with interweaved

name storage (MATA-NW), which leverage a combination of hash-

ing and the hierarchical nature of the content names to realize efﬁ-

cient compression and lookup.

Figure 6(c) compares the performance of our kernel (GPU-C)

with the kernels proposed in [9], namely ATA, MATA and MATA-

NW by running their code on our GPU. For such comparison we

use the reference workload, where u∼3, as well as a more adver-

sarial workload where u= 8. We refer to this adversarial workload

as “adversarial FIB.”

Compared to the results presented in [9], we measure less than

half the throughput for ATA, MATA and MATA-NW. This is ex-

pected, since our GPU has half the cores than the GTX 590 GPU

used in [9]. The ﬁgure also shows that the throughput measured

for Caesar, about 95 Mpps, matches the results from the synthetic

traces when u= 8 and n= 10 M, cf. Figure 6(b). MATA-NW is

slightly faster than Caesar, 100 versus 95 Mpps, assuming the refer-

ence workload. This happens because MATA-NW exploits the fact

that most of the content preﬁxes in the FIB are very short, e.g., 2 or

3 components, to reduce LPM to (mostly) an exact matching oper-

ation. Instead, our algorithm does not rely on such assumption; this

design choice makes it resilient to more diverse FIBs at the expense

of a performance loss with a simplistic FIB. Such feature is visible

in the presence of the adversarial FIB, where Caesar is twice as fast

as MATA-NW.

To summarize, GPU ofﬂoad augments Caesar’s forwarding rate

by an order of magnitude, with a small penalty in packet latency,

and our GPU-based LPM algorithm is resilient to adversarial traf-

ﬁc workloads.

7. ADDITIONAL FEATURES

Name-based forwarding is the key task of a content router. Ad-

ditional features are caching, multicasting, and dynamic multipath

forwarding. To support these features, a Pending Interest Table

(PIT) and a Content Store (CS) are required. The PIT keeps track

of pending content requests, or “Interest” in the NDN terminology,

already forwarded by the content router. The CS stores a copy of

forwarded data packets to satisfy eventual future requests.

The design, implementation, and evaluation of PIT and CS is out

of the scope of this paper, and left as future work. However, we

have recently started extending Caesar with both PIT and CS based

on a set of design guidelines derived in our previous work [20, 5].

In the following, we brieﬂy summarize such integration.

In [20], we identify two challenges in PIT design: placement

and data structure. Placement refers to where in the content router

the PIT should reside. Data structure refers to how the PIT en-

tries should be stored and organized to enable efﬁcient operations.

The paper concludes that the best approach is a third-party place-

146

ment leveraging the semantic of content names to select a line card

where PIT matching is performed. This idea ﬁts well the distributed

forwarding scheme used by Caesar, which we plan to piggyback

for the PIT implementation. As data structure, we use an open-

addressed hash table (cf. Section 3.4).

The CS consists of a packet store, where data packets are phys-

ically stored, and an index table, that keeps track of data packet

memory locations in the packet store. Similar to the PIT, the index

table is implemented as an open-addressed hash table. In addition

to pointers to data packets, the index table stores data statistics,

e.g., access frequency and timestamps, to enable replacement poli-

cies like FIFO or LRU. We implement the packet store by an ex-

tension to Caesar’s packet buffer in order to allow Caesar to store

data packets after forwarding as well as serve them when needed.

An evinction mechanism was also added to support the removal

of data packets according to the replacement policy. The CS is

physically allocated on the off-chip DRAM memory. Additional

levels of storage on lower throughput/higher capacity technologies

(e.g., SSD) can complement the packet store design; however, this

optimization is not supported by our current hardware setup.

8. CONCLUSION

The Internet usage is currently centered around content distribu-

tion, instead of the original host-to-host communication. Future In-

ternet architectures are thus expected to depart from a host-centric

design to a content-centric one. Such evolution requires routers to

operate on content names instead of IP addresses. A high burden is

expected on the routers due to the explosion of the address space,

both in number of content preﬁxes, which are hard to aggregate

compared to IP, and their length, expected to be on the order of

tens of bytes as opposed to 32 or 128 bits for IPv4 and IPv6, re-

spectively. Our paper investigates the design and implementation

of Caesar, a content router capable of forwarding packets based

on names at wire speed. Caesar advances the state of the art in

many ways. First, it introduces the novel preﬁx Bloom ﬁlter (PBF)

data structure to allow efﬁcient longest preﬁx matching operation

on content names. Second, it is fully compatible with current pro-

tocols and network equipment. Third, it supports packet processing

ofﬂoad to external units, such as graphics processing units (GPUs),

and distributed forwarding, a mechanism which allows line cards

to share their FIBs with each other. Our experiments show that

Caesar sustains up to 10 Gbps input trafﬁc per line card assuming a

minimum packet size of 188 bytes, and a FIB with 10 million con-

tent preﬁxes. We also show that the two proposed extensions allow

Caesar to support both a larger FIB and higher forwarding speed,

with a small penalty in packet latency.

ACKNOWLEDGMENTS

This work has been partially carried out at the Laboratory of Infor-

mation, Networking, and Computer Science (LINCS), and results

have been partially produced in the framework of the common re-

search laboratory between INRIA and Bell Labs, Alcatel-Lucent.

9. REFERENCES

[1] “Akamai,” http://www.akamai.com/.

[2] “Amazon elastic compute cloud (amazon ec2),”

http://aws.amazon.com/ec2/.

[3] G. Caroﬁglio, G. Morabito, L. Muscariello, I. Solis, and

M. Varvello, “From Content Delivery Today to Information

Centric Networking,” Computer Networks, 2013.

[4] V. Jacobson, D. K. Smetters, J. D. Thronton, M. F. Plass,

N. H. Briggs, and R. L. Braynard, “Network Named

Content,” in Proc. ACM CoNEXT, Rome, Italy, Dec. 2009.

[5] D. Perino and M. Varvello, “A Reality Check for Content

Centric Networking,” in Proc. ACM ICN, Toronto, Canada,

Aug. 2011.

[6] W. So, A. Narayanan, and D. Oran, “Named Data

Networking on a Router: Fast and DoS-Resistant Forwarding

with Hash Tables,” in Proc. IEEE/ACM ANCS, San Jose,

California, USA, Oct. 2013.

[7] G. Pankaj, L. Steven, and M. Nick, “Routing Lookups in

Hardware at Memory Access Speeds,” in Proc. IEEE

INFOCOM, San Francisco, CA, Mar. 1998.

[8] W. So, A. Narayanan, D. Oran, and M. Stapp, “Named Data

Networking on a Router: Forwarding at 20Gbps and

Beyond,” in Proc. ACM SIGCOMM (demo), Honk Kong,

China, Aug. 2013.

[9] Y. Wang, Y. Zu, T. Zhang, K. Peng, Q. Dong, B. Liu,

W. Meng, H. Dai, X. Tian, Z. Xu, H. Wu, and D. Yang,

“Wire Speed Name Lookup: a GPU-Based Approach,” in

Proc. NSDI, Lombard, IL, Apr. 2013.

[10] H. Yuan, T. Song, and P. Crowley, “Scalable NDN

forwarding: Concepts, Issues and Principles,” in Proc.

ICCCN, Bundeswehr Munchen, Jul. 2012.

[11] Y. Wang, K. He, H. Dai, W. Meng, J. Jiang, B. Liu, and

Y. Chen, “Scalable Name Lookup in NDN Using Effective

Name Component Encoding,” in Proc. ICDCS, Macau,

China, Jun. 2012.

[12] Y. Wang, T. Pan, Z. Mi, H. Dai, X. Guo, T. Zhang, B. Liu,

and Q. Dong, “NameFilter: Achieving Fast Name Lookup

with Low Memory Cost via Applying Two-Stage Bloom

Filters,” in Proc. IEEE INFOCOM, Turin, Italy, Aug. 2013.

[13] S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor,

“Longest Preﬁx Matching Using Bloom Filters,” in Proc.

ACM SIGCOMM, Karlsruhe, Germany, Aug. 2003.

[14] H. Song, F. Hao, M. S. Kodialam, and T. V. Lakshman, “IPv6

Lookups using Distributed and Load Balanced Bloom Filters

for 100Gbps Core Router Line Cards,” in Proc. IEEE

INFOCOM, Rio de Janeiro, Brazil, Jul. 2009.

[15] “Information centric networking research group (icnrg),”

http://irtf.org/icnrg.

[16] A. K. M. M. Hoque, S. O. Amin, A. Alyyan, B. Zhang,

L. Zhang, and L. Wang, “NLSR: Named-data Link State

Routing Protocol,” in Proc. ACM ICN, Hong Kong, China,

Aug. 2013.

[17] S. Iyer and N. W. McKeown, “Analysis of the Parallel Packet

Switch Architecture,” IEEE/ACM Transactions on

Networking, vol. 11, no. 2, pp. 314–324, Apr. 2003.

[18] “Alcatel 7950,”

http://www.alcatel-lucent.com/products/

7950-extensible-routing-system.

[19] “nvidia. gtx 580,”

http://geforce.com/hardware/desktop-gpus/geforce-gtx-580/.

[20] M. Varvello, D. Perino, and L. Linguaglossa, “On the Design

and Implementation of a Wire-Speed Pending Interest

Table,” in Proc. IEEE NOMEN, Turin, Italy, Aug. 2013.

147

Search mechanism for data contents based on bloom filter and tree hybrid structure in system wide information management

Article

Full-text available

May 2023

The system wide information management (SWIM) system infrastructure layer uses information‐centric networking (ICN) to implement the cache routing and sharing of air traffic information data. The SWIM network searches and forwards routes on the basis of content names. The variable length, hierarchical name structure, and routing updates caused by the frequent publication and deletion of content make the implementation of fast name routing lookup algorithms a significant but arduous task. To address this problem, this study designed a name matching mechanism on the basis of the hybrid structure of the Bloom filter and the Tree. Compared with the traditional Bloom filter search scheme, this search mechanism reduces the quantity of Bloom filter insertion entries and reduces the possibility of hash collisions. Compared with the traditional Tree search scheme, the proposed mechanism can effectively solve the problem of numerous memory accesses caused by extremely long name prefixes, improving the query efficiency. In addition, the proposed composite structure is compared with several methods, such as DIPIT and NPT, by analyzing the effects of layering, global delay and search speed. The experimental results show that the proposed structure has a favourable performance in terms of storage overhead and search speed.

An NDN Cache-Optimization Strategy Based on Dynamic Popularity and Replacement Value

Article

Full-text available

Sep 2022

Aiming at examining the problems of the low cache hit ratio and high-average routing hops in named data networking (NDN), this paper proposes a cache-optimization strategy based on dynamic popularity and replacement value. When the requested content arrives at the routing node, the latest popularity is calculated based on the number of requests in the current cycle and the popularity of the previous cycle. We adjust the node cache threshold according to the occupation of the node cache space and cache the content with a higher popularity than the threshold. When the cache is complete, the cache-optimization strategy considers the last request time, popularity, and transmission cost of cached content to calculate the replacement value of cached content. We move the content with the lowest replacement value out of the cache, and keep the content with a high replacement value. We deploy the proposed cache-optimization strategy by using a programmable language in a real network with programmable devices. The experimental results illustrate that the strategy proposed in this paper can effectively improve the cache hit ratio and reduce the average routing hops for user request responses compared with other traditional NDN caching strategies.

An Entropy-Based Approach: Compressing Names for NDN Lookup

Article

Full-text available

Jul 2021

NDN (Named Data Networking) is one of the most popular future network architecture, a “clean slate” design for replacing the traditional TCP/IP network. However, the lookup algorithm of FIB entry in NDN is the bottleneck of the current NDN. Owing to the unique identifier of content name, whose length is variable, the size of FIB entries is proliferating, and the effectiveness of lookup algorithms is low. This paper proposed an entropy-oriented name processing mechanism, compressing the content names effectively by bringing in an encoding scheme. This mechanism can be split into two parts: name compression and lookup. The first part compressed the content names and converted them into a kind of code with a smaller size by considering the information redundancies of content names; the second part built a compact structure to minimize the memory footprint of FIB entries with keeping the high lookup performance. This mechanism outperformed many traditional name lookup algorithms, had better flexibility and cost less memory footprint.

Towards a Scalable Named Data Border Gateway Protocol

Conference Paper

Nov 2022

On the Performance Benefits of Heterogeneous Virtual Network Function Execution Frameworks

Conference Paper

Jun 2022

As the adoption of softwarized network functions (NFs) keeps growing, we evaluate the performance benefits of SDN-aware data-plane implementations when compared to diverse acceleration and process-based NFV frameworks. Typical network functions have been implemented using four alternative frameworks scenarios, an SDN-aware software switch (data-plane), a virtual machine (VM), a Data-Plane Development Kit (DPDK) NF, and a containerized NF. Results from our experiments show that the data-plane NF implementation yields much higher bandwidth and packets per second (pps) rates. The bandwidth obtained is 14% more than the user-space scenario while retaining CPU utilization. The DPDK NFs in our evaluation can process packets at a much higher rate for 64B packets, on a single CPU core, which is 7 times higher than the containerized NF implementations, also tied to a single core. Our results also show the performance gains from deploying virtual network functions on heterogeneous frameworks.

Forwarding and Routing With Packet Subscriptions

Article

Dec 2022

In this paper, we explore how programmable data planes can naturally provide a higher-level of service to user applications via a new abstraction called packet subscriptions. Packet subscriptions generalize forwarding rules, and can be used to express both traditional routing and more esoteric, content-based approaches. We present strategies for routing with packet subscriptions in which a centralized controller has a global view of the network, and the network topology has a hierarchical or general structure. We also describe a compiler for packet subscriptions that uses a novel BDD-based algorithm to efficiently translate predicates into P4 tables that can support O(100K) expressions. Using our system, we have built eight diverse applications. We show that these applications can be deployed in brownfield networks while performing line-rate message processing, using the full switch bandwidth of 6.5Tbps.

Smart Name Lookup for NDN Forwarding Plane via Neural Networks

Article

Oct 2021

Name lookup is a key technology for the forwarding plane of content router in Named Data Networking (NDN). To realize the efficient name lookup, what counts is deploying a high-performance index in content routers. So far, the proposed indexes have shown good performance, most of which are optimized for or evaluated with URLs collected from the current Internet, as the large-scale NDN names are not available yet. Unfortunately, the performance of these indexes is always impacted in terms of lookup speed, memory consumption and false positive probability, as the distributions of URLs retrieved in memory may differ from those of real NDN names independently generated by content-centric applications online. Focusing on this gap, a smart mapping model named Pyramid-NN via neural networks is proposed to build an index called LNI for NDN forwarding plane. Through learning the distributions of the names retrieved in the static memory, LNI that will be trained by real NDN names offline and preset in content routers in the future can not only reduce the memory consumption and the probability of false positive, but also ensure the performance of real NDN name lookup. Experimental results show that LNI-based FIB can reduce the memory consumption to 58.258 MB. Moreover, as it can be deployed on SRAMs, the throughput is about 177 MSPS, which well meets the current network requirement for fast packet processing.

Vision: toward 10 Tbps NDN forwarding with billion prefixes by programmable switches

Conference Paper

Sep 2021

Smart Name Lookup for NDN Forwarding Plane via Neural Networks

Preprint

May 2021

Name lookup is a key technology for the forwarding plane of content router in Named Data Networking (NDN). To realize the efficient name lookup, what counts is deploying a highperformance index in content routers. So far, the proposed indexes have shown good performance, most of which are optimized for or evaluated with URLs collected from the current Internet, as the large-scale NDN names are not available yet. Unfortunately, the performance of these indexes is always impacted in terms of lookup speed, memory consumption and false positive probability, as the distributions of URLs retrieved in memory may differ from those of real NDN names independently generated by content-centric applications online. Focusing on this gap, a smart mapping model named Pyramid-NN via neural networks is proposed to build an index called LNI for NDN forwarding plane. Through learning the distributions of the names retrieved in the static memory, LNI can not only reduce the memory consumption and the probability of false positive, but also ensure the performance of real NDN name lookup. Experimental results show that LNI-based FIB can reduce the memory consumption to 58.258 MB for 2 million names. Moreover, as it can be deployed on SRAMs, the throughput is about 177 MSPS, which well meets the current network requirement for fast packet processing.

Forwarding and routing with packet subscriptions

Conference Paper

Nov 2020

Named Data Networking on a Router: Forwarding at 20Gbps and Beyond

Article

Full-text available

Aug 2013
COMPUT COMMUN REV

Named data networking (NDN) is a new networking paradigm using named data instead of named hosts for communication. Implementation of scalable NDN packet forwarding remains a challenge because NDN requires fast variable-length hierarchical name-based lookup, per-packet data plane state update, and large-scale forwarding tables. We have designed and implemented an NDN data plane with a software forwarding engine on an Intel Xeon-based line card in a Cisco ASR9000 router. In order to achieve high-speed forwarding, our design features (1) name lookup via hash tables with fast collision-resistant hash computation, (2) an efficient and secure FIB lookup algorithm that provides good average and bounded worst-case FIB lookup time, (3) PIT partitioning that enables linear multi-core speedup, and (4) an optimized data structure and software prefetching to maximize data cache utilization. In this demonstration, we showcase our NDN router implementation on the ASR9000 and demonstrate that it can forward real NDN traffic at 20Gbps or higher.

Named data networking on a router: Fast and DoS-resistant forwarding with hash tables

Conference Paper

Full-text available

Oct 2013

On the design and implementation of a wire-speed pending interest table

Conference Paper

Apr 2013

Scalable NDN Forwarding: Concepts, Issues and Principles

Article

Jul 2012

Named Data Networking (NDN) is a recently pro-posed general-purpose network architecture that leverages the strengths of Internet architecture while aiming to address its weaknesses. NDN names packets rather than end-hosts, and most of NDN's characteristics are a consequence of this fact. In this paper, we focus on the packet forwarding model of NDN. Each packet has a unique name which is used to make forwarding decisions in the network. NDN forwarding differs substantially from that in IP; namely, NDN forwards based on variable-length names and has a read-write data plane. Designing and evaluating a scalable NDN forwarding node architecture is a major effort within the overall NDN research agenda. In this paper, we present the concepts, issues and principles of scalable NDN forwarding plane design. The essential function of NDN forwarding plane is fast name lookup. By studying the performance of the NDN reference implementation, known as CCNx, and simplifying its forwarding structure, we identify three key issues in the design of a scalable NDN forwarding plane: 1) exact string matching with fast updates, 2) longest prefix matching for variable-length and unbounded names and 3) large-scale flow maintenance. We also present five forwarding plane design principles for achieving 1 Gbps throughput in software implementation and 10 Gbps with hardware acceleration.

Wire speed name lookup: A GPU-based approach

Conference Paper

Apr 2013

This paper studies the name lookup issue with longest prefix matching, which is widely used in URL filtering, content routing/switching, etc. Recently Content-Centric Networking (CCN) has been proposed as a clean slate future Internet architecture to naturally fit the content-centric property of today's Internet usage: instead of addressing end hosts, the Internet should operate based on the identity/name of contents. A core challenge and enabling technique in implementing CCN is exactly to perform name lookup for packet forwarding at wire speed. In CCN, routing tables can be orders of magnitude larger than current IP routing tables, and content names are much longer and more complex than IP addresses. In pursuit of conquering this challenge, we conduct an implementation-based case study on wire speed name lookup, exploiting GPU's massive parallel processing power. Extensive experiments demonstrate that our GPU-based name lookup engine can achieve 63.52M searches per second lookup throughput on large-scale name tables containing millions of name entries with a strict constraint of no more than the telecommunication level 100µs per-packet lookup latency. Our solution can be applied to contexts beyond CCN, such as search engines, content filtering, and intrusion prevention/ detection.

NLSR: Named-data link state routing protocol

Conference Paper

Aug 2013

This paper presents the design of the Named-data Link State Routing protocol (NLSR), a routing protocol for Named Data Networking (NDN). Since NDN uses names to identify and retrieve data, NLSR propagates reachability to name prefixes instead of IP prefixes. Moreover, NLSR differs from IP-based link-state routing protocols in two fundamental ways. First, NLSR uses Interest/Data packets to disseminate routing updates, directly benefiting from NDN's data authenticity. Second, NLSR produces a list of ranked forwarding options for each name prefix to facilitate NDN's adaptive forwarding strategies. In this paper we discuss NLSR's main design choices on (1) a hierarchical naming scheme for routers, keys, and routing updates, (2) a hierarchical trust model for routing within a single administrative domain, (3) a hop-by-hop synchronization protocol to replace the traditional network-wide flooding for routing update dissemination, and (4) a simple way to rank multiple forwarding options. Compared with IP-based link state routing, NLSR offers more efficient update dissemination, built-in update authentication, and native support of multipath forwarding.

From content delivery today to information centric networking

Article

Nov 2013
COMPUT NETW

Today, content delivery is a heterogeneous ecosystem composed by various independent infrastructures. The ever increasing growth of Internet traffic has encouraged the proliferation of different architectures to serve content provider needs and user demand. Despite the differences among the technology, their low level implementation can be characterized in a few fundamental building blocks: network storage, request routing, and data transfer. Existing solutions are inefficient because they try to build an information centric service model over a network infrastructure which was designed to support host-to-host communications. The Information-Centric Networking (ICN) paradigm has been proposed as a possible solution to this mismatch. ICN integrates content delivery as a native network feature. The rationale is to architect a network that automatically interprets, processes, and delivers content (information) independently of its location. This paper makes the following contributions: (1) it identifies a set of building blocks for content delivery, (2) it surveys the most popular approaches to realize the above building blocks, (3) it compares content delivery solutions relying on the current Internet infrastructure with novel ICN approaches.

NameFilter: Achieving fast name lookup with low memory cost via applying two-stage Bloom filters

Conference Paper

Apr 2013

In this paper we design, implement and evaluate NameFilter, a two-stage Bloom filter-based scheme for Named Data Networking name lookup, in which the first stage determines the length of a name prefix, and the second stage looks up the prefix in a narrowed group of Bloom filters based on the results from the first stage. Moreover, we optimize the hash value calculation of name strings, as well as the data structure to store multiple Bloom filters, which significantly reduces the memory access times compared with that of non-optimized Bloom filters. We conduct extensive experiments on a commodity server to test NameFilter's throughput, memory occupation, name update as well as scalability. Evaluation results on a name prefix table with 10M entries show that our proposed scheme achieves lookup throughput of 37 million searches per second at low memory cost of only 234.27 MB, which means 12 times speedup and 77% memory savings compared to the traditional character trie structure. The results also demonstrate that NameFilter can achieve 3M per second incremental updates and exhibit good scalability to large-scale prefix tables.

Scalable Name Lookup in NDN Using Effective Name Component Encoding

Conference Paper

Jun 2012

Name-based route lookup is a key function for Named Data Networking (NDN). The NDN names are hierarchical and have variable and unbounded lengths, which are much longer than IPv4/6 address, making fast name lookup a challenging issue. In this paper, we propose an effective Name Component Encoding (NCE) solution with the following two techniques: (1) A code allocation mechanism is developed to achieve memory-efficient encoding for name components, (2) We apply an improved State Transition Arrays to accelerate the longest name prefix matching and design a fast and incremental update mechanism which satisfies the special requirements of NDN forwarding process, namely to insert, modify, and delete name prefixes frequently. Furthermore, we analyze the memory consumption and time complexity of NCE. Experimental results on a name set containing 3,000,000 names demonstrate that compared with the character trie NCE reduces overall 30% memory. Besides, NCE performs a few millions lookups per second (on an Intel 2.8 GHz CPU), a speedup of over 7 times compared with the character trie. Our evaluation results also show that NCE can scale up to accommodate the potential future growth of the name sets.

A reality check for content centric networking

Article

Aug 2011

Content-Centric Networking (CCN) is a novel networking paradigm centered around content distribution rather than host-to-host connectivity. This change from host-centric to content-centric has several attractive advantages, such as network load reduction, low dissemination latency, and energy efficiency. However, it is unclear whether today's technology is ready for the CCN (r)evolution. The major contribution of this paper is a systematic evaluation of the suitability of existing software and hardware components in today's routers for the support of CCN. Our main conclusion is that a CCN deployment is feasible at a Content Distribution Network (CDN) and ISP scale, whereas today's technology is not yet ready to support an Internet scale deployment.

Caesar: A Content Router for High-Speed Forwarding on Content Names

Abstract and Figures

Recommended publications

Towards practical use of Bloom Filter based IP lookup in operational network

TB2F: Tree-bitmap and bloom-filter for a scalable and efficient name lookup in Content-Centric Netwo...

Caesar: A content router for high speed forwarding

Performance Comparison and Optimization of ICN Prototypes

Hardware accelerator to speed up packet processing in NDN router