ArticlePDF Available

Vectored-Bloom Filter for IP Address Lookup: Algorithm and Hardware Architectures

October 2019
Applied Sciences 9(21):4621

October 2019
9(21):4621

DOI:10.3390/app9214621

License
CC BY

Authors:

Hyesook Lim

Ewha Womans University

The Internet Protocol (IP) address lookup is one of the most challenging tasks for Internet routers, since it requires to perform packet forwarding at wire-speed for tens of millions of incomming packets per second. Efficient IP address lookup algorithms have been widely studied to satisfy this requirement. Among them, Bloom filter-based approach is attractive in providing high performance. This paper proposes a high-speed and flexible architecture based on a vectored-Bloom filter (VBF), which is a space-efficient data structure that can be stored in a fast on-chip memory. An off-chip hash table is infrequently accessed, only when the VBF fails to provide address lookup results. The proposed architecture has been evaluated through both a behavior simulation with C language and a timing simulation with Verilog. The hardware implementation result shows that the proposed architecture can achieve the throughput of 5 million packets per second in a field programmable gate array (FPGA) operated at 100 MHz.

Overall structure of vectored Bloom filter (VBF) algorithm.

…

Block diagram of Basic IP Address Lookup Module: the bold lines represent data paths and the dotted line represents a control path.

…

State machine for Internet Protocol (IP) address lookup procedure.

…

Parallel architecture consisting of two basic IP address lookup modules: the bold lines represent data paths and the dotted lines represent control paths.

…

+10

Parallel architecture employing a single hash table: the bold lines represent data paths and the dotted lines represent control paths.

…

Figures - available via license: CC BY

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

applied

sciences

Article

Vectored-Bloom Filter for IP Address Lookup:

Algorithm and Hardware Architectures

Hayoung Byun , Qingling Li and Hyesook Lim *

Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul 03760, Korea;

hayoung77@ewhain.net (H.B.); liqingling05@gmail.com (Q.L.)

*Correspondence: hlim@ewha.ac.kr; Tel.: +82-2-3277-3403

Received: 28 September 2019; Accepted: 25 October 2019; Published: 30 October 2019





Abstract: The Internet Protocol (IP) address lookup is one of the most challenging tasks for Internet

routers, since it requires to perform packet forwarding at wire-speed for tens of millions of incomming

packets per second. Efﬁcient IP address lookup algorithms have been widely studied to satisfy this

requirement. Among them, Bloom ﬁlter-based approach is attractive in providing high performance.

This paper proposes a high-speed and ﬂexible architecture based on a vectored-Bloom ﬁlter (VBF),

which is a space-efﬁcient data structure that can be stored in a fast on-chip memory. An off-chip

hash table is infrequently accessed, only when the VBF fails to provide address lookup results.

The proposed architecture has been evaluated through both a behavior simulation with C language

and a timing simulation with Verilog. The hardware implementation result shows that the proposed

architecture can achieve the throughput of 5 million packets per second in a ﬁeld programmable gate

array (FPGA) operated at 100 MHz.

Keywords: Bloom ﬁlter; IP address lookup; vectored-bloom ﬁlter; FPGA; hardware accelerator

1. Introduction

The global IP trafﬁc forecast provided by Cisco Systems reported that the annual run rate for

global IP trafﬁc was 1.5 Zeta bytes per year (122 EB per month) in 2017, and the global IP trafﬁc

will reach 4.8 ZB per year by 2022, which means it will increase nearly threefold over the next ﬁve

years. The rapid growth of trafﬁc has made packet forwarding in routers a bottleneck in constructing

high-performance networks.

An IP address is composed of a network part and a host part. The network part indicates a group

of hosts included in a network while the host part indicates a speciﬁc host [

]. The network part

is called a preﬁx and hosts connected to the same network have the same preﬁx. In a class-based

addressing scheme, the preﬁx length is ﬁxed at 8, 16, or 24 bits and routers perform an exact match

operation for an IP address lookup. However, because of the excessive address wasting caused by the

inﬂexibility of the preﬁx lengths under the class-based addressing scheme, a new addressing scheme

called classless inter-domain routing (CIDR) has been introduced. In the CIDR scheme, arbitrary preﬁx

lengths are allowed and routers identify the longest preﬁx among all matching preﬁxes as the best

matching preﬁx (BMP) for an IP address lookup [2–4].

Various IP address lookup algorithms have been researched, including trie-based [

hash table-based [

] and Bloom ﬁlter-based algorithms [

]. Since an access to an off-chip memory is

10-20 times slower than an access to an on-chip memory [

], reducing the number of off-chip memory

accesses required for looking up an IP address is the most effective strategy [

]. While the trie and

the hash table are generally stored in off-chip memories due to their sizes, a Bloom ﬁlter is an efﬁcient

structure that can be stored in an on-chip memory.

Appl. Sci. 2019,9, 4621; doi:10.3390/app9214621 www.mdpi.com/journal/applsci

Appl. Sci. 2019,9, 4621 2 of 20

In building a packet forwarding engine, various hardware components, such as application- speciﬁc

integrated circuits (ASIC), ternary content addressable memory (TCAM) and ﬁeld-programmable

gate arrays (FPGA), are used to satisfy the wire-speed packet forwarding requirement. As a ﬂexible

and programmable device, the FPGA has a matrix of conﬁgurable logic blocks connected through

programmable interconnects. In particular, FPGAs have been widely used to build the prototypes of

packet forwarding engines [9–13] and intrusion detection systems [14–16].

The contribution of this paper is as follows. We propose a vectored-Bloom ﬁlter (VBF) architecture,

which is a multi-bit vector Bloom ﬁlter for the purpose of IP address lookup. The VBF is proposed to

obtain lookup results by accessing only an on-chip memory. An off-chip hash table is rarely accessed

when the VBF fails to provide the results. We have evaluated our proposed architecture in two steps.

The construction procedures of the VBF and the hash table are implemented using C at behavior

level, since the construction procedure is not necessarily performed in real-time. The performance

at the behavior level has been evaluated in terms of on-chip memory requirement, off-chip memory

requirement, the average number and the worst-case number of memory accesses, indeterminable rate

and false port return rate. The search procedure that should be performed in real-time is implemented

using Verilog with a single FPGA. The performance of the FPGA has been evaluated in terms of

block-RAM (BRAM) requirement, resource utilization and throughput.

The remainder of this paper is organized as follows. Section 2brieﬂy explains the Bloom ﬁlter

and previous IP address lookup algorithms utilizing on-chip memories. Section 3describes our

proposed IP address lookup algorithm using a vectored-Bloom ﬁlter including theoretical analysis on

the search failure probability of our proposed algorithm. Section 4describes the hardware architecture

implemented on an FPGA. Section 5shows behavior simulation results comparing the performance of

the proposed VBF with other structures. Section 6shows the hardware implementation details of our

proposed structure. Finally, Section 7concludes the paper.

2. Related Works

2.1. Bloom Filter

A Bloom ﬁlter [

] is a multi-bit probabilistic data structure used for membership querying to

determine whether an input is the element included in a given set. Bloom ﬁlters have been applied

to many network applications due to their space-efﬁcient attributes [

–

] and hardware-friendly

features [22–26].

A Bloom ﬁlter has two operations: programming and querying. In programming, the membership

information of each element in a given set is stored using a number of hash indexes, which are obtained

by entering each speciﬁc element as the input of a hash function. In querying, the membership of each

input is checked using the same hash function as used in programming [

]. Even though a Bloom

ﬁlter requires multiple hash indexes for the programming and the querying, the hash indexes can be

easily obtained using a single hash function and a few simple operations [28,29].

In programming set

S={x1

···

xn}

, every bit of the

-bit Bloom ﬁlter is initially 0 and

different hash indexes are used to map each element to

locations of the

-bit array. Let

be the

number of elements in the programming set. The optimal number of hash indexes is deﬁned as follows.

k=m

nln 2 (1)

In order to program an element in the set, the

bits pointed by the

hash indexes obtained using

the element are all set to 1. If a speciﬁc bit location in the m-bit array was already 1, it is not changed.

A Bloom ﬁlter provides whether or not an input is a member of the programmed set by querying.

In querying input

, the

hash indexes obtained by the same procedure as in the programming are

used. The querying to the Bloom ﬁlter produces two types of results: negative or positive. If any of the

-bit locations are 0,

is deﬁnitively not a member of set

and is thus termed a negative. If all of the

Appl. Sci. 2019,9, 4621 3 of 20

-bit locations are 1,

is considered a member of set

and thus termed a positive. However, a Bloom

ﬁlter can produce a false positive owing to hash collision even if

y6∈ S

. For

elements, the false

positive rate fof an m-bit Bloom ﬁlter is obtained as follows [27].

f=1−(1−1

knk

≈(1−e−kn/m)k(2)

As shown, the false positive rate can be reduced by increasing the size of the Bloom ﬁlter

but cannot be completely eliminated. Larger Bloom ﬁlters require larger

according to Label (1),

resulting in more overhead. However, since the on-chip processing overhead for the Bloom ﬁlter stored

in an on-chip memory is much smaller compared with the off-chip processing overhead for accessing

off-chip memories, this paper aims to reduce the number of off-chip memory accesses by utilizing an

on-chip Bloom ﬁlter.

2.2. IP Address Lookup Algorithms Utilizing On-Chip Memories

In the parallel Bloom ﬁlter (PBF) architecture proposed by Dharmapurikar et al. [

Bloom

ﬁlters associated with each preﬁx length are maintained, where

is the number of distinct preﬁx

lengths. In programming, preﬁxes with the same length are programmed into the same Bloom ﬁlter.

In search procedure, for an input IP address, all of the Bloom ﬁlters stored in on-chip memories are ﬁrst

queried in parallel. For the lengths where positive results are returned by the Bloom ﬁlters, an off-chip

hash table is sequentially accessed starting from the longest length to ﬁnd the output port of the best

matching preﬁx.

In the Bloom ﬁlter chaining (BF-chaining) architecture proposed by Mun et. al. [

], a binary

trie is primarily constructed. Every node in the binary trie is programmed into an on-chip Bloom

ﬁlter, while preﬁxes are only programmed in an off-chip hash table. For an input IP address, the

Bloom ﬁlter is sequentially queried from the root node until a negative result is produced. Since a

node cannot exist without an ancestor node, the trie level (preﬁx length) of the last positive would

have the longest matching node for the input (if the positive is true positive). The off-chip hash table

is accessed for this length. If the positive is a false positive or if the node is an empty node without

storing a preﬁx, a back-tracking should occur, which accesses for a shorter length. The back-tracking

by the false positive cannot be avoided but the back-tracking by the empty node can be avoided by

pre-computation, which makes each empty node store the output port of the direct ancestor preﬁx.

An IP address lookup is achieved by a single off-chip memory access if the Bloom ﬁlter positive is true.

However, the PBF and BF-chaining architectures require at least one off-chip memory access

because the on-chip Bloom ﬁlter does not return the output port of the longest matching preﬁx.

Yang et al. [

] proposed a splitting approach for IP address lookup (SAIL). The SAIL architecture

applies the splitting along with preﬁx lengthes and focuses to speed-up the IP address lookup with

short lengths using on-chip bitmaps. The SAIL also requires at least one off-chip access because the

on-chip bitmap does not return the matching output port.

3. IP Address Lookup Using a Vectored-Bloom Filter

The basic idea of the proposed vectored-Bloom ﬁlter (VBF) structure for IP address lookup was

brieﬂy introduced in References [

] and [

]. The VBF consists of

multi-bit vectors and each vector

comprised of

bits contains an output port. The proposed structure completes an IP address lookup by

querying only the on-chip VBF without accessing the off-chip hash table. Depending on how output

ports are programmed to the VBF, two different structures are possible [

]. The ﬁrst structure involves

making the

-bit vector represent up to 2

l−

2 output ports, zero and a conﬂict value, which is the value

of 2l−1. The conﬂict indicates that the vector is programmed by two or more different output ports.

In programming a preﬁx, the

vectors pointed by the

hash indexes of the preﬁx are written

with the output port of the preﬁx. If any of the vectors already have an output port other than the

Appl. Sci. 2019,9, 4621 4 of 20

output port, all of the bits in the vector are set to 1 in order to represent the conﬂict (2

l−

1). In querying

an input, if all the

vectors located by the

hash indexes of the input have the conﬂict values, the VBF

cannot return an output port and the result is termed an indeterminable. If any of the

vectors has 0,

the input is not a preﬁx and hence the result is termed a negative. If the vectors other than conﬂict have

the same value, the input is a preﬁx and the value is the output port of the preﬁx and hence the result

is termed a positive.

The second structure involves making the

-bit vector represent

output ports by assigning each

bit in a vector an output port. Hence, two or more different output ports can be represented in a vector

by setting the corresponding bits and the conﬂict value is not necessarily deﬁned.

The motivation of the proposed VBF structure is to complete the IP address lookup by only

searching the VBF implemented with an on-chip memory, without accessing a hash table implemented

with an off-chip memory. The hash table is infrequently accessed only when the VBF fails to provide

the output port. The VBF structure is shown in Figure 1. Assuming that output ports are uniformly

distributed, for

given preﬁxes, the optimal number of hash functions for the VBF is deﬁned as

follows, similarly to Label (1).

kv=ml

nln 2 (3)

since a VBF can be considered with

l m

-bit Bloom ﬁlters, each of which is programmed by

n/l

preﬁxes.

In this paper, we describe the second structure in detail including the hardware architecture

which was not described in Reference [

]. Hence, Section 4is completely a new section compared

with Reference [

]. In Reference [

], even though the hardware architecture of the VBF implemented

on the FPGA is brieﬂy described, the proposed structure has not been evaluated through a behavior

simulation and not been compared with other algorithms. In this paper, we describe our algorithm and

the proposed hardware structures in detail and the proposed architecture has been evaluated through

both a behavior simulation with C language and a timing simulation with Verilog. In addition, we

compare our algorithm with other algorithms with a large scale. Hence. the Section 5is completely a

new section and Section 6is largely extended compared with Reference [32].

Figure 1. Overall structure of vectored Bloom ﬁlter (VBF) algorithm.

Appl. Sci. 2019,9, 4621 5 of 20

3.1. VBF Programming

Algorithm 1describes the construction procedure of the VBF. All of the preﬁxes in a routing set

are programmed into the VBF and also stored in a hash table. For preﬁx

in a routing set with output

port

hash indexes are ﬁrst obtained. In order to program the output port to the

vectors

pointed by the kvindexes, each bit location corresponding to x.port is set to 1.

Algorithm 1: Programming Procedure of VBF

Function programVBF(x)

for (i =1to kv)do

BF[hi(x)][ x.port −1] = 1;

end

3.2. VBF Querying

Algorithm 2describes the querying procedure of the VBF in detail. For a given input key,

indexes are ﬁrst obtained and a bit-wise AND operation is then performed for

vectors obtained

from the VBF in order to obtain a

-bit result vector,

check

. The querying has three possible results:

negative,positive or indeterminable. If all of the bits in the

check

are 0, the input was deﬁnitively not

programmed to the VBF and the result is termed a negative. If the

check

has a single set bit, the input

is considered a member of the routing set and the location of the set bit is returned as the matching

output port; the result is then termed a positive. If the

check

has plural set bits, the VBF cannot return a

value and the result is termed an indeterminable. In Algorithm 2, outPort is the matching output port

and

counter

is the number of set bits in the

check

. Therefore, if

counter

is 1, it means a positive and

outPort is returned as the matching port. If

counter

is larger than 1, it means an indeterminable and a

matching port cannot be returned from the VBF.

Algorithm 2: Querying Procedure of VBF

Function queryVBF(y)

check = BF[h0(y)]&BF[h1(y)]&...&BF[hkv−1(y)];

counter = 0;

for (i =0to l −1)do

if ( check[i] == true ) then

outPort = i+ 1;

counter =counter + 1;

if ( counter >1 ) then

break;

end

if ( counter == 0 ) then

return 0; // negative

else if ( counter == 1 ) then

return outPort; // positive

else

return -1; // indeterminable

end

Appl. Sci. 2019,9, 4621 6 of 20

3.3. Hash Table Implementation

In implementing the hash table, hash collisions should be carefully considered. In order to reduce

the number of collisions, we use two hash indexes and a linked list to store a preﬁx into the hash table.

In other words, storing a preﬁx can have a maximum of three choices of hash entries. The ﬁrst hash

index has the highest priority. If both entries indicated by the two hash indexes have already been

ﬁlled by other preﬁxes, the preﬁx is stored in the entry indicated by the linked list. Each hash table

entry stores a preﬁx, the length of the preﬁx, a corresponding output port and a linked list.

3.4. IP Address Lookup Using VBF

Algorithm 3describes the IP address lookup procedure using the VBF in detail. For a given input

IP address, the VBF is queried by gradually decreasing the querying length, starting from the longest

preﬁx length.

If the VBF returns a negative at the current length, the querying is continued at a shorter length (if

the current length is not already the shortest).

If the result of the VBF is a positive, the VBF returns a matching output port (BMPport) and the IP

address lookup procedure is completed. Note that the VBF can generate a false positive for a substring

of the given input even though the substring is not included in the routing set and a false output port

can be returned for the input. As will be shown in simulation, the false port return rate converges to

zero when the sizing factor of the VBF is larger than two.

If the VBF returns an indeterminable, the off-chip hash table should be accessed because the

matching output port at the current length of the given input cannot be determined. In searching the

hash table, two hash indexes are used for multi-hashing. If the matching entry is found from the entry

pointed by the ﬁrst hash index, the search is complete. If the matching entry is not found, the procedure

continues to search the hash entry pointed by the second hash index. If the matching entry is not found,

the procedure continues to search the hash entry pointed by the linked list. The search procedure is

completed if the matching entry is found. Otherwise, the search procedure should go back to the VBF.

Algorithm 3: IP Address Lookup Procedure Using VBF

Function Search(Dst Addr)

for (length =longestLen to shortestLen) do

if ( queryVBF(DstAddr.length) == 0 ) then

continue; //negative

else if ( queryVBF(DstAddr.length) != 0 ) & ( queryVBF(DstAddr.length) != -1) then

//positive

BMPport = queryVBF(DstAddr.length);

break;

else

//indeterminable

BMPport = searchHT(Dst Addr.length);

if (BMPport != NULL ) then

break; // no more Bloom ﬁlter access

end

return BMPport;

end

3.5. Theoretical Analysis on Search Failure Probability

In this section, we present the theoretical probability of search failure of the proposed VBF.

The search failure in the VBF occurs in two cases; false port return and indeterminable. The theoretical

Appl. Sci. 2019,9, 4621 7 of 20

probability of the search failure for the ﬁrst architecture of the VBF (summarized in the ﬁrst paragraph

of Section 3) has been provided in Reference [

] and this section provides the theoretical search failure

probability for the second structure of the VBF. The VBF consists of

multi-bit vectors and each vector

comprised of

bits contains

different ports. For

elements included in programming set

, assuming

that the elements are equally distributed to each port, the number of elements in each port set,

is equal to n/l.

In querying, while false positives can occur for non-programmed inputs, indeterminables can occur

for all inputs including programmed inputs. If the VBF returns a value for a non-programmed input,

it is a false positive, which means that a single port is returned among

values. If the VBF returns two

or more values, it is an indeterminable, which means that more than one port is returned among

ports

in one querying.

Let

represent the probability that a speciﬁc bit in a vector is set at least once by

elements,

after programming all elements by

hash functions. If the hash functions are assumed to be perfectly

random, pcan be calculated as

p=1−(1−1

m)kn0. (4)

When querying input

not included in set

, a false positive can occur if the same bit locations

in each of

vectors are 1 and other bit locations in each of

vectors are 0. Hence, the false port return

probability, P(F), is as follows.

P(F) = P(Sc)P(F|Sc) = P(Sc)·l·(pk)·(1−pk)l−1(5)

An indeterminable can occur if there exist more than one port regardless of true or false ports.

The indeterminable probability, P(I), is deﬁned as follows.

P(I) = P(S)P(I|S) + P(Sc)P(I|Sc)(6)

P(I|S)

is the indeterminable probability for the inputs included in

. In case of inputs included in

, since one true port must occur, an indeterminable can occur if the number of false ports is from 1 to

l−1 times. Therefore, P(I|S)is as follows.

P(I|S) = 1−(1−pk)l−1(7)

P(I|Sc)

is the indeterminable probability for the inputs not included in

. In case of inputs not

included in

, an indeterminable can occur if the number of false ports is from 2 to

times. Therefore,

P(I|Sc)is as follows.

P(I|Sc) = 1−(1−pk)l−P(F|Sc)(8)

From Labels (7) and (8), the indeterminable probability is

P(I) = P(S)1−(1−pk)l−1+P(Sc)1−(1−pk)l−P(F|Sc)

=P(S)1−(1−pk)l−1+P(Sc)1−(1−pk)l−l·(pk)·(1−pk)l−1(9)

From Label (4), since

is the function of the size of the Bloom ﬁlter,

, the false port return

probability in Label (5) and the indeterminable probability in Label (9) can be controlled by the size of

the Bloom ﬁlter.

4. Hardware Architecture

The construction procedures of the VBF and the hash table are implemented at the behavior

level using C language, since they are not necessarily performed in real-time. The IP address lookup

Appl. Sci. 2019,9, 4621 8 of 20

procedure described in Algorithms 2and 3is implemented using Verilog on a single FPGA. In this

section, we describe the hardware architecture of the VBF which is implemented on the FPGA.

4.1. Basic IP Address Lookup Module

Figure 2shows the block diagram of our proposed IP address lookup module in its basic form.

The basic IP address lookup module contains a hash index generator, a VBF search block and a hash

table search block. The hash index generator is realized with a 64-bit cyclic redundancy check (CRC-64)

generator. The CRC-64 is implemented with a shift register and few exclusive-OR gates, which is much

simpler than MD5 and SHA-256 and hence the CRC-64 is used in our implementation.

Figure 2.

Block diagram of Basic IP Address Lookup Module: the bold lines represent data paths and the

dotted line represents a control path.

Figure 3shows the state machine representing the IP address lookup procedure in detail. When the

reset signal is given at state

Start

, the state machine is ready to accept an input address. When an IP

address is given to the module, the lookup procedure starts from the longest preﬁx length and iterates

by decreasing the length until a matching preﬁx is found.At state

CRC Gen

, for a substring of the

input IP address, the CRC code is ﬁrst obtained when the corresponding number of input bits are

serially entered into the CRC generator starting from the most signiﬁcant bit. Multiple hash indexes are

obtained by combining multiple bits in the CRC code. In our implementation, to obtain the CRC code

within a single cycle, the operations required in the 64-bit CRC generator are implemented in parallel.

Figure 3. State machine for Internet Protocol (IP) address lookup procedure.

Appl. Sci. 2019,9, 4621 9 of 20

The vectors of the VBF pointed by hash indexes are bitwise-ANDed to obtain a result vector at

state

Query VBF

. The vector can produce three different results: negative,positive or indeterminable.

The negative result means no matching output port at the current length; the search procedure goes

on to state

No match

and then continues to query the VBF with a shorter length. The positive result

means a matching output port from the VBF and hence the search is completed for the current input;

in this case, the search procedure goes to state

Input ready

for the next input. The indeterminable

result means that an output port cannot be returned from the VBF and hence the search procedure

needs to proceed to the hash table.

At State

1st HT entry

, the hash entry pointed by the ﬁrst hash index is accessed. If the entry does

not have a matching preﬁx, another entry pointed by the second hash index is accessed at state

2nd

HT entry

. If the entry pointed by the second hash index does not have a matching preﬁx, the other

entry pointed by the linked list of the second entry is accessed at state

Linked list

. If the output port is

not found in any of these hash entries, the matching preﬁx does not exist in the current length and the

search procedure goes back to state

No match

and continues to query the VBF with a shorter length.

If a matching preﬁx is found from any of these entries, the search is completed for the current input

and hence the search procedure goes to state Input ready for the next input.

In terms of time complexity, since the query for the vectored-Bloom ﬁlter performs the linear

search on the number of bits in an IP address, the on-chip search performance is

O(W)

, where

the length of the IP address. However, the off-chip hash table access which mainly determines the IP

address lookup performance does not occur for a sufﬁciently large VBF.

4.2. Parallel Architecture

The IP address lookup performance using the VBF can be effectively improved by applying

parallelism. Figure 4describes the parallel architecture using two basic IP address lookup modules.

Two new blocks are implemented for this parallel architecture: a Distributer and an Output Queue.

The Distributer provides inputs to each module when the module is ready for a new input, while the

Output Queue provides the output ports according to the order of arrived inputs.

Figure 4.

Parallel architecture consisting of two basic IP address lookup modules: the bold lines

represent data paths and the dotted lines represent control paths.

Appl. Sci. 2019,9, 4621 10 of 20

4.3. Parallel Architecture with a Single Hash Table

Since the parallel architecture has multiple copies of the basic IP address lookup modules,

the block-RAM (BRAM) utilization is rapidly increased when the number of modules is increased.

The BRAM thus becomes the bottleneck in increasing the degree of parallelism. Since the hash table

is infrequently accessed in our proposed IP address lookup architecture, multiple VBF blocks can

share one hash table. In other words, we can separate the hash table from the basic IP address lookup

module and make multiple VBF search blocks share a single hash table.

Figure 5shows the parallel architecture with a single hash table, in which two VBF search blocks

share a single hash table. The hash table search block is now composed of a Hash Table Queue and the

hash table.

Figure 5.

Parallel architecture employing a single hash table: the bold lines represent data paths and

the dotted lines represent control paths.

The Hash Table Queue is a waiting place before accessing the hash table in preparing the case that

two or more input addresses need to access the hash table at the same time. If the output port for

an input is not determined by the VBF search block, the index of the input and two hash indexes for

the hash table are stored in the Hash Table Queue. When the input arrives to the front of the queue,

the stored indexes for each input are used to access the shared hash table. The input is removed from

the queue after accessing the hash table.

5. Behavior Simulation

Performance evaluation was carried out using routing sets downloaded from backbone

routers [

] at the behavior level with C language and at the hardware level with Verilog on an

FPGA. We have created four routing sets and the number of preﬁxes (

) in each set is 1000, 5000,

14,553 and 30,000 (called 1 k, 5 k, 14 k and 30 k, respectively). Note that the number of preﬁxes included

in actual backbone routers can be several hundred thousands. The simulation results of our proposed

algorithm for large routing sets have been already shown in our previous paper [

]. This paper

focuses on verifying the feasibility of the hardware implementation using an FPGA for our proposed

architecture. Since the number of preﬁxes that can be handled by the FPGA is limited by the size of

BRAM, we have used sets with a small number of preﬁxes in this paper.

The number of inputs to test is three times the number of preﬁxes in each routing set. Assuming

that the number of output ports is eight, 8 bits are allocated for each vector of the vectored-Bloom ﬁlter.

5.1. Performance of the Proposed Structure

Table 1shows the data structures of the VBF and a hash table. Let

be the number of preﬁxes

in each set. For the depth of the VBF,

m=αN0

, where

N0=

dlog2Ne

and sizing factor

= 1, 2 and 4.

The width of the VBF is the vector size (l), which is determined by the number of output ports.

Appl. Sci. 2019,9, 4621 11 of 20

The depth of the hash table,

and the width of the hash table is the size of a hash entry.

A single hash entry has four ﬁelds. In storing a preﬁx, 32 bits are allocated assuming IPv4. The preﬁx

length has ﬁve bits calculated by

log232

. The output port uses three bits calculated by

log2l

because the

number of output ports

is assumed to be 8. Since the number of entries in the hash table is 2

dlog2Ne

the linked list has dlog2Ne+1 bits.

Table 1. Data structures.

VBF HT

Depth No. of vectors (m)αN0No. of entries (B) 2N0

preﬁx 32

Width (bits) Vector size (l) output port 8 Entry size (E)length 5

ouput port 3

linked list dlog2Ne+1

Since the VBF querying procedure starts from the longest length existing in the set and stops

when a positive result is returned, the performance is affected by the distribution of the preﬁx lengths.

Figure 6shows the distribution of preﬁxes according to their lengths in each set.

Table 2shows the on-chip memory requirement (

) for a VBF and the off-chip memory

requirement (

) for a hash table. It is shown that the VBF requires less memory than a hash

table even when the sizing factor

is 4, since each vector of the VBF only represents an output port

while the hash entry stores a preﬁx, a length, an output port and a linked list. The optimal number of

hash functions for the VBF (kv) is shown as well.

Figure 6. Distribution of the number of preﬁxes according to preﬁx lengths.

Table 3shows the indeterminable rate and false port return rate according to the size of the VBF.

The indeterminable rate (

) is deﬁned as the number of inputs causing indeterminable over the number of

inputs. Since the output ports for these inputs are not identiﬁed by the VBF because of multiple set

bits in the resulting vector, the search procedure should perform an access to the hash table. The false

port return rate (

) is deﬁned as the number of inputs having false return over the number of inputs.

False ports can be returned by the false positives of the VBF. Note that the indeterminable rate and the

false port return rate are zero when the sizing factor of the VBF is larger than two as shown in Table 3.

Appl. Sci. 2019,9, 4621 12 of 20

Table 2. Memory requirement of proposed structure.

Routing Set (N)αVBF HT

kvDepth Width (bits) Mb(KBytes) Depth Width (bits) Mh(KBytes)

1 6 1024 8 1

1000 2 11 2048 8 2 2048 51 12.75

4 22 4096 8 4

1 6 8192 8 8

5000 2 11 16,384 8 16 16,384 54 108

4 22 32,768 8 32

1 6 16,384 8 16

14,553 2 11 32,768 8 32 32,768 55 220

4 22 65,536 8 64

1 6 32,768 8 32

30,000 2 11 65,536 8 64 65,536 56 448

4 22 131,072 8 128

Table 3. Indeterminable and False port return rate.

Routing Set (N)αI F

1 0.122 0.444

1000 2 0.003 0.019

4 0.000 0.000

1 0.017 0.135

5000 2 0.000 0.001

4 0.000 0.000

1 0.067 0.424

14,553 2 0.002 0.013

4 0.000 0.000

1 0.072 0.451

30,000 2 0.003 0.016

4 0.000 0.000

5.2. Performance Comparison with Other Structures

Table 4shows the comparison on Bloom ﬁlter characteristics of the proposed VBF structure with

other structures such as BF-chaining with pre-computation [

] and PBF [

]. For fair comparison,

the performances of each algorithm should be compared under the same amount of the on-chip

memory required for constructing each Bloom ﬁlter. Based on the memory amount of the VBF shown

in Table 2, in constructing Bloom ﬁlters for the BF-chaining and the PBF, sizing factor

and the number

of hash indexes for each structure are calculated, when the same amount of the memory is used.

As sizing factor

increases, the number of off-chip hash table accesses decreases because the number

of the search failures decreases.Tables should be cited in sequential numerical order. Please conﬁrm

if it’s invalid citation, if no, please revise the order. The VBF has the width of 8 but the number of

elements for each output port is

and hence the VBF is basically the same as a standard Bloom ﬁlter

as in the BF-chaining and the PBF structures. Since all nodes of a binary trie are stored in the Bloom

ﬁlter of the BF-chaining, the

of the BF-chaining is smaller than

. Since the PBF requires up to

independent Bloom ﬁlters [

], where

is the valid lengths of preﬁxes, the space efﬁciency is degraded.

Appl. Sci. 2019,9, 4621 13 of 20

Table 4. Comparison on Bloom ﬁlter characteristics.

Routing Set (N)Mb(KB) BF-Chaining PBF VBF

αkαkαkv

1 1 1 6.30 4 1 6

1000 2 2 1 12.60 9 2 11

4 4 3 25.20 17 4 22

8 2 1 10.13 7 1 6

5000 16 4 3 20.24 14 2 11

32 8 6 40.49 28 4 22

16 1 1 5.11 4 1 6

14,553 32 2 1 10.22 7 2 11

64 4 3 20.44 14 4 22

32 2 1 5.07 4 1 6

30,000 64 4 3 10.14 7 2 11

128 8 6 20.28 14 4 22

Table 5shows the comparison of off-chip memory requirements. The hash table in the BF-chaining

stores all of the nodes (

) in a binary trie, while the PBF and the VBF only store preﬁxes (

) in each

routing set. Thus, the off-chip memory requirement of the PBF and the VBF is much smaller than that

of the BF-chaining.

Table 5. Comparison of off-chip memory requirement.

Routing Set (N)BF-Chaining PBF VBF

T Mh(KB) Mh(KB) Mh(KB)

1000 7678 108 13 13

5000 29,534 448 108 108

14,553 76,708 1856 220 220

30,000 127,576 1856 448 448

Table 6shows the comparison of the on-chip search performance in terms of the average number

and the worst-case number of Bloom ﬁlter accesses, represented by

and

, respectively. Since the

Bloom ﬁlter querying is not performed for lengths not including any preﬁx,

is the number of valid

preﬁx lengths for each set. Note that the Bloom ﬁlter querying of the BF-chaining proceeds from the

shortest length, while the querying of the PBF and the VBF proceeds from the longest length. Thus,

as the size of each Bloom ﬁlter (

) increases, the average number of Bloom ﬁlter accesses (

) of

the BF-chaining decreases but that of the VBF increases, because the number of negatives increases.

The larger number of negatives results in the smaller number of hash table accesses in both structures.

In case of the PBF, for the fair comparison with other structures, we assume that all of the BFs in

the PBF structure are sequentially queried until the longest preﬁx is matched as in the VBF. Hence,

the

is constant regardless of the size of the Bloom ﬁlter because Bloom ﬁlter querying always stops

at the length of the matching longest preﬁx. In other words, starting from the longest length, the Bloom

ﬁlter querying always continues until a true positive occurs and hence the average number of Bloom

ﬁlter querying is not related to the size of the Bloom ﬁlter. The

of the VBF in 5 k is greater than

that in other sets, because the 5 k set has many short length preﬁxes, such as length 16, as shown in

Figure 6. Since the search procedure proceeds from the longest length, if the preﬁx matching occurs in

a short length, the Abbecomes large.

Appl. Sci. 2019,9, 4621 14 of 20

Table 6. Comparison on the number of on-chip Bloom ﬁlter querying.

Routing Set (N)Mb(KB) BF-Chaining PBF VBF

AbWbAbWbAbWb

1 14.6 4.4

1000 2 13.9 18 6.7 18 6.3 18

4 13.5 6.7

8 13.5 8.9

5000 16 13.1 21 10.0 21 9.5 21

32 13.0 10.0

16 16.6 5.5

14,553 32 16.2 22 8.2 22 7.5 22

64 15.9 8.2

32 16.5 5.2

30,000 64 16.1 22 8.1 22 7.2 22

128 15.9 8.1

Table 7shows the comparison on off-chip search performance in terms of the average number and

the worst-case number of hash table accesses, represented by

and

, respectively. The number

of off-chip memory accesses is the most important performance criterion in the IP address lookup

problem. To improve the IP address lookup performance, the number of off-chip accesses should

be minimized. In the BF-chaining and the PBF, even if the size of the Bloom ﬁlter (

) increases,

the off-chip hash table is accessed at least once in order to obtain the matching output port. However,

hash table accesses occur infrequently in our proposed VBF architecture, since the Bloom ﬁlter stored

in an on-chip memory returns the matching output port in every case except the indeterminable cases.

Only when the VBF produces an indeterminable result, the hash table is accessed. It is shown that the

average number of hash table accesses becomes zero as the size of the Bloom ﬁlter (Mb) increases.

Table 7. Comparison on the number of off-chip hash table accesses.

Routing set (N)Mb(KB) BF-Chaining PBF VBF

AhWhAhWhAhWh

1 2.385 9 1.170 5 0.203 5

1000 2 1.633 9 1.004 2 0.004 3

4 1.153 4 1.000 1 0.000 0

8 1.565 10 1.052 3 0.024 4

5000 16 1.140 4 1.001 2 0.000 0

32 1.015 3 1.000 1 0.000 0

16 1.784 12 1.491 6 0.110 5

14,553 32 1.335 6 1.054 4 0.003 5

64 1.046 4 1.001 2 0.000 0

32 1.619 13 1.322 6 0.124 6

30,000 64 1.160 6 1.037 3 0.004 6

128 1.019 3 1.001 2 0.000 0

Appl. Sci. 2019,9, 4621 15 of 20

6. Hardware Implementation

This section shows the hardware implementation details of our proposed structure. Especially,

the parallel architecture with a single hash table shown in Figure 5has been implemented. The hardware

implementation has been carried out with Verilog language using Vivado 2017.4 development tool.

Our target device is NetFPGA CML operating at 100 MHz. The size of the VBF is

. The VBF

and the hash table implemented with BRAMs are loaded with values obtained from the simulation at

the behavior level.

Figure 7shows the hardware test ﬂow. Generally, the input IP addresses should be provided by

an external input generater, as shown in Figure 7a. However, in our experiment, the input IP addresses

are stored in a BRAM, as shown in Figure 7b, in order to provide the input at the operation rate of the

hardware implemented on the FPGA. The input IP addresses are stored in dual-port RAMs and hence

the parallel architecture can process two IP addresses in parallel.

(a)

(b)

Figure 7.

Test ﬂow of ﬁeld programmable gate array (FPGA)-based IP address lookup: (

) General

case. (b) Our case.

Since the VBF search block in the basic IP address lookup module has 11 hash indexes obtained

from Label (3) and a VBF is implemented with a dual-port RAM, six duplicates of the VBF are required.

Since the parallel architecture has two basic IP address lookup modules, the VBF duplication should

be doubled again and hence the 12 copies of the VBFs are implemented in total.

Table 8shows the memory requirement in the hardware implementation. The BRAM has 18-Kbit

and 36-Kbit blocks and the blocks are automatically allocated for each component. It is shown that

a VBF is implemented with a single 18-Kbit block for the 1 k set, with four 36-Kbit blocks for the 5 k

set and so on. Since each parallel architecture requires 12 copies of VBFs, the total number of VBFs is

multiplied by 12. Similarly, it is shown the number of blocks used to store a hash table and input IP

addresses for each routing set.

The values in total BRAMs represent the summation of the BRAM in KBytes. The utilization rate

of the BRAM is based on the BRAM capacity of 16,020 Kbits. Since each component is allocated in

blocks, the memory requirement of a VBF or a hash table in hardware implementation is greater than

that in behavior simulation.

Table 9shows resource utilization. Capacity means the amount of available resources and Used

means the amount of resources used in our implementation. Utilization represents the ratio between

Used and Capacity. It is shown that the utilization of the BRAM reaches up to 84.0% for the 30 k set.

Appl. Sci. 2019,9, 4621 16 of 20

Table 8. Memory requirement in hardware.

Routing Input Set (#. of blks) VBF (#. of blks) Total VBFs (#. of blks) HT (#. of blks) Total BRAMs

Set 18-Kbits 36-Kbits 18-Kbits 36-Kbits 18-Kbits 36-Kbits 18-Kbits 36-Kbits (KBytes) (%)

1000 0 3 1 0 12 0 0 3 54 2.7

5000 4 12 0 4 0 48 0 24 387 19.3

14,553 7 36 0 8 0 96 0 49 830 41.5

30,000 0 80 0 16 0 192 0 102 1683 84.0

Table 9. Resource utilization.

Resource Capacity 1 k 5 k 14 k 30 k

Used Utilization (%) Used Utilization (%) Used Utilization (%) Used Utilization (%)

LUT 203,800 5560 2.7 6440 3.2 6995 3.4 7419 3.6

FF 407,600 2824 0.7 3029 0.7 3084 0.8 3087 0.8

BRAM 445 12 2.7 86 19.3 184.5 41.5 374 84.0

IO 400 6 1.5 6 1.5 6 1.5 6 1.5

BUFG 32 3 9.4 3 9.4 3 9.4 3 9.4

Table 10 shows the total on-chip power and the worst negative slack reported in Vivado 2017.4

development tool.

Table 10. Power consumption and the worst negative slack.

1 k 5 k 14 k 30 k

Total on chip power (W) 0.172 0.181 0.188 0.262

Worst negative slack (ns) 88.51 82.35 84.30 79.56

Figure 8graphically shows the resource utilization. Since the utilizations of IO and BUFG do not

depend on the sizes of the routing sets, their utilizations are not shown. The utilizations of LUT and

FF are almost constant, even when the size of the routing set increases. BRAM w/o IP refers to the

BRAM utilization when input IP addresses are not stored in BRAMs, as shown in Figure 7a. In this

case, the BRAM utilization reaches up to 66.1%, which is substantially lower than 84.0%.

Figure 8. Resource utilization.

Appl. Sci. 2019,9, 4621 17 of 20

Figure 9shows the average number of cycles needed to perform an IP address lookup and

the throughput of our implementation. The average number of cycles is related to the search

performance and the search performance is affected by the preﬁx length distribution shown in Figure 6.

The throughput is the maximum number of processed packets per second (p/s or pps) and it is

inversely related to the average number of cycles. The best-case of the throughput is 4.92 million

packets/sec for the 1 k set, while the worst-case of the throughput is 3.32 million packets/sec for

the 5 k set, because the 5 k set has more short-length preﬁxes than other sets as shown in Figure 6.

Note that the throughput performance depends more on the preﬁx length distribution rather than

the number of preﬁxes. The preﬁx length distribution is related to the role of routers, whether the

router is a backbone router or an edge router. Backbone routers have the more number of short-length

preﬁxes than edge routers, since edge routers connect mostly access networks, which has a more

speciﬁc (longer-length) preﬁx.

Figure 9. Relationship between average number of cycles and throughput.

Considering that the number of preﬁxes is scaled to several hundred thousands reﬂecting real

backbone routing tables, if the proposed architecture is implemented with an ASIC, which has at least

4 Mbytes on-chip memory operating at 500 MHz [

], the expected throughput can be improved by 5

times. The proposed architecture can provide the wire-speed IP address lookup at the rate of about 15

to 25 million packets/sec, since the routing sets used in our simulation has the similar characteristics

as actual routing sets by being downloaded (and randomly selected) from real backbone routers.

7. Conclusions

In this paper, we have proposed a new efﬁcient IP address lookup architecture using a Bloom ﬁlter.

The proposed architecture is based on a vectored-Bloom ﬁlter, which stores an output port in each vector.

We have also proposed the use of parallel architectures to improve the search performance. In the

proposed parallel architectures, multiple VBF search blocks are implemented with sharing a single

hash table, since the hash table is infrequently accessed. Hence the memory efﬁciency is increased.

The performance of the proposed architecture has been evaluated at the behavior level using C and

at the hardware level using Verilog. The behavior evaluation shows that the proposed architecture

can perform the IP address lookup without accessing the off-chip hash table for a reasonably sized

Bloom ﬁlter. The hardware evaluation shows that the proposed architecture provides the throughput

of 4.92 million packets/sec on an FPGA operating at 100MHz. The proposed hardware architecture

can be a promising candidate to provide the wire-speed packet forwarding since a more degree of

parallelism can be pursued when it is implemented with an ASIC operating at a higher frequency.

Appl. Sci. 2019,9, 4621 18 of 20

Author Contributions:

Conceptualization, H.B. and H.L.; methodology, H.L.; software, H.B.; hardware, Q.L.;

validation, H.B. and Q.L.; investigation, H.L.; resources, H.B. and Q.L.; data curation, H.B.; writing—original draft

preparation, H.B.; writing—review and editing, H.L.; visualization, H.B.; supervision, H.L.; project administration,

H.L.; funding acquisition, H.L.

Funding: This research was funded by the National Research Foundation of Korea (NRF), NRF-2017R1A2B4011254.

Conﬂicts of Interest:

The authors declare no conﬂict of interest. The funders had no role in the design of the

study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to

publish the results.

Abbreviations

IP Internet Protocol

CIDR Classless Inter-Domain Routing

BMP Best Matching Preﬁx

ASIC Application-Speciﬁc Integrated Circuits

TCAM Ternary Content Addressable Memory

FPGA Field-Programmable Gate Arrays

HT Hash Table

BF Bloom Filter

VBF Vectored-Bloom Filter

PBF Parallel Bloom Filter

SAIL Splitting Approach for IP address Lookup

CRC Cyclic Redundancy Check

References

1. Chao, H. Next Generation Routers. Proc. IEEE 2002,90, 1518–1588, doi:10.1109/jproc.2002.802001.

Lim, H.; Lee, N. Survey and Proposal on Binary Search Algorithms for Longest Preﬁx Match. IEEE Commun.

Surv. Tutor. 2012,14, 681–697, doi:10.1109/surv.2011.061411.00095.

Yang, T.; Xie, G.; Li, Y.; Fu, Q.; Liu, A.; Li, Q.; Mathy, L. Guarantee IP Lookup Performance with FIB

Explosion. In Proceedings of the ACM SIGCOMM, Chicago, IL, USA, 17–22 August 2014; pp. 39–50,

doi:10.1145/2619239.2626297.

Vagionas, C.; Maniotis, P.; Pitris, S.; Miliou, A.; Pleros, N. Integrated Optical Content Addressable Memories

(CAM) and Optical Random Access Memories (RAM) for Ultra-Fast Address Look-Up Operations. Appl. Sci.

2017,7, 700, doi:10.3390/app7070700.

Gupta, P.; Lin, S.; Mckeown, N. Routing Lookups in Hardware at Memory Access Speed. In Proceedings

of the IEEE INFOCOM ’98, the Conference on Computer Communications. Seventeenth Annual Joint

Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century, San Francisco,

CA, USA, 29 March–2 April 1998; pp. 1240–1247, doi:10.1109/infcom.1998.662938.

Dharmapurikar, S.; Krishnamurthy, P.; Taylor, D. Longest Preﬁx Matching Using Bloom Filters. IEEE/ACM

Trans. Netw. 2006,14, 397–409, doi:10.1109/tnet.2006.872576.

Lim, H.; Lim, K.; Lee, N.; Park, K. On Adding Bloom Filters to Longest Preﬁx Matching Algorithms.

IEEE Trans. Comput. 2014,63, 411–423, doi:10.1109/TC.2012.193.

Panda, P.; Dutt, N.; Nicolau, A. On-Chip vs. Off-Chip Memory: The Data Partitioning Problem in

Embedded Processor-Based Systems. ACM Trans. Des. Autom. Electron. Syst.

2000

,5, 682–704,

doi:10.1145/348019.348570.

Jiang, W.; Prasanna, V. Sequence-preserving parallel IP lookup using multiple SRAM-based pipelines.

J. Parallel Distrib. Comput. 2009,69, 778–789, doi:10.1016/j.jpdc.2009.04.001.

10.

Erdem, O.; Bazlamacci, C. Array Design for Trie-based IP Lookup. IEEE Commun. Lett.

2010

,14, 773–775,

doi:10.1109/lcomm.2010.08.100398.

11.

Pérez, K.; Yang, X.; Scott-Hayward, S.; Sezer, S. Optimized packet classiﬁcation for Software-Deﬁned

Networking. In Proceedings of the 2014 IEEE International Conference on Communications (ICC), Sydney,

Australia, 10–14 June 2014; pp. 859–864, doi:10.1109/icc.2014.6883427.

Appl. Sci. 2019,9, 4621 19 of 20

12.

Song H.; Lockwood, J. Efﬁcient packet classiﬁcation for network intrusion detection using FPGA.

In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate

Arrays, Monterey, CA, USA, 20–22 February 2005; pp. 238–245, doi:10.1145/1046192.1046223.

13.

Sanny, A.; Ganegedara, T.; Prasanna, V.K. A Comparison of Ruleset Feature Independent Packet Classiﬁcation

Engines on FPGA. In Proceedings of the IEEE International Symposium on Parallel and Distributed

Processing, Workshops and PhD Forum (IPDPSW), Cambridge, MA, USA, 20–24 May 2013; pp. 124–133,

doi:10.1109/ipdpsw.2013.249.

14.

Viegas, E.; Santin, A.; França, A.; Jasinski, R.; Pedroni, V.; Oliveira, L. Towards an Energy-Efﬁcient

Anomaly-Based Intrusion Detection Engine for Embedded Systems. IEEE Trans. Comput.

2017

,66, 163–177,

doi:10.1109/tc.2016.2560839.

15.

Gidansky, J.; Stefan, D.; Dalal, I. FPGA-based SoC for real-time network intrusion detection using counting

Bloom ﬁlters. In Proceedings of the IEEE Southeastcon, Atlanta, GA, USA, 5–8 March 2009; pp. 452–458,

doi:10.1109/secon.2009.5174096.

16.

Hieu, T.; Thinh, T.; Vu T.; Tomiyama, S. Optimization of Regular Expression Processing Circuits for NIDS on

FPGA. In Proceedings of the Second International Conference on Networking and Computing, Osaka, Japan,

30 November–2 December 2011; pp. 105–112, doi:10.1109/icnc.2011.23.

17.

Bloom, B. Space/time tradeoffs in in hash coding with allowable errors. Commun. ACM

1970

,13, 422–426,

doi:10.1145/362686.362692.

18.

Waldvogel, M.; Varghese, G.; Turner, J.; Plattner, B. Scalable High Speed IP Routing Lookups. In Proceedings

of the ACM SIGCOMM ’97 Conference on Applications, Technologies, Architectures, and Protocols for

Computer Communication, Cannes, France, 14–18 September 1997; pp. 25–35, doi:10.1145/263109.263136.

19.

Mun, J.; Lim, H.; Yim, C. Binary Search on Preﬁx Lengths for IP Address Lookup. IEEE Commun. Lett.

2006

10, 492–494, doi:10.1109/lcomm.2006.1638626.

20.

Mun, J.; Lim, H. New Approach for Efﬁcient IP Address Lookup Using a Bloom Filter in Trie-Based

Algorithms. IEEE Trans. Comput. 2016,65, 1558–1565, doi:10.1109/TC.2015.2444850.

21.

Byun, H.; Lim, H. A New Bloom Filter Architecture for FIB Lookup in Named Data Networking. Appl. Sci.

2019,9, 329, doi:10.3390/app9020329.

22.

Moralis-Pegios, M.; Terzenidis, N.; Mourgias-Alexandris, G.; Vyrsokinos, K. Silicon Photonics towards

Disaggregation of Resources in Data Centers. Appl. Sci. 2018,8, 83, doi:10.3390/app8010083.

23.

Lin, P.; Lin, Y.; Lai, Y.; Zheng, Y.; Lee, T. Realizing a Sub-Linear Time String-Matching Algorithm With

a Hardware Accelerator Using Bloom Filters. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

2009

,17,

1008–1020, doi:10.1109/tvlsi.2008.2012011.

24.

Lai, B.; Chen, K.; Wu, P. A High-Performance Double-Layer Counting Bloom Filter for Multicore Systems.

IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015,23, 2473–2486, doi:10.1109/tvlsi.2014.2370761.

25.

Reviriego, P.; Pontarelli, S.; Maestro, J.; Ottavi, M. A Synergetic Use of Bloom Filters for Error Detection and

Correction. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

2015

,23, 584–587, doi:10.1109/tvlsi.2014.2311234.

26.

Chen, Y.; Schmidt, B.; Maskell, D. Reconﬁgurable Accelerator for the Word-Matching Stage of BLASTN.

IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2013,21, 659–669, doi:10.1109/tvlsi.2012.2196060.

27.

Tarkoma, S.; Rothenberg, C. E.; Lagerspetz, E. Theory and practice of Bloom ﬁlters for distributed systems.

IEEE Commun. Surv. Tutor. 2012,14, 131–155, doi:10.1109/SURV.2011.031611.00024.

28.

Lu, J.; Yang, T.; Wang, Y.; Dai, H.; Jin, L.; Song, H.; Liu, B. One-Hashing Bloom Filter. In Proceedings of the

IEEE 23rd International Symposium on Quality of Service (IWQoS), Portland, OR, USA, 15–16 June 2015;

pp. 289–298, doi:10.1109/iwqos.2015.7404748.

29.

Qiao, Y.; Li, T.; Chen, S. One Memory Access Bloom Filters and Their Generalization. In Proceedings of the

IEEE INFOCOM Shanghai, China, 10–15 April 2011; pp. 1745–1753, doi:10.1109/infcom.2011.5934972.

30.

Yang, T.; Xie, G.; Liu, A.; Fu, Q.; Li, Y.; Li, X.; Mathy, L. Constant IP Lookup With FIB Explosion. IEEE/ACM

Trans. Netw. 2018,26, 1821–1836, doi:10.1109/tnet.2018.2853575.

31.

Byun, H.; Lim, H. IP Address Lookup Algorithm Using a Vectored Bloom Filter. Trans. Korean Inst. Electr. Eng.

2016,65, 2061–2068, doi:10.5370/KIEE.2016.65.12.2061.

Appl. Sci. 2019,9, 4621 20 of 20

32.

Byun, H.; Li, Q.; Lim, H. Vectored-Bloom Filter Implemented on FPGA for IP Address Lookup.

In Proceedings of the ICEIC 2019, Auckland, New Zealand, 22–25 January 2019; pp. 967–970,

doi:10.23919/ELINFOCOM.2019.8706399.

33. Preﬁx Names Available online: http://www.potaroo.net (accessed on 29 October 2019).

2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access

article distributed under the terms and conditions of the Creative Commons Attribution

(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Comparison on Search Failure between Hash Tables and a Functional Bloom Filter

Article

Full-text available

Jul 2020

Hash-based data structures have been widely used in many applications. An intrinsic problem of hashing is collision, in which two or more elements are hashed to the same value. If a hash table is heavily loaded, more collisions would occur. Elements that could not be stored in a hash table because of the collision cause search failures. Many variant structures have been studied to reduce the number of collisions, but none of the structures completely solves the collision problem. In this paper, we claim that a functional Bloom filter (FBF) provides a lower search failure rate than hash tables, when a hash table is heavily loaded. In other words, a hash table can be replaced with an FBF because the FBF is more effective than hash tables in the search failure rate in storing a large amount of data to a limited size of memory. While hash tables require to store each input key in addition to its return value, a functional Bloom filter stores return values without input keys, because different index combinations according to each input key can be used to identify the input key. In search failure rates, we theoretically compare the FBF with hash-based data structures, such as multi-hash table, cuckoo hash table, and d-left hash table. We also provide simulation results to prove the validity of our theoretical results. The simulation results show that the search failure rates of hash tables are larger than that of the functional Bloom filter when the load factor is larger than 0.6.

Cuckoo Bloom Hybrid Filter: Algorithm and Hardware Architecture for High Performance Satellite Internet Protocol Route Lookup

Article

Full-text available

Sep 2023

The next-generation satellite Internet Protocol (IP) router is required to achieve tens of millions of route lookups per second, since satellite Internet services based on low Earth orbit (LEO) constellations have become a reality. Due to the limitation of hardware resources on satellites and the high reliability requirements for equipment, a new satellite IP route lookup architecture is proposed in this paper. The proposed architecture uses a Bloom and cuckoo filter-based structure called cuckoo Bloom hybrid filter (CBHF), which guarantees only one off-chip memory access per lookup, to accelerate the Prefix-Route Trie (PR-Trie) algorithm. The proposed architecture has been evaluated through both a behavioral simulation in C++ language and a hardware implementation in Verilog hardware description language (HDL). Our simulation and implementation results show that the proposed satellite IP route lookup architecture can achieve a single-port throughput beyond 13 Gbps on a field programmable gate array (FPGA) board with a single DDR3 memory chip when operating at 200 MHz. In addition, the resource utilization in the FPGA shows that the proposed architecture also supports triple modular redundancy (TMR) to enhance reliability.

A Cache Efficient One Hashing Blocked Bloom Filter (OHBB) for Random Strings and the K-mer Strings in DNA Sequence

Article

Full-text available

Sep 2022

Bloom filters are widely used in genome assembly, IoT applications and several network applications such as symmetric encryption algorithms, and blockchain applications owing to their advantages of fast querying, despite some false positives in querying the input elements. There are many research works carried out to improve both the insertion and querying speed or reduce the false-positive or reduce the storage requirements separately. However, the optimization of all the aforementioned parameters is quite challenging with the existing reported systems. This work proposes to simultaneously improve the insertion and querying speeds by introducing a Cache-efficient One-Hashing Blocked Bloom filter. The proposed method aims to reduce the number of memory accesses required for querying elements into one by splitting the memory into blocks where the block size is equal to the cache line size of the memory. In the proposed filter, each block has further been split into partitions where the size of each partition is the prime number. For insertion and query, one hash value is required, which yields different values when modulo divided with prime numbers. The speed is accelerated using simple hash functions where the hash function is called only once. The proposed method has been implemented and validated using random strings and symmetric K-mer datasets used in the gene assembly. The simulation results show that the proposed filter outperforms the Standard Bloom Filter in terms of the insertion and querying speed.

Dynamically Allocated Bloom Filter-Based PIT Architectures

Article

Full-text available

Jan 2022

As a key component in implementing Named Data Networking (NDN), Pending Interest Table (PIT) requires an efficient exact-matching algorithm for a scalable and fast PIT lookup. A Bloom filter (BF) is a memory-efficient data structure for performing exact matching operations. In this paper, three different BF-based PIT architectures are proposed: PIT using functional Bloom filters (FBF-PIT), PIT using counting Bloom filters with return values (rCBF-PIT), and a refined rCBF-PIT with signatures (R-rCBF-PIT). The proposed BF-based PITs incrementally allocate a new BF for storing multiple incoming faces of Interest packets with the same content name. For a Data packet lookup, the proposed PIT architectures simultaneously access every BF structure to find matching faces and delete the faces (i.e., matching Interest packet information). The functional Bloom filter (FBF) used in an FBF-PIT is a key-value data structure that stores values only without keys. However, because the number of non-reusable conflict cells in the FBF increases as the number of stored packets increases in the FBF-PIT, the indeterminable rate increases. To decrease the indeterminable rate, we propose the rCBF-PIT, which uses counting Bloom filters with return values (rCBFs), allowing reusable conflict cells. False positives for Interest packets lead to incorrect deletions that can cause false negatives for incoming Data packets. Because most of the false positives occur in the first BF structure, we finally propose the R-rCBF-PIT, in which the first rCBF is replaced with an rCBF with a signature field. The proposed PITs also provide an aging mechanism using a valid bit and a hit bit for entry expiration. Simulation results show that rCBF-PIT and R-rCBF-PIT both reduce the indeterminable rate by more than 81% compared with FBF-PIT. The results also show that R-rCBF-PIT resolves false negatives caused by incorrect deletions by including the signature fields in the first rCBF.

A Novel Prefix Cache with Two-Level Bloom Filters in IP Address Lookup

Article

Full-text available

Oct 2020

Prefix caching is one of the notable techniques in enhancing the IP address lookup performance which is crucial in packet forwarding. A cached prefix can match a range of IP addresses, so prefix caching leads to a higher cache hit ratio than IP address caching. However, prefix caching has an issue to be resolved. When a prefix is matched in a cache, the prefix cannot be the result without assuring that there is no longer descendant prefix of the matching prefix which is not cached yet. This is due to the aspect of the IP address lookup seeking to find the longest matching prefix. Some prefix expansion techniques avoid the problem, but the expanded prefixes occupy more entries as well as cover a smaller range of IP addresses. This paper proposes a novel prefix caching scheme in which the original prefix can be cached without expansion. In this scheme, for each prefix, a Bloom filter is constructed to be used for testing if there is any matchable descendant. The false positive ratio of a Bloom filter generally grows as the number of elements contained in the filter increases. We devise an elaborate two-level Bloom filter scheme which adjusts the filter size at each level, to reduce the false positive ratio, according to the number of contained elements. The experimental result shows that the proposed scheme achieves a very low cache miss ratio without increasing the number of prefixes. In addition, most of the filter assertions are negative, which means the proposed prefix cache effectively hits the matching prefix using the filter.

Boolean-Function-based IP Lookup on FPGAs

Conference Paper

Oct 2023

Ant Colony Optimization based approach to filter IP Address from TOR network

Conference Paper

Dec 2022

N. K. Sreelaja

Learned FBF: Learning-Based Functional Bloom Filter for Key-Value Storage

Article

Sep 2021

As a challenging attempt to replace a traditional data structure with a learned model, this paper proposes a learned functional Bloom filter (L-FBF) for a key--value storage. The learned model in the proposed L-FBF learns the characteristics and the distribution of given data and classifies each input. It is shown through theoretical analysis that the L-FBF provides a lower search failure rate than a single FBF in the same memory size, while providing the same semantic guarantees. For model training, character-level neural networks are used with pretrained embeddings. In experiments, four types of different character-level neural networks are trained: a single gated recurrent unit (GRU), two GRUs, a single long short-term memory (LSTM), and a single one-dimensional convolutional neural network (1D CNN). Experimental results prove the validity of theoretical results, and show that the L-FBF reduces the search failures by 82.8% to 83.9% when compared with a single FBF under the same amount of memory used.

Scalable Balanced Pipelined IPv6 Lookup Algorithm

Article

Full-text available

Aug 2021

Zoran Cica

One of the most critical router’s functions is the IP lookup. For each incoming IP packet, IP lookup determines the output port to which the packet should be forwarded. IPv6 addresses are envisioned to replace IPv4 addresses because the IPv4 address space is exhausted. Therefore, modern IP routers need to support IPv6 lookup. Most of the existing IP lookup algorithms are adjusted for the IPv4 lookup, but not for the IPv6 lookup. Scalability represents the main problem in the existing IP lookup algorithms because the IPv6 address space is much larger than the IPv4 address space due to longer IPv6 addresses. In this paper, we propose a novel IPv6 lookup algorithm that supports very large IPv6 lookup tables and achieves high IP lookup throughput.

An Efficient Name Look-up Architecture Based on Binary Search in NDN Networking

Conference Paper

Dec 2020

A New Bloom Filter Architecture for FIB Lookup in Named Data Networking

Article

Full-text available

Jan 2019

Network traffic has increased rapidly in recent years, mainly associated with the massive growth of various applications on mobile devices. Named data networking (NDN) technology has been proposed as a future Internet architecture for effectively handling this ever-increasing network traffic. In order to realize the NDN, high-speed lookup algorithms for a forwarding information base (FIB) are crucial. This paper proposes a level-priority trie (LPT) and a 2-phase Bloom filter architecture implementing the LPT. The proposed Bloom filters are sufficiently small to be implemented with on-chip memories (less than 3 MB) for FIB tables with up to 100,000 name prefixes. Hence, the proposed structure enables high-speed FIB lookup. The performance evaluation result shows that FIB lookups for more than 99.99% of inputs are achieved without needing to access the database stored in an off-chip memory.

Constant IP Lookup With FIB Explosion

Article

Full-text available

Aug 2018

With the fast development of Internet, the forwarding tables in backbone routers have been growing fast in size. An ideal IP lookup algorithm should achieve constant, yet small, IP lookup time, and on-chip memory usage. However, no prior IP lookup algorithm achieves both requirements at the same time. In this paper, we first propose SAIL, a splitting approach to IP lookup. One splitting is along the dimension of the lookup process, namely finding the prefix length and finding the next hop, and another splitting is along the dimension of prefix length, namely IP lookup on prefixes of length ≤ 24 and that longer than 24. Second, we propose a suite of algorithms for IP lookup based on our SAIL framework. Third, we implemented our algorithms on four platforms: CPU, FPGA, GPU, and many-core. We conducted extensive experiments to evaluate our algorithms using real FIBs and real traffic from a major ISP in China. Experimental results show that our SAIL algorithms are much faster than well known IP lookup algorithms.

Silicon Photonics towards Disaggregation of Resources in Data Centers

Article

Full-text available

Jan 2018

In this paper, we demonstrate two subsystems based on Silicon Photonics, towards meeting the network requirements imposed by disaggregation of resources in Data Centers. The first one utilizes a 4 × 4 Silicon photonics switching matrix, employing Mach Zehnder Interferometers (MZIs) with Electro-Optical phase shifters, directly controlled by a high speed Field Programmable Gate Array (FPGA) board for the successful implementation of a Bloom-Filter (BF)-label forwarding scheme. The FPGA is responsible for extracting the BF-label from the incoming optical packets, carrying out the BF-based forwarding function, determining the appropriate switching state and generating the corresponding control signals towards conveying incoming packets to the desired output port of the matrix. The BF-label based packet forwarding scheme allows rapid reconfiguration of the optical switch, while at the same time reduces the memory requirements of the node’s lookup table. Successful operation for 10 Gb/s data packets is reported for a 1 × 4 routing layout. The second subsystem utilizes three integrated spiral waveguides, with record-high 2.6 ns/mm2, delay versus footprint efficiency, along with two Semiconductor Optical Amplifier Mach-Zehnder Interferometer (SOA-MZI) wavelength converters, to construct a variable optical buffer and a Time Slot Interchange module. Error-free on-chip variable delay buffering from 6.5 ns up to 17.2 ns and successful timeslot interchanging for 10 Gb/s optical packets are presented.

Integrated Optical Content Addressable Memories (CAM) and Optical Random Access Memories (RAM) for Ultra-Fast Address Look-Up Operations

Article

Full-text available

Jul 2017

Electronic Content Addressable Memories (CAM) implement Address Look-Up (AL) table functionalities of network routers; however, they typically operate in the MHz regime, turning AL into a critical network bottleneck. In this communication, we demonstrate the first steps towards developing optical CAM alternatives to enable a re-engineering of AL memories. Firstly, we report on the photonic integration of Semiconductor Optical Amplifier-Mach Zehnder Interferometer (SOA-MZI)-based optical Flip-Flop and Random Access Memories on a monolithic InP platform, capable of storing the binary prefix-address data-bits and the outgoing port information for next hop routing, respectively. Subsequently the first optical Binary CAM cell (B-CAM) is experimentally demonstrated, comprising an InP Flip-Flop and a SOA-MZI Exclusive OR (XOR) gate for fast search operations through an XOR-based bit comparison, yielding an error-free 10 Gb/s operation. This is later extended via physical layer simulations in an optical Ternary-CAM (T-CAM) cell and a 4-bit Matchline (ML) configuration, supporting a third state of the "logical X" value towards wildcard bits of network subnet masks. The proposed functional CAM and Random Access Memories (RAM) sub-circuits may facilitate light-based Address Look-Up tables supporting search operations at 10 Gb/s and beyond, paving the way towards minimizing the disparity with the frantic optical transmission linerates, and fast re-configurability through multiple simultaneous Wavelength Division Multiplexed (WDM) memory access requests.

Towards an Energy-Efficient Anomaly-Based Intrusion Detection Engine for Embedded Systems

Article

Full-text available

Jan 2016

Nowadays, a significant part of all network accesses comes from embedded and battery-powered devices, which must be energy efficient. This paper demonstrates that a hardware (HW) implementation of network security algorithms can significantly reduce their energy consumption compared to an equivalent software (SW) version. The paper has four main contributions: (i) a new feature extraction algorithm, with low processing demands and suitable for hardware implementation; (ii) a feature selection method with two objectives - accuracy and energy consumption; (iii) detailed energy measurements of the feature extraction engine and three machine learning (ML) classifiers implemented in SW and HW - Decision Tree (DT), Naive-Bayes (NB), and k-Nearest Neighbors (kNN); and (iv) a detailed analysis of the tradeoffs in implementing the feature extractor and ML classifiers in SW and HW. The new feature extractor demands significantly less computational power, memory, and energy. Its SW implementation consumes only 22 percent of the energy used by a commercial product and its HW implementation only 12 percent. The dual-objective feature selection enabled an energy saving of up to 93 percent. Comparing the most energy-efficient SW implementation (new extractor and DT classifier) with an equivalent HW implementation, the HW version consumes only 5.7 percent of the energy used by the SW version.

One-Hashing Bloom Filter

Conference Paper

Full-text available

Jun 2015

Vectored-Bloom Filter Implemented on FPGA for IP Address Lookup

Conference Paper

Jan 2019

IP Address Lookup Algorithm Using a Vectored Bloom Filter

Article

Dec 2016

A Bloom filter is a space-efficient data structure popularly applied in many network algorithms. This paper proposes a vectored Bloom filter to provide a high-speed Internet protocol (IP) address lookup. While each hash index for a Bloom filter indicates one bit, which is used to identify the membership of the input, each index of the proposed vectored Bloom filter indicates a vector which is used to represent the membership and the output port for the input. Hence the proposed Bloom filter can complete the IP address lookup without accessing an off-chip hash table for most cases. Simulation results show that with a reasonable sized Bloom filter that can be stored using an on-chip memory, an IP address lookup can be performed with less than 0.0003 off-chip accesses on average in our proposed architecture.

A High-Performance Double-Layer Counting Bloom Filter for Multicore Systems

Article

Nov 2014

The snoopy-based protocol is a widely used cache coherence mechanism for a symmetric multiprocessor (SMP) system. However, this broadcast-based protocol blindly disseminates data sharing information across the system, and introduces many unnecessary data operations. This paper proposes a novel architecture of double-layer counting Bloom filter (DLCBF) to reduce the unnecessary data lookups on the local cache and redundant data transactions on the shared interconnection of an SMP system. By adding an extra filtering layer, the DLCBF effectively exploits the data locality of applications. The two-layer hierarchy reduces the storage size of DLCBF by 18.75%, and achieves 81.99% and 31.36% better filtering rates when compared with a classic Bloom filter (BF) and original counting BF, respectively. When applied on the segmented shared bus of an SMP system, the DLCBF outperforms the previous work by 58% for In-filters and 1.86x for Out-filters. This paper also comprehensively explores the key design parameters of DLCBF, including the sizes of top-layer, bottom-layer, and multilayer design. The results show that enlarging the layer filters enhance the filtering rates of DLCBF, while adding an extra filter layer only provides slight benefit.

New Approach for Efficient IP Address Lookup Using a Bloom Filter in Trie-Based Algorithms

Article

Jan 2015

IP address lookup determines the longest matching prefix of each incoming destination address. Since the IP address lookup should be performed at wire-speed for every packet in Internet routers, search speed is the most important performance metric. Previous researches have shown that the search performance of trie-based algorithms can be improved by adding onchip Bloom filters. In these algorithms, an on-chip Bloom filter identifies the membership of a node in an off-chip trie, and the number of trie accesses is reduced, because the Bloom filter can filter out accesses to non-existing nodes in the trie. In this paper, we propose a new method of utilizing a Bloom filter for the IP address lookup problem. In the previous Bloom filter-based approach, false positiveness has to be identified by accessing the off-chip trie for every positive result, since false positives can produce wrong results. In our proposed approach, the false positiveness of a Bloom filter is not necessarily identified, since false positives do not mislead the search. Hence the number of off-chip trie accesses are significantly reduced. Simulation results show that the best matching prefix can be found with a single offchip access in average and in the worst-case with the reasonable size of a Bloom filter in our proposed method.