Conference PaperPDF Available

Shift-based pattern matching for compressed web traffic

July 2011

July 2011

DOI:10.1109/HPSR.2011.5986030

Source
DBLP

Conference: Proceedings of the 12th IEEE International Conference on High Performance Switching and Routing, HPSR 2011, 4-6 July 2011, Cartagena, Spain

Authors:

Yaron Koral

Princeton University

Compressing web traffic using standard GZIP is becoming both popular and challenging due to the huge increase in wireless web devices, where bandwidth is limited. Security and other content based networking devices are required to decompress the traffic of tens of thousands concurrent connections in order to inspect the content for different signatures. The overhead imposed by the decompression inhibits most devices from handling compressed traffic, which in turn either limits traffic compression or introduces security holes and other dysfunctionalities. The ACCH algorithm was the first to present a unified approach to pattern matching and decompression, by taking advantage of information gathered in the decompression phase to accelerate the pattern matching. ACCH accelerated the DFA-based Aho-Corasick multi-pattern matching algorithm. In this paper, we present a novel algorithm, SPC (Shift-based Pattern matching for Compressed traffic) that accelerates the commonly used Wu-Manber pattern matching algorithm. SPC is simpler and has higher throughput and lower storage overhead than ACCH. Analysis of real web traffic and real security devices signatures shows that we can skip scanning up to 87.5% of the data and gain performance boost of more than 51% as compared to ACCH. Moreover, the additional storage requirement of the technique requires only 4KB additional information per connection as compared to 8KB of ACCH.

…

Content may be subject to copyright.

Content uploaded by Anat Bremler-barr

Content may be subject to copyright.

Shift-based Pattern Matching for Compressed Web

Trafﬁc

Anat Bremler-Barr

Computer Science Dept.

Interdisciplinary Center, Herzliya, Israel

Email: bremler@idc.ac.il

Yaron Koral

Blavatnik School of Computer Sciences

Tel-Aviv University, Israel

Email: yaronkor@post.tau.ac.il

Victor Zigdon

Computer Science Dept.

Interdisciplinary Center, Herzliya, Israel

Email: victor.zigdon@gmail.com

Abstract—Compressing web trafﬁc using standard GZIP is

becoming both popular and challenging due to the huge increase

in wireless web devices, where bandwidth is limited. Security

and other content based networking devices are required to

decompress the trafﬁc of tens of thousands concurrent connec-

tions in order to inspect the content for different signatures.

The overhead imposed by the decompression inhibits most

devices from handling compressed trafﬁc, which in turn either

limits trafﬁc compression or introduces security holes and other

dysfunctionalities.

The ACCH algorithm [1] was the ﬁrst to present a uniﬁed

approach to pattern matching and decompression, by taking

advantage of information gathered in the decompression phase

to accelerate the pattern matching. ACCH accelerated the DFA-

based Aho-Corasick multi-pattern matching algorithm. In this

paper, we present a novel algorithm, SPC (Shift-based Pattern

matching for Compressed trafﬁc) that accelerates the commonly

used Wu-Manber pattern matching algorithm. SPC is simpler

and has higher throughput and lower storage overhead than

ACCH. Analysis of real web trafﬁc and real security devices

signatures shows that we can skip scanning up to 87.5% of the

data and gain performance boost of more than 51% as compared

to ACCH. Moreover, the additional storage requirement of

the technique requires only 4KB additional information per

connection as compared to 8KB of ACCH.

I. INT ROD UC TI ON

Compressing HT TP text when transferring pages over the

web is in sharp increase motivated mostly by the expansion in

web surﬁng over mobile cellular devices such as smartphones.

Sites like Yahoo!, Google, MSN, YouTube, Facebook and

others use HTTP compression to enhance the speed of their

content downloads. Moreover, iPhone API for apps develop-

ment applied support for web trafﬁc compression. Fig. 1 shows

statistics of top sites using HTTP Compression: two-thirds of

the top 1000 most popular sites use HTTP compression. The

standard compression method used by HTTP 1.1 is GZ IP.

This sharp increase in HTTP compression presents new

challenges to networking devices that inspect the trafﬁc con-

tents for security hazards and balancing decisions. Those

devices reside between the server and the client and perform

Deep Packet Inspection (DPI). When receiving compressed

trafﬁc the networking device needs ﬁrst to decompress the

message in order to inspect its payload. This process suffers

from performance penalties of both time and space.

Those penalties lead most security tools vendors to either

ignore compressed trafﬁc, which may lead to miss-detection

Fig. 1. HTTP Compression usage among the Alexa [2] top-site lists

of malicious activity, or ensure that no compression takes

place by re-writing the ’client-to’ HTTP-header to indicate that

compression is not supported by the client’s browser thus de-

creasing the overall performance and bandwidth. Few security

tools [3] handle HTTP compressed trafﬁc by decompressing

the entire page on the proxy and performing signature scan

on it before forwarding it to the client. The last option is not

applicable for security tools that operate at a high speed or

when introducing additional delay is not an option.

Recent work [1] presents technique for pattern matching

on compressed trafﬁc that decompresses the trafﬁc and then

uses data from the decompression phase to accelerate the

process. Speciﬁcally, GZIP compression algorithm eliminates

repetitions of strings using back-references (pointers). The

key insight is to store information produced by the pattern

matching algorithm for scanned decompressed trafﬁc, and in

case of pointers, to use this data to either ﬁnd a match or

to skip scanning that area. That work analyzed the case of

using the well known Aho-Corasick (AC) [4] algorithm as a

multi-pattern matching technique. AC has a good worst-case

performance since every character requires traversing exactly

one Deterministic Finite Automaton (DFA) edge. However,

the adaptation for compressed trafﬁc, where some characters

represented by pointers can be skipped, is complicated since

AC requires inspection of every byte.

Inspired by the insights of that work, we investigate the

case of performing DPI over compressed web trafﬁc using the

shift-based multi-pattern matching technique of the modiﬁed

Wu-Manber (MWM) algorithm [5]. MWM inherently does

not scan every position within the trafﬁc and in fact it shifts

(skips) scanning areas in which the algorithm concludes that

no pattern starts at.

As a preliminary step, we present an improved version

for the MWM algorithm (see Section III). The modiﬁca-

tion improves both time and space aspects to ﬁt the large

number of patterns within current pattern-sets such as Snort

database [6]. We then present Shift-based Pattern matching

for Compressed trafﬁc algorithm, SPC, that accelerates MWM

on compressed trafﬁc. SPC results in a simpler algorithm,

with higher throughput and lower storage overhead than the

accelerated AC, since MWM basic operation involves shifting

(skipping) some of the trafﬁc. Thus, it is natural to combine

MWM with the idea of shifting (skipping) parts of pointers.

We show in Section V that we can skip scanning up

to 87.5% of the data and gain performance boost of more

than 73% as compared to the MWM algorithm on real web

trafﬁc and security-tools signatures. Furthermore, we show that

the suggested algorithm also gains a normalized throughput

improvement of 51% as compared to best prior art [1]. The

SPC algorithm also reduces the additional space required

for previous scan results by half, by storing only 4KB per

connection as compared to the 8KB of [1].

II. BACKGROU ND

Compressed HTTP: HTTP 1.1 [7] supports the usage of

content-codings to allow a document to be compressed. The

RFC suggests three content-codings: GZIP, COMPRESS and

DEFLATE. In fact, GZIP uses DEFLATE with an additional

thin shell of meta-data. For the purpose of this paper, both

algorithms are considered the same. These are the common

codings supported by browsers and web servers.1

The GZIP algorithm uses a combination of the following

compression techniques: ﬁrst the text is compressed with the

LZ77 algorithm and then the output is compressed with the

Huffman coding. Let us elaborate on the two algorithms:

(1) LZ77 Compression [8]- which reduces the string presen-

tation size by spotting repeated strings within the last 32KB of

the uncompressed data. The algorithm replaces each repeated

string by (distance,length) pair, where distance is a number

in [1,32768] (32K) indicating the distance in bytes of the

repeated string from the current pointer location and length is

a number in [3,258] indicating length. For example, the text:

‘abcdefgabcde’ can be compressed to: ‘abcdefg(7,5)’; namely,

“go back 7bytes and copy 5bytes from that point”. LZ77

refers to the above pair as “pointer” and to uncompressed bytes

as “literals”.

(2) Huffman Coding [9]- Recall that the second stage of

GZIP is the Huffman coding, that receives the LZ77 symbols

as input. The purpose of Huffman coding is to reduce the

symbol coding size by encoding frequent symbols with fewer

bits. Huffman coding assigns a variable-size codeword to

symbols. Dictionaries are provided to facilitate the translation

of binary codewords to bytes.

1Analyzing packets from Internet Explorer, FireFox and Chrome browsers

shows that they accept only the GZIP and DEFLATE codings.

Deep packet inspection (DPI): Essential to almost every

intrusion detection system is the ability to search through pack-

ets and identify content that matches known attacks. Space and

time efﬁcient string matching algorithms are therefore impor-

tant for inspection at line rate. The two fundamental paradigms

to perform string matching derive from deterministic ﬁnite

automaton (DFA) based algorithms and shift-based algorithms.

The fundamental algorithm of the ﬁrst paradigm is Aho-

Corasick (AC) [4], which provides deterministic linear time in

the size of the input. The most popular algorithm of the second

paradigm is the the modiﬁed Wu-Manber (MWM) [5]. The

algorithm does not have a deterministic performance, hence

it may be exposed to algorithmic attacks. Still, such attacks

can be easily identiﬁed and the system can switch to using

another engine with deterministic characteristics. Overall, the

average case performance of MWM is among the best of all

multi-pattern string matching algorithms.

A. The Challenges of Performing DPI over Compressed Trafﬁc

Recall that in the LZ77 compression algorithm each symbol

is determined dynamically by the data. For instance, string

representation depends on whether it is a part of a repeated

section and on the distance from that occurrence, which in

turn, implies that the LZ77 (and hence, GZIP) is an adaptive

compression. Thus, decoding the pattern is futile for DPI, since

it will not appear in the text in some speciﬁc form, implying

that there is no “easy” way to do DPI without decompression.

Still, decompression is a considerably light process that im-

poses only a slight performance penalty on the entire process:

LZ77 decompression requires copying consecutive sequences

of bytes and therefore beneﬁts from cache advantages gained

by spatial and temporal locality. Huffman decoding is also

a light task that requires most of the time a single memory

access per symbol to a 200B dictionary.

The space required for managing multiple-connection en-

vironment is also an important issue to tackle. On such

environment, the LZ77 32KB window requires a signiﬁcant

space penalty since in order to perform deep packet inspection,

one needs to maintain 32KB windows of all active compressed

sessions. When dealing with thousands of concurrent sessions,

this overhead becomes signiﬁcant. Recent work [10] has

shown techniques that circumvents that problem and drasti-

cally reduce the space requirement by over 80%, with only a

slight increase in time.

III. THE MO DI FIE D WU-MAN BE R ALG ORITHM

In this section, as a preliminary step to SPC, we present

the basic MWM algorithm and an improved version of it. The

modiﬁcations improve both time and space aspects to ﬁt the

large number of patterns within current pattern-sets.

MWM can be thought as an extension for the Boyer-

Moore (BM) [11] single-pattern-matching algorithm. In that

algorithm, given a single pattern of length nto match, one can

look ahead in the input string by ncharacters. If the character

at this position is not a character from our pattern, we can

immediately move the search pointer ahead by ncharacters

without examining the characters in between. If the character

appears in the string, but is not the last character in the search

string, we can skip ahead by the largest number of bytes that

ensures that we have not missed an instance of our pattern.

This technique is adapted in a straightforward manner to most

implementations of shift-based multi-pattern string matching

algorithms, including MWM. The algorithm ﬁts to a ﬁxed

length pattern, hence MWM trims all patterns to their mbytes

preﬁx, where mis the size of the shortest pattern. In addition,

determining shift-value based on a single character does not ﬁt

multi-pattern environment since almost every character would

appear as the last byte of some pattern. Instead, MWM chooses

predeﬁned group of bytes, namely B, to determine the shift

value.

Algorithm 1 outlines the main MWM scan loop and the

exact pattern match process. MWM starts by precomputing

two tables: a skip shift table called Shif tT able (a.k.a SHIFT

in MWM) and a patterns hash table, called P trns (a.k.a.

PREFIX and HASH in MWM). The ShiftTable determines

the shift value after each text scan. On average, MWM

performs shifts larger than one, hence it skips bytes. The scan

is performed using a virtual scan window of size m. The

shift value is determined by indexing the ShiftTable with the

Bbytes sufﬁx of the scan window (Line 3). As opposed to

MWM that implemented ShiftTable as a complete array with

all possible keys (i.e., (|Σ|B) where |Σ|is the alphabet size),

we implement ShiftTable as a hash table and store only keys

with shift value smaller than the maximal one.

Algorithm 1 The MWM Algorithm

trf1···trfn- the input trafﬁc

pos - the position of the next m-bytes scan window

Shif tT able - array that holds the shift value for each last B-bytes

of the window

P trns - the pattern-set hashed by the ﬁrst m-byte of the patterns

1: procedure ScanText(trf1. . . trfn)

2: pos ←1

3: while pos +m≤ndo ◃Get shiftV alue using last Bbytes

4: shiftV alue ←S hif tT able[trfpos+m−B. . . trfpos+m]

5: if shiftV alue = 0 then ◃Check for Exact Matches

6: for all pat in P trns[trfpos . . . trfpos+m]do

7: if pat =trfpos . . . trfpos+pat.len then

8: Handle Match Found

9: end if

10: end for

11: pos ←pos + 1

12: else

13: pos ←pos +shftV alue ◃ shf tV alue > 0

14: end if

15: end while

ShiftTable values determine how far we can shift forward

the text scan. Let X=X1. . . XBbe the B-byte sufﬁx of scan

window. If Xdoes not appear as a substring in any pattern,

we can make the maximal shift, m−B+ 1 bytes. Otherwise,

we ﬁnd the rightmost occurrence of Xin any of the patterns:

assume that [q−B+ 1, q]is the rightmost occurrence of X

at any of the patterns. In such a case, we skip m−qbytes.

Generally, the values in the shift table are the largest possible

safe values of skip.

When the shift table returns with a 0value (no shift), a

possible match is found. In this case, all m-bytes of scan

window are indexed into the Ptrns hash table to ﬁnd a list

of possible matching patterns. These patterns are compared to

the text to ﬁnd any matches (Lines 6–11). Then the input is

shifted ahead by one byte and the scan process continues.

The P trns hash-table has a major effect on the performance

of MWM. In the original MWM implementation, the Pattern-

Set is hashed with only B-bytes preﬁx of the patterns, resulting

in an unbalanced hash with long chains of patterns that share

the same hash key. For example when B= 2, the average

chain length is 4.2 for Snort DB, slowing down the exact

matching process, where one iterates over all possible pattern-

match list and compare each of these patterns to the trafﬁc

text. Since the number of patterns grew tremendously in the

past years, a longer hash-key should be used, thus we take the

entire scan window as the hash-key. That reduces the average

hash load to 1.44 for Snort DB.

Fig. 2 shows an excerpt of the MWM data structure for

B= 2. All patterns are trimmed to m= 5 bytes (Fig. 2(a)).

Fig. 2(b) presents shift table entries with shift values smaller

than the maximal shift. The rest of the byte pairs, not shown

in the example, are those which gain the highest shift value

of m−B+ 1 = 4. Byte pairs in the middle of the strings

have reduced shift values, and those that are at the end of

the strings, such as ‘nb’ or ‘er’ with shift value = 0, must

be checked for exact match. Fig. 2(c) shows an MWM scan

example. The scan window of length 5starts at the beginning

of the text and advance by skipping segments of the text. Note

that most of the time the scan window gains shift value larger

than 1. There are two cases where the shift value is 0and

the P trns hash-table is being queried. The ﬁrst case returns a

Match of the string ‘river’, while the second does not locate

any matched pattern.

(a) (b)

(c)

Fig. 2. (a) Pattern set and the m-bytes preﬁxes. (b) shift table of the

corresponding pattern-set. (c) MWM scan example. The arrows indicate shifts

of the scan window larger than 1. The row below the text shows the shift value

for each m-bytes scan window step. The bottom line contains B-bytes value

for shift calculation after each step.

IV. SHI FT-BA SE D PATTERN MATCH IN G FOR COMPRESSED

TR AFFI C (SPC)

In this section, we present our Shift-based Pattern matching

algorithm for Compressed HTTP trafﬁc (SPC). Recall that

HTTP uses GZIP which, in turn, uses LZ77 that compresses

data with pointers to past occurrences of strings. Thus, the

bytes referred by the pointers (called referred bytes or re-

ferred area) were already scanned; hence, if we have a prior

knowledge that an area does not contain patterns, we can skip

scanning most of it.

Observe that even if no patterns were found when the

referred area was scanned, patterns may occur in the bound-

aries of the pointer: a preﬁx of the referred bytes may be a

sufﬁx of a pattern that started previously to the pointer; or a

sufﬁx of the referred bytes may be a preﬁx of a pattern that

continues after the pointer (as shown in Fig. 3). Therefore,

special care need to be taken to handle pointer boundaries

correctly and to maintain MWM characteristics while skipping

data represented by LZ77 pointers.

The general method of the algorithm is to use a combined

technique that scans uncompressed portions of the data using

MWM and skips scanning most of the data represented by the

LZ77 pointers. Note that scanning is performed on decom-

pressed data such that both decompression and scanning tasks

are performed on-the-ﬂy, while using the pointer information

to accelerate scanning. For simplicity and clarity of the al-

gorithm description, the pseudocode is written such that all

uncompressed text and previous scan information are kept in

memory. However in real life implementation it is enough to

store only the last 32KB of the uncompressed trafﬁc.

The SPC pseudocode is given in Algorithm 2. The key

idea is that we store additional information of partial matches

found within previously scanned text, in a bit-vector called

P artialM atch. The jth bit is set to true if in position j

the m-bytes of the scan window match an m-byte preﬁx of

a pattern. Note that we store partial match rather than exact

match information. Hence, if the body of the referred area

contains no partial matches we can skip checking the pointer’s

internal. However, if the referred area contains a partial match,

we still need to perform an exact match. Maintaining partial

match rather than exact match information is signiﬁcantly less

complicated, especially over skipped characters, due to the fact

that pointers can copy parts of the patterns.

The algorithm with the P artialM atch bit-vector integrates

smoothly with MWM. In fact, as long as scan window is not

fully contained within a pointer boundaries, a regular MWM

scan is performed (Lines 22–34). The only change is that

we update the P artialM atch data structure (Line 25). Note

that shftV alue = 0 implies that the B-bytes sufﬁx of scan

window matched an entry within Shif tT able. It does not

imply a partial match, which is explicitly checked by querying

the P trns table.

In the second case, where the m-bytes scan window shifts

into a position such that it is fully contained within pointer

boundaries, SPC checks which areas of the pointer can be

Algorithm 2 The SPC Algorithm

Deﬁnitions are as in Algorithm 1, with the following additions:

utrf - The uncompressed HTTP trafﬁc.

pointer - Describes pointer parameters: distance,len and endP os

- position of the pointer rightmost byte in the uncompressed data.

Data received from the decompression phase.

PartialMatch - Bit vector that indicate whether there is a partial

match in the already scanned trafﬁc. Each bit in P artialM atch

vector has a corresponding byte in utrf .

ﬁndPartialMatches(start . . . end) - Returns a list of Partial

Matches positions in range start . . . end.

1: procedure ScanText(utrf1...n)

2: pos ←1

3: Set P artialM atch1...n bits to f alse

4: while pos +m≤ndo

5: if utrfpos . . . utrfpos+mwindow internal to pointer then

◃Check valid areas for skipping

6: start ←pos −pointer.dist

7: end ←pointer.endP os −pointer.dist −(m−1)

8: pMatchList ←f indP artialM atches(start . . . end)

9: if pMatchList is not empty then

10: for all pm in pMatchesList do

11: pos ←pm.pos +pointer.dist

12: P artialM atch[pos]←true

13: for all pat in P trns[utrfpos . . . utrfpos+m]do

14: if pat =utrfpos . . . utrfpos+pat.len then

15: Handle Match found

16: end if

17: end for

18: end for

19: end if

20: pos ←pointer.endP os −(m−1)

21: else ◃MWM scan with P artialM atch updating

22: shftV alue =S hif tT able[utrfpos+m−B. . . utrfpos+m]

23: if shftV alue = 0 then

24: if P trns[utrfpos . . . utrfpos+m]is not empty then

25: P artialM atch[pos]←true

26: for all pat in P trns[utrfpos . . . utrfpos+m]do

27: if pat =utrfpos . . . utrfpos+pat.len then

28: Handle Match found

29: end if

30: end for

31: end if

32: pos ←pos + 1

33: else pos ←pos +shftV alue ◃ shf tV alue > 0

34: end if

35: end if

36: end while

skipped (Lines 6–20). We start by checking whether any par-

tial match occurred within referred bytes by calling function

f indP artialM atches(start . . . end)(Line 8). In the simple

case where no partial matches were found, we can safely shift

the scan window to m−1bytes before the pointer end (Line

20). In effect we skip the entire pointer body, set the end of

the scan window one byte passed the pointer and continue

with the regular MWM scan. The correctness is due to the

fact that any point prior to that point is guaranteed to be free

of partial matches (otherwise there would have been a match

also within the referred bytes). SPC algorithm gains the most

from shifting over the pointer body without the extra overhead

of checking Shif tT able and P trns in the cases where there

are no actual partial matches.

If f indP artialM atches(start . . . end)returns partial

matches, we are certain that those were copied entirely from

the referred bytes, therefore, we start by setting the corre-

sponding positions within P artialM atch bit-vector to true

(Line 12). For each partial match, we then query the P trns

hash-table to check whether an exact match occurs, in the same

way as in MWM (Lines 13–17).

Fig. 3 demonstrates the SPC algorithm, using the same

pattern-set used in Fig. 2. SPC starts with a regular MWM

scan. While sanning, SPC locates the m-bytes preﬁx ‘rainb’

and mark it as a partial match in P artialM atch bit-vector.

Note that this m-bytes preﬁx did not result in an exact match

with any pattern in the pattern-set. The algorithm continues

the MWM scan until the ‘shine’ preﬁx is found, marked as a

partial match and also exactly matched to a pattern in the set.

Note that at this point we are still not within a pointer, rather

we are at the pointer’s left boundary. Note that this pointer

refers to an area with no partial matches. Therefore it scans

only the pointer boundaries and skips its internal area. In this

example both boundaries are part of a pattern.

Note that the GZIP algorithm maintains the last 32KB

of each session. SPC maintains also the P artialM atch bit-

vector, i.e. one bit per byte resulting in 4KB or 36KB

altogether. Those 36KB can be stored using cyclic buffer,

thus re-using also the P artialM atch bits whenever we cycle

to the buffer start. Therefore, we cannot rely on the default

initialization of those bits and need to add lines that explicitly

set the bits to false.

Altogether we keep a 36KB of memory per session, which

may result in a high memory consumption in a multi-session

environment. Note that most of the memory requirement is

due to GZIP and is mandatory for any pattern matching on

compressed trafﬁc. As mentioned in Section II-A, recent work

[10] has shown techniques that save over 80% of the space

required. Those techniques can be combined with SPC and

reduce the space to around 6KB per session.

The correctness of the algorithm is captured by the follow-

ing theorem.

Theorem 1: SPC detects all patterns in the decompressed

trafﬁc utrf.

Sketch of Proof: The proof is by induction on the index

of the uncompressed character within the trafﬁc. Assume the

algorithm runs correctly until position pos; namely, it ﬁnds all

pattern occurrences and marks correctly the P artialM atch

vector. We now show that the SPC algorithm: 1. ﬁnds if there

is a pattern in position pos; 2. if it shifts to pos +shftV alue

there is no patten that starts after pos + 1 and prior to pos +

shftV alue; 3. updates correctly the P artialM atch vector.

The correctness relies on the MWM basic property that if

a pattern starts at position jthen MWM will set scan window

at position jand the pattern will be located. If scan window

at position pos is not contained in a pointer then the validity

is straightforward from the correctness of MWM. Otherwise,

we need to prove that SPC ﬁnds all partial matches and exact

matches correctly. The correctness is derived from the induc-

tion hypothesis regarding the validity of the P artialM atch

vector up to position pos.

V. EXPERIMENTAL RESU LTS

In this section, we analyze SPC and the parameters which

inﬂuence its performance. In addition, we compare its perfor-

mance to both MWM and ACCH algorithms.

All the experiments were executed on an Intel Core i5

750 processor, with 4cores running at 2.67GHz and 4GB

RAM. Each core has 32KB L1 data and instruction caches

and a 256KB dedicated L2 cache. The third-level (L3) cache

is larger, at 8MB, and is shared by all four cores.

A. Data Set

The context of this paper is compressed web trafﬁc. There-

fore, we collected HTTP pages encoded with GZIP taken from

a list constructed from the Alexa website [2] that maintains

web trafﬁc metrics and top-site lists. The data set contains

6781 ﬁles with a total uncompressed size of 335MB (66MB

in its compressed form). The compression ratio is 19.7%. The

ratio of bytes represented by pointers is 92.1% and the average

pointer length is 16.65B.

B. Pattern Set

Our pattern-sets were gathered from two different sources:

ModSecurity [3], an open source web application ﬁrewall

(WAF) and Snort [6], an open source network intrusion

prevention system.

In ModSecurity, we chose the signatures group which

applies to HTTP-responses (since only the response is com-

pressed). Patterns containing regular expressions were nor-

malized into several plain patterns. The total number of

ModSecurity patterns is 148.

The Snort pattern-set contains 10621 signatures. As opposed

to ModSecurity, Snort is not of the web application domain,

therefore, it is less applicable for inspecting threats from

incoming HTTP trafﬁc. Nevertheless, since Snort is the promi-

nent reference pattern-set in multi-pattern matching papers, we

used it to compare the performance of our algorithm to other

pattern-matching algorithms. Since HTML pages contain only

printable (Base64) characters, there is no need to search for

binary patterns, leaving 6837 textual patterns. We also note,

that within our data-set, Snort patterns has a signiﬁcantly high

match rate because of patterns such as ”http”, ”href”, ”ref=”,

etc. Our data-set contains 11M matches, which accounts for

3.24% of the text. ModSecurity have a modest number of 93K

matches, which accounts for 0.026% of the text.

C. SPC Characteristics Analysis

This section explores the various parameters affecting the

performance of SPC over compressed HTTP and compares it

to the MWM running over uncompressed trafﬁc.

Shift-based pattern matching algorithms, and speciﬁcally

MWM and SPC are sensitive to the shortest pattern length

Fig. 3. Pointer scan procedure example. The patterns are as in Fig. 2. The dashed box indicate a Partial Match and the solid line indicate an Exact Match.

The solid box indicate a pointer and its referred area.

as it deﬁnes the maximal shift value for the algorithm and

inﬂuence false positive references to the P trns table. It also

bounds the size of B, resulting in poor average shift values,

since most combinations of those B-bytes are sufﬁxes of our

m-preﬁx patterns. The Snort pattern set contains many short

patterns, speciﬁcally 410 distinct patterns of length ≤3,539

of length 4and 381 of length 5. To circumvent this problem we

inspected the containing rules. We can eliminate most of the

short patterns by using longer pattern within the same rule (as

in Snort that deﬁnes such pattern with the fast pattern ﬂag) or

relying on speciﬁc ﬂow parameters (as in [12]). For instance,

74% of the rules that contain these short patterns, contain

also longer patterns. Eliminating short patterns is effective for

patterns shorter than 5, hence we can safely choose m= 5.

Still in order to understand the effects of different mand B,

we experimented with values for 4≤m≤6.

In order to understand the impact of Band mwe examined

the character of skip ratio, Sr, the percentage of characters the

algorithm skips. Sris a dominant performance factor of both

SPC and MWM. Fig. 4 outlines the skip ratio as a parameter

of mand Band compares the performance of SPC to MWM.

As described in Section IV, SPC shift ratio is based on two

factors: the MWM shift for scans outside pointers and skipping

internal pointer byte scans. When m=B, MWM does not

skip at all. In that case the SPC shifts are based solely on

internal pointer skipping. For m=B,Srranges from 70% to

60% as mincreases; i.e. the factor based on internal pointer

skips, is the dominant one for the given mvalues.

We note that m= 6 gains the best performance as it

provides the largest maximal shift value (equals to m−B+1).

However, using m= 6 as the shortest pattern length discards

too many patterns. We chose m= 5 as a tradeoff between

performance and pattern-set coverage. The skip ratio of SPC

is much better than that of MWM, and on Snort, for some

values of mand B, we get more than twice the skip ratio.

This property of SPC is a direct result of skipping pointers

whose referred bytes were already scanned.

The Bparameter determines the text block size on which

the shftV alue is calculated and has two dominant effects on

the performance of MWM and SPC: larger Bvalue decreases

the maximal shift, Sm=m−B+ 1, which correlates directly

to the average shift but it also increases part of the shift values

as it decreases the percentage of entries which results in shift

value is 0. Overall the maximal skip ratio for Snort is 82.7%

for m= 5 and B= 3, whereas on ModSecurity Sris 87.5%

for m= 5 and B= 2.

D. SPC Run-Time Performance

This section presents the run-time performance of SPC

as compared to our improved implementation of MWM (as

described in section III) and to ACCH, the only current

algorithm that handles compressed web trafﬁc.

Note that SPC have a basic overhead of 10% over MWM

when running on plain uncompressed trafﬁc. This overhead

is attributed to the additional P artialM atch bit-vector that

impose an overhead of 12.5% memory-references. However

since this bit is stored next to its corresponding byte of the

trafﬁc and due to the skipping operation we get a smaller

overhead.

The algorithms performance is measured by their through-

put T=W ork/T ime (i.e., scanning the entire data-set

divided by the scan time). The throughput, as shown in Fig. 5,

is normalized to the one of ACCH (which does not depend on

mand Bvalues). We note that ACCH’s throughput is roughly

three times better than Aho-Corasick (AC is omitted from the

ﬁgure for clarity). ACCH was tuned with optimal parameters

as recommended in [1]. The measured throughput of SPC on

our experimental environment for Snort is 1.016 Gbit/sec for

m= 5 and B= 4 and for ModSecurity it is 2.458 Gbit/sec

for m= 5 and B= 3. Those results were received by running

with 4threads that performs pattern matching on data loaded in

advance to the memory. Our implementation uses C# language

and general purpose software libraries and is not optimized for

the best throughput. Our goal is to compare between the differ-

ent algorithms for better understanding of SPC characteristics.

Better throughput can be gained by using optimized software

libraries or hardware optimized to networking.

As can be seen, for m= 5, when running on Snort, SPC’s

throughput is better than ACCH’s by up to 51.86%, whereas

on ModSecurity, we get throughput improvement of 113.24%.

When comparing SPC to MWM the throughput improvement

is 73.23% on Snort, and 90.04% on ModSecurity. Note that

for all mand Bvalues, SPC is faster that MWM. The max-

imum throughput is achieved for m= 5 when B= 4 while

the maximum skip ratio is achieved for B= 3. This is due

to the fact that when Bis larger we avoid unnecessary false

positive references to the P trns data structure. Furthermore,

we found out that for the Snort pattern-set we reach a small

value of 0.3memory reference per char.

E. SPC Storage Requirements

We elaborate on the data structures that are used by SPC and

MWM: Shif tT able and P trns As explained in Section III.

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6

Skip Ratio (%)

SPC MWM

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6

Skip Ratio (%)

SPC MWM

(a). Snort (b). ModSecurity

Fig. 4. Skipped Character Ratio (Sr)

50%

100%

150%

200%

250%

B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6

Throughput

Normalized to ACCH

SPC MWM ACCH

50%

100%

150%

200%

250%

B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6

Throughput

Normalized to ACCH

SPC MWM ACCH

(a). Snort (b). ModSecurity

Fig. 5. Normalized Throughput

•Shif tT able - is a hash-table that holds shift values and

uses the last B-bytes of the scan window as the hash-

key. If the key is not found in the hash, the maximal

shift value, Sm=m−B+1, is returned. Each hash-entry

contains a pointer to the shift value and the corresponding

list of possible B-bytes keys. Hence, the storage required

for each entry of Shif tT able is composed of:

1) Entry pointer - 32bits per entry

2) Shift value - The maximal value needed to be stored

is the maximal shift value−1, hence there are m−B

different shift values. To represent each value, we

need log2(m−B)bits

3) Hash-key - 8×Bbits for each B-bytes key

The total size of this table is less than 58KB for Snort

and less then 1.61KB for ModSecurity.

•P trns - is a hash-table of the pattern references (indexes)

hashed using the m-bytes pattern preﬁxes. Each P trns

table entry holds a pointer to the list of pattern references

with the same m-bytes preﬁx, where each reference is an

index to an array which contains the patterns themselves.

Hence, for each entry we only need to store:

1) Entry pointer - 32bits per entry

2) Pattern index - log2(N)bits per pattern reference,

where Nis the number of patterns

3) Hash-key - 8×mbits for each m-bytes key

For Snort, with m= 5 this data-structure requires less

then 152KB whereas, for ModSecurity, it requires only

4.05KB.

Note that we use hash-table implementation such as each

hash entry is a list implemented as a ﬁxed size array. This

provides a space efﬁcient implementation but is expensive in

case of updates to the patterns-set. We believe that for most

usage scenarios this is the better tradeoff. However, in case

the hash-table needs to support updates, an additional space

is needed, as the arrays are replaced by linked lists with

pointers, this roughly multiply the memory requirements of

the hash-tables by two. Overall, this is still a space efﬁcient

implementation.

Table I summarizes the memory required by each of the

listed data-structures:

m B ShiftTable Ptrns Total Storage

Snort

5 2 14.77 151.73 166.50

5 3 54.35 151.73 206.09

5 4 57.82 151.73 209.55

5 5 36.89 151.73 188.62

ModSecurity

5 2 1.31 4.05 5.36

5 3 1.61 4.05 5.66

5 4 1.40 4.05 5.45

5 5 0.98 4.05 5.03

TABLE I

STOR AGE RE QUI RE MEN TS (KB)

Our implementation is very space efﬁcient as we receive that

both MWM and SPC algorithms requires around 1.88 bytes

per char for Snort and ModSecurity as opposed to 1.4KB [13]

of the original MWM algorithm. This small space requirement

increases the probability that the entire table would reside

within the cache and thus is a key factor responsible for the

performance achieved by the algorithm.

VI. RE LATE D WORK

Compressed pattern matching has received attention in the

context of the Lempel-Ziv compression family [14]–[17].

However, the LZW/LZ78 are more attractive and simple for

pattern matching than LZ77. Recall that HTTP uses LZ77

compression, and hence all the above works are not applicable

to our case. Klein and Shapira [18] suggested modiﬁcation

to the LZ77 compression algorithm to make the task of the

matching easier in ﬁles. However, their suggestion is not

implemented in today’s web trafﬁc.

Farach et. al [19] is the only paper we are aware of that

deals with pattern matching over LZ77. However, in this paper

the algorithm is capable of matching only single pattern and

requires two passes over the compressed text (ﬁle), which does

not comply to the ’on-the-ﬂy’ processing requirement applied

by the network domains.

The ﬁrst attempt that tackles the problem of performing

efﬁcient pattern matching on compressed HTTP trafﬁc, i.e., on

the LZ77 family and in the context of networking is presented

in [1]. The paper suggests that the pattern matching task can be

accelerated using the compression information. In fact, that pa-

per shows that pattern matching on compressed HTTP trafﬁc,

with the overhead of decompression is faster than DFA-based

pattern matching (such as Aho-Corasick algorithm [4]). Our

paper shows that the same approach can be applied on another

important family of pattern matching algorithms, the shift-

based technique, such as Boyer-Moore [11] and the modiﬁed

Wu-Manber (MWM) [5]. We show that accelerating MWM

pattern algorithm results in a simpler algorithm, with higher

throughput and lower storage overhead than accelerating Aho-

Corasick Algorithm. The algorithm can be combined with

enhanced solutions based on MWM such as [20]–[25] and

also can be implemented for TCAM environment as in [12].

VII. CONCLUSION AND FUT UR E WOR K

With the sharp increase in cellular web surﬁng, HTTP

compression becomes common in todays web trafﬁc. Yet

due to its performance requirements, most security devices

tend to ignore or bypass the compressed trafﬁc and thus

introduce either a security hole or a potential for a denial of

service attack. This paper presents SPC, a technique that takes

advantage of the information within the compressed trafﬁc to

accelerate rather than slow down the entire pattern matching

process. The algorithm gains a performance boost of over

51% using half the space of the additional information per

connection compared to previous known solution, ACCH. The

algorithm presented in this paper should encourage vendors to

support inspection of such trafﬁc in their security equipment.

As for future work we plan on handling regular expression

matching over compressed web trafﬁc.

REF ER EN CE S

[1] A. Bremler-Barr and Y. Koral, “Accelerating multi-patterns matching

on compressed HT TP,” in INFOCOM 2009. 28th IEEE International

Conference on Computer Communications, April 2009.

[2] “Top sites,” July 2010. http://www.alexa.com/topsites.

[3] “Modsecurity.” http://www.modsecurity.org (accessed on July 2008).

[4] A. Aho and M. Corasick, “Efﬁcient string matching: an aid to biblio-

graphic search,” Communications of the ACM, pp. 333–340, 1975.

[5] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching,”

Tech. Rep. TR-94-17, Department of Computer Science, University of

Arizona, May 1994.

[6] “Snort.” http://www.snort.org (accessed on October 2010).

[7] “Hypertext transfer protocol – http/1.1,” June 1999. RFC 2616,

http://www.ietf.org/rfc/rfc2616.txt.

[8] J. Ziv and A. Lempel, “A universal algorithm for sequential data

compression,” IEEE Transactions on Information Theory, pp. 337– 343,

May 1977.

[9] D. Huffman, “A method for the construction of minimum-redundancy

codes,” Proceedings of IRE, p. 10981101, 1952.

[10] Y. Afek, A. Bremler-Barr, and Y. Koral, “Efﬁcient processing of multi-

connection compressed web trafﬁc,” in IFIP NETWORKING, 2011.

[11] R. Boyer and J. Moore, “A fast string searching algorithm,” Communi-

cations of the ACM, pp. 762 – 772, October 1977.

[12] Y. Weinsberg, S. Tzur-David, D. Dolev, and T. Anker, “High perfor-

mance string matching algorithm for a network intrusion prevention

system (nips),” in HPSR, 2006.

[13] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “Deterministic

memoryefﬁcient string matching algorithms for intrusion detection,” in

INFOCOM 2004, 2004.

[14] A. Amir, G. Benson, and M. Farach, “Let sleeping ﬁles lie: Pattern

matching in z-compressed ﬁles,” Journal of Computer and System

Sciences, pp. 299–307, 1996.

[15] T. Kida, M. Takeda, A. Shinohara, and S. Arikawa, “Shift-and approach

to pattern matching in lzw compressed text,” in 10th Annual Symposium

on Combinatorial Pattern Matching (CPM 99), 1999.

[16] G. Navarro and M. Rafﬁnot, “A general practical approach to pattern

matching over ziv-lempel compressed text,” in 10th Annual Symposium

on Combinatorial Pattern Matching (CPM 99), 1999.

[17] G. Navarro and J. Tarhio, “Boyer-moore string matching over ziv-lempel

compressed text,” in Proceedings of the 11th Annual Symposium on

Combinatorial Pattern Matching, pp. 166 – 180, 2000.

[18] S. Klein and D. Shapira, “A new compression method for compressed

matching,” in Proceedings of data compression conference DCC-2000,

Snowbird, Utah, pp. 400–409, 2000.

[19] M. Farach and M. Thorup, “String matching in lempel-ziv compressed

strings,” in 27th annual ACM symposium on the theory of computing,

pp. 703–712, 1995.

[20] R. Liu, N. Huang, C. Kao, C. Chen, and C. Chou, “A fast pattern-match

engine for network processor-based network intrusion detection system,”

in ITCC, pp. 97–101, 2004.

[21] S. Antonatos, M. Polychronakis, P. Akritidis, K. G. Anagnostakis, and

E. P. Markatos, “Piranha: Fast and memory-efﬁcient pattern matching

for intrusion detection,” in IFIP Advances in Information and Commu-

nication Technology, pp. 393–408, 2005.

[22] B. Zhang, X. Chen, L. Ping, and Z. Wu, “Address ﬁltering based wu-

manber multiple patterns matching algorithm,” in WCSE, 2009.

[23] Z. Qiang, “An improved multiple patterns matching algorithm for

intrusion detection,” in ICIS, 2010.

[24] Y. Choi, M. Jung, and S. Seo, “L+1-mwm: A fast pattern matching

algorithm for high-speed packet ﬁltering,” in INFOCOM, 2008.

[25] K. G. Anagnostakis, E. P. Markatos, S. Antonatos, and M. Polychron-

akis, “E2xB: A domain-speciﬁc string matching algorithm for intrusion

detection,” in SEC2003, 2003.

Efficient regular expression matching over hybrid dictionary-based compressed data

Article

Apr 2023

Skipping-Based Multi-Patterns Matching Algorithm for Compressed HTTP Traffic

Article

Full-text available

Jan 2020

源田

Efficient Regular Expression Matching over Compressed Traffic

Article

Nov 2019
COMPUT NETW

Nowadays, regular expression matching becomes a critical component of the network traffic detection applications, which describes the fine-grained signature of traffic. Web services tend to compress their traffic for less data transmission, which is a great challenge for regular expression matching to achieve wire-speed processing. In this paper, we propose Twins, an efficient regular expression matching method over compressed traffic, which leverages the returned states encoding in the compression to skip repeated scanning. We also present an evaluation model to elaborate the factors that influence the performance of compressed traffic matching methods. Our evaluations demonstrate that Twins could skip ~ 90% compression data and can achieve 1.2 Gbps throughput with a single CPU core. It gains 2.2–3.0 times performance boost than the state-of-the-art works. With a parallel implementation using multiple CPU cores, the throughput could be up to 10 Gbps.

Towards a Fast Regular Expression Matching Method Over Compressed Traffic

Conference Paper

Jun 2018

COIN: A fast packet inspection method over compressed traffic

Article

Dec 2018

Matching multiple patterns simultaneously is a key technique in Deep Packet Inspection systems, such as firewall, Intrusion Detection Systems, etc. However, most web services nowadays tend to compress their traffic for less data transferring and better user experience, which has challenged the original multi-pattern matching method that work on raw content only. The straightforward solutions directly match decompressed data which multiply the data to be matched. The state-of-the-art works skip scanning some data in compressed segments, but still exist the redundant checking, which are not efficient enough. In this paper, we propose COmpression INspection (COIN) method for multi-pattern matching over compressed traffic. COIN does not recheck the patterns within compressed segment if it has been matched before, so as to further improve the performance of matching, we have collected real traffic data from Alexa top sites and performed the experiments. The evaluation results show that COIN achieves 20.3% and 17.0% in the average of improvement than the state-of-the-art approaches on the string and regular expression matching with real traffic and rule sets.

Accelerating regular expression matching over compressed HTTP

Conference Paper

Apr 2015

An Improved Pattern Matching Algorithm

Conference Paper

Apr 2010

Pattern matching is a basic operation in various kinds of string process system. Pattern matching algorithms play an important role in intrusion detection, information retrieval, feature matching of computer virus and so on. Base on analyzing BM algorithm and relative algorithms, a new improved pattern matching algorithm is presented. The improved algorithm makes use of singleness and combination of the Last Character and Next Character of string, to increase the probability of maximum displacement. Experiment results show that the improved algorithm can effectively improve the efficiency of the matching process.

String Matching in Lempel—Ziv Compressed Strings

Article

Full-text available

Jan 1998

String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the naïve ``decompresion'' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) .

Shift-And Approach to Pattern Matching in LZW Compressed Text

Conference Paper

Full-text available

Jul 1999
Lect Notes Comput Sci

This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it. The algorithm is indeed fast when a pattern length is at most 32, or the word length. After an O(m + |∑|) time and O(|∑|) space preprocessing of a pattern, it scans an LZW compressed text in O(n + r) time and reports all occurrences of the pattern, where n is the compressed text length, m is the pattern length, and r is the number of the pattern occurrences. Experimental results show that it runs approximately 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Moreover, the algorithm can be extended to the generalized pattern matching, to the pattern matching with k mismatches, and to the multiple pattern matching, like the Shift-And algorithm.

Boyer—Moore String Matching over Ziv-Lempel Compressed Text

Conference Paper

Full-text available

Aug 2000

We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not improve because still all the symbols in the compressed text have to be scanned, we show experimentally that speedups of up to 30% over the fastest previous approaches are obtained. Moreover, we show that using an encoding method that sacrifices some compression ratio our method is twice as fast as decompressing plus searching using the best available algorithms.

High performance string matching algorithm for a network intrusion prevention system (NIPS)

Conference Paper

Full-text available

Jan 2006

Intrusion detection systems (IDS) were developed to identify and report attacks in the late 1990s, as hacker attacks and network worms began to affect the Internet. Traditional IDS technologies detect hostile traffic and send alerts but do nothing to stop the attacks. Network intrusion prevention systems (NIPS) are deployed in-line with the network segment being protected. As the traffic passes through the NIPS, it is inspected for the presence of an attack. Like viruses, most intruder activities have some sort of signatures. Therefore, a pattern-matching algorithm resides at the heart of the NIPS. When an attack is identified, the NIPS blocks the offending data. There is an alleged trade-off between the accuracy of detection and algorithmic efficiency. Both are paramount in ensuring that legitimate traffic is not delayed or disrupted as it flows through the device. For this reason, the pattern-matching algorithm must be able to operate at wire speed, while simultaneously detecting the main bulk of intrusions. With networking speeds doubling every year, it is becoming increasingly difficult for software based solutions to keep up with the line rates. This paper presents a novel pattern-matching algorithm. The algorithm uses a ternary content addressable memory (TCAM) and is capable of matching multiple patterns in a single operation. The algorithm achieves line-rate speed of several orders of magnitude faster than current works, while attaining similar accuracy of detection. Furthermore, our system is fully compatible with Snort's rules syntax, which is the de facto standard for intrusion prevention systems

L+1-MWM: A Fast Pattern Matching Algorithm for High-Speed Packet Filtering

Conference Paper

Apr 2008

A universal algorithm for sequential data compression

Article

Jan 1977

A universal algorithm for data compression

Article

Jan 1977

Address Filtering Based Wu-Manber Multiple Patterns Matching Algorithm

Article

Oct 2009

Wu-Manber is a widely used multiple patterns matching algorithm. But in practical application, it exist the following limitations. First, there are redundant information and operations. Second, the prefix table is established, but is hardly used. Third, need to traverse the whole link list. These limitations make the algorithm complicated and affect the performance of it. Though some limitations have been indicated by others, different ways are taken in this paper. Prefix table is used to filter the patterns. Address filtering based search method avoids traversing the whole link list. Experimental result shows that the improved algorithm has a good performance than the original one.

A method for the construction of minimum-redundancy codes

Article

Feb 2006

David A. Huffman

An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

A Fast String Searching Algorithm

Article

Oct 1977

An algorithm is presented that searches for the location, ″i,″ of the first occurrence of a character string, ″pat,″ in another string, ″string.″ During the search operation, the characters of pat are matched starting with the last character of pat. The information gained by starting the match at the end of the pattern often allows the algorithm to proceed in large jumps through the text being searched. Thus the algorithm has the unusual property that, in most cases, not all of the first i characters of string are inspected. The number of characters actually inspected (on the average) decreases as a function of the length of pat. For a random English pattern of length 5, the algorithm will typically inspect i/4 characters of string before finding a match at i. Furthermore, the algorithm has been implemented so that (on the average) fewer than i plus patlen machine instructions are executed. These conclusions are supported with empirical evidence and a theoretical analysis of the average behavior of the algorithm. The worst case behavior of the algorithm is linear in i plus patlen, assuming the availability of array space for tables linear in patlen plus the size of the alphabet.

Shift-based pattern matching for compressed web traffic

Abstract and Figures

Recommended publications

Goodput Analysis and link adaptation for IEEE 802.11a Wireless LANs

Space-Time processing for broadband wireless access

Application of Biosignal Data Compression for u-Health Sensor Network System

Extending Accurate Time Distribution and Timeliness Capabilities Over the Air to Enable Future Wirel...