Conference PaperPDF Available

Shift-based pattern matching for compressed web traffic

Authors:

Abstract and Figures

Compressing web traffic using standard GZIP is becoming both popular and challenging due to the huge increase in wireless web devices, where bandwidth is limited. Security and other content based networking devices are required to decompress the traffic of tens of thousands concurrent connections in order to inspect the content for different signatures. The overhead imposed by the decompression inhibits most devices from handling compressed traffic, which in turn either limits traffic compression or introduces security holes and other dysfunctionalities. The ACCH algorithm was the first to present a unified approach to pattern matching and decompression, by taking advantage of information gathered in the decompression phase to accelerate the pattern matching. ACCH accelerated the DFA-based Aho-Corasick multi-pattern matching algorithm. In this paper, we present a novel algorithm, SPC (Shift-based Pattern matching for Compressed traffic) that accelerates the commonly used Wu-Manber pattern matching algorithm. SPC is simpler and has higher throughput and lower storage overhead than ACCH. Analysis of real web traffic and real security devices signatures shows that we can skip scanning up to 87.5% of the data and gain performance boost of more than 51% as compared to ACCH. Moreover, the additional storage requirement of the technique requires only 4KB additional information per connection as compared to 8KB of ACCH.
Content may be subject to copyright.
Shift-based Pattern Matching for Compressed Web
Traffic
Anat Bremler-Barr
Computer Science Dept.
Interdisciplinary Center, Herzliya, Israel
Email: bremler@idc.ac.il
Yaron Koral
Blavatnik School of Computer Sciences
Tel-Aviv University, Israel
Email: yaronkor@post.tau.ac.il
Victor Zigdon
Computer Science Dept.
Interdisciplinary Center, Herzliya, Israel
Email: victor.zigdon@gmail.com
Abstract—Compressing web traffic using standard GZIP is
becoming both popular and challenging due to the huge increase
in wireless web devices, where bandwidth is limited. Security
and other content based networking devices are required to
decompress the traffic of tens of thousands concurrent connec-
tions in order to inspect the content for different signatures.
The overhead imposed by the decompression inhibits most
devices from handling compressed traffic, which in turn either
limits traffic compression or introduces security holes and other
dysfunctionalities.
The ACCH algorithm [1] was the first to present a unified
approach to pattern matching and decompression, by taking
advantage of information gathered in the decompression phase
to accelerate the pattern matching. ACCH accelerated the DFA-
based Aho-Corasick multi-pattern matching algorithm. In this
paper, we present a novel algorithm, SPC (Shift-based Pattern
matching for Compressed traffic) that accelerates the commonly
used Wu-Manber pattern matching algorithm. SPC is simpler
and has higher throughput and lower storage overhead than
ACCH. Analysis of real web traffic and real security devices
signatures shows that we can skip scanning up to 87.5% of the
data and gain performance boost of more than 51% as compared
to ACCH. Moreover, the additional storage requirement of
the technique requires only 4KB additional information per
connection as compared to 8KB of ACCH.
I. INT ROD UC TI ON
Compressing HT TP text when transferring pages over the
web is in sharp increase motivated mostly by the expansion in
web surfing over mobile cellular devices such as smartphones.
Sites like Yahoo!, Google, MSN, YouTube, Facebook and
others use HTTP compression to enhance the speed of their
content downloads. Moreover, iPhone API for apps develop-
ment applied support for web traffic compression. Fig. 1 shows
statistics of top sites using HTTP Compression: two-thirds of
the top 1000 most popular sites use HTTP compression. The
standard compression method used by HTTP 1.1 is GZ IP.
This sharp increase in HTTP compression presents new
challenges to networking devices that inspect the traffic con-
tents for security hazards and balancing decisions. Those
devices reside between the server and the client and perform
Deep Packet Inspection (DPI). When receiving compressed
traffic the networking device needs first to decompress the
message in order to inspect its payload. This process suffers
from performance penalties of both time and space.
Those penalties lead most security tools vendors to either
ignore compressed traffic, which may lead to miss-detection
Fig. 1. HTTP Compression usage among the Alexa [2] top-site lists
of malicious activity, or ensure that no compression takes
place by re-writing the ’client-to’ HTTP-header to indicate that
compression is not supported by the client’s browser thus de-
creasing the overall performance and bandwidth. Few security
tools [3] handle HTTP compressed traffic by decompressing
the entire page on the proxy and performing signature scan
on it before forwarding it to the client. The last option is not
applicable for security tools that operate at a high speed or
when introducing additional delay is not an option.
Recent work [1] presents technique for pattern matching
on compressed traffic that decompresses the traffic and then
uses data from the decompression phase to accelerate the
process. Specifically, GZIP compression algorithm eliminates
repetitions of strings using back-references (pointers). The
key insight is to store information produced by the pattern
matching algorithm for scanned decompressed traffic, and in
case of pointers, to use this data to either find a match or
to skip scanning that area. That work analyzed the case of
using the well known Aho-Corasick (AC) [4] algorithm as a
multi-pattern matching technique. AC has a good worst-case
performance since every character requires traversing exactly
one Deterministic Finite Automaton (DFA) edge. However,
the adaptation for compressed traffic, where some characters
represented by pointers can be skipped, is complicated since
AC requires inspection of every byte.
Inspired by the insights of that work, we investigate the
case of performing DPI over compressed web traffic using the
shift-based multi-pattern matching technique of the modified
Wu-Manber (MWM) algorithm [5]. MWM inherently does
not scan every position within the traffic and in fact it shifts
(skips) scanning areas in which the algorithm concludes that
no pattern starts at.
As a preliminary step, we present an improved version
for the MWM algorithm (see Section III). The modifica-
tion improves both time and space aspects to fit the large
number of patterns within current pattern-sets such as Snort
database [6]. We then present Shift-based Pattern matching
for Compressed traffic algorithm, SPC, that accelerates MWM
on compressed traffic. SPC results in a simpler algorithm,
with higher throughput and lower storage overhead than the
accelerated AC, since MWM basic operation involves shifting
(skipping) some of the traffic. Thus, it is natural to combine
MWM with the idea of shifting (skipping) parts of pointers.
We show in Section V that we can skip scanning up
to 87.5% of the data and gain performance boost of more
than 73% as compared to the MWM algorithm on real web
traffic and security-tools signatures. Furthermore, we show that
the suggested algorithm also gains a normalized throughput
improvement of 51% as compared to best prior art [1]. The
SPC algorithm also reduces the additional space required
for previous scan results by half, by storing only 4KB per
connection as compared to the 8KB of [1].
II. BACKGROU ND
Compressed HTTP: HTTP 1.1 [7] supports the usage of
content-codings to allow a document to be compressed. The
RFC suggests three content-codings: GZIP, COMPRESS and
DEFLATE. In fact, GZIP uses DEFLATE with an additional
thin shell of meta-data. For the purpose of this paper, both
algorithms are considered the same. These are the common
codings supported by browsers and web servers.1
The GZIP algorithm uses a combination of the following
compression techniques: first the text is compressed with the
LZ77 algorithm and then the output is compressed with the
Huffman coding. Let us elaborate on the two algorithms:
(1) LZ77 Compression [8]- which reduces the string presen-
tation size by spotting repeated strings within the last 32KB of
the uncompressed data. The algorithm replaces each repeated
string by (distance,length) pair, where distance is a number
in [1,32768] (32K) indicating the distance in bytes of the
repeated string from the current pointer location and length is
a number in [3,258] indicating length. For example, the text:
‘abcdefgabcde’ can be compressed to: ‘abcdefg(7,5)’; namely,
“go back 7bytes and copy 5bytes from that point”. LZ77
refers to the above pair as “pointer” and to uncompressed bytes
as “literals”.
(2) Huffman Coding [9]- Recall that the second stage of
GZIP is the Huffman coding, that receives the LZ77 symbols
as input. The purpose of Huffman coding is to reduce the
symbol coding size by encoding frequent symbols with fewer
bits. Huffman coding assigns a variable-size codeword to
symbols. Dictionaries are provided to facilitate the translation
of binary codewords to bytes.
1Analyzing packets from Internet Explorer, FireFox and Chrome browsers
shows that they accept only the GZIP and DEFLATE codings.
Deep packet inspection (DPI): Essential to almost every
intrusion detection system is the ability to search through pack-
ets and identify content that matches known attacks. Space and
time efficient string matching algorithms are therefore impor-
tant for inspection at line rate. The two fundamental paradigms
to perform string matching derive from deterministic finite
automaton (DFA) based algorithms and shift-based algorithms.
The fundamental algorithm of the first paradigm is Aho-
Corasick (AC) [4], which provides deterministic linear time in
the size of the input. The most popular algorithm of the second
paradigm is the the modified Wu-Manber (MWM) [5]. The
algorithm does not have a deterministic performance, hence
it may be exposed to algorithmic attacks. Still, such attacks
can be easily identified and the system can switch to using
another engine with deterministic characteristics. Overall, the
average case performance of MWM is among the best of all
multi-pattern string matching algorithms.
A. The Challenges of Performing DPI over Compressed Traffic
Recall that in the LZ77 compression algorithm each symbol
is determined dynamically by the data. For instance, string
representation depends on whether it is a part of a repeated
section and on the distance from that occurrence, which in
turn, implies that the LZ77 (and hence, GZIP) is an adaptive
compression. Thus, decoding the pattern is futile for DPI, since
it will not appear in the text in some specific form, implying
that there is no “easy” way to do DPI without decompression.
Still, decompression is a considerably light process that im-
poses only a slight performance penalty on the entire process:
LZ77 decompression requires copying consecutive sequences
of bytes and therefore benefits from cache advantages gained
by spatial and temporal locality. Huffman decoding is also
a light task that requires most of the time a single memory
access per symbol to a 200B dictionary.
The space required for managing multiple-connection en-
vironment is also an important issue to tackle. On such
environment, the LZ77 32KB window requires a significant
space penalty since in order to perform deep packet inspection,
one needs to maintain 32KB windows of all active compressed
sessions. When dealing with thousands of concurrent sessions,
this overhead becomes significant. Recent work [10] has
shown techniques that circumvents that problem and drasti-
cally reduce the space requirement by over 80%, with only a
slight increase in time.
III. THE MO DI FIE D WU-MAN BE R ALG ORITHM
In this section, as a preliminary step to SPC, we present
the basic MWM algorithm and an improved version of it. The
modifications improve both time and space aspects to fit the
large number of patterns within current pattern-sets.
MWM can be thought as an extension for the Boyer-
Moore (BM) [11] single-pattern-matching algorithm. In that
algorithm, given a single pattern of length nto match, one can
look ahead in the input string by ncharacters. If the character
at this position is not a character from our pattern, we can
immediately move the search pointer ahead by ncharacters
without examining the characters in between. If the character
appears in the string, but is not the last character in the search
string, we can skip ahead by the largest number of bytes that
ensures that we have not missed an instance of our pattern.
This technique is adapted in a straightforward manner to most
implementations of shift-based multi-pattern string matching
algorithms, including MWM. The algorithm fits to a fixed
length pattern, hence MWM trims all patterns to their mbytes
prefix, where mis the size of the shortest pattern. In addition,
determining shift-value based on a single character does not fit
multi-pattern environment since almost every character would
appear as the last byte of some pattern. Instead, MWM chooses
predefined group of bytes, namely B, to determine the shift
value.
Algorithm 1 outlines the main MWM scan loop and the
exact pattern match process. MWM starts by precomputing
two tables: a skip shift table called Shif tT able (a.k.a SHIFT
in MWM) and a patterns hash table, called P trns (a.k.a.
PREFIX and HASH in MWM). The ShiftTable determines
the shift value after each text scan. On average, MWM
performs shifts larger than one, hence it skips bytes. The scan
is performed using a virtual scan window of size m. The
shift value is determined by indexing the ShiftTable with the
Bbytes suffix of the scan window (Line 3). As opposed to
MWM that implemented ShiftTable as a complete array with
all possible keys (i.e., (|Σ|B) where |Σ|is the alphabet size),
we implement ShiftTable as a hash table and store only keys
with shift value smaller than the maximal one.
Algorithm 1 The MWM Algorithm
trf1···trfn- the input traffic
pos - the position of the next m-bytes scan window
Shif tT able - array that holds the shift value for each last B-bytes
of the window
P trns - the pattern-set hashed by the first m-byte of the patterns
1: procedure ScanText(trf1. . . trfn)
2: pos 1
3: while pos +mndo Get shiftV alue using last Bbytes
4: shiftV alue S hif tT able[trfpos+mB. . . trfpos+m]
5: if shiftV alue = 0 then Check for Exact Matches
6: for all pat in P trns[trfpos . . . trfpos+m]do
7: if pat =trfpos . . . trfpos+pat.len then
8: Handle Match Found
9: end if
10: end for
11: pos pos + 1
12: else
13: pos pos +shftV alue shf tV alue > 0
14: end if
15: end while
ShiftTable values determine how far we can shift forward
the text scan. Let X=X1. . . XBbe the B-byte suffix of scan
window. If Xdoes not appear as a substring in any pattern,
we can make the maximal shift, mB+ 1 bytes. Otherwise,
we find the rightmost occurrence of Xin any of the patterns:
assume that [qB+ 1, q]is the rightmost occurrence of X
at any of the patterns. In such a case, we skip mqbytes.
Generally, the values in the shift table are the largest possible
safe values of skip.
When the shift table returns with a 0value (no shift), a
possible match is found. In this case, all m-bytes of scan
window are indexed into the Ptrns hash table to find a list
of possible matching patterns. These patterns are compared to
the text to find any matches (Lines 6–11). Then the input is
shifted ahead by one byte and the scan process continues.
The P trns hash-table has a major effect on the performance
of MWM. In the original MWM implementation, the Pattern-
Set is hashed with only B-bytes prefix of the patterns, resulting
in an unbalanced hash with long chains of patterns that share
the same hash key. For example when B= 2, the average
chain length is 4.2 for Snort DB, slowing down the exact
matching process, where one iterates over all possible pattern-
match list and compare each of these patterns to the traffic
text. Since the number of patterns grew tremendously in the
past years, a longer hash-key should be used, thus we take the
entire scan window as the hash-key. That reduces the average
hash load to 1.44 for Snort DB.
Fig. 2 shows an excerpt of the MWM data structure for
B= 2. All patterns are trimmed to m= 5 bytes (Fig. 2(a)).
Fig. 2(b) presents shift table entries with shift values smaller
than the maximal shift. The rest of the byte pairs, not shown
in the example, are those which gain the highest shift value
of mB+ 1 = 4. Byte pairs in the middle of the strings
have reduced shift values, and those that are at the end of
the strings, such as ‘nb’ or ‘er’ with shift value = 0, must
be checked for exact match. Fig. 2(c) shows an MWM scan
example. The scan window of length 5starts at the beginning
of the text and advance by skipping segments of the text. Note
that most of the time the scan window gains shift value larger
than 1. There are two cases where the shift value is 0and
the P trns hash-table is being queried. The first case returns a
Match of the string ‘river’, while the second does not locate
any matched pattern.
(a) (b)
(c)
Fig. 2. (a) Pattern set and the m-bytes prefixes. (b) shift table of the
corresponding pattern-set. (c) MWM scan example. The arrows indicate shifts
of the scan window larger than 1. The row below the text shows the shift value
for each m-bytes scan window step. The bottom line contains B-bytes value
for shift calculation after each step.
IV. SHI FT-BA SE D PATTERN MATCH IN G FOR COMPRESSED
TR AFFI C (SPC)
In this section, we present our Shift-based Pattern matching
algorithm for Compressed HTTP traffic (SPC). Recall that
HTTP uses GZIP which, in turn, uses LZ77 that compresses
data with pointers to past occurrences of strings. Thus, the
bytes referred by the pointers (called referred bytes or re-
ferred area) were already scanned; hence, if we have a prior
knowledge that an area does not contain patterns, we can skip
scanning most of it.
Observe that even if no patterns were found when the
referred area was scanned, patterns may occur in the bound-
aries of the pointer: a prefix of the referred bytes may be a
suffix of a pattern that started previously to the pointer; or a
suffix of the referred bytes may be a prefix of a pattern that
continues after the pointer (as shown in Fig. 3). Therefore,
special care need to be taken to handle pointer boundaries
correctly and to maintain MWM characteristics while skipping
data represented by LZ77 pointers.
The general method of the algorithm is to use a combined
technique that scans uncompressed portions of the data using
MWM and skips scanning most of the data represented by the
LZ77 pointers. Note that scanning is performed on decom-
pressed data such that both decompression and scanning tasks
are performed on-the-fly, while using the pointer information
to accelerate scanning. For simplicity and clarity of the al-
gorithm description, the pseudocode is written such that all
uncompressed text and previous scan information are kept in
memory. However in real life implementation it is enough to
store only the last 32KB of the uncompressed traffic.
The SPC pseudocode is given in Algorithm 2. The key
idea is that we store additional information of partial matches
found within previously scanned text, in a bit-vector called
P artialM atch. The jth bit is set to true if in position j
the m-bytes of the scan window match an m-byte prefix of
a pattern. Note that we store partial match rather than exact
match information. Hence, if the body of the referred area
contains no partial matches we can skip checking the pointer’s
internal. However, if the referred area contains a partial match,
we still need to perform an exact match. Maintaining partial
match rather than exact match information is significantly less
complicated, especially over skipped characters, due to the fact
that pointers can copy parts of the patterns.
The algorithm with the P artialM atch bit-vector integrates
smoothly with MWM. In fact, as long as scan window is not
fully contained within a pointer boundaries, a regular MWM
scan is performed (Lines 22–34). The only change is that
we update the P artialM atch data structure (Line 25). Note
that shftV alue = 0 implies that the B-bytes suffix of scan
window matched an entry within Shif tT able. It does not
imply a partial match, which is explicitly checked by querying
the P trns table.
In the second case, where the m-bytes scan window shifts
into a position such that it is fully contained within pointer
boundaries, SPC checks which areas of the pointer can be
Algorithm 2 The SPC Algorithm
Definitions are as in Algorithm 1, with the following additions:
utrf - The uncompressed HTTP traffic.
pointer - Describes pointer parameters: distance,len and endP os
- position of the pointer rightmost byte in the uncompressed data.
Data received from the decompression phase.
PartialMatch - Bit vector that indicate whether there is a partial
match in the already scanned traffic. Each bit in P artialM atch
vector has a corresponding byte in utrf .
findPartialMatches(start . . . end) - Returns a list of Partial
Matches positions in range start . . . end.
1: procedure ScanText(utrf1...n)
2: pos 1
3: Set P artialM atch1...n bits to f alse
4: while pos +mndo
5: if utrfpos . . . utrfpos+mwindow internal to pointer then
Check valid areas for skipping
6: start pos pointer.dist
7: end pointer.endP os pointer.dist (m1)
8: pMatchList f indP artialM atches(start . . . end)
9: if pMatchList is not empty then
10: for all pm in pMatchesList do
11: pos pm.pos +pointer.dist
12: P artialM atch[pos]true
13: for all pat in P trns[utrfpos . . . utrfpos+m]do
14: if pat =utrfpos . . . utrfpos+pat.len then
15: Handle Match found
16: end if
17: end for
18: end for
19: end if
20: pos pointer.endP os (m1)
21: else MWM scan with P artialM atch updating
22: shftV alue =S hif tT able[utrfpos+mB. . . utrfpos+m]
23: if shftV alue = 0 then
24: if P trns[utrfpos . . . utrfpos+m]is not empty then
25: P artialM atch[pos]true
26: for all pat in P trns[utrfpos . . . utrfpos+m]do
27: if pat =utrfpos . . . utrfpos+pat.len then
28: Handle Match found
29: end if
30: end for
31: end if
32: pos pos + 1
33: else pos pos +shftV alue shf tV alue > 0
34: end if
35: end if
36: end while
skipped (Lines 6–20). We start by checking whether any par-
tial match occurred within referred bytes by calling function
f indP artialM atches(start . . . end)(Line 8). In the simple
case where no partial matches were found, we can safely shift
the scan window to m1bytes before the pointer end (Line
20). In effect we skip the entire pointer body, set the end of
the scan window one byte passed the pointer and continue
with the regular MWM scan. The correctness is due to the
fact that any point prior to that point is guaranteed to be free
of partial matches (otherwise there would have been a match
also within the referred bytes). SPC algorithm gains the most
from shifting over the pointer body without the extra overhead
of checking Shif tT able and P trns in the cases where there
are no actual partial matches.
If f indP artialM atches(start . . . end)returns partial
matches, we are certain that those were copied entirely from
the referred bytes, therefore, we start by setting the corre-
sponding positions within P artialM atch bit-vector to true
(Line 12). For each partial match, we then query the P trns
hash-table to check whether an exact match occurs, in the same
way as in MWM (Lines 13–17).
Fig. 3 demonstrates the SPC algorithm, using the same
pattern-set used in Fig. 2. SPC starts with a regular MWM
scan. While sanning, SPC locates the m-bytes prefix ‘rainb’
and mark it as a partial match in P artialM atch bit-vector.
Note that this m-bytes prefix did not result in an exact match
with any pattern in the pattern-set. The algorithm continues
the MWM scan until the ‘shine’ prefix is found, marked as a
partial match and also exactly matched to a pattern in the set.
Note that at this point we are still not within a pointer, rather
we are at the pointer’s left boundary. Note that this pointer
refers to an area with no partial matches. Therefore it scans
only the pointer boundaries and skips its internal area. In this
example both boundaries are part of a pattern.
Note that the GZIP algorithm maintains the last 32KB
of each session. SPC maintains also the P artialM atch bit-
vector, i.e. one bit per byte resulting in 4KB or 36KB
altogether. Those 36KB can be stored using cyclic buffer,
thus re-using also the P artialM atch bits whenever we cycle
to the buffer start. Therefore, we cannot rely on the default
initialization of those bits and need to add lines that explicitly
set the bits to false.
Altogether we keep a 36KB of memory per session, which
may result in a high memory consumption in a multi-session
environment. Note that most of the memory requirement is
due to GZIP and is mandatory for any pattern matching on
compressed traffic. As mentioned in Section II-A, recent work
[10] has shown techniques that save over 80% of the space
required. Those techniques can be combined with SPC and
reduce the space to around 6KB per session.
The correctness of the algorithm is captured by the follow-
ing theorem.
Theorem 1: SPC detects all patterns in the decompressed
traffic utrf.
Sketch of Proof: The proof is by induction on the index
of the uncompressed character within the traffic. Assume the
algorithm runs correctly until position pos; namely, it finds all
pattern occurrences and marks correctly the P artialM atch
vector. We now show that the SPC algorithm: 1. finds if there
is a pattern in position pos; 2. if it shifts to pos +shftV alue
there is no patten that starts after pos + 1 and prior to pos +
shftV alue; 3. updates correctly the P artialM atch vector.
The correctness relies on the MWM basic property that if
a pattern starts at position jthen MWM will set scan window
at position jand the pattern will be located. If scan window
at position pos is not contained in a pointer then the validity
is straightforward from the correctness of MWM. Otherwise,
we need to prove that SPC finds all partial matches and exact
matches correctly. The correctness is derived from the induc-
tion hypothesis regarding the validity of the P artialM atch
vector up to position pos.
V. EXPERIMENTAL RESU LTS
In this section, we analyze SPC and the parameters which
influence its performance. In addition, we compare its perfor-
mance to both MWM and ACCH algorithms.
All the experiments were executed on an Intel Core i5
750 processor, with 4cores running at 2.67GHz and 4GB
RAM. Each core has 32KB L1 data and instruction caches
and a 256KB dedicated L2 cache. The third-level (L3) cache
is larger, at 8MB, and is shared by all four cores.
A. Data Set
The context of this paper is compressed web traffic. There-
fore, we collected HTTP pages encoded with GZIP taken from
a list constructed from the Alexa website [2] that maintains
web traffic metrics and top-site lists. The data set contains
6781 files with a total uncompressed size of 335MB (66MB
in its compressed form). The compression ratio is 19.7%. The
ratio of bytes represented by pointers is 92.1% and the average
pointer length is 16.65B.
B. Pattern Set
Our pattern-sets were gathered from two different sources:
ModSecurity [3], an open source web application firewall
(WAF) and Snort [6], an open source network intrusion
prevention system.
In ModSecurity, we chose the signatures group which
applies to HTTP-responses (since only the response is com-
pressed). Patterns containing regular expressions were nor-
malized into several plain patterns. The total number of
ModSecurity patterns is 148.
The Snort pattern-set contains 10621 signatures. As opposed
to ModSecurity, Snort is not of the web application domain,
therefore, it is less applicable for inspecting threats from
incoming HTTP traffic. Nevertheless, since Snort is the promi-
nent reference pattern-set in multi-pattern matching papers, we
used it to compare the performance of our algorithm to other
pattern-matching algorithms. Since HTML pages contain only
printable (Base64) characters, there is no need to search for
binary patterns, leaving 6837 textual patterns. We also note,
that within our data-set, Snort patterns has a significantly high
match rate because of patterns such as ”http”, ”href”, ”ref=”,
etc. Our data-set contains 11M matches, which accounts for
3.24% of the text. ModSecurity have a modest number of 93K
matches, which accounts for 0.026% of the text.
C. SPC Characteristics Analysis
This section explores the various parameters affecting the
performance of SPC over compressed HTTP and compares it
to the MWM running over uncompressed traffic.
Shift-based pattern matching algorithms, and specifically
MWM and SPC are sensitive to the shortest pattern length
Fig. 3. Pointer scan procedure example. The patterns are as in Fig. 2. The dashed box indicate a Partial Match and the solid line indicate an Exact Match.
The solid box indicate a pointer and its referred area.
as it defines the maximal shift value for the algorithm and
influence false positive references to the P trns table. It also
bounds the size of B, resulting in poor average shift values,
since most combinations of those B-bytes are suffixes of our
m-prefix patterns. The Snort pattern set contains many short
patterns, specifically 410 distinct patterns of length 3,539
of length 4and 381 of length 5. To circumvent this problem we
inspected the containing rules. We can eliminate most of the
short patterns by using longer pattern within the same rule (as
in Snort that defines such pattern with the fast pattern flag) or
relying on specific flow parameters (as in [12]). For instance,
74% of the rules that contain these short patterns, contain
also longer patterns. Eliminating short patterns is effective for
patterns shorter than 5, hence we can safely choose m= 5.
Still in order to understand the effects of different mand B,
we experimented with values for 4m6.
In order to understand the impact of Band mwe examined
the character of skip ratio, Sr, the percentage of characters the
algorithm skips. Sris a dominant performance factor of both
SPC and MWM. Fig. 4 outlines the skip ratio as a parameter
of mand Band compares the performance of SPC to MWM.
As described in Section IV, SPC shift ratio is based on two
factors: the MWM shift for scans outside pointers and skipping
internal pointer byte scans. When m=B, MWM does not
skip at all. In that case the SPC shifts are based solely on
internal pointer skipping. For m=B,Srranges from 70% to
60% as mincreases; i.e. the factor based on internal pointer
skips, is the dominant one for the given mvalues.
We note that m= 6 gains the best performance as it
provides the largest maximal shift value (equals to mB+1).
However, using m= 6 as the shortest pattern length discards
too many patterns. We chose m= 5 as a tradeoff between
performance and pattern-set coverage. The skip ratio of SPC
is much better than that of MWM, and on Snort, for some
values of mand B, we get more than twice the skip ratio.
This property of SPC is a direct result of skipping pointers
whose referred bytes were already scanned.
The Bparameter determines the text block size on which
the shftV alue is calculated and has two dominant effects on
the performance of MWM and SPC: larger Bvalue decreases
the maximal shift, Sm=mB+ 1, which correlates directly
to the average shift but it also increases part of the shift values
as it decreases the percentage of entries which results in shift
value is 0. Overall the maximal skip ratio for Snort is 82.7%
for m= 5 and B= 3, whereas on ModSecurity Sris 87.5%
for m= 5 and B= 2.
D. SPC Run-Time Performance
This section presents the run-time performance of SPC
as compared to our improved implementation of MWM (as
described in section III) and to ACCH, the only current
algorithm that handles compressed web traffic.
Note that SPC have a basic overhead of 10% over MWM
when running on plain uncompressed traffic. This overhead
is attributed to the additional P artialM atch bit-vector that
impose an overhead of 12.5% memory-references. However
since this bit is stored next to its corresponding byte of the
traffic and due to the skipping operation we get a smaller
overhead.
The algorithms performance is measured by their through-
put T=W ork/T ime (i.e., scanning the entire data-set
divided by the scan time). The throughput, as shown in Fig. 5,
is normalized to the one of ACCH (which does not depend on
mand Bvalues). We note that ACCH’s throughput is roughly
three times better than Aho-Corasick (AC is omitted from the
figure for clarity). ACCH was tuned with optimal parameters
as recommended in [1]. The measured throughput of SPC on
our experimental environment for Snort is 1.016 Gbit/sec for
m= 5 and B= 4 and for ModSecurity it is 2.458 Gbit/sec
for m= 5 and B= 3. Those results were received by running
with 4threads that performs pattern matching on data loaded in
advance to the memory. Our implementation uses C# language
and general purpose software libraries and is not optimized for
the best throughput. Our goal is to compare between the differ-
ent algorithms for better understanding of SPC characteristics.
Better throughput can be gained by using optimized software
libraries or hardware optimized to networking.
As can be seen, for m= 5, when running on Snort, SPC’s
throughput is better than ACCH’s by up to 51.86%, whereas
on ModSecurity, we get throughput improvement of 113.24%.
When comparing SPC to MWM the throughput improvement
is 73.23% on Snort, and 90.04% on ModSecurity. Note that
for all mand Bvalues, SPC is faster that MWM. The max-
imum throughput is achieved for m= 5 when B= 4 while
the maximum skip ratio is achieved for B= 3. This is due
to the fact that when Bis larger we avoid unnecessary false
positive references to the P trns data structure. Furthermore,
we found out that for the Snort pattern-set we reach a small
value of 0.3memory reference per char.
E. SPC Storage Requirements
We elaborate on the data structures that are used by SPC and
MWM: Shif tT able and P trns As explained in Section III.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6
m=
4
5
m=
6
Skip Ratio (%)
SPC MWM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6
m=
4
5
m=
6
Skip Ratio (%)
SPC MWM
(a). Snort (b). ModSecurity
Fig. 4. Skipped Character Ratio (Sr)
0%
50%
100%
150%
200%
250%
B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6
m=
4
m=
5
m=
6
Throughput
Normalized to ACCH
SPC MWM ACCH
0%
50%
100%
150%
200%
250%
B=2 B=3 B=4 B=2 B=3 B=4 B=5 B=2 B=3 B=4 B=5 B=6
m=
4
m=
5
m=
6
Throughput
Normalized to ACCH
SPC MWM ACCH
(a). Snort (b). ModSecurity
Fig. 5. Normalized Throughput
Shif tT able - is a hash-table that holds shift values and
uses the last B-bytes of the scan window as the hash-
key. If the key is not found in the hash, the maximal
shift value, Sm=mB+1, is returned. Each hash-entry
contains a pointer to the shift value and the corresponding
list of possible B-bytes keys. Hence, the storage required
for each entry of Shif tT able is composed of:
1) Entry pointer - 32bits per entry
2) Shift value - The maximal value needed to be stored
is the maximal shift value1, hence there are mB
different shift values. To represent each value, we
need log2(mB)bits
3) Hash-key - 8×Bbits for each B-bytes key
The total size of this table is less than 58KB for Snort
and less then 1.61KB for ModSecurity.
P trns - is a hash-table of the pattern references (indexes)
hashed using the m-bytes pattern prefixes. Each P trns
table entry holds a pointer to the list of pattern references
with the same m-bytes prefix, where each reference is an
index to an array which contains the patterns themselves.
Hence, for each entry we only need to store:
1) Entry pointer - 32bits per entry
2) Pattern index - log2(N)bits per pattern reference,
where Nis the number of patterns
3) Hash-key - 8×mbits for each m-bytes key
For Snort, with m= 5 this data-structure requires less
then 152KB whereas, for ModSecurity, it requires only
4.05KB.
Note that we use hash-table implementation such as each
hash entry is a list implemented as a fixed size array. This
provides a space efficient implementation but is expensive in
case of updates to the patterns-set. We believe that for most
usage scenarios this is the better tradeoff. However, in case
the hash-table needs to support updates, an additional space
is needed, as the arrays are replaced by linked lists with
pointers, this roughly multiply the memory requirements of
the hash-tables by two. Overall, this is still a space efficient
implementation.
Table I summarizes the memory required by each of the
listed data-structures:
m B ShiftTable Ptrns Total Storage
Snort
5 2 14.77 151.73 166.50
5 3 54.35 151.73 206.09
5 4 57.82 151.73 209.55
5 5 36.89 151.73 188.62
ModSecurity
5 2 1.31 4.05 5.36
5 3 1.61 4.05 5.66
5 4 1.40 4.05 5.45
5 5 0.98 4.05 5.03
TABLE I
STOR AGE RE QUI RE MEN TS (KB)
Our implementation is very space efficient as we receive that
both MWM and SPC algorithms requires around 1.88 bytes
per char for Snort and ModSecurity as opposed to 1.4KB [13]
of the original MWM algorithm. This small space requirement
increases the probability that the entire table would reside
within the cache and thus is a key factor responsible for the
performance achieved by the algorithm.
VI. RE LATE D WORK
Compressed pattern matching has received attention in the
context of the Lempel-Ziv compression family [14]–[17].
However, the LZW/LZ78 are more attractive and simple for
pattern matching than LZ77. Recall that HTTP uses LZ77
compression, and hence all the above works are not applicable
to our case. Klein and Shapira [18] suggested modification
to the LZ77 compression algorithm to make the task of the
matching easier in files. However, their suggestion is not
implemented in today’s web traffic.
Farach et. al [19] is the only paper we are aware of that
deals with pattern matching over LZ77. However, in this paper
the algorithm is capable of matching only single pattern and
requires two passes over the compressed text (file), which does
not comply to the ’on-the-fly’ processing requirement applied
by the network domains.
The first attempt that tackles the problem of performing
efficient pattern matching on compressed HTTP traffic, i.e., on
the LZ77 family and in the context of networking is presented
in [1]. The paper suggests that the pattern matching task can be
accelerated using the compression information. In fact, that pa-
per shows that pattern matching on compressed HTTP traffic,
with the overhead of decompression is faster than DFA-based
pattern matching (such as Aho-Corasick algorithm [4]). Our
paper shows that the same approach can be applied on another
important family of pattern matching algorithms, the shift-
based technique, such as Boyer-Moore [11] and the modified
Wu-Manber (MWM) [5]. We show that accelerating MWM
pattern algorithm results in a simpler algorithm, with higher
throughput and lower storage overhead than accelerating Aho-
Corasick Algorithm. The algorithm can be combined with
enhanced solutions based on MWM such as [20]–[25] and
also can be implemented for TCAM environment as in [12].
VII. CONCLUSION AND FUT UR E WOR K
With the sharp increase in cellular web surfing, HTTP
compression becomes common in todays web traffic. Yet
due to its performance requirements, most security devices
tend to ignore or bypass the compressed traffic and thus
introduce either a security hole or a potential for a denial of
service attack. This paper presents SPC, a technique that takes
advantage of the information within the compressed traffic to
accelerate rather than slow down the entire pattern matching
process. The algorithm gains a performance boost of over
51% using half the space of the additional information per
connection compared to previous known solution, ACCH. The
algorithm presented in this paper should encourage vendors to
support inspection of such traffic in their security equipment.
As for future work we plan on handling regular expression
matching over compressed web traffic.
REF ER EN CE S
[1] A. Bremler-Barr and Y. Koral, “Accelerating multi-patterns matching
on compressed HT TP,” in INFOCOM 2009. 28th IEEE International
Conference on Computer Communications, April 2009.
[2] “Top sites,” July 2010. http://www.alexa.com/topsites.
[3] “Modsecurity. http://www.modsecurity.org (accessed on July 2008).
[4] A. Aho and M. Corasick, “Efficient string matching: an aid to biblio-
graphic search,” Communications of the ACM, pp. 333–340, 1975.
[5] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching,
Tech. Rep. TR-94-17, Department of Computer Science, University of
Arizona, May 1994.
[6] “Snort.” http://www.snort.org (accessed on October 2010).
[7] “Hypertext transfer protocol – http/1.1,” June 1999. RFC 2616,
http://www.ietf.org/rfc/rfc2616.txt.
[8] J. Ziv and A. Lempel, “A universal algorithm for sequential data
compression,” IEEE Transactions on Information Theory, pp. 337– 343,
May 1977.
[9] D. Huffman, “A method for the construction of minimum-redundancy
codes,” Proceedings of IRE, p. 10981101, 1952.
[10] Y. Afek, A. Bremler-Barr, and Y. Koral, “Efficient processing of multi-
connection compressed web traffic,” in IFIP NETWORKING, 2011.
[11] R. Boyer and J. Moore, “A fast string searching algorithm,Communi-
cations of the ACM, pp. 762 – 772, October 1977.
[12] Y. Weinsberg, S. Tzur-David, D. Dolev, and T. Anker, “High perfor-
mance string matching algorithm for a network intrusion prevention
system (nips),” in HPSR, 2006.
[13] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “Deterministic
memoryefficient string matching algorithms for intrusion detection,” in
INFOCOM 2004, 2004.
[14] A. Amir, G. Benson, and M. Farach, “Let sleeping files lie: Pattern
matching in z-compressed files,” Journal of Computer and System
Sciences, pp. 299–307, 1996.
[15] T. Kida, M. Takeda, A. Shinohara, and S. Arikawa, “Shift-and approach
to pattern matching in lzw compressed text,” in 10th Annual Symposium
on Combinatorial Pattern Matching (CPM 99), 1999.
[16] G. Navarro and M. Raffinot, “A general practical approach to pattern
matching over ziv-lempel compressed text,” in 10th Annual Symposium
on Combinatorial Pattern Matching (CPM 99), 1999.
[17] G. Navarro and J. Tarhio, “Boyer-moore string matching over ziv-lempel
compressed text,” in Proceedings of the 11th Annual Symposium on
Combinatorial Pattern Matching, pp. 166 – 180, 2000.
[18] S. Klein and D. Shapira, “A new compression method for compressed
matching,” in Proceedings of data compression conference DCC-2000,
Snowbird, Utah, pp. 400–409, 2000.
[19] M. Farach and M. Thorup, “String matching in lempel-ziv compressed
strings,” in 27th annual ACM symposium on the theory of computing,
pp. 703–712, 1995.
[20] R. Liu, N. Huang, C. Kao, C. Chen, and C. Chou, “A fast pattern-match
engine for network processor-based network intrusion detection system,
in ITCC, pp. 97–101, 2004.
[21] S. Antonatos, M. Polychronakis, P. Akritidis, K. G. Anagnostakis, and
E. P. Markatos, “Piranha: Fast and memory-efficient pattern matching
for intrusion detection,” in IFIP Advances in Information and Commu-
nication Technology, pp. 393–408, 2005.
[22] B. Zhang, X. Chen, L. Ping, and Z. Wu, “Address filtering based wu-
manber multiple patterns matching algorithm,” in WCSE, 2009.
[23] Z. Qiang, “An improved multiple patterns matching algorithm for
intrusion detection,” in ICIS, 2010.
[24] Y. Choi, M. Jung, and S. Seo, “L+1-mwm: A fast pattern matching
algorithm for high-speed packet filtering,” in INFOCOM, 2008.
[25] K. G. Anagnostakis, E. P. Markatos, S. Antonatos, and M. Polychron-
akis, “E2xB: A domain-specific string matching algorithm for intrusion
detection,” in SEC2003, 2003.
Article
Nowadays, regular expression matching becomes a critical component of the network traffic detection applications, which describes the fine-grained signature of traffic. Web services tend to compress their traffic for less data transmission, which is a great challenge for regular expression matching to achieve wire-speed processing. In this paper, we propose Twins, an efficient regular expression matching method over compressed traffic, which leverages the returned states encoding in the compression to skip repeated scanning. We also present an evaluation model to elaborate the factors that influence the performance of compressed traffic matching methods. Our evaluations demonstrate that Twins could skip ~ 90% compression data and can achieve 1.2 Gbps throughput with a single CPU core. It gains 2.2–3.0 times performance boost than the state-of-the-art works. With a parallel implementation using multiple CPU cores, the throughput could be up to 10 Gbps.
Article
Matching multiple patterns simultaneously is a key technique in Deep Packet Inspection systems, such as firewall, Intrusion Detection Systems, etc. However, most web services nowadays tend to compress their traffic for less data transferring and better user experience, which has challenged the original multi-pattern matching method that work on raw content only. The straightforward solutions directly match decompressed data which multiply the data to be matched. The state-of-the-art works skip scanning some data in compressed segments, but still exist the redundant checking, which are not efficient enough. In this paper, we propose COmpression INspection (COIN) method for multi-pattern matching over compressed traffic. COIN does not recheck the patterns within compressed segment if it has been matched before, so as to further improve the performance of matching, we have collected real traffic data from Alexa top sites and performed the experiments. The evaluation results show that COIN achieves 20.3% and 17.0% in the average of improvement than the state-of-the-art approaches on the string and regular expression matching with real traffic and rule sets.
Conference Paper
Pattern matching is a basic operation in various kinds of string process system. Pattern matching algorithms play an important role in intrusion detection, information retrieval, feature matching of computer virus and so on. Base on analyzing BM algorithm and relative algorithms, a new improved pattern matching algorithm is presented. The improved algorithm makes use of singleness and combination of the Last Character and Next Character of string, to increase the probability of maximum displacement. Experiment results show that the improved algorithm can effectively improve the efficiency of the matching process.
Article
Full-text available
String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the naïve ``decompresion'' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) .
Conference Paper
Full-text available
This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it. The algorithm is indeed fast when a pattern length is at most 32, or the word length. After an O(m + |∑|) time and O(|∑|) space preprocessing of a pattern, it scans an LZW compressed text in O(n + r) time and reports all occurrences of the pattern, where n is the compressed text length, m is the pattern length, and r is the number of the pattern occurrences. Experimental results show that it runs approximately 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Moreover, the algorithm can be extended to the generalized pattern matching, to the pattern matching with k mismatches, and to the multiple pattern matching, like the Shift-And algorithm.
Conference Paper
Full-text available
We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not improve because still all the symbols in the compressed text have to be scanned, we show experimentally that speedups of up to 30% over the fastest previous approaches are obtained. Moreover, we show that using an encoding method that sacrifices some compression ratio our method is twice as fast as decompressing plus searching using the best available algorithms.
Conference Paper
Full-text available
Intrusion detection systems (IDS) were developed to identify and report attacks in the late 1990s, as hacker attacks and network worms began to affect the Internet. Traditional IDS technologies detect hostile traffic and send alerts but do nothing to stop the attacks. Network intrusion prevention systems (NIPS) are deployed in-line with the network segment being protected. As the traffic passes through the NIPS, it is inspected for the presence of an attack. Like viruses, most intruder activities have some sort of signatures. Therefore, a pattern-matching algorithm resides at the heart of the NIPS. When an attack is identified, the NIPS blocks the offending data. There is an alleged trade-off between the accuracy of detection and algorithmic efficiency. Both are paramount in ensuring that legitimate traffic is not delayed or disrupted as it flows through the device. For this reason, the pattern-matching algorithm must be able to operate at wire speed, while simultaneously detecting the main bulk of intrusions. With networking speeds doubling every year, it is becoming increasingly difficult for software based solutions to keep up with the line rates. This paper presents a novel pattern-matching algorithm. The algorithm uses a ternary content addressable memory (TCAM) and is capable of matching multiple patterns in a single operation. The algorithm achieves line-rate speed of several orders of magnitude faster than current works, while attaining similar accuracy of detection. Furthermore, our system is fully compatible with Snort's rules syntax, which is the de facto standard for intrusion prevention systems
Article
Wu-Manber is a widely used multiple patterns matching algorithm. But in practical application, it exist the following limitations. First, there are redundant information and operations. Second, the prefix table is established, but is hardly used. Third, need to traverse the whole link list. These limitations make the algorithm complicated and affect the performance of it. Though some limitations have been indicated by others, different ways are taken in this paper. Prefix table is used to filter the patterns. Address filtering based search method avoids traversing the whole link list. Experimental result shows that the improved algorithm has a good performance than the original one.
Article
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Article
An algorithm is presented that searches for the location, ″i,″ of the first occurrence of a character string, ″pat,″ in another string, ″string.″ During the search operation, the characters of pat are matched starting with the last character of pat. The information gained by starting the match at the end of the pattern often allows the algorithm to proceed in large jumps through the text being searched. Thus the algorithm has the unusual property that, in most cases, not all of the first i characters of string are inspected. The number of characters actually inspected (on the average) decreases as a function of the length of pat. For a random English pattern of length 5, the algorithm will typically inspect i/4 characters of string before finding a match at i. Furthermore, the algorithm has been implemented so that (on the average) fewer than i plus patlen machine instructions are executed. These conclusions are supported with empirical evidence and a theoretical analysis of the average behavior of the algorithm. The worst case behavior of the algorithm is linear in i plus patlen, assuming the availability of array space for tables linear in patlen plus the size of the alphabet.