Conference PaperPDF Available

MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm for Large-Scale Pattern Set

Authors:

Abstract and Figures

String matching algorithm is one of the key technologies in numerous network security applications and systems. Nowadays, the increasing network bandwidth and pattern set size both calls for high speed string matching algorithm for large-scale pattern set. This paper proposes a novel algorithm called Multi-phase Dynamic Hash (MDH), which cut down the memory requirement by multi-phase hash and explore valuable pattern set information to speed up searching procedure by dynamic-cut heuristics. The experimental results demonstrate that MDH can improve matching performance by 100% to 300% comparing with other popular algorithms, whereas the memory requirement stays in a comparatively low level.
Content may be subject to copyright.
S. Qing, H. Imai, and G. Wang (Eds.): ICICS 2007, LNCS 4861, pp. 201–215, 2007.
© Springer-Verlag Berlin Heidelberg 2007
MDH: A High Speed Multi-phase Dynamic Hash String
Matching Algorithm for Large-Scale Pattern Set
Zongwei Zhou1,2, Yibo Xue2,3, Junda Liu1,2, Wei Zhang1,2, and Jun Li2,3
1 Department of Computer Science and Technology, Tsinghua University, Beijing, China
2 Research Institute of Information Technology, Tsinghua University, Beijing, China
3 Tsinghua National Laboratory for Information Science and Technology, Beijing, China
zhou-zw02@mails.tsinghua.edu.cn
Abstract. String matching algorithm is one of the key technologies in
numerous network security applications and systems. Nowadays, the increasing
network bandwidth and pattern set size both calls for high speed string
matching algorithm for large-scale pattern set. This paper proposes a novel
algorithm called Multi-phase Dynamic Hash (MDH), which cut down the
memory requirement by multi-phase hash and explore valuable pattern set
information to speed up searching procedure by dynamic-cut heuristics. The
experimental results demonstrate that MDH can improve matching performance
by 100% to 300% comparing with other popular algorithms, whereas the
memory requirement stays in a comparatively low level.
Keywords: Network Security, String Matching Algorithm, Multi-Phases Hash,
Dynamic-Cut Heuristics.
1 Introduction
Along with the rapid development of modern network technology, demands for anti-
attack and security protection are now facing a drastic increase in almost all network
applications and systems. String matching is one of the key technologies of them. For
example, widely deployed network intrusion detection and prevention systems
(NIDS/IPS) often use signature-based method to detect possible malicious attacks, so
string matching algorithm is their basic operation. It has been demonstrated that string
matching takes about 31% of the total processing time in Snort[1][5], the most
famous open source NIDS system[8]. The other remarkable instance is content
inspection network security systems. More and more such applications, including, but
not limited to, anti-virus, anti-spam, instant message filtering, and information
leakage prevention require payload inspection as a critical functionality. And, string
matching is also the most widely used technology in payload scanning.
However, string matching technology now encounters new challenges from two
important facts, both of which indicate that more efficient and practical high speed
string matching algorithms for large-scale pattern set are urgently needed.
The first challenge is that large-scale pattern sets are becoming increasingly
pervasive. In this paper, we define pattern set that has more than 10, 000 patterns as
large-scale pattern set, in contrast to small or middle size pattern sets in typical
202 Z. Zhou et al.
network security systems. As more types of virus, worm, trojan and malware spread
on the Internet, pattern set size in anti-virus applications keeps increasing. For
example, the famous open source anti-virus software—Clam AntiVirus[2] now has
more than 100,000 patterns, and daily update is still quickly enlarging it. From
February 14th to March 18th, 2007, the pattern set size increase by about 10, 000.
However, most existing string matching algorithms are designed and tested under
small and moderate pattern set. They cannot be efficiently used in large-scale
scenario.
Secondly, network edge bandwidth is increasing from 100Mbps to 1Gbps or even
more. Such development demands for high throughput of current inline network
security applications. In newly emerging UTM (Unified Threat Management)
systems, turning on real-time security functionalities like intrusion prevention, anti-
virus, and content filtering will greatly reduce the system overall throughput, because
such functionalities all need extensive string matching operation. However, string
matching algorithms now are still far from efficient enough to meet the needs driven
by bandwidth upgrade.
This paper proposes a novel high-speed string matching algorithm, Multi-Phase
Dynamic Hash (MDH), for large-scale pattern sets. We introduce multi-phase hash to
cut down the memory requirement and to deal with high hash collision rate under
large-scale pattern set. And we also propose a novel idea, dynamic-cut heuristics,
which can explore the independence and discriminability of the patterns to speed up
the string matching procedure. Experimental results of both random pattern sets and
some real-life pattern sets show that MDH increases the matching throughput by about
100% to 300%, compared with some other popular string matching algorithms,
whereas, maintain its memory requirement at a low level.
The rest of this paper is structured as follows: Section 2 overviews pervious work
on string matching algorithms. Section 3 describes in detail our MDH algorithm. The
experimental results are given out in Section 4 to demonstrate high matching
performance and low memory requirement of our algorithm. Conclusions and future
work are in the last section.
2 Related Work
There are basically two categories of string matching algorithms—forward algorithm
and backward algorithm. They both use a window in the text, which is of the same
length as the pattern (the shortest pattern if there are multiple patterns). The window
will slide from leftmost of the text to the rightmost. Forward algorithm examines the
characters in the text window from left to right, while backward algorithm starts at
the rightmost position of the window and read the characters backward.
Among the forward algorithms, Aho-Corasick algorithm[6] is the most famous
one. This algorithm preprocesses multiple patterns into a deterministic finite state
automaton. AC examines the text one character at a time, so its searching time
complexity is ()On when n is the total length of the text. This means that AC
algorithm is theoretically regardless of pattern numbers. However, in practical usage,
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 203
automaton size increases quickly when the pattern set size goes up, which would
require too much memory. This limits the scalability of AC to large-scale string
matching.
It has been demonstrated that backward algorithm have higher average search
speed than forward algorithm in practical usage, because it can skip unnecessary
character comparisons in the text by certain heuristics[3]. Boyer-Moore algorithm [7]
is the most well-known backward algorithm used in single pattern matching. There
are two important heuristics in BM algorithm, bad character and good suffix, which is
shown in Fig.1. BM calculates both of the shift values according to these two
heuristics and then shifts the window according to the bigger one.
Fig. 1. Bad character (left) and good suffix (right) heuristic, y denotes the text and x is the
pattern. u is the match suffix of the text window.
Wu-Manber algorithm[4] extended BM to concurrently search multiple strings.
Instead of using bad character heuristic to compute the shift value, WM uses a
character block including 2 or 3 characters. WM stores the shift values of these blocks
in SHIFT table and builds HASH table to link the blocks and the related patterns. The
SHIFT table and the HASH table are both hash tables which enable efficient search.
Moreover, in order to further speed up the algorithm, WM also builds another hash
table, the PREFIX table, with the two-byte prefixes of the patterns. This algorithm has
excellent average time performance in practical usage. But, its performance is limited
by minimum pattern length m since the maximum shift value in SHIFT table equals to
m-1.
However, when pattern set is comparatively large, the average shift value in WM
algorithm will decrease and thus the searching performance will be compromised. B.
Xu and J. Li proposed the Recursive Shift Indexing (RSI)[10] algorithm for this
problem. RSI engages a heuristic with a combination of the two neighboring suffix
character blocks in the window. It also uses bitmaps and recursive tables to enhance
matching efficiency. These ideas are enlightening for large-scale string matching
algorithms.
J. Kytojoki, L. Salmela, and J. Tarhioin also presented a q-Grams based Boyer-
Moore-Horspool algorithm[11]. This algorithm cuts a pattern into several q-length
blocks and builds q-Grams tables to calculate the shift value of the text window. This
algorithm shows excellent performance on moderate size of pattern set. However,
when coming into large-scale scope, it is not good enough both in searching time and
memory requirement.
C. Allauzen and M. Raffinot introduced Set Backward Oracle Matching Algorithm
(SBOM)[12]. Its basic idea is to construct a more lightweight data structure called
204 Z. Zhou et al.
factor oracle, which is built only on all reverse suffixes of minimum pattern length m
window in every pattern. It consumes reasonable memory when pattern set is
comparatively large.
There are also some other popular Backward algorithms which combine the BM
heuristic idea and AC automaton idea. C. Coit, S. Staniford, and J. McAlerney
proposed AC_BM algorithm[8]. This algorithm constructs a prefix tree of all patterns
in preprocessing stage, and then takes both BM bad character and good suffix
heuristics in shift value computation. A similar algorithm called Setwise Boyer Moore
Horspool (SBMH)[9] is proposed by M. Fisk and G. Varghese. It utilizes a trie
structure according to suffixes of all patterns and compute shift value only using the
bad character heuristic. However, these two algorithms are also limited by the
memory consumption when the pattern set is large.
3 MDH Algorithm
We have reviewed some popular multiple string matching algorithms. They are the
best algorithms under different circumstances. But, for large-scale pattern sets, all of
them suffer drastic matching performance decline. Some of them, such as AC,
AC_BM and SBMH, also face memory explosion. Moreover, as we have considered,
there are few algorithms now solve the large-scale pattern set problem well. In this
context, MDH is designed to both improve the matching performance and maintain
moderate memory consumption. Based on WM algorithm, our new algorithm has two
main improvements:
First, when pattern sets become larger, WM algorithm has to increase the size of
the SHIFT table and the HASH table to improve matching performance. This would
consume lots of memory. MDH introduces multi-phase hash to cut down the high
memory requirement.
Second, WM algorithm considers only the first m characters of the patterns. It is
simple and efficient, but overlooks helpful information in other characters. Therefore,
MDH introduces dynamic-cut heuristics to select the optimum m consecutive
characters for preprocessing. This mechanism will bring in higher matching
performance.
3.1 Key Ideas of MDH
In the following description, we let B to be the block size used in WM and MDH, m
to minimum pattern length, to be the alphebet set of both pattern and text, to
be the alphebet set size, k to be the total pattern number, l to be the average length of
all the patterns.
3.1.1 Multi-phase Hash
In WM algorithm, a certain SHIFT table entry stores the minimum shift value of all
the character blocks hashed to it. As the pattern number increases, high hash collision
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 205
will reduce the average shift value ( )Eshift in SHIFT table and thus compromise the
matching performance.
Therefore, a better algorithm for large-scale pattern set always increases character
block size B to deal with the high hash collision rate. But larger B will result in bigger
SHIFT and HASH table, and thereby greatly increases the memory requirement.
Considering the limited cache in modern computers, high memory consumption will
decline the cache targeting rate and increase average memory access time. It will in
turn decrease the matching performance. On the other hand, it is also difficult to load
such large data structures into SRAM when the algorithm is implemented on current
high speed appliance such as network processor, multi-thread processing chips and
FPGA. This will limit its scalability to hardware implementations.
Under such observations, we propose a novel technique called multi-phase hash. In
WM algorithm, general hash function is used to build SHIFT table and HASH table,
the character blocks and the hash table entries are one-to-one correspondent. But in
MDH, we use two compressed hash table, the SHIFT table and the PMT table, to
replace them. They are of the similar functionality, but consume less memory. MDH
first choose a compressed hash function 1
h, to reduce SHIFT table from B
entries
to
/
8a
(8aB<), which means that 1
h only uses a bits of the B-length character
block. However, compressing the SHIFT table entries together will also reduce the
average shift value, similar with increasing pattern set size. Some entries with non-
zero shift value would be hashed into zero shift value entry. This will bring in more
character comparison time in matching procedure. So we then introduce another
compressed hash table, PMT table, to separate the non-zero shift value entries away
from zero shift value entry. When a certain character block with non-zero shift value
is hashed into a zero shift value entry, MDH uses another hash function 2
h to rehash it
and store their shift value as skip value in the PMT table. PMT table is of the size
/
8b
( 8ba B<< ). Moreover, PMT table also linked by some possible matching
patterns, similar with HASH table in WM. The number of these pattern linked to a
certain PMT table entry is recorded as its num value.
3.1.2 Dynamic-Cut Heuristics
Following the common practice of some previous work[3], the average character
comparison times ( )E comparison is important for the matching performance of WM
algorithm. Large-scale pattern set can increase ( )E comparison and compromise the
matching performance. We handle it by introduce dynamic-cut heuristics.
Mathematical analysis of ( )E comparison decides the detail mechanisms used in
dynamic-cut heuristics.
Let ZR to be the ratio of the number of zero entries (entries with zero shift value)
SHIFT entry to the total number of SHIFT table entries. Let 0
T to be the number of
non-zero entries (entries linked with possible matching patterns) in PMT table,
therefore 0
/
kT is the average number of possible matching patterns (APM) in PMT
table.
206 Z. Zhou et al.
In the searching stage, MDH first checks the shift value in SHIFT table. If it is
zero, the algorithm then checks the skip value in PMT table. Only if the skip value is
also zero should the algorithm verify the possible matching patterns. So the
probability of comparison times equals to x (Pr( )comparison x=) is calculated as
follows:
/
8
0
/8
0
Pr( 1) 1
Pr( 2) *(1 / )
Pr( 2) * /
b
b
comparison ZR
comparison ZR T
comparison ZR T
==
==
>=
(1)
Thus, under average condition, ( )E comparison could be estimated as follows:
()1*Pr( 1)2*Pr( 2)
(2 * )* Pr( 2)
E comparison comparison comparison
lAPM comparison
==+=
++ > (2)
From (1) and (2), we get:
/
8
()1**/
b
E comparison ZR l k ZR=+ + (3)
Moreover, the above analysis is only under the normal condition of network
security application, when the pattern matches in the text are comparatively sparse.
However, new denial-of-service attacks, such as sending text of extremely high
matches and jamming the pattern matching modules, have emerge to compromise the
network security application with BM-family string matching algorithms. Thus it is
very necessary to consider the condition of heavy-load case or even worst-case, when
there are lots of matches in the text. Under such circumstance, ( )E comparison will
be calculated as follows:
()2*
w
E comparison l APM≈+ (4)
Therefore, after setting the SHIFT table size and PMT table size in multi-phase
hash, there still remains two probabilities for improving the searching performance.
First, from equation (3), smaller ZR results in smaller ( )E comparison under normal
condition and thereby brings in higher average searching performance. Secondly, as
in equation (4), smaller APM results in smaller ( )
w
E comparison and thus ensures
high searching performance for worse-case condition.
According to the above analysis, MDH uses dynamic-cut heuristics to cut every
pattern into the optimum consecutive m characters and to reduce the ZR and APM in
SHIFT table and PMT table. Theoretically, MDH could compute all the ZR and APM
values under all the cutting conditions and then choose the optimum one. Apparently,
such heuristic mechanism demands for high time and memory consumption in
preprocessing when the pattern number k and average pattern length l are large. Note
that in most network security application and systems with large-scale string
matching, such as anti-virus and content inspection, pattern sets are changing very
fast. It is improper to choose such complex preprocessing mechanism.
Thus we implement the heuristics in a comparatively simple way, which is
described detail in the following section.
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 207
3.2 Algorithmic Details of MDH
3.2.1 Preprocessing Stage
In the following description, we let the block length B=4, SHIFT table size a=20,
PMT table size b=17. The pattern set is {opionrate, torrential, extension,
cooperation}. So the minimum pattern length m=9. << denotes for the bit operator of
left shift. Hash function 1
hand 2
h are as follows:
1
h (block)=(*(block))&0x000FFFFF (5)
2
h (block)=((*(block)<<12)+(*(block+1)<<8)
+(*(block+2)<<4)+*(block+3))&0x0001FFFF (6)
There are three steps in preprocessing stage:
Step1: Initialize SHIFT table and PMT table, set all shift value and skip value to be
m-B+1, all num value to be zero. Each pattern has its offset value, that is, the offset of
optimum m window in the pattern. All offset value is initiated to zero.
Step2: Process the patterns one by one, set their optimum m window position
according to the dynamic-cut heuristics and note down the offset value. Meantime, all
the suffix character blocks of these windows are added into the SHIFT table and the
PMT table. Related shift value and num value are set.
Step3: Process the patterns one by one again, add the other blocks (except the suffix
block) in all the optimum m windows into the SHIFT table and the PMT table.
Related shift value and skip value are set.
3.2.1.1 Step2—Optimum m Window Position Setting. In this step, the algorithm
processes the patterns one pattern by another and calculates their optimum m window
position.
“opionrate” is the one of the shortest patterns in the pattern set. So its optimum m
window is “opionrate” itself. Its suffix block “rate” is added into SHIFT table and
PMT table. The algorithm sets Shift value in the 1
h (rate) SHIFT entry to 0, set num
value in the 2
h (rate) PMT entry to 1, and link the pattern after 2
h(rate) PMT entry.
For pattern “torrential”, it has two possible m window positions—“torrentia” and
“orrential”. The algorithm check the 1
h (ntia) SHIFT entry, the shift value is 4. Then
we check the 1
h (tial) SHIFT entry, this shift value is still 4. So optimum m window is
not found, the algorithm will manually set “torrentia” as the optimum m window and
set related shift and num value. The procedure of adding the pattern “extension” is
similar with that of adding “opionrate” because they are both the shortest patterns.
Then here comes the last pattern “cooperation”. The procedure of adding this pattern
reveals the effect of dynamic-cut heuristics. There are three possible m window
positions—“cooperati”, “ooperatio” and “operation”. The algorithm first checks the
1
h (rati) SHIFT entry and found its shift value is 4, then checks the 1
h(atio) SHIFT
entry and gets the same result. So, the algorithm moves the window again and checks
the 1
h (tion) SHIFT entry. Since 11
h (tion)=h (sion) , its shift value will be zero. Note
208 Z. Zhou et al.
PMT
index
...
h1(rate)
...
h
...
~ap
shift
6
6
0
5
...
h1(rati)
h1(tial)
h1(tion/sion)
Ă
h1(ntia)
h1(atio )
6
6
6
6
6
6
6
SHIFT
index
h2(sion)
h2(tion)
...
h2(rate/rati)
...
h2(ntia)
num
1
1
0
1
0
1
opionrate
torrenti al
skip
6
6
6
6
6
6
cooperation
extension
index
...
h1(rate)
...
h
...
~ap
shift
6
0
0
...
h1(rati)
h1(tial)
h1(tion/sion)
Ă
h1(ntia)
h1(atio)
6
6
6
0
6
0
6
SHIFT
PMT
index
h2(sion)
h2(tion)
...
h2(rate/rati)
...
h2(ntia)
num
0
0
0
0
0
0
skip
6
6
6
6
6
6
Ă6Ă6
Fig. 2. SHIFT table and PMT table before and after setting optimum minimum m window
position for pattern set {opionrate, torrential, extension, cooperation}
that the 2
h (tion) PMT entry has a zero num value. According to our heuristics,
“operation” will be the optimum m window of pattern “cooperation” and the related
offset value is 2.
Figure 2 illustrates that, without dynamic-cut heuristics, the shift value of 1
h(rati)
SHIFT entry will be zero, there would be four SHIFT entries with zero shift value and
therefore the ZR becomes bigger. And also “cooperation” and “opionrate” will both
be linked to 2
h (rate) PMT entry and APM becomes larger, since 22
h(rate)=h(rati).
Thus, it is demonstrated that Dynamic-cut heuristics helps to make both ZR and
APM smaller, which will contribute to bring in higher searching performance.
Comparison experiments between MDH without dynamic-cut heuristics and MDH
full implementation will appear in Section 4 to further prove its effect.
3.2.1.2 Step3—Adding Characters Blocks in the optimum m windows. In this step,
we take processing pattern “opionrate” for example. The algorithm put a B-length
block window (B window) at the leftmost position of the pattern and slide. Let j to be
the offset of B window, the shift value of the character block in B window can be
calculated by m-B-j. First compute the hash value of “opio” by hash function
PMT
index
h2(sion)
h2(tion )
...
h2(rate/rati)
...
h2(n tia)
num
1
1
0
1
0
1
opionrate
torrentia l
skip
6
6
6
6
6
6
cooperation
extension
index
...
h1(rate)
...
h
...
~ap
shift
6
0
0
...
h1(rati)
h1(tia l)
h1(tion /sion)
Ă
h1(ntia )
h1(atio )
6
6
6
0
6
0
6
SHIFT
PMT
index
h2(sion)
h2(tion )
h2(pion)
h2(rate/rati)
...
h2(ntia)
num
1
1
0
1
0
1
desperate
torrential
skip
6
6
4
6
6
6
cooperation
extension
index
Ă
h1(rate)
...
h
...
~ap
shift
6
0
0
SHIFT
h1(ionr)
h1(rati)
h1(tial)
h1(tio n/sion )
h1(onra)
h1(n tia)
Ă
3
6
6
0
2
0
6
h1(opio) 5
h1(nrat) 1
Ă6
Ă6
Fig. 3. SHIFT table and PMT table before and after filling shift value and skip value of all B-
length character blocks in pattern “opionrate”
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 209
1
h. The shift value in 1
h (opio) SHIFT entry is 6, and the shift value of “opio” is 5. So
the algorithm will note down the smaller value 5 as the new shift value of this entry.
Then we will compute the hash value of “pion”, which is the same as 1
h (tion/sion) .
The shift value of 1
h (tion/sion) entry is zero. Under this condition, the algorithm will
compute 2
h (pion) and index to the related PMT entry. The skip value of 2
h(pion)
PMT entry is 6 and the shift value of “pion” is 4. So the algorithm will note down the
smaller value 4 as the new skip value of this entry. Following this way, the algorithm
then processes character block “ionr”, “onra”, “nrat”.
Figure 3 shows the SHIFT table and the PMT table before and after the whole
procedure above. Apparently, without multi-phase hash idea, character block “pion”
will be hashed into 1
h (tion/sion) SHIFT entry with zero shift value. It will cause
unnecessary pattern verification of “cooperation” with a suffix block “tion”.
However, the algorithm will get its real shift value by checking the skip value of
2
h (pion) PMT entry and unnecessary character comparison can be avoided.
3.2.2 Scanning Stage
The scanning procedure is comparatively simple and explicit. B-length text window
slides from leftmost position of the text to right. Each time we examine B characters
in the text window, calculates its hash value according to hash function 1
h, check the
relevant SHIFT table entry. If the shift value in this entry is not zero, move the text
rightwards by the shift value and restart this procedure. Otherwise, hash this text
block again using hash function 2
h, use the new hash value to index to the
corresponding PMT table entry. Verify every possible matching pattern linked in this
entry using naïve comparison method. After that, move the text rightwards by the skip
value of this entry and restart the whole procedure.
4 Experimental Results
This section gives out a serial of experiments to demonstrate the performance of
MDH algorithm. The test platform is a personal computer with one dual-core Intel
Centrino Duo™ 1.83GHz processor and 1.5GB DDR2 667MHz memory. The CPU
has 32KB L1 instruction cache and 32KB L1 data cache. The shared L2 cache is
2048KB.
The text and patterns are both randomly generated on alphabet set 256=. And
we then insert all the patterns into random position of the text for three times to
guarantee a number of matches between random text and patterns. In the first
experiment of searching time comparison, we also use a recent antivirus pattern set
from Clam AntiVirus to demonstrate the practical performance of MDH algorithm.
The text size in the following tests is 32MB. The pattern length of our large-scale
pattern sets extends from 4 to 100 and 80% of patterns are of the length between 8
and 16, which is comparatively close to content inspection based network security
application such as instant message filtering and content inspection, recommend by
CNCERT/CC [13].
210 Z. Zhou et al.
4.1 Searching Time and Memory Requirement Comparison
To better evaluate the performance of MDH, we choose five typical multiple string
matching algorithms which are widely deployed in recent practical applications. The
source codes of AC, AC_BM, WM algorithms are adopted from Snort. Unnecessary
codes about case sensitive related operations are eliminated to take off extra time and
memory consuming. In WM algorithm, we set the block length B=2. The source
codes of SBMH and SBOM are from [14].
Fig. 4. The upper graph is the searching time comparison between MDH and some typical
algorithms. Under the pattern sets larger than 30k, MDH is much better than any other
algorithms in this experiment. And the scalability of MDH to even larger patter sets more than
100k is promising since its performance decline is not so rapid as other algorithms when pattern
set size increases from 10k to 100k. The lower graph is the memory comparison. Table-based
algorithm like MDH and WM algorithm consume much less memory than other algorithms in
the experiment.
Figure 4 illustrates that the performance of all the five typical algorithms suffer
drastic declines when pattern set size exceeds 30k. Their matching throughput is
fewer than 96Mbps with 50k patterns. Algorithms like AC, AC_BM and SBMH can
not support pattern sets larger than 60k under our test condition because of their high
memory consumption. When there are 100k patterns, the matching throughput of
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 211
MDH algorithm is still more than 100Mbps. It exceeds SBOM by 169% and WM by
231%. In addition, MDH algorithm possesses high stability as pattern set size
increases and also excellent scalability to small and moderate pattern set size. The
stable performance also indicates that it has better scalability to supper-large-scale
pattern sets. In our further test, the matching throughput of MDH is abut 48.8 Mbps
when pattern set size is 200k, still better than that of WM and SBOM under 100k
pattern set.
MDH algorithm is also superior in memory requirement. When pattern set size
increases up to 50k, memory requirement of all algorithms except WM and MDH are
more than 200MB. Table-based algorithms like WM and our solution only consume
less than 20 MB memory even in 100k pattern sets.
4.2 Experiments on Real-Life Pattern Set
To demonstrate the practical performance of MDH algorithm, we choose the real-life
pattern set used in Clam AntiVirus in this experiment. The total number of the current
virus data base has 102, 540 patterns. We removed all the patterns that is either
represented by regular expressions or of the length shorter than 4. After that, the
pattern set size is 77, 607. We also form three different subset of the size 20k, 40k and
60k. The minimum pattern length of all these four pattern sets is 4. SBOM and WM
are chosen to be compared with MDH, because these two algorithms also have
reasonable searching time performance and memory consumption in Section 4.1.
Table 1. In this table, Mem represents the total memory consumption and Thr denotes the
matching throughput, that is, size of the text that have been processed in a second Under large-
scale pattern sets. Under Clam AntiVirus pattern set, MDH possesses both higher searching
performance and lower memory consumption when comparing with WM and SBOM
algorithm.
20k 40k 60k 77k
Algorithm Thr
(Mbps) Mem
(MB) Thr
(Mbps) Mem
(MB) Thr
(Mbps) Mem
(MB) Thr
(Mbps) Mem
(MB)
MDH 250.56 3.82 203.28 5.2 174.24 8.08 150.16 10.41
WM 329.52 3.33 126 5.2 66.88 8.53 43.36 11.27
SBOM 69.68 81.87 56.16 162.5 43.76 244.7 36.48 316.84
From Table 1, we can see, from 20k to 77k patterns, the searching throughput of
MDH algorithm does not suffer drastic decline as WM and SBOM algorithm. This
stable performance indicates that MDH has better scalability to even supper-large-
scale pattern sets in real-life applications. When there are 77k patterns, the matching
throughput of MDH algorithm is more than 150Mbps, which exceeds SBOM by
311% and WM by 246%. Meanwhile, MDH only consumes about 3 to 11 MB
memory to process these pattern sets, no more than WM algorithm and much fewer
than SBOM algorithm. It is fair to assert that MDH algorithm possesses excellent
time and space performance under the large-scale pattern sets from real-life security
applications.
212 Z. Zhou et al.
4.3 Experiments on Multi-phase Hash
Table 2 is the result of comparison test between WM algorithm (B=2), WM algorithm
(B=3) and MDH with multi-phase hash. In this table, MEM stands for the total
memory used in WM or MDH algorithm. When pattern number is more than 10k, ZR
becomes very high in WM algorithm (B=2). According to equation (3) in Section
3.1.2, higher ZR would bring in bigger ( )E comparison and greatly compromise the
searching performance. If B=3, ZR becomes comparatively low to ensure good
searching performance. However, under this condition, SHIFT table and HASH table
will become bigger since these tables are both of the size B
. So MEM in WM
(B=3) increase to more than 80MB. With multi-phase hash, MDH is able to maintain
moderate ZR. Its MEM is nearly in the same level with WM (B=2) and only about
2%~7% of WM (B=3).
Table 2. This table is a comparison of ZR and MEM between WM algorithm (B=2), WM
algorithm (B=3) and MDH algorithm with multi-phase hash. ZR is high in WM algorithm
(B=2) under large-scale pattern set. If B=3, WM algorithm possesses low ZR, but another
problem is that it consumes too much MEM. MDH is both good in maintaining low ZR and
resonable MEM.
WM(B=2) WM(B=3) MDH
Pattern
number ZR
(%) MEM
(MB) ZR
(%) MEM
(MB) ZR
(%) MEM
(MB)
10k 14.2 0.95 0.059 80.64 0.85 2.42
25k 31.7 1.91 0.149 81.59 1.91 2.98
50k 53.3 3.5 0.297 83.19 3.46 3.93
75k 68.0 5.09 0.446 84.78 4.32 4.87
100k 78.3 6.69 0.594 86.38 6.25 5.81
4.4 Experiments on Dynamic-Cut Heuristics
From Table 3, we can see that ZR has a drastic decline when dynamic-cut heuristics
are applied. In 10k pattern set, dynamic-cut heuristics reduce the zero entry number
by about 10%, and in 100k patter set, this number increases up to nearly 30%. The
heuristics’ influence on ZR becomes more significant when pattern set size is larger.
It also has been demonstrated that APM value becomes comparatively smaller owing
to dynamic-cut heuristics.
As for time performance, dynamic-cut heuristics save about 7.6% to 14%
searching time when pattern number ranges from 10k to 100k. Noticeably, the bigger
the pattern set is, the more significant the time-saving effect will be. It strongly
testifies the excellent scalability of the dynamic-cut heuristics to even larger pattern
set. However, the overhead in processing time is still reasonable since most of the
network security applications do not have high frequency of pattern set changing and
more attentions are focused on improving the searching time.
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 213
Table 3. ZR is the zero SHIFT entry radio, the same as in Table 2. APM indicates the average
number of possible matching patterns in PMT table. MP denotes of the MDH implementation
without dynamic-cut heuristics. We can see that dynamic-cut herurisitcs have greatly reduce the
Znum and AN in PMT, which contributes to the searching time decrease.
ZR APM
Preprocessing
Time(ms) Searching
Time (ms)
Pattern
Number MP MDH MP MDH MP MDH MP MDH
10k 9940 8878 1.04 1.03 18.8 20.1 1112 1028
30k 29458 23391 1.12 1.08 26.4 37.4 1459 1312
50k 48461 36262 1.2 1.12 36 62.8 1668 1512
70k 67005 48198 1.29 1.16 49.9 84.4 2118 1877
100k 93842 65494 1.43 1.22 68.3 105.9 3117 2680
Fig. 5. In the upper graph, SHIFT table size is set to 20
2 (a=20) and the PMT table size is
ranging from 15
2 (b=15) to 19
2 (b=19). MDH has less run time (or better performance) when
using larger PMT table size. The experiment related with the lower graph is done under same
PMT table size as 17
2 (b=17). SHIFT table size is ranging from 18
2 (a=18) to 22
2(a=22). The
optimum SHIFT table size is different under different pattern sets.
214 Z. Zhou et al.
4.5 SHIFT and PMT Table Size Selection
The selection of SHIFT and PMT table size is the critical part of MDH
implementation. In the upper graph of Fig 5, we can conclude that bigger PMT table
is more helpful in improving searching performance. It matches our previous analysis.
When PMT table is larger, we are able to partition all character blocks with zero
SHIFT value into more entries. So APM value could be smaller. This would highly
reduce unnecessary verification time and benefit for final performance. Thus, within
the memory limitation, it is better to choose as larger PMT table as possible. In MDH
algorithm, we choose a moderate and acceptable PMT table size as 17
2(b=17).
In the lower graph of Fig 5, we test the selection of SHIFT table size under the
same PMT table size of 17
2(b=17). The optimum SHIFT table size is related to the
pattern set size. From 10k to about 110k patters, MDH with SHIFT table size of
19
2and 20
2 are of higher searching speed than other ones. And for pattern set between
110k and 190k, a=21 becomes the best choice. When pattern number increases to
200k or even more, a=22 will perform better than others. Moreover, we can also
conclude that the run times curve of larger SHIFT table size always possess smaller
average slope. The reason is that in large SHIFT table, ZR is comparatively small.
The pattern set increment can not significantly raise this ratio and compromise the
matching performance.
Thus, we may conclude that the selection of SHIFT table size depends on the
pattern set size. The algorithm should choose larger SHIFT table size to meet the
needs of larger patter set. In this paper, we focus on pattern sets ranging from 10k to
100k and thus set the SHIFT table size to be 20
2 (a=20).
5 Conclusion and Future Works
This paper proposes a novel string matching algorithm named Multi-Phases Dynamic
Hash algorithm (MDH) for large-scale pattern set. Owing to multi-phase hash and
Dynamic-cut heuristics, MDH can improve matching performance under large-scale
pattern set by about 100% to 300% compared with other typical algorithms, whereas
the memory requirement remains at a comparatively low level. Low memory
requirement will help to raise the cache targeting rate in practical usage and thereby
improve the matching performance. It would also contribute to support accelerating
hardware architectures based on MDH, like FPGA and new multi-core chips.
However, several works will be considered in the future. We are in the progress of
finding the relationships between character block B, SHIFT table size a, PMT table
size b and pattern sets size k through more experimental and mathematic analysis. We
can also study more complex and efficient alternatives for dynamic-cut heuristics. In
addition, architecture design of network content filtering systems based on MDH and
multi-thread models will also be within our scope.
Acknowledgement. The authors thank CNCERT/CC for their support of this work.
CNCERT/CC is the abbreviation of National Computer Network Emergency
Response Technical Team/Coordination Center of China. The authors would also like
MDH: A High Speed Multi-phase Dynamic Hash String Matching Algorithm 215
to thank Mr. Kai Li, Mrs. Xue Li, Mr. Bo Xu, Mr. Xin Zhou and Mr. Yaxuan Qi for
enlightened suggestions and helps. Last but not least, the authors would like to thank
numerous volunteers who contributed to the open source projects like Snort and Clam
AntiVirus.
References
1. Roesch, M.: Snort: lightweight intrusion detection for networks. In: Proc. of the 1999
USENIX LISA Systems Administration Conference (1999)
2. Clam AntiVirusTM http://www.clamav.net/
3. Navarro, G., Raffinot, M.: Flexible pattern matching in strings. Cambridge University
Press, Cambridge (2002)
4. Wu, S., Manber, U.: A fast algorithm for multi-pattern searching, Technical Report TR-
94-17, Department of Computer Science, University of Arizona (1994)
5. Snort, http://www.snort.org/
6. Aho, A., Corasick, M.: Fast pattern matching: an aid to bibliographic search. Journal on
Communication ACM 18(6), 333–340 (1975)
7. Boyer, R., Moore, J.: A fast string searching algorithm. Journal on Communication.
ACM 20(10), 762–772 (1977)
8. Coit, C., Staniford, S., McAlerney, J.: Towards faster string matching for intrusion
detection or exceeding the speed of snort, DARPA Information Survivability Conference
and Exposition, pp. 367–373 (2001)
9. Fisk, M., Varghese, G.: An analysis of fast string matching applied to content-based
forwarding and intrusion detection. Technical Report CS2001-0607 (updated version),
University of California-San Diego (2002)
10. Xu, B., Zhou, X., Li, J.: Recursive shift indexing: a fast multi-pattern string matching
Algorithm. In: Zhou, J., Yung, M., Bao, F. (eds.) ACNS 2006. LNCS, vol. 3989, Springer,
Heidelberg (2006)
11. Kytojoki, J., Salmela, L., Tarhio, J.: Tuning string matching for huge pattern sets? In:
Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp.
211–224. Springer, Heidelberg (2003)
12. Allauzen, C., Raffinot, M.: Factor oracle of a set of words, Technical report 99-11, Institut
Gaspard-Monge, Universite de Marne-la-Vallee (1999)
13. National Computer Network Emergency Response Technical Team/Coordination Center
of China, http://www.cert.org.cn/
14. Network Security Lab: Research Institute of Information Technology, Tsinghua University,
Beijing, http://security.riit.tsinghua.edu.cn/share/pattern.html
... Các hướng tiếp cận mới hiện nay được trình bày trong [12][13][14][15][16][17][18][19][20]. Sau khi tối ưu các thuật toán người ta còn chuyển các thuật toán thành dạng kiến trúc bộ nhớ được thiết kế sẵn cho phép thực thi quá trình so khớp một cách nhanh nhất sử dụng thêm cấu trúc bảng băm đa lựa chọn. ...
Article
Full-text available
The pattern matching algorithms are important roles in most applications of informationtechnology. For example, Network Intrusion Detection System looks for evidence of malicious behavior based on matching packet contents with known patterns. Therefore, the study of the pattern-matching algorithm is a hot topic many researchers are interested. In this paper, we propose a new method to optimize the state storage of the pattern matching algorithms, AhoCorasick by using compressed row and index table techniques. The experimental results that compare the efficacy performed between the original Aho-Corasick algorithm and the improved algorithm installed in Snort showed that our method achieved better results.
... It reads the input string in blocks to effectively increase the size of the alphabet and then applies a hashing technique to reduce the necessary memory space. Zhou et al. [46] proposed an algorithm called MDH, a variant of Wu-Manber for large-scale pattern sets. ...
Article
Full-text available
Multiple pattern matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG, five of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be processed, as is common in computational biology applications. Over the past years, Graphics Processing Units (GPUs) have evolved to powerful parallel processors outperforming CPUs in scientific applications. This paper evaluates the speedup of the basic parallel strategy and the different optimization strategies for parallelization of Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG algorithms on a GPU.
... Otherwise the hash function of this text block characters is calculated again using the second hash function, now identify the entry in PMT table by using the new hash value. In the last step a verification for every possible matching pattern linked in this entry and then moving the text window right to restart the whole procedure again [56]. ...
Article
Full-text available
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
... Otherwise the hash function of this text block characters is calculated again using the second hash function, now identify the entry in PMT table by using the new hash value. In the last step a verification for every possible matching pattern linked in this entry and then moving the text window right to restart the whole procedure again [56]. ...
Article
Full-text available
The concept of string matching algorithms are playing an important role of string algorithms in finding a place where one or several strings (patterns) are found in a large body of text (e.g., data streaming, a sentence, a paragraph, a book, etc.). Its application covers a wide range, including intrusion detection Systems (IDS) in computer networks, applications in bioinformatics, detecting plagiarism, information security, pattern recognition, document matching and text mining. In this paper we present a short survey for well-known and recent updated and hybrid string matching algorithms. These algorithms can be divided into two major categories, known as exact string matching and approximate string matching. The string matching classification criteria was selected to highlight important features of matching strategies, in order to identify challenges and vulnerabilities.
... These ideas are enlightening for large-scale string matching algorithms. In [13], Zhou proposed MDH algorithm which optimized WM algorithm with multi-phase hash and dynamic-cut heuristics strategies. According to Zhou's experiments, the performance of MDH is superior to WM and some other algorithms. ...
Article
Full-text available
String matching algorithms are essential for network application devices that filter packets and flows based on their payload. Applications like intrusion detection/ prevention, web filtering, anti-virus, and anti-spam all raise the demand for efficient algorithms dealing with string matching. In this paper, we present a new algorithm for multiple-pattern exact matching. Our approach reduces character comparisons and memory space based on graph transition structure and search technique using dynamic linked list. Theoretical analysis and experimental results, when compared with previously known pattern-matching algorithms, show that our algorithm is highly efficient in both space and time. Index Terms—Pattern matching, multi-pattern matching, network intrusion detection system.
Article
The string matching algorithms are among the essential fields in computer science, such as text search, intrusion detection systems, fraud detection, sequence search in bioinformatics. The exact string matching algorithms are divided into two parts: single and multiple. Multiple string matching algorithms involve finding elements of the pattern set P in a given input text T. String matching processes should be done in a time-efficient manner for DNA sequences. As the volume of the text T increases and the number of search patterns increases, the total runtime increases. Efficient algorithms should be selected to perform these search operations as soon as possible. In this study, the Wu-Manber algorithm, one of the multiple exact string matching algorithms, is improved. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In this study, the WM-q algorithm, a version of the Wu-Manber algorithm based on the perfect hash function for DNA sequences, is proposed. String matching is performed using different block lengths provided by the perfect hash function instead of using the fixed block length as in the traditional Wu-Manber algorithm. The proposed approach has been compared with E. Coli and Human Chromosome1 datasets, frequently used in the literature, using multiple exact string matching algorithms. The proposed algorithm gives better results for performance metrics such as the average runtime, the average number of characters and hash comparisons.
Chapter
With the continuous expanding of the Internet of Things, the security of networked embedded devices attracts much attention. Large scale embedded device firmware provides basic data for automated and artificial intelligent analysis method. Thus, an association analysis method for the large scale firmware security is proposed in this paper. Then, a firmware database platform based on the proposed analysis method is developed. First, the platform can complete the mainline of embedded device firmware crawl with its web crawler program. Then, a firmware NoSQL database including the firmware and its information (such as its vendor, product, version, URL, files, etc.) is formed. Last, the firmware analysis method is applied on the database by matching the hashes of the web files and programs in the firmware file system with vulnerability file. The experimental result shows that the proposed method is effective and efficient.
Article
Multi-string matching (MSM) is a core technique searching a text string for all occurrences of some string patterns. It is widely used in many applications. However, as the number of string patterns increases, most of the existing algorithms suffer from two issues: the long matching time, and the high memory consumption. To address these issues, in this paper, a fast matching engine is proposed for large-scale string matching problems. Our engine includes a filter module and a verification module. The filter module is based on several bitmaps which are responsible for quickly filtering out the invalid positions in the text, while for each potential matched position, the verification module confirms true pattern occurrence. In particular, we design a compact data structure called Adaptive Matching Tree (AMT) for the verification module, in which each tree node only saves some pattern fragments of the whole pattern set and the inner structure of each tree node is chosen adaptively according to the features of the corresponding pattern fragments. This makes the engine time and space efficient. The experiments indicate that, our matching engine performs better than the compared algorithms, especially for large pattern sets.
Article
Full-text available
This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
Article
Full-text available
This paper also proves that this is optimal among algorithms processing the text with a one-symbol buffer. The bound becomes
Conference Paper
String matching algorithms are essential for network devices that filter packets and flows based on their payload. Applications like intrusion detection/prevention, web filtering, anti-virus, and anti-spam all raise the demand for efficient algorithms dealing with string matching. This paper presents a novel multi-pattern string matching algorithm which reduces character comparisons based on recursive shift indexing. Theoretical analysis and experi- mental results show that the new algorithm is highly efficient: Its search time is cut down significantly compared with other popular existing algorithms, whereas its memory occupa- tion stays at a low level. It is also demonstrated that the proposed algorithm has a simpler structure for easy implementation.1
Article
Network Intrusion Detection Systems (NIDS) often rely on exact string matching techniques. Depending on the choice of algorithm, implementation, and the frequency with which it is applied, this pattern matching may become a performance bottleneck. To keep up with increasing network speeds and traffic, NIDS can take advantage of advanced string matching algorithms. In this paper we describe the effectiveness of a significantly faster approach to pattern matching in the open source NIDS Snort.
Article
An algorithm is presented that searches for the location, ″i,″ of the first occurrence of a character string, ″pat,″ in another string, ″string.″ During the search operation, the characters of pat are matched starting with the last character of pat. The information gained by starting the match at the end of the pattern often allows the algorithm to proceed in large jumps through the text being searched. Thus the algorithm has the unusual property that, in most cases, not all of the first i characters of string are inspected. The number of characters actually inspected (on the average) decreases as a function of the length of pat. For a random English pattern of length 5, the algorithm will typically inspect i/4 characters of string before finding a match at i. Furthermore, the algorithm has been implemented so that (on the average) fewer than i plus patlen machine instructions are executed. These conclusions are supported with empirical evidence and a theoretical analysis of the average behavior of the algorithm. The worst case behavior of the algorithm is linear in i plus patlen, assuming the availability of array space for tables linear in patlen plus the size of the alphabet.
Conference Paper
We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply q-grams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the Aho-Corasick algorithm. Our algorithms showed to be substantially faster than earlier solutions for sets of 1,000–100,000 patterns. The gain is due to the improved filtering efficiency caused by q-grams.
Conference Paper
ABSTRACT Network intrusion detection systems (NIDS) are an important part of any network security architecture They provide a layer of defense which monitors network traffic for predefined suspicious activity or patterns, and alert system administrators when potential hostile traffic is detected Commercial NIDS have many differences, but Information Systems departments must face the commonalities that they share such as significant system footprint, complex deployment and high monetary cost Snort was designed to address these issues
Book
String matching problems range from the relatively simple task of searching a single text for a string of characters to searching a database for approximate occurrences of a complex pattern. Recent years have witnessed a dramatic increase of interest in sophisticated string matching problems, especially in information retrieval and computational biology. This book presents a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice. It covers searching for simple, multiple and extended strings, as well as regular expressions, and exact and approximate searching. It includes all the most significant new developments in complex pattern searching. The clear explanations, step-by-step examples, algorithm pseudocode, and implementation efficiency maps will enable researchers, professionals and students in bioinformatics, computer science, and software engineering to choose the most appropriate algorithms for their applications.