Compressing Regular Expressions` DFA Table by Matrix

Compressing Regular Expressions’ DFA Table
by Matrix Decomposition
Yanbing Liu1,2,3 , Li Guo1,3 , Ping Liu1,3 , and Jianlong Tan1,3
1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190
2
Graduate School of Chinese Academy of Sciences, Beijing, 100049
3
National Engineering Laboratory for Information Security Technologies, 100190
[email protected],
{guoli,liuping,tjl}@ict.ac.cn
Abstract. Recently regular expression matching has become a research
focus as a result of the urgent demand for Deep Packet Inspection (DPI) in
many network security systems. Deterministic Finite Automaton (DFA),
which recognizes a set of regular expressions, is usually adopted to cater
to the need for real-time processing of network traffic. However, the huge
memory usage of DFA prevents it from being applied even on a mediumsized pattern set. In this article, we propose a matrix decomposition method
for DFA table compression. The basic idea of the method is to decompose
a DFA table into the sum of a row vector, a column vector and a sparse
matrix, all of which cost very little space. Experiments on typical rule sets
show that the proposed method significantly reduces the memory usage
and still runs at fast searching speed.
1
Introduction
Recent years, regular expression matching has become a research focus in network security community. This interest is motivated by the demand for Deep
Packet Inspection (DPI), which inspects not only the headers of network packets
but also the payloads. In network security systems, signatures are represented
as either exact strings or complicated regular expressions, and the number of
signatures is quite large. Considering the requirement on real-time processing
of network traffic in such systems, Deterministic Finite Automaton (DFA) is
usually adopted to recognize a set of regular expressions. However, the combined DFA for regular expressions might suffer from the problem of exponential
blow-up, and the huge memory usage prevents it from being applied even on a
medium-sized pattern set. Therefore it’s necessary to devise compression methods to reduce DFA’s space so that it can reside in memory or high speed CPU
caches.
In this article, we propose a matrix decomposition method for DFA table
compression. Matrix decomposition has been widely studied and used in many
fields, but it has not yet been considered for DFA table compression. We treat
the state transition table of DFA as a matrix, and formulate a scheme for DFA
table compression from the angle of matrix decomposition. The basic idea of our
M. Domaratzki and K. Salomaa (Eds.): CIAA 2010, LNCS 6482, pp. 282–289, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Compressing Regular Expressions’ DFA Table by Matrix Decomposition
283
method is to decompose a DFA table into three parts: a row vector, a column
vector and a residual sparse matrix, all of which cost very little space. We test
our method on typical regular expression rule sets and the results show that the
proposed method significantly reduces the memory usage and still runs at fast
searching speed comparable to that of the original DFA.
The rest of this paper is organized as follows. We first summarize related work
in DFA compression area in section 2. And then we formulate a matrix decomposition problem for DFA compression in section 3.1. After that, we propose an
iterative algorithm for matrix decomposition and DFA compression in section
3.2. Finally, we carry out experiments with the proposed method and report the
results in section 4. Section 5 concludes the paper.
2
Related Work
Lots of theoretic and algorithmic results on regular expression matching have
been achieved since 1960s [1–6]. To bridge the gap between theory and practice,
recent years there are great interests in implementing fast regular expression
matching in real-life systems [7–14]. The large memory usage and potential state
explosion of DFA are the common concerns of many researchers.
Yu et al.[7] exploit rewrite rules to simplify regular expressions, and develop
a grouping method that divides a set of regular expressions into several groups
so that they can be compiled into medium-sized DFAs. The rewrite rules work
only if the non-overlapping condition is satisfied.
Kumar et al. [8] propose Delayed Input DFA which uses default transition
to eliminate redundant transitions, but the time of state switching per text
character increases proportionally.
Becchi et al. [9] propose a compression method that results in at most 2N
state traversals when processing an input text of length N . It takes advantage of
state distance to achieve high compressibility comparable to that of the Delayed
Input DFA method.
Ficara et al. [10] devise the method δFA to eliminate redundant transitions in
DFA. The idea is based on the observation that adjacent states in DFA traversing share the majority of the next-hop states associated with the same input
characters, therefore the transitions of current state can be retrieved from its
predecessor’s transition table dynamically. However, the update of a local state
transition table is time-consuming.
Simth et al. [11] introduce XFA to handle two special classes of regular expressions that suffer from the exponential explosion problem. XFA augments
DFA by attaching counters to DFA states to memorize additional information.
This method needs to update a set of counters associated with each state during
traversing, and therefore it is not practical for software implementation.
In short, most of the above methods make use of space-time tradeoff: reducing
space usage at the cost of increasing running time. Though these methods are
efficient in particular environments, better space-time tradeoff techniques are
still need to be studied.
284
3
Y. Liu et al.
A Matrix Decomposition Method for DFA Compression
In this section, we first formulate a matrix decomposition problem: Additive
Matrix Decomposition. And then we propose an iterative algorithm to solve the
stated problem. Based on the matrix decomposition, a DFA compression scheme
is naturally followed.
3.1
Problem Formulation: Additive Matrix Decomposition
The state transition table of a DFA can be treated as an m × n matrix A, where
m is the number of states and n is the cardinality of alphabet Σ. Matrix element
A[i, j] (or Ai,j ) defines the state switching from current state i to the next state
through character label j.
The basic idea of our method is approaching the DFA matrix A by a special
matrix D (that can be stored with little space) so that the residual matrix
R = A−D is as sparse as possible. By replacing the original DFA matrix with the
special matrix D and the residual sparse matrix R, a space compression scheme
sounds plausible. We formulate our idea as the following matrix decomposition
problem:
Additive Matrix Decomposition. Let X be a column vector of size
Y be a row vector of size n. Let D be the m × n matrix induced by X
with D[i, j] = X[i] + Y [j] (1 ≤ i ≤ m, 1 ≤ j ≤ n). Now given an m × n
A, find X and Y such that the number of zero elements in the residual
R = A − D = [A[i, j] − X[i] − Y [j]] is maximized.
m and
and Y
matrix
matrix
According to above matrix decomposition, DFA matrix A can be represented
with a column vector X, a row vector Y , and a residual matrix R. Since A[i, j] =
X[i] + Y [j] + R[i, j], state switching in DFA is still O(1) as long as accessing an
element in the residual sparse matrix R is accomplished in O(1) time.
For the purpose of high compressibility and fast access time, the residual
matrix R should be as sparse as possible. Space usage of the proposed scheme
consists of the size of X, the size of Y , and the number of nonzero elements in the
.
residual matrix R = A − D, resulting in the compression ratio m+n+nonzero(R)
mn
This metric is used to evaluate our method’s compression efficiency in section 4.
3.2
Iterative Algorithm for Additive Matrix Decomposition
We present here an iterative algorithm to find the vectors X and Y that maximize
the number of zero elements in the residual matrix R.
We start with the observation that if vectors X and Y are the optimal vectors to the additive matrix decomposition problem, then the following necessary
constraints must be satisfied:
1. For any 1 ≤ i ≤ m, X[i] is the most frequent element in multiset Di. =
{A[i, j] − Y [j] | 1 ≤ j ≤ n}.
2. For any 1 ≤ j ≤ n, Y [j] is the most frequent element in multiset D.j =
{A[i, j] − X[i] | 1 ≤ i ≤ m}.
Compressing Regular Expressions’ DFA Table by Matrix Decomposition
285
The above constrains are easy to obtain. For fixed Y [j], if X[i] is not the most
frequent element in Di. , then we can increase the number of zero elements in R
by simply replacing X[i] with the most frequent element in Di. . Constrains hold
for Y likewise.
We devise an iterative algorithm based on the above constraints to compute
X and Y . The elements of X and Y are firstly initialized to random seeds.
Then we iteratively compute X from current Y and compute Y from current
X until the above constraints are all satisfied. The number of zero elements in
R is increasing during each iteration, and therefore the algorithm terminates in
finite steps. In practice this algorithm usually terminates in 2 or 3 iterations.
Since the constraints are not sufficient conditions, our iterative algorithm might
not converge to a global optimal solution. Fortunately, the algorithm usually
produces fairly good results.
The iterative procedure for computing X and Y is described in algorithm 1.
4
Experiment and Evaluation
We carry out experiments on several regular expression rule sets and compare
our method (CRD, Column-Row Decomposition) with the original DFA as well
as the δFA method[10] in terms of compression efficiency and searching time. The
CHAR-STATE technique in δFA is not implemented because it is not practical
for software implementation.
The experiments are carried out on regular expression signatures obtained
from several open-source systems, including: L7-filter[16], Snort[17], BRO[18].
We also generate 6 groups of synthetic rules according to the classification proposed by Fang et.al[7], who categorize regular expressions into several classes
with different space complexity.
Since the DFA for a set of regular expressions usually suffers from the state
blow-up problem, it is usually hard to generate a combined DFA for a whole
large rule set. We use the regex-tool [19] to partition a large rule set into several
parts and to generate a moderate-sized DFA for each subset. In experiments the
L7-filter rule set is divided into 8 subsets, and the Snort rule set is divided into
3 parts. Details of the rule sets are described in table 1.
4.1
Compression Efficiency Comparison
This section compares the space usage of our method CRD with that of the
original DFA and the δFA. We also compare our method CRD with its two simplified versions: DefaultROW and DefaultCOL. DefaultROW (or DefaultCOL)
corresponds to set the vector Y (or X) in CRD to zero, and to extract the most
frequent element in each row (or column) as a default state.
We use the term compression ratio to evaluate the methods’ compression
efficiency. For the original DFA, its compression ratio is always 1.0. For δFA, its
compression ratio is the percent of nonzero elements in the final residual sparse
.
matrix. For our method CRD, its compression ratio is defined as m+n+nonzero(R)
mn
286
Y. Liu et al.
Algorithm 1. Decompose an m × n matrix A into a column vector X with
size m, a row vector Y with size n, and an m × n sparse matrix R. A[i, j] =
X[i] + Y [j] + R[i, j]. Let n(x, S) denote the number of occurrences of x in a
multiset S.
1: procedure MatrixDecomposition(A, m, n)
2:
for i ← 1, m do
3:
X[i] ←rand()
4:
end for
5:
for j ← 1, n do
6:
Y [j] ←rand()
7:
end for
8:
repeat
9:
changed ←FALSE
10:
for i ← 1, m do
11:
x ← the most frequent element in multiset Di. = {A[i, j] − Y [j] | 1 ≤
j ≤ n}
12:
if n(x, Di. ) > n(X[i], Di. ) then
13:
X[i] ← x
14:
changed ←TRUE
15:
end if
16:
end for
17:
for j ← 1, n do
18:
y ← the most frequent element in multiset D.j = {A[i, j] − X[i] | 1 ≤
i ≤ m}
19:
if n(y, D.j ) > n(Y [j], D.j ) then
20:
Y [j] ← y
21:
changed ←TRUE
22:
end if
23:
end for
24:
until changed =FALSE
25:
R ← [A[i, j] − X[i] − Y [j]]m×n
26:
return (X, Y, R)
27: end procedure
Algorithm 2. State switching in our DFA table compression scheme
1: procedure NextState(s, c)
2:
t ← X[s] + Y [c]
3:
if BloomFilter. test(s, c) = 1 then
4:
t ← t + SparseMatrix. get(s, c)
5:
end if
6:
return t
7: end procedure
Compressing Regular Expressions’ DFA Table by Matrix Decomposition
287
Table 1 presents the compression ratio of the algorithms on typical rule sets.
We can see that our method achieves better compressibility on L7-filter, BRO
and synthetic rules, whereas δFA performs better on Snort rules. Of all the 18
groups of rules, CRD outperforms δFA on 14 rule sets. We can also see that
CRD combines the merits of both DefaultROW and DefaultCOL, and performs
better than these two simplified versions except on the rule set Synthetic 1.
Table 1. Compression ratio of the algorithms on L7-filter, Snort, BRO and synthetic
rule sets
Rule set # of rules # of states
L7 1
26
3172
L7 2
7
42711
L7 3
6
30135
L7 4
13
22608
L7 5
13
8344
L7 6
13
12896
L7 7
13
3473
L7 8
13
28476
Snort24
24
13882
Snort31
31
19522
Snort34
34
13834
BRO217
217
6533
Synthetic 1
50
248
Synthetic 2
10
78337
Synthetic 3
50
8338
Synthetic 4
10
5290
Synthetic 5
50
7828
Synthetic 6
50
14496
4.2
δFA
0.634964
0.918592
0.960985
0.097177
0.820768
0.827021
0.912125
0.804303
0.037515
0.053581
0.032259
0.061814
0.111281
0.099659
0.948123
0.990808
0.947048
0.973929
CRD DefaultROW DefaultCOL
0.226984 0.232905
0.817074
0.240451 0.243461
0.968942
0.356182 0.356860
0.968619
0.379325
0.381078
0.832390
0.198944 0.203315
0.961631
0.053005 0.055044
0.974603
0.054519 0.059149
0.928100
0.231228 0.231309
0.985363
0.103243
0.108468
0.957364
0.058584
0.061309
0.915806
0.058067
0.060473
0.947866
0.035062 0.224820
0.514522
0.011656
0.186697
0.007749
0.026233 0.030254
0.998601
0.014934 0.018575
0.335646
0.042752 0.046357
0.958690
0.016112 0.019762
0.326956
0.048839 0.173284
0.478337
Searching Time Comparison
This section compares the searching time of our method CRD with that of the
original DFA and the δFA. We generate a random text of size 10MB to search
against with the synthetic rule sets.
Both the δFA and our method need to store a residual sparse matrix using
compact data structure. To represent the sparse matrix, we store the nonempty
elements in each row in a sorted array, and accessing an element is accomplished
by doing binary searching on it. To avoid unnecessary probes into the sparse
table, we use the bloom filter[15] technique to indicate whether a position in the
sparse matrix is empty or not (See Algorithm 2). This simple but efficient trick
eliminates most of the probes into the sparse matrix.
Searching time of the algorithms on synthetic rule sets is listed in table 2.
Despite its high compressibility, our method CRD still runs at fast speed comparable to that of the original DFA. The searching time increase of our method
is limited within 20%∼25%. The δFA method, which is designed for hardware
288
Y. Liu et al.
Table 2. Searching time (in seconds) of the algorithms on synthetic rule sets
Rule set Original DFA
Synthetic 1
0.1250
Synthetic 2
0.1218
Synthetic 3
0.1204
Synthetic 4
0.1204
Synthetic 5
0.1203
Synthetic 6
0.1250
CRD
0.1516
0.1485
0.1500
0.1500
0.1484
0.1500
δFA
103.813
48.094
211.734
224.672
188.937
200.735
implementation, is significantly slower than the original DFA and our method.
State switching in δFA costs O(|Σ|) resulting in poor performance.
5
Conclusion
The huge memory usage of regular expressions’ DFA prevents it from being
applied on large rule sets. To deal with this problem, we proposed a matrix
decomposition-based method for DFA table compression. The basic idea of our
method is to decompose a DFA table into the sum of a row vector, a column
vector and a sparse matrix, all of which cost very little space. Experiments on
typical rule sets show that the proposed method significantly reduces the memory
usage and still runs at fast searching speed.
Acknowledgment
This work is supported by the National Basic Research Program of China (973)
under grant No. 2007CB311100 and the National Information Security Research
Program of China (242) under grant No. 2009A43. We would like to thank
Michela Becchi for providing her useful regex-tool to us for evaluation. We are
also grateful to the anonymous referees for their insightful comments.
References
1. Thompson, K.: Programming Techniques: Regular expression search algorithm.
Communications of the ACM 11(6), 419–422 (1968)
2. Myers, E.W.: A four Russians algorithm for regular expression pattern matching.
Journal of the ACM 39(2), 430–448 (1992)
3. Baeza-Yates, R.A., Gonnet, G.H.: Fast text searching for regular expressions or
automaton searching on tries. Journal of the ACM 43(6), 915–936 (1996)
4. Navarro, G., Raffinot, M.: Compact DFA representation for fast regular expression
search. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds.) WAE 2001.
LNCS, vol. 2141, pp. 1–12. Springer, Heidelberg (2001)
5. Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps
pattern matching, with application to protein searching. In: Proceedings of the
5th Annual International Conference on Computational Molecular Biology, pp.
231–240 (2001)
Compressing Regular Expressions’ DFA Table by Matrix Decomposition
289
6. Champarnaud, J.-M., Coulon, F., Paranthoen, T.: Compact and Fast Algorithms
for Regular Expression Search. Intern. J. of Computer. Math. 81(4) (2004)
7. Yu, F., Chen, Z., Diao, Y.: Fast and memory-efficient regular expression matching
for deep packet inspection. In: Proceedings of the 2006 ACM/IEEE symposium on
Architecture for Networking and Communications Systems, pp. 93–102 (2006)
8. Kumar, S., Dharmapurikar, S., Yu, F., Crowley, P., Turner, J.: Algorithms to
accelerate multiple regular expressions matching for deep packet inspection. ACM
SIGCOMM Computer Communication Review 36(4), 339–350 (2006)
9. Becchi, M., Crowley, P.: An improved algorithm to accelerate regular expression
evaluation. In: Proceedings of the 3rd ACM/IEEE Symposium on Architecture for
Networking and Communications Systems, pp. 145–154 (2007)
10. Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi, G., Pietro, A.D.: An
improved DFA for fast regular expression matching. ACM SIGCOMM Computer
Communication Review 38(5), 29–40 (2008)
11. Smith, R., Estan, C., Jha, S.: XFA: Faster signature matching with extended automata. In: IEEE Symposium on Security and Privacy, Oakland, pp. 187–201 (May
2008)
12. Kumar, S., Chandrasekaran, B., Turner, J., Varghese, G.: Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In: Proceedings
of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, pp. 155–164 (2007)
13. Becchi, M., Cadambi, S.: Memory-efficient regular expression search using state
merging. In: 26th IEEE International Conference on Computer Communications,
pp. 1064–1072 (2007)
14. Majumder, A., Rastogi, R., Vanama, S.: Scalable regular expression matching on
data streams. In: Proceedings of the 2008 ACM SIGMOD International Conference
on Management of Data, Vancouver, Canada, pp. 161–172 (2008)
15. Bloom, B.H.: Spacetime Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM 13(7), 422–426 (1970)
16. http://l7-filter.sourceforge.net/
17. http://www.snort.org/
18. http://www.bro-ids.org/
19. http://regex.wustl.edu/