Compressing Regular Expressions’ DFA Table by Matrix Decomposition Yanbing Liu1,2,3 , Li Guo1,3 , Ping Liu1,3 , and Jianlong Tan1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 2 Graduate School of Chinese Academy of Sciences, Beijing, 100049 3 National Engineering Laboratory for Information Security Technologies, 100190 [email protected], {guoli,liuping,tjl}@ict.ac.cn Abstract. Recently regular expression matching has become a research focus as a result of the urgent demand for Deep Packet Inspection (DPI) in many network security systems. Deterministic Finite Automaton (DFA), which recognizes a set of regular expressions, is usually adopted to cater to the need for real-time processing of network traffic. However, the huge memory usage of DFA prevents it from being applied even on a mediumsized pattern set. In this article, we propose a matrix decomposition method for DFA table compression. The basic idea of the method is to decompose a DFA table into the sum of a row vector, a column vector and a sparse matrix, all of which cost very little space. Experiments on typical rule sets show that the proposed method significantly reduces the memory usage and still runs at fast searching speed. 1 Introduction Recent years, regular expression matching has become a research focus in network security community. This interest is motivated by the demand for Deep Packet Inspection (DPI), which inspects not only the headers of network packets but also the payloads. In network security systems, signatures are represented as either exact strings or complicated regular expressions, and the number of signatures is quite large. Considering the requirement on real-time processing of network traffic in such systems, Deterministic Finite Automaton (DFA) is usually adopted to recognize a set of regular expressions. However, the combined DFA for regular expressions might suffer from the problem of exponential blow-up, and the huge memory usage prevents it from being applied even on a medium-sized pattern set. Therefore it’s necessary to devise compression methods to reduce DFA’s space so that it can reside in memory or high speed CPU caches. In this article, we propose a matrix decomposition method for DFA table compression. Matrix decomposition has been widely studied and used in many fields, but it has not yet been considered for DFA table compression. We treat the state transition table of DFA as a matrix, and formulate a scheme for DFA table compression from the angle of matrix decomposition. The basic idea of our M. Domaratzki and K. Salomaa (Eds.): CIAA 2010, LNCS 6482, pp. 282–289, 2011. c Springer-Verlag Berlin Heidelberg 2011 Compressing Regular Expressions’ DFA Table by Matrix Decomposition 283 method is to decompose a DFA table into three parts: a row vector, a column vector and a residual sparse matrix, all of which cost very little space. We test our method on typical regular expression rule sets and the results show that the proposed method significantly reduces the memory usage and still runs at fast searching speed comparable to that of the original DFA. The rest of this paper is organized as follows. We first summarize related work in DFA compression area in section 2. And then we formulate a matrix decomposition problem for DFA compression in section 3.1. After that, we propose an iterative algorithm for matrix decomposition and DFA compression in section 3.2. Finally, we carry out experiments with the proposed method and report the results in section 4. Section 5 concludes the paper. 2 Related Work Lots of theoretic and algorithmic results on regular expression matching have been achieved since 1960s [1–6]. To bridge the gap between theory and practice, recent years there are great interests in implementing fast regular expression matching in real-life systems [7–14]. The large memory usage and potential state explosion of DFA are the common concerns of many researchers. Yu et al.[7] exploit rewrite rules to simplify regular expressions, and develop a grouping method that divides a set of regular expressions into several groups so that they can be compiled into medium-sized DFAs. The rewrite rules work only if the non-overlapping condition is satisfied. Kumar et al. [8] propose Delayed Input DFA which uses default transition to eliminate redundant transitions, but the time of state switching per text character increases proportionally. Becchi et al. [9] propose a compression method that results in at most 2N state traversals when processing an input text of length N . It takes advantage of state distance to achieve high compressibility comparable to that of the Delayed Input DFA method. Ficara et al. [10] devise the method δFA to eliminate redundant transitions in DFA. The idea is based on the observation that adjacent states in DFA traversing share the majority of the next-hop states associated with the same input characters, therefore the transitions of current state can be retrieved from its predecessor’s transition table dynamically. However, the update of a local state transition table is time-consuming. Simth et al. [11] introduce XFA to handle two special classes of regular expressions that suffer from the exponential explosion problem. XFA augments DFA by attaching counters to DFA states to memorize additional information. This method needs to update a set of counters associated with each state during traversing, and therefore it is not practical for software implementation. In short, most of the above methods make use of space-time tradeoff: reducing space usage at the cost of increasing running time. Though these methods are efficient in particular environments, better space-time tradeoff techniques are still need to be studied. 284 3 Y. Liu et al. A Matrix Decomposition Method for DFA Compression In this section, we first formulate a matrix decomposition problem: Additive Matrix Decomposition. And then we propose an iterative algorithm to solve the stated problem. Based on the matrix decomposition, a DFA compression scheme is naturally followed. 3.1 Problem Formulation: Additive Matrix Decomposition The state transition table of a DFA can be treated as an m × n matrix A, where m is the number of states and n is the cardinality of alphabet Σ. Matrix element A[i, j] (or Ai,j ) defines the state switching from current state i to the next state through character label j. The basic idea of our method is approaching the DFA matrix A by a special matrix D (that can be stored with little space) so that the residual matrix R = A−D is as sparse as possible. By replacing the original DFA matrix with the special matrix D and the residual sparse matrix R, a space compression scheme sounds plausible. We formulate our idea as the following matrix decomposition problem: Additive Matrix Decomposition. Let X be a column vector of size Y be a row vector of size n. Let D be the m × n matrix induced by X with D[i, j] = X[i] + Y [j] (1 ≤ i ≤ m, 1 ≤ j ≤ n). Now given an m × n A, find X and Y such that the number of zero elements in the residual R = A − D = [A[i, j] − X[i] − Y [j]] is maximized. m and and Y matrix matrix According to above matrix decomposition, DFA matrix A can be represented with a column vector X, a row vector Y , and a residual matrix R. Since A[i, j] = X[i] + Y [j] + R[i, j], state switching in DFA is still O(1) as long as accessing an element in the residual sparse matrix R is accomplished in O(1) time. For the purpose of high compressibility and fast access time, the residual matrix R should be as sparse as possible. Space usage of the proposed scheme consists of the size of X, the size of Y , and the number of nonzero elements in the . residual matrix R = A − D, resulting in the compression ratio m+n+nonzero(R) mn This metric is used to evaluate our method’s compression efficiency in section 4. 3.2 Iterative Algorithm for Additive Matrix Decomposition We present here an iterative algorithm to find the vectors X and Y that maximize the number of zero elements in the residual matrix R. We start with the observation that if vectors X and Y are the optimal vectors to the additive matrix decomposition problem, then the following necessary constraints must be satisfied: 1. For any 1 ≤ i ≤ m, X[i] is the most frequent element in multiset Di. = {A[i, j] − Y [j] | 1 ≤ j ≤ n}. 2. For any 1 ≤ j ≤ n, Y [j] is the most frequent element in multiset D.j = {A[i, j] − X[i] | 1 ≤ i ≤ m}. Compressing Regular Expressions’ DFA Table by Matrix Decomposition 285 The above constrains are easy to obtain. For fixed Y [j], if X[i] is not the most frequent element in Di. , then we can increase the number of zero elements in R by simply replacing X[i] with the most frequent element in Di. . Constrains hold for Y likewise. We devise an iterative algorithm based on the above constraints to compute X and Y . The elements of X and Y are firstly initialized to random seeds. Then we iteratively compute X from current Y and compute Y from current X until the above constraints are all satisfied. The number of zero elements in R is increasing during each iteration, and therefore the algorithm terminates in finite steps. In practice this algorithm usually terminates in 2 or 3 iterations. Since the constraints are not sufficient conditions, our iterative algorithm might not converge to a global optimal solution. Fortunately, the algorithm usually produces fairly good results. The iterative procedure for computing X and Y is described in algorithm 1. 4 Experiment and Evaluation We carry out experiments on several regular expression rule sets and compare our method (CRD, Column-Row Decomposition) with the original DFA as well as the δFA method[10] in terms of compression efficiency and searching time. The CHAR-STATE technique in δFA is not implemented because it is not practical for software implementation. The experiments are carried out on regular expression signatures obtained from several open-source systems, including: L7-filter[16], Snort[17], BRO[18]. We also generate 6 groups of synthetic rules according to the classification proposed by Fang et.al[7], who categorize regular expressions into several classes with different space complexity. Since the DFA for a set of regular expressions usually suffers from the state blow-up problem, it is usually hard to generate a combined DFA for a whole large rule set. We use the regex-tool [19] to partition a large rule set into several parts and to generate a moderate-sized DFA for each subset. In experiments the L7-filter rule set is divided into 8 subsets, and the Snort rule set is divided into 3 parts. Details of the rule sets are described in table 1. 4.1 Compression Efficiency Comparison This section compares the space usage of our method CRD with that of the original DFA and the δFA. We also compare our method CRD with its two simplified versions: DefaultROW and DefaultCOL. DefaultROW (or DefaultCOL) corresponds to set the vector Y (or X) in CRD to zero, and to extract the most frequent element in each row (or column) as a default state. We use the term compression ratio to evaluate the methods’ compression efficiency. For the original DFA, its compression ratio is always 1.0. For δFA, its compression ratio is the percent of nonzero elements in the final residual sparse . matrix. For our method CRD, its compression ratio is defined as m+n+nonzero(R) mn 286 Y. Liu et al. Algorithm 1. Decompose an m × n matrix A into a column vector X with size m, a row vector Y with size n, and an m × n sparse matrix R. A[i, j] = X[i] + Y [j] + R[i, j]. Let n(x, S) denote the number of occurrences of x in a multiset S. 1: procedure MatrixDecomposition(A, m, n) 2: for i ← 1, m do 3: X[i] ←rand() 4: end for 5: for j ← 1, n do 6: Y [j] ←rand() 7: end for 8: repeat 9: changed ←FALSE 10: for i ← 1, m do 11: x ← the most frequent element in multiset Di. = {A[i, j] − Y [j] | 1 ≤ j ≤ n} 12: if n(x, Di. ) > n(X[i], Di. ) then 13: X[i] ← x 14: changed ←TRUE 15: end if 16: end for 17: for j ← 1, n do 18: y ← the most frequent element in multiset D.j = {A[i, j] − X[i] | 1 ≤ i ≤ m} 19: if n(y, D.j ) > n(Y [j], D.j ) then 20: Y [j] ← y 21: changed ←TRUE 22: end if 23: end for 24: until changed =FALSE 25: R ← [A[i, j] − X[i] − Y [j]]m×n 26: return (X, Y, R) 27: end procedure Algorithm 2. State switching in our DFA table compression scheme 1: procedure NextState(s, c) 2: t ← X[s] + Y [c] 3: if BloomFilter. test(s, c) = 1 then 4: t ← t + SparseMatrix. get(s, c) 5: end if 6: return t 7: end procedure Compressing Regular Expressions’ DFA Table by Matrix Decomposition 287 Table 1 presents the compression ratio of the algorithms on typical rule sets. We can see that our method achieves better compressibility on L7-filter, BRO and synthetic rules, whereas δFA performs better on Snort rules. Of all the 18 groups of rules, CRD outperforms δFA on 14 rule sets. We can also see that CRD combines the merits of both DefaultROW and DefaultCOL, and performs better than these two simplified versions except on the rule set Synthetic 1. Table 1. Compression ratio of the algorithms on L7-filter, Snort, BRO and synthetic rule sets Rule set # of rules # of states L7 1 26 3172 L7 2 7 42711 L7 3 6 30135 L7 4 13 22608 L7 5 13 8344 L7 6 13 12896 L7 7 13 3473 L7 8 13 28476 Snort24 24 13882 Snort31 31 19522 Snort34 34 13834 BRO217 217 6533 Synthetic 1 50 248 Synthetic 2 10 78337 Synthetic 3 50 8338 Synthetic 4 10 5290 Synthetic 5 50 7828 Synthetic 6 50 14496 4.2 δFA 0.634964 0.918592 0.960985 0.097177 0.820768 0.827021 0.912125 0.804303 0.037515 0.053581 0.032259 0.061814 0.111281 0.099659 0.948123 0.990808 0.947048 0.973929 CRD DefaultROW DefaultCOL 0.226984 0.232905 0.817074 0.240451 0.243461 0.968942 0.356182 0.356860 0.968619 0.379325 0.381078 0.832390 0.198944 0.203315 0.961631 0.053005 0.055044 0.974603 0.054519 0.059149 0.928100 0.231228 0.231309 0.985363 0.103243 0.108468 0.957364 0.058584 0.061309 0.915806 0.058067 0.060473 0.947866 0.035062 0.224820 0.514522 0.011656 0.186697 0.007749 0.026233 0.030254 0.998601 0.014934 0.018575 0.335646 0.042752 0.046357 0.958690 0.016112 0.019762 0.326956 0.048839 0.173284 0.478337 Searching Time Comparison This section compares the searching time of our method CRD with that of the original DFA and the δFA. We generate a random text of size 10MB to search against with the synthetic rule sets. Both the δFA and our method need to store a residual sparse matrix using compact data structure. To represent the sparse matrix, we store the nonempty elements in each row in a sorted array, and accessing an element is accomplished by doing binary searching on it. To avoid unnecessary probes into the sparse table, we use the bloom filter[15] technique to indicate whether a position in the sparse matrix is empty or not (See Algorithm 2). This simple but efficient trick eliminates most of the probes into the sparse matrix. Searching time of the algorithms on synthetic rule sets is listed in table 2. Despite its high compressibility, our method CRD still runs at fast speed comparable to that of the original DFA. The searching time increase of our method is limited within 20%∼25%. The δFA method, which is designed for hardware 288 Y. Liu et al. Table 2. Searching time (in seconds) of the algorithms on synthetic rule sets Rule set Original DFA Synthetic 1 0.1250 Synthetic 2 0.1218 Synthetic 3 0.1204 Synthetic 4 0.1204 Synthetic 5 0.1203 Synthetic 6 0.1250 CRD 0.1516 0.1485 0.1500 0.1500 0.1484 0.1500 δFA 103.813 48.094 211.734 224.672 188.937 200.735 implementation, is significantly slower than the original DFA and our method. State switching in δFA costs O(|Σ|) resulting in poor performance. 5 Conclusion The huge memory usage of regular expressions’ DFA prevents it from being applied on large rule sets. To deal with this problem, we proposed a matrix decomposition-based method for DFA table compression. The basic idea of our method is to decompose a DFA table into the sum of a row vector, a column vector and a sparse matrix, all of which cost very little space. Experiments on typical rule sets show that the proposed method significantly reduces the memory usage and still runs at fast searching speed. Acknowledgment This work is supported by the National Basic Research Program of China (973) under grant No. 2007CB311100 and the National Information Security Research Program of China (242) under grant No. 2009A43. We would like to thank Michela Becchi for providing her useful regex-tool to us for evaluation. We are also grateful to the anonymous referees for their insightful comments. References 1. Thompson, K.: Programming Techniques: Regular expression search algorithm. Communications of the ACM 11(6), 419–422 (1968) 2. Myers, E.W.: A four Russians algorithm for regular expression pattern matching. Journal of the ACM 39(2), 430–448 (1992) 3. Baeza-Yates, R.A., Gonnet, G.H.: Fast text searching for regular expressions or automaton searching on tries. Journal of the ACM 43(6), 915–936 (1996) 4. Navarro, G., Raffinot, M.: Compact DFA representation for fast regular expression search. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds.) WAE 2001. LNCS, vol. 2141, pp. 1–12. Springer, Heidelberg (2001) 5. Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with application to protein searching. In: Proceedings of the 5th Annual International Conference on Computational Molecular Biology, pp. 231–240 (2001) Compressing Regular Expressions’ DFA Table by Matrix Decomposition 289 6. Champarnaud, J.-M., Coulon, F., Paranthoen, T.: Compact and Fast Algorithms for Regular Expression Search. Intern. J. of Computer. Math. 81(4) (2004) 7. Yu, F., Chen, Z., Diao, Y.: Fast and memory-efficient regular expression matching for deep packet inspection. In: Proceedings of the 2006 ACM/IEEE symposium on Architecture for Networking and Communications Systems, pp. 93–102 (2006) 8. Kumar, S., Dharmapurikar, S., Yu, F., Crowley, P., Turner, J.: Algorithms to accelerate multiple regular expressions matching for deep packet inspection. ACM SIGCOMM Computer Communication Review 36(4), 339–350 (2006) 9. Becchi, M., Crowley, P.: An improved algorithm to accelerate regular expression evaluation. In: Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, pp. 145–154 (2007) 10. Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi, G., Pietro, A.D.: An improved DFA for fast regular expression matching. ACM SIGCOMM Computer Communication Review 38(5), 29–40 (2008) 11. Smith, R., Estan, C., Jha, S.: XFA: Faster signature matching with extended automata. In: IEEE Symposium on Security and Privacy, Oakland, pp. 187–201 (May 2008) 12. Kumar, S., Chandrasekaran, B., Turner, J., Varghese, G.: Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In: Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, pp. 155–164 (2007) 13. Becchi, M., Cadambi, S.: Memory-efficient regular expression search using state merging. In: 26th IEEE International Conference on Computer Communications, pp. 1064–1072 (2007) 14. Majumder, A., Rastogi, R., Vanama, S.: Scalable regular expression matching on data streams. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, pp. 161–172 (2008) 15. Bloom, B.H.: Spacetime Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM 13(7), 422–426 (1970) 16. http://l7-filter.sourceforge.net/ 17. http://www.snort.org/ 18. http://www.bro-ids.org/ 19. http://regex.wustl.edu/
© Copyright 2026 Paperzz