Implementation of Regular Expressions with Constraint Repetition

Implementation of Regular Expressions with
Constraint Repetition
Karel Heyse, Karel Bruneel, and Dirk Stroobandt
ELIS department
Ghent University
B-9000 Ghent, Belgium
19th June 2012
1
Introduction
This document serves as an accompanying technical report to [1].
In this document, additional information is given on the implementation of
regular expressions using Nonderministic Finite Automatons (NFA) (Section 2),
and the optimised implementation of constraint repetitions using counters (Section 3). A simple counter implementation, similar to the one presented in [2, 3],
used for the experiments in [1] is described here. In the final section a list is
given with the regular expressions found in the Snort ruleset that could not be
implemented using the simple counter implementation.
2
2.1
Background
Regular Expressions
The regular expression syntax is a special purpose language to describe complex patterns of characters. Regular expression matching is the searching of
substrings of the input string that satisfy the specified pattern.
The basic operators of the regular expression syntax are empty string consumption (), character consumption (c), concatenation (R1 R2 ), union (R1 |R2 )
and at-least-once repetition (R+).
A summary of the regular expression syntax can be found in Table 1.
2.2
NFA Implementation of Regular Expressions
Regular expression matching can efficiently be implemented using finite state
automatons. The automaton steps through the input string one characters at a
time and returns true when a match of the regular expression is found [4].
1
Syntax
Table 1: Summary of the regular expression syntax
Description
Empty string consumption: Accepts the string containing no
characters.
c
Character consumption: Accepts a string with a single character ‘c’.
R1 R2
Concatenation: Accepts string if composed of substring accepted by R1 followed by substring accepted by R2 .
R1 |R2
Union: Accepts string if accepted by R1 or R2 .
R+
At-least-once repetition: Accepts string if concatenation of one
or more strings accepted by R.
Constraint repetitions:
R{N }
Exact: Equivalent to ‘RR...RR
| {z }’.
R{N, }
At-least: Equivalent to ‘RR...R
| {z } R+’.
N
N −1
R{M, N }
Between: Equiv. ‘RR...R
| {z }(|R|RR|...| RR...R
| {z })’.
M
R?
R
N −M
At-most-once repetition: Equivalent to ‘|R’.
∗
Kleene star: Equivalent to ‘|R+’.
(...)
Grouping: Used to clarify to which subexpression an operator
is applied or to improve readability.
[abc]
Character class: Union of all characters in class.
∧
[ abc]
Can be negated,
[a−z]
or contain ranges.
\w, \d, \s, .
Shorthand character classes respectively for word, numeric,
whitespace characters and any character except newline.
\xF F
Characters can be defined by their hexadecimal ASCII value.
\n, \t
Shorthand resp. for newline, tab.
∧
R
The pattern specified in R must occur at the beginning of the
input string. If the modifier flag ‘m’ is defined, the pattern
may also occur after a newline character.
R$
The pattern specified in R must occur at the end of the input
string. If the modifier flag ‘m’ is defined, the pattern may also
occur before a newline character.
2.2.1
Nondeterministic Finite Automaton for Regular Expressions
Nondeterministic Finite Automatons (Figure 1) have a finite number of states,
each of which can be active or inactive. Multiple states can be active simultaneously.
2
q0
c
q1
𝞊
d
q2
Figure 1: Example of an NFA
𝞊
c
R1
(a) Empty string (b) Character conconsumption: ‘’ sumption: ‘c’
𝞊
R1
𝞊
𝞊
R2
(c) Concatenation: ‘R1 R2 ’
𝞊
𝞊
𝞊
𝞊
R
R2
(d) Union: ‘R1 |R2 ’
(e) At-least-once repetition: ‘R+’
Figure 2: Method to create the NFA for basic regular expression operators
One of the states is called the initial state (q0 ). It is activated when the
automaton is initialised.
During each execution cycle one character of the input string is consumed
and the states that will be active in the next cycle are selected. A state will be
activated if a transition exists from a currently active state to that state and
the accompanying condition (c) is satisfied by the current input character.
An active state immediately activates another state if a nonconsuming transition () exists from the former to the latter.
If a state labeled as final (q1 ) is active, this means that the input string read
up until that cycle does match the regular expression.
To make working with NFAs easier we assume, without loss of generality,
that the initial state does not have incoming transitions.
2.2.2
Generation of NFAs
Using a variant of the McNaughton-Yamada [5] method a regular expression
can be converted into an NFA by decomposing it into a syntax tree of basic
operations and recursively performing the corresponding actions to create the
NFA of each operation.
The method to create the NFA for the basic operators is presented graphically in Figure 2.
We describe in detail the method to create the NFA of the concatenation:
R1 R2 . Concatenation is implemented by adding a nonconsuming transition ()
from the final state of R1 to the initial state of R2 (Figure 2(c)). The final state
of R2 is the final state of the concatenation.
An expression which does not start with the ‘∧ ’ metacharacter is prepended
with the starting NFA depicted in Figure 3(a). This ensures that every substring of the input is evaluated. An expression which does start with the ‘∧ ’
metacharacter is prepended with the starting NFA depicted in Figure 3(b) if the
modifier flag ‘m’ is defined. This ensures that matching is started again after
every newline character.
3
The operator ‘$’ is implemented by simply ignoring the output of the NFA
until the final character of the input has been processed, or if the ‘m’ flag is
defined until before a newline character is read.
any
𝞊
𝞊
\n
any
𝞊
(a)
(b)
Figure 3: Starting NFAs
3
A Simple Counting NFA With Single Counter
Given the NFA of a regular expression, R, we now describe the proposed implementation of its constraint repetition, ‘R{M, N }’ with M ∈ N, N ∈ N0 ∪{+∞},
M ≤ N , using a counting NFA (Figure 4).
Without loss of generality we assume that R does not match the empty string
and that all nonconsuming outgoing transitions of the initial state have been
resolved. Constraint repetitions with a lower bound equal to zero, ‘R{0, N }’,
are implemented as ‘|R{1, N }’.
To create the counting NFA we first introduce loop-back transitions by
adding the outgoing transitions of the initial state to the final state of the
NFA.
The counter is incremented when the final state of the NFA is activated and
is reset to 0 whenever there is no character consuming transition active between
states in A, with A the set containing all states of R except its initial state.
Finally we add a nonconsuming transition from the final state of R to the
new final state of the constraint repetition. It has the condition: M ≤ cntr ≤ N .
This nonconsuming transition is executed at the end of the execution cycle of
the automaton, after the counters have been updated.
c1
(a) Subexpression R with initial state with one outgoing
transition
c1
c1
𝞊∧ M≤cntr≤N
A
cntrnext = cntr+1
(b) Constraint repetition of subexpression R
Figure 4: Example of a counting NFA implementation of a constraint repetition
4
The counting NFA can be concatenated, unioned and even repeated just like
any subexpression. It is important however that every constraint repetition uses
a separate counter.
4
Hardware implementation
Before the NFA is implemented in hardware, the nonconsuming transitions are
resolved using standard techniques.
The states of the NFA are implemented using one flip-flop per state, except
for the initial state which is implemented using a special signal that is high on
the first character of the input string.
A transition is implemented by AND-ing its condition with the output of
the flip-flop of its departure state. The input of a flip-flop is the OR of all of
the incoming transitions of the state (Figure 5).
The character conditions are implemented using one-hot-encoding, while the
counter condition, M ≤ cntr ≤ N , is implemented using a custom parameterised
module, which also has a incr and reset input (Figure 6).
The resulting logic network can then be optimised by using for instance the
reverse distributivity property.
init c
c
a
q0
q1
b
a
b
q3
d
q2
a
q4
match
a
d
q1
b
b
q3
q4
q2
(a) Counting NFA
(b) Hardware implementation
Figure 5: Hardware implementation of NFA for regular expression: ‘∧ (ac∗ a|bb)d’
5
Analysis of the Snort Ruleset
This section contains the sid and revision numbers of the Snort (an open source
Network Intrusion Detection and Prevention System [6]) rules that cannot be
proven implementable using the previously presented counting NFA. The verification method used to do this and further analysis of the Snort ruleset is
presented in [1].
In Table 2 the rules can be found that were proven unimplementable using the presented counting NFA. For example, a number of rules start with
the expression: ‘\x00\x00\x00 [\x00\x01] .{4}\x00\x01\x00 . . . ’ of which the
constraint repetition ‘.{4}’ cannot be implemented using a single counter.
Table 3 contains the rules that could not be verified because they reached
more than 10 million DFA states, at which point we terminate our algorithm.
References
[1] K. Heyse, K. Bruneel, and D. Stroobandt, “Proving correctness of regular
expression matchers with constraint repetition.” (Unpublished), 2012.
5
b
cntrnext = cntr+1
a
q0
a
q1
b
cntrnext = cntr+1
q2
c ⋀ M≤cntr≤N
not b
cntrnext = 0
q3
qs
(a) Counting NFA
M≤cntr≤N
incr
reset
init
a
b
c
q2
q1
match
q3
(b) Hardware implementation
Figure 6: Hardware implementation of counting NFA for regular expression:
‘∧ a+b{M, N }c’, M > 0
[2] M. Faezipour and M. Nourani, “Constraint repetition inspection for regular
expression on FPGA,” in Proceedings of the 16th IEEE Symposium on High
Performance Interconnects, (Washington, DC, USA), pp. 111–118, 2008.
[3] S. Yun and K. Lee, “Regular expression pattern matching supporting constrained repetitions,” in Proceedings of the 5th International Workshop on
Reconfigurable Computing: Architectures, Tools and Applications, (Berlin,
Heidelberg), pp. 300–305, Springer-Verlag, 2009.
[4] R. Sidhu and V. K. Prasanna, “Fast regular expression matching using FPGAs,” in Proceedings of the the 9th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, FCCM ’01, (Washington, DC,
USA), pp. 227–238, IEEE Computer Society, 2001.
[5] R. McNaughton and H. Yamada, “Regular expressions and state graphs for
automata,” Electronic Computers, IRE Transactions on, vol. EC-9, pp. 39–
47, Mar. 1960.
R
[6] Sourcefire, “Snort.”
http://www.snort.org.
6
Table 2: Snort rules unimplementable using simple counting NFA
sid
rev
sid
rev
sid
rev
15105 10
18706 5
20194 1
16501 2
19926 2
20195 1
16502 2
20185 1
20196 1
16514 2
20186 1
20197 1
16679 2
20187 1
20198 1
17390 1
20189 1
20199 1
17535 2
20190 1
21044 1
18537 3
20191 1
21045 1
18704 5
20192 1
18705 4
20193 1
Table 3: Verification algorithm terminated (more than 10 million DFA states)
sid
15105
16345
16346
7
rev
10
2
2

Download Report

Implementation of Regular Expressions with Constraint Repetition

Paperzz.com

Your Paperzz