Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
599
Proposal and Implementation of Mixed Finite Automata
Optimization by Balancing Active States and Transitions
Kosuke Nishimura+, Kenichi Takagiwa+, Hiroaki Nishi*
Graduate School of Science and Technology, Keio University, Japan
+
{nishimura, takagiwa}@west.sd.keio.ac.jp, *[email protected]
Abstract – A string matching program corresponding to a
regular expression is implemented using two different finite
automata: a nondeterministic finite automaton (NFA) and a
deterministic finite automaton (DFA). However, these finite
automata both have their own pros and cons. An NFA allows
activating multiple states. It increases the access latency to the
state memory because of frequent memory accesses, thereby
degrading the processing throughput. In contrast, its memory
footprint can be reduced. The memory footprint of a DFA
becomes larger than an NFA. A DFA can improve the matching
processing throughput and reduce the number of memory
accesses because it always has one active state. In this paper,
a mixed FA (MFA), a new automaton combining an NFA and a
DFA, is proposed. An MFA combines an NFA and a DFA by
changing their mixing ratio, and enables an adjustment of the
memory footprint and the maximum number of active states.
Keywords: Regular expressions, finite automata, string
matching.
1
Introduction
A network intrusion detection system (NIDS) and an
intrusion prevention system (IDS) are significant applications
for cyberattack measures. An NIDS acquires packet data in a
communication flow of a network, and provides intrusion
detection or attack prevention by analyzing the flows in real
time [1]. When an NIDS detects malformed data, it informs the
arrival of the data to an administrator. A well-known NIDS
software is Snort [2] [3], which detects attacks by matching a
dedicated database against the data flowing in the network. The
database consists of signatures and rules generated according
to the characteristics of various attack methods.
An early NIDS only utilized single strings for describing
the characteristics of viruses as their signatures. In recent years,
an NIDS has been developed to use regular expression and
extended regular expressions (Perl Compatible Regular
Expression, PCRE [4]), instead of single strings. Regular
expressions can describe a wide variety of patterns in a single
string with special characters to detect diversified and
sophisticated attacks, which are evolving into threats to newer
computer systems. Therefore, it is natural that NIDS uses a
regular expression for explaining complex strings of viruses
and malformed attacks as their signatures or attacking
processes for detecting them. Owing to the flexibility of regular
expressions in describing these signatures and patterns, a
regular expression is applied in various applications, such as
content-based spam e-mail filters [5].
As a result of its many advantages, an NIDS with a
matching function using a regular expression plays an
important role in computer security, and its needs are
increasing. However, the implementation of a regular
expression as a computer software function has several
problems, such as a large memory footprint in describing
complex and massive expressions, and a reduction of the
processing throughput caused by frequent memory accesses.
This paper focuses on a string matching program that can
detect complex strings using regular expressions.
A string matching program based on regular expressions
are conventionally implemented using two different finite
automata: a nondeterministic finite automaton (NFA) and a
deterministic finite automaton (DFA) [6]. These finite
automata both have their own pros and cons. Because an NFA
permits multiple transitions of states in a character-bycharacter manner, various states can be activated. However,
this characteristic degrades the processing throughput because
it requires multiple memory accesses to step the matching
process of one character to another. A DFA has a feature in
which the number of state transitions is always limited one for
each processing step of a character. This feature enables an
acceleration of the processing throughput because one memory
access is sufficient to step the matching process. Therefore, a
DFA is faster than an NFA in its matching process throughput.
However, a DFA increases the memory footprint in describing
regular expressions because the number of states is
exponentially increased according to the size and complexity
of the given regular expressions. Hence, an NFA takes a longer
time in match processing than a DFA [5].
As mentioned, there is a trade-off between an NFA and a
DFA in terms of memory footprint and processing throughput.
As a method for breaking this tradeoff, a compression
technique of state memory in a DFA has been proposed [5] [7]
[8]. A dual FA [5] isolates the parts of the automata used for
processing special characters of repeats and compresses the
total memory usage by preventing an exponential enlargement
of the repeat process. This straightforward technique can
reduce the memory usage. However, it does not consider the
trade-off of memory usage and processing throughput. If a
given automaton is composed of only repeats, it generates an
automaton equal only to an NFA. Hence, the processing
throughput of a dual FA is drastically degraded. As another
approach, Google developed the RE2 regular expression
algorithm [9], which solves the memory exhaustion problem
by switching from a DFA to an NFA when the memory
footprint exceeds a certain amount of memory usage. However,
RE2 does not consider the optimization of memory usage. The
600
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
available memory footprint is varied based on the application
or implemented system. Namely, a new method is required that
utilizes all of the available memory space efficiently to
maximize the processing throughput of regular expression
matching.
In this paper, a mixed FA (MFA) is proposed, which
considers this trade-off. It flexibly combines an NFA and a
DFA according to the available memory footprint. An MFA
changes the mixing ratio of an NFA and a DFA automatically
to fit the best mixing ratio. Namely, as its novel feature, an
MFA can balance the memory footprint and the maximum
number of active states by maximizing the total performance
under the limited memory space varied by the target
applications. When there is a problem of required memory size
in using a DFA, an MFA can optimize the required memory
footprint. Although it degrades its processing throughput as a
drawback, an MFA can provide a better performance than an
NFA.
2
2.1
Fig. 1: The NFA for “c(a|b)*a”
Research Background
Regular Expression
A regular expression describes a set of strings. For
example, the regular expression “[bc]ook” matches “book” and
“cook.” A regular expression enables complex patterns to be
searched efficiently. As an example, well-known applications
such as grep text filter, vi text editor, and many kinds of scriptbased programming languages support regular expressions.
The matching string “[A-Q]” in a regular expression
means it matches any character between A and Q. The wildcard
character “.” means it matches any one character. Simple
character repetitions can be described using the expressions “a?”
meaning that “a” is repeated zero or one time, “a*” meaning
that “a” is repeated zero or any number of times, and “a+”
meaning that “a” is repeated at least once. If repetition special
characters of “?”, “+,” and “*” are used in a DFA, an increase
in the number of states and transitions will occur. This problem
should be solved, and a technique addressing the problem is
described later in this paper.
2.2
state transitions. Namely, multiple states can be activated in an
NFA when an input character is received. Fig. 1 shows the state
transition graph of an NFA generated for processing the regular
expressions “c(a|b)*a.” The circles and arrows in Fig. 1
represent states and transitions, respectively. In this figure,
state q0 is the initial state, and state q2 is the final state. In Fig.
1, two arrows of “a” are output from q1. Therefore, if the
character “a” is received as an input, the states of both q1 and
q2 become active.
The details of the activation process are as follows. When
the NFA of “c(a|b)*a” of Fig. 1 processes the input text “cba,”
this NFA executes the process of the state transition given as
ሺͲݍሻ՜ ሺͳݍሻ՜ ሺͲݍǡ ͳݍሻ՜ ሺͲݍǡ ͳݍǡ ʹݍሻ
In this case, the maximum number of active states is three.
Theoretically, an NFA requires the widest bandwidth in
accessing memory. This characteristic degrades the processing
throughput because it requires multiple memory accesses to
step the matching process one character to another, especially
when processing complex regular expressions. This is a
disadvantage of an NFA. However, an NFA has an advantage
in that the memory footprint is smaller because the number of
states and transitions is less than in a DFA. This phenomenon
is clearly shown when comparing the state transition graphs of
an NFA in Fig. 1 and a DFA in Fig. 2.
2.2.2
Deterministic Finite Automaton
Finite automaton
A finite automaton is a mathematical model with a discrete
input and output. The destination state (in some cases the
current state is same as the original state) is always unique in
all states. One of the states is the initial state. Some of the states
are the final states. The input characters of a string are
processed from the initial state. A state transits to another state
in a one-by-one according to the input characters. This state
transition process repeats until the state reaches the final state,
or the input string is exhausted.
Fig. 2: The DFA for “c(a|b)*a”
2.2.1
Nondeterministic Finite Automaton
An NFA and a DFA are pattern matching automata that
handle a set of regular expressions. An NFA permits multiple
The DFA for the regular expression “c(a|b)*a,” which is
the same as the example for an NFA, is shown in Fig. 2. As a
difference from an NFA, every state in a DFA has only one
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
outgoing transition for each character. Therefore, the numbers
of states and transitions of a DFA are larger than those of an
NFA. In particular, a DFA consumes a larger memory footprint
for processing repeat special characters because all possible
states and transitions generated by the special characters
should be stored in memory. This is a disadvantage of a DFA.
However, a DFA always has only one active state because each
state has one outgoing transition for each string. This feature is
an advantage of a DFA in maximizing the matching process
throughput and minimizing the processing time.
Nondeterministic Finite Automaton with ε
transitions
An NFA allows a state transition by using null character
ε as an extension of its function. An epsilon transition ε permits
transitions without receiving an input string. Fig. 3 shows the
ε-NFA transitions generated from a regular expression. These
state transition graphs are atomic regular expressions. The
filled-in circles in Fig. 3 represent the accepting state, where
the state means that the input pattern matches the target regular
expression. In the proposed MFA, this translation is
accomplished first.
2.2.3
Fig. 3: Conversion from the regular expression to an NFA
with ε transitions
3
3.1
Mixed Finite Automaton
Purpose
As mentioned above, string matching algorithms for a
regular expression are generally implemented using either an
NFA or a DFA. However, these finite automata both have their
own pros and cons. The characteristics of an NFA and a DFA
can be summarized in Table 1. It is necessary to consider the
trade-off between the memory footprint and the maximum
number of active states, namely between the size and
processing throughput.
601
Table 1: Features of an NFA and a DFA
NFA
DFA
Space complexity
Small
Large
Time complexity
Large
Small
The proposed MFA is adaptable to a variety of network
applications. An MFA combines an NFA and a DFA flexibly
according to the available memory footprint, and enables a
change in the mixing ratio of both. It is possible to balance the
memory footprint and the maximum number of active states by
mixing an NFA and a DFA. An MFA observes this strategy,
and by using the following dedicated algorithm, can maximize
the total performance under a limited and varying memory
space based on the target applications.
3.2
Algorithm
This section describes the method for combining an NFA
and a DFA in an MFA. Initially, a regular expression is divided
into two groups of atomic regular expressions. The first group
is converted into a DFA, and the other group is converted into
an NFA. This conversion order is effective from the viewpoint
of memory access throughput. The reason this order is effective
can be described through a simple example. As an example, a
regular expression consists of two atomic regular expressions.
If the basic syntax of these two expressions is the same, there
are any difference between a case converted into DFA+NFA
and one converted into NFA+DFA. However, the number of
activated states differs between them. For the DFA+NFA case,
there is one active state of the first DFA part. The NFA part
increases the number of active states. The total number of
active states depends only on the last NFA group. For the
NFA+DFA case, the first DFA part increases the number of
active states. The last NFA part maintains the number of active
states. Namely, the total number of active states is about twice
that of the first DFA part. This is an approximate estimation. In
fact, there is one active state in the initial state, and this number
gradually increases. Even with this fact, the total number of
active states is larger than DFA+NFA. This is proved using a
simple birth process under a multidimensional Markov
diffusion process.
A regular expression “(a|b)*a(a|b)*a” is used as an
example of MFA generation. To generate an MFA, this regular
expression is separated into atomic regular expression parts.
Each atomic regular expression is shown in Fig. 3. These
atomic regular expressions can be converted into an NFA or a
DFA. The given regular expression is divided into r1 = “(a|b)*,”
r2 = “a,” r3 = “(a|b)*,” r4 = “a,” as atomic regular expression
parts. In this case, these four atomic regular expressions are
grouped into two DFA and NFA groups.
An MFA can adjust the mixing ratio of an NFA and a DFA
flexibly by controlling the boundary between NFA and DFA
groups, as shown in Fig. 4. The variety of burden between an
NFA group and a DFA group assures the flexibility of the
mixing ratio of an NFA and a DFA in an MFA. An MFA can
602
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
maximize the processing throughput by considering the
memory consumption under the given memory footprint. An
application can use the optimized matching automata of the
given regular expression using an MFA.
Figs. 5 and 6 show automata of an NFA and a DFA
generated from the given regular expression. Through Figs. 5
and 6, the maximum number of active states of an NFA and a
DFA can be counted. For an evaluation, the memory footprint
is defined as the number of states and transitions. The NFA of
Fig. 5 has three states and six transitions, whereas the DFA of
Fig. 6 has five states and ten transitions. Hence, the memory
footprint of a DFA is larger than that of an NFA in terms of the
numbers of both the states and transitions.
Fig. 4: Characteristics of MFAs with a different boundaries
between NFA and DFA groups
3.3
Examples of Mixed FA
This section describes an example of a generated MFA.
As an input regular expression, “(a|b)*a(a|b)*a” is used as well
as section 3.2. This regular expression is divided into four
atomic regular expression parts, i.e., r1 through r6. As an
example of an MFA, “r1r2r3” is converted into an NFA, and
“r4r5r6” is converted into a DFA.
a
q0
b
Fig. 7 shows an example of an MFA composed by
combining the states and transitions of Figs. 5 and 6. In mixing
the DFA and NFA, the composed MFA selects and mixes the
dotted parts of the NFA of fig. 5 and DFA of fig. 6. The MFA
of Fig. 7 supports the regular expression “(a|b)*a(a|b)*a.”
a
a
a
q1
b
Fig. 5: NFA for “(a|b)*a(a|b)*a”
Fig. 7: MFA for “(a|b)*a(a|b)*a”
q2
In the composed MFA shown in Fig. 7, states q1 and q3
become active states when character “a” is input twice. Namely,
two states become active. In contrast, three states become
active in the NFA of Fig. 5, and the number of active states is
always one in a DFA. Hence, the size of the memory footprint
and the number of active states of an MFA are smaller than
those for a DFA and an NFA, respectively. An MFA can
flexibly adjust the mixing ratio of an NFA and a DFA by
controlling the boundary between NFA and DFA groups. The
number of states, number of transitions and maximum number
of active states are compared in Table 2.
Table 2: Performance comparison of each automaton
Fig. 6: DFA for “(a|b)*a(a|b)*a”
Number of
states
Number of
transitionss
Maximum
number of
active statess
NFA
MFA
3
4
6
7
3
2
DFA
5
10
1
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
4
Evaluation
603
250
7777Experimental environment
200
In this evaluation, an NIDS application is used for
evaluating an MFA for a comparison with an NFA and a DFA.
Snort is a well-known application of NIDS, as described in
section 1. Snort rule sets consist of regular expressions for
detecting the signatures of malformed messages in a network
so as to prevent security attacks. Hence, regular expressions
used in Snort are an appropriate benchmark rule set for this
evaluation. We extracted the benchmark rule sets from
Snortrules-snapshot-2970 of Snort ver. 2.9 [2]. The regular
expression “a{n}” means that n iterations of “a” is partially
supported. When more than ten iterations are found, the
character is changed to a regular expression character “+.”
When fewer than ten iterations are found, it means these
iterations were designed correctly.
150
Programming language C++ is used in implementing the
regular expression processor, and g++ Ver4.4.7 is used as a
compiler. The regular expression processor used in this
experiment can generate not only an MFA but also an NFA and
a DFA as special cases of an MFA. The number of states and
transitions, the maximum number of active states, the
configuration time, and the computation time of each
automaton were evaluated using the regular expression
processor. The configuration time of the automata denotes the
delay time in generating an MFA for a target regular expression.
As shown in Fig. 8, even in the case of Snort rules, the
trends of the number of states and the number of transitions and
maximum number of active states are the same with the cases
of the simple examples shown in Figs. 5 and 6. Namely, the
numbers of states and transitions of a DFA are larger than those
of an NFA. In addition, the maximum number of active states
of an NFA is larger than that of a DFA. In both cases, an MFA
ranks between an NFA and a DFA. This result shows that the
proposed MFA can provide the benefit of both an NFA and a
DFA by considering the tradeoff between the size and the
throughput.
Table 3 shows an extracted pattern as a benchmark rule set
from the Snort rule sets. These regular expression patterns are
randomly selected from Snortules-snapshot-2970. The regular
expression processor inputs these regular expressions one after
another. The results are described in the next section.
Table 3: Regular expression pattern extracted from Snortrulessnapshot-2970 [2]
(\s*|\s*\r?\n\s+)
malware(\w|\s)*\d{10}
.*Root\x2User-cgi\x2f.*\x2ecgi[a-z0-9]+
\s\w+\s\d+\r?\n[^\n]*
(no|up|\d+\x2e\d+\x2e|d+\x2e\d+)
.PHP[a-z]+[a-f0-9]+[a-z]+=.*[a-z]+=.*[a-z]
\w+\x3b.*\x3b.*\x3b
.*aspn\x2fvgi-bin\x2f.*\x2ecgi[a-z0-9]+
User-Agent[^\n]*\x2eDIAN
Server\x3a[^\r\n]*Root{^\r\n]*Kit[^\r\n]*Scaner
4.1
Experimental results
Fig. 8 shows the number of states, the number of
transitions, and the maximum number of active states of an
NFA, a DFA, and an MFA.
100
50
0
Number of states
NFA
Number of
transitions
MFA
Maximum
number of active
states
DFA
Fig. 8: The number of states and transitions, and the
maximum number of active states of each automaton
Next, we will show that an MFA can adjust the memory
footprint and the maximum number of active states. In this
experiment, 100 KB of random text data was used. After
dividing regular expression into ten, the regular expression
processor of the MFA is inputted it. The configuration and
computational time of the MFA are shown in Fig. 9. In the
figure, the x axis indicates the mixing ratio of the DFA and
NFA. A comparison of the number of states, the number of
transitions, and the maximum number of active states in
varying mixed rates of the NFA and DFA are shown in Fig. 10.
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
3
187
2.9
186
2.8
185
2.7
184
2.6
183
2.5
182
2.4
181
2.3
2.2
180
2.1
179
2
178
of active states was reduced. Therefore, the computation time
in Fig. 9 was improved.
Computation time(μs)
Configuration time (μs)
604
Configuration time of the automaton
Computation time of the automaton
Fig. 9: Configuration and computation times of mixed FA
when changing the ratio
45
5
Conclusion
In this paper, an MFA, new automaton combining an NFA
and a DFA, was proposed, implemented, and evaluated using
the Snort rule set. An MFA combines the existing string
matching programs of an NFA and a DFA by changing their
mixing ratio. The results indicate that the proposed MFA can
optimize the matching throughput under the required memory
footprint size by combining an NFA and a DFA. An MFA
enables a trade-off between the memory footprint and
processing throughput, and varies both. Therefore, it can
maximize the processing throughput while fulfilling the
conditions of the memory size. An MFA has the potential to be
applied to various applications owing to its flexibility.
ACKNOWLEDGEMENT
40
This work was partially supported by the funds of SECOM
Science and Technology Foundation, and by MEXT/JSPS
KAKENHI Grant (B) Number 24360230 and 25280033.
35
30
25
N
Based on the above description, an MFA enables the
memory footprint and the maximum number of active states to
be adjusted for optimizing the total performance under the
limited memory space available in a target application.
6
20
References
[1] S. Kumar, B. Chandrasekaran, and J. Turner, “Curing
regular expressions matching algorithms from Insomnia,
Amnesia, and Acalculia,” In Proc. of ANCS’07, pp. 155164.ACM.
15
10
5
[2] Snort. http://www.snort.org
0
N = Number of states
N = Number of transitions
N = Maximum number of active states
Fig. 10: The number of states and transitions, and the
maximum number of active states of a mixed FA when
changing the ratio
As shown in Fig. 9, the configuration time of the
automaton was increased in association with the increase in the
DFA ratio. In contrast, the computation time was reduced. As
a result of the computational time, an MFA can vary the
computation time. Fig. 10 demonstrates that the numbers of
states and transitions were gradually increased corresponding
to the increase in the DFA ratio. Namely, the memory footprint
becomes larger in this case. In contrast, the maximum number
[3] T.T. Hieu, T.N. Thin, T.H. Vu, S. Tomiyama,
“Optimization of Regular Expression processing circuits for
NIDS on FPGA,” Second International Conference on
Networking and Computing, 2011
[4] J. Shangjie, L. Mejian, “Research and Design of
Preprocessor plugin based on PCRE under Snort Platform,”
Control, Automation and Systems Engineering (CASE), 2011
International Conference, IEEE, 30-31 July 2011
[5] C. Liu, J. Wu, “Fast Deep Packet Inspection with a Dual
Finite Automata,” Computers, IEEE Transaction on (Volume
62, Issue 2), Feb.2013
[6] J.E. Hopcroft, J.D. Ullman, “Introduction to Automata
Theory,” Addison Wesley, 1979
[7] M. Becchi, P. Crowley, “A Hybrid Finite Automata for
Practical Deep Packet Inspection,” CoNEXT 2007.
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
[8] J. Zhang, D. Zhang, K. Huang, “A Regular Expression
Matching Algorithm Using Transition Merging,” 2009 15th
IEEE Pacific Rim International Symposium on Dependable
Computing,
[9] RE2. https://github.com/google/re2
605
© Copyright 2026 Paperzz