2.3 Deep packet inspection - Applied Research Laboratory

Doctoral Dissertation Proposal: Acceleration of Network
Processing Algorithms
Sailesh Kumar
Washington University
Computer Science and Engineering
St. Louis, MO 63130-4899
+1-314-935-4306
[email protected]
Research advisor: Jonathan S. Turner
ABSTRACT
Modern networks need to process and forward an increasingly
large volume of traffic and the growth in the number of packets
often outpaces the improvements in the processor, memory and
software technology. Consequently, there is a persistent interest in
novel network algorithms which can implement the network
features more efficiently. In this proposal, we propose several new
algorithms to accelerate and optimize the implementations of
three core network functionalities namely: i) packet buffering and
forwarding, ii) packet header processing, and iii) packet payload
inspection.
1. INTRODUCTION
Modern networking devices perform an array of operations
upon receiving a packet. These operations have to be
finished within a limited time budget in order to maintain a
high packet throughput and low processing latency. There
are two trends which put additional pressure on the
performance: i) new features are regularly added in today’s
networks, many of which are employed on a per packet
basis, and ii) the rate of increase in the packet arrival rates
generally outpaces the rate at which hardware and memory
technology advances. Due to these performance pressures,
it becomes critical to implement various network features
efficiently. Clearly, it is crucial to efficiently implement and
optimize any new feature; additionally the exiting features
also need to be improved and updated with the advances in
the technology. A good implementation of any network
feature requires a good understanding of both the classical
algorithmic methods and the hardware and system
technology, thereby making it an interesting research
problem. Additionally, due to their importance, these
implementation methods have received an enormous
attention in the networking research community.
Two core network features which have remained the focus
of the researchers are: i) packet buffering and scheduling,
which generally involves fast packet buffering mechanisms
coupled by a queuing and scheduling system, and ii) packet
header processing, which includes header lookup operations
in order to determine the next hop for the packet and packet
classification in order to prioritize the packets based upon
the source and destination addresses and the protocol. A
third class of network feature which has recently seen a
wide adoption is deep packet inspection, in which every
byte of the packet payload is examined in order to search
for a set of pre-defined patterns. Deep packet inspection is
often used in the emerging application layer packet
forwarding applications and intrusion detection systems.
Due to the importance and broad deployment of these three
network features, a collection of novel methods have been
proposed to implement them efficiently. These methods
often consider the constraints and capabilities of the current
hardware platforms and involve a complex mix of ideas
drawn from theoretical computer science (algorithms and
data structures), and system and hardware technology.
Since the hardware and system technology evolves rapidly,
there is a constant need to upgrade these implementations,
nevertheless there is also room to improve them in a more
abstract and theoretical sense. In this proposal, we intend to
undertake these tasks for the three network features
mentioned above. More specifically, we intend to evaluate
the existing methods to implement them, and propose novel
algorithms and mechanisms to improve their performance
on the next generation memory subsystems and hardware
platforms. Our aim is to split the efforts evenly between the
combination of the packet buffering and header lookup
features and the deep packet inspection feature.
The first two network features we are focusing on have
already been comprehensively studied, and it appears that
there is little room for any fundamental improvement.
However, the evolution of new implementation platforms
like network processors has opened up several opportunities
for novel ideas and methods of implementation. Network
processors are software-programmable devices and their
feature sets are specifically targeted for the networking
applications. They sport a collection of memory banks,
running at different operating frequencies thereby creating
memories of different bandwidth and access latency. The
storage capacity of these memories is also diverse; a
general trend is that larger memories have relatively lower
bandwidth and higher access latency.
The presence of such a diverse collection of memories
presents new levels of challenges and opportunities in
developing the memory sub-system. For example, if the
data-structures used in the memory intensive features like
packet buffering and header lookup, are spread out across
various memories and the fast but small memories are
prudently used, then the performance can be dramatically
enhanced. Consequently, one of the objectives of the
research is to develop innovative ways of distributing the
data-structure across different memories such that both the
total available bandwidth and space are uniformly utilized.
The distribution mechanism can either be static, in which
the memories will be pre-allocated to different datastructure segments or can be dynamic, in which portions of
the data-structure will be allowed migrate from one memory
to the other. Traditional caches, which often improve the
average-case performance, are one form of such dynamic
mechanism.
Current network processor devices also contain specialized
engines, like hash accelerator and content accessible
memory (CAM), which offer further opportunities. Popular
hash based techniques like hash tables and Bloom filters
can now be cost-effectively employed. The presence of
hashing eases the use of the randomized methods which can
provide strong probabilistic performance guarantees at a
reduced cost. CAM, on the other hand, opens up the
opportunity to easily employ an associative caching
scheme, which can greatly improve the average-case
performance of the queuing and buffering features. In this
research proposal, we also aim to explore these possibilities
of utilizing various specialized hardware capabilities in
improving the performance of the network features.
The third network feature deep packet inspection, which is
one of our primary research focuses, has recently gained
widespread adoption. The key reason is that many emerging
network services now handle packets based on payload
content, in addition to the structured information found in
packet headers. Forwarding packets based on content
requires new levels of support in networking equipment,
wherein every byte of the packet payload is inspected in
addition to examining the packet headers. Traditionally, this
deep packet inspection has been limited to comparing
packet content to sets of strings. However, newly emerging
systems are replacing string sets with regular expressions,
due to their increased flexibility and expressiveness.
Several content inspection engines have recently migrated
to regular expressions, including: Snort [5], Bro [4],
3Com’s TippingPoint X505 [20], and various network
security appliances from Cisco Systems [21]. Additionally,
layer 7 filters based on regular expressions [30] are
available for the Linux operating system.
While flexible and expressive, regular expressions
traditionally require substantial amounts of memory, and
the state-of-the art algorithms to perform regular expression
matching are unable to keep up with the ever increasing link
speeds. To see why, we must consider how regular
expressions are implemented. A regular expression is
typically represented by a finite automaton (FA). FA can be
of two basic types: non-deterministic finite automaton
(NFA) and deterministic finite automaton (DFA). The
distinction between a NFA and a DFA is that a NFA can
potentially make multiple moves on an input symbol, while
a DFA makes a single move on any given input symbol.
DFA, therefore results in deterministic and high
performance as there is a single active state at any point in
time. However, the state space blowup problem in DFAs
appears to be a serious issue, and limits their practical
applicability.
The proposed research will focus mostly on the algorithmic
solutions to the problem, aiming at developing innovative
architectures which can efficiently implement the current
and future regular expressions. We propose to begin the
research by systematically studying the trade-offs involved
in the traditional regular expressions implementation and
the hardware capabilities needed to execute any given finite
automaton. A preliminary analysis suggests that the current
hardware technologies are capable of executing machines
which are much more complex that finite-automaton. Such
machines can trade-off space and performance much more
effectively than the finite automaton and fully utilize the
capabilities of the current hardware platforms. These
machines can also employ probabilistic methods to improve
the performance because the likelihood of completely
matching the patterns which are used in networking
applications is remote. One of our main objectives in this
proposal is to develop such novel machine architectures.
Another important objective of the research is to develop
algorithms to efficiently store a given machine (e.g. a finite
automaton, a push-down automaton or any new machine)
into the memory. The prime concerns are memory usage
and bandwidth required to execute the machine. Traditional
table compression algorithms are known to be inefficient in
packing the finite automatons which are used in the
networking systems. Recently developed CD2FA approach
appears promising, however its applicability to more
complex machines is not yet known, therefore, we intend to
extend these schemes so that they can be applied to more
general machines.
An orthogonal research objective is to investigate the
possibilities to further reduce the memory by eliminating
the overheads involved in explicitly storing the transitions.
Traditionally transition of an automaton requires log2n
bits, where n is the total number of states. It appears that,
with the aid of hashing techniques, each node can be
represented with fewer bits, thus the number of bits needed
to store a transition can be reduced. Our preliminary
analysis suggests that when conventional methods require
20-bits to represent each transition in a one million node
machine, this technique will only require 4-bits.
To summarize, in this proposal we plan to work on three
important network functions. For each of the function, we
plan to evaluate the existing implementation methods and
the challenges that arise with the introduction of new
platforms. Specific for each feature, we plan to undertake
the following tasks:
1.
Packet buffering and queuing (25% of the total effort)
1.1. How to use randomization techniques to improve
the buffering performance
1.2. How to build efficient buffering sub-system with
a collection of memories of different size,
bandwidth and access latency
1.3. Hashing and Bloom filter based buffering and
caching sub-systems
2.
Header lookup (25% of the total effort)
2.1. Architecture of header lookup engines capable of
supporting tera-bit data throughput
2.2. Algorithms to compress lookup data-structure and
the implication of caching on lookup performance
3.
Deep packet inspection (50% of the total effort)
3.1. Evaluation the patterns used in current deep
packet inspection systems
3.2. Trade-off between using NFA and DFA; analysis
of the worst- and average-case performance
3.3. Evaluation of intermediate approaches like lazy
DFA
3.4. Introduction to novel machines, potentially
different from finite automaton, which are
capable of performing regular expressions
matching
3.5. Memory compression schemes (e.g. Delayed
input DFAs (D2FA) and Content addressed
delayed input DFAs (CD2FA)
The remainder of the proposal is organized as follows.
Background on regular expressions and related work are
presented in Section 2. Section 3 describes the D2FA
representation. Details of our construction algorithm and
the compression results are presented in Section 4. Section
5 presents the system architecture, load balancing
algorithms and throughput results. The paper ends with
concluding remarks in Section 6.
2. BACKGROUND AND RELATED WORK
We split this section into three subsections. Each subsection
will cover some of the relevant related work for the “packet
buffering and queuing”, “packet header lookup” and “deep
packet inspection”, respectively.
2.1 Packet buffering and queuing
Packet buffers in routers require substantial amounts of
memory to store packets awaiting transmission. Router
vendors typically dimension packet storage subsystems to
have a capacity at least equal to the product of the link
bandwidth and the typical network round-trip delay. While
a recent paper Error! Reference source not found. has
questioned the necessity of such large amounts of storage,
current practice continues to rely on the bandwidth-delay
product rule. The amount of storage used by routers is large
enough to require the use of high density memory
components. Since high density memories like DRAM have
limited random access bandwidth and short packets are
common in networks, it has become challenging for them to
keep up with the continuously increasing link bandwidths.
A number of architectures have been proposed to buffer
packets at such high link rates.
In reference [x], authors propose ping-pong buffer, which
can double the random access bandwidth of a memory
based packet buffer. Such a buffer has been shown to
exhibit good utilization properties; the memory utilization
remains as high as 95% in practice. In references [x][x],
Iyer et al. have shown that a hybrid approach combining
multiple off-chip memory channels with an on-chip SRAM
can deliver high performance even in the presence of worstcase access patterns. The on-chip SRAM is used to provide
a moderate amount of fast, per-queue storage, while the offchip memory channels provide bulk storage. Unfortunately,
the amount of on-chip SRAM needed grows as the product
of the number of memory modules and the number of
queues, making it practical only when the number of
individual queues is limited.
More recently, multichannel packet storage systems [x][x]
have been proposed that use randomization to enable high
performance in the presence of arbitrary packet retrieval
patterns. Such an architecture requires an on-chip SRAM,
whose size is proportional only to the number of memory
modules and doesn’t grow as the product of the number of
memory modules and the number of queues, making it
practical for any system irrespective of the number of
queues. It has been shown that, even for systems which uses
DRAM memories with large number of banks, the overall
on-chip buffering requirements depends mostly on the
number of channels and not the product of number of
channels and the number of banks, thereby making such an
approach highly scalable.
While packet storage is important, modern routers often
employ multiple queues to store packets. These queues are
used to implement various packet scheduling policies, QoS,
and other types of differentiated services applied to packet
aggregates. The problem of scheduling real-time messages
in packet switched networks has been studied extensively.
Practical algorithms can be broadly classified as either
timestamp or round-robin. Time stamp based algorithms [x]
try to emulate a GPS [x] by sending packets, approximately,
in the same order as sent by a reference GPS server. It
involves computation of timestamps for various queues, and
sorting them in an increasing order. Round-robin schedulers
[x] avoid the sorting bottleneck by assigning time slots to
the queues and transmitting multiple packets with
cumulative size up to a maximum sized packet from the
queue in the current slot.
Many routers use multiple hierarchies of queues in order to
implement sophisticated scheduling policies, e.g. the first
set of queues may represent physical ports, the second
classes of traffic and the third may consist of virtual
(output) queues. When there are a large number of queues
then off-chip memory is required to store the queuing and
scheduling data-structure which complicates the design of
the buffering and queuing subsystem. In fact, the off-chip
memory can be a significant contributor to the cost of the
queuing subsystem and can place serious limits on its
performance, particularly as link speeds scale beyond 10
Gb/s. A recent series of papers has demonstrated queuing
architectures which alleviates these problems and maintains
a high throughput.
In [x], authors show how queuing subsystems using a
combination of implicit buffer pointers, multi-buffer list
nodes and coarse grained scheduling can dramatically
improve the worst-case performance, while also reducing
the SRAM bandwidth and capacity needed to achieve high
performance. In [x], authors propose a cache based queue
engine, which consists of a hardware cache and a closely
coupled queuing engine to perform queue operations. Such
cache based engines are also available on modern network
processors like Intel IXP series. While such NP based
queuing assists have limited capability, a number of more
sophisticated commercial packet queuing subsystems are
available, which often targets QoS and traffic management
applications. They often use custom logic and memory
interfaces to achieve high performance, however there is a
need of more general and programmable queuing solutions,
which use commodity memory and processing components.
2.2 Packet header lookup
Internet router processes and forwards the incoming packets
based upon the structured information in the packet header.
The next hop for the packet is determined after examining
the destination IP address; this operation is often called IP
lookup. An array of advanced services which determines
the treatment a packet receives at a router examines the
combination of the source and destination IP addresses and
ports; this operation is called packet classification. The
distinction between IP Lookup and Packet classification is
simply that IP Lookup classifies a packet based on a single
field in the header while Packet Classification classifies a
packet based on multiple fields. The core of both functions
consists of determining the longest prefix matching the
header fields within a database of variable length prefixes.
Longest prefix match algorithms has been widely studied.
Some well known mechanisms encompasses from TCAM
[9][10] to Bloom filters [6] to hash tables [1] based
schemes. While these hardware based approaches,
especially TCAM, have been widely adopted, they
generally often consume a lot of power. Consequently,
algorithmic solutions have remained one of the interests of
the researchers. Algorithmic solutions often employ a trie to
perform the longest prefix lookup. A trie can be built by
traversing the bits in each prefix from left to right, and
inserting appropriate nodes in the trie. This trie can be later
be traversed to perform the lookup operations. A substantial
number of papers have been written in this space to
efficiently implement these tries so that the total memory
consumption can be reduced and the lookup and update
rates can be improved [xxx].
With the current memory technology, trie based
implementations of header lookup can easily support a data
throughput of 10 Gbps. However, at 40 Gbps data rates, a
minimum sized 40-byte packet may arrive every 8 ns, and it
may become challenging to perform lookup operations with
a single memory. A number of researchers have proposed a
pipelined trie. Such tries enable high throughput because
when there are enough memory stages in the pipeline, no
stage is accessed more than once for a search and during
each cycle, each stage can service a memory request for a
different lookup.
Recently, Baboescu et al. [21] have proposed a circular
pipelined trie, which is different from the previous ones in
that the memory stages are configured in a circular, multipoint access pipeline so that lookups can be initiated at any
stage. At a high-level, this multi-access and circular
structure enables more flexibility in mapping trie nodes to
pipeline stages, which in turn maintains uniform memory
occupancy. A refined version of circular pipeline called
CAMP has been introduced in [x], which employs relatively
simple method to map the trie nodes to the pipeline stages,
thereby improving the rate at which the trie can be updated.
CAMP also presents relatively simple but effective methods
to maintain a high memory utilization and scalability in the
number of pipeline stages.
Such circular pipeline based lookup implementations can
not only provide a high lookup rate but also improve the
memory utilization and reduce the power consumption. It
will be extremely valuable to evaluate the feasibility of
incorporating such specialized engines in modern network
processors. A number of proposals have also been made in
the context of header lookup implementations on a network
processor.
2.3 Deep packet inspection
Deep packet inspection has recently gained widespread
popularity as it provides the capability to accurately classify
and control traffic in terms of content, applications, and
individual subscribers. Cisco and others today see deep
packet inspection happening in the network and they argue
that “Deep packet inspection will happen in the ASICs, and
that ASICs need to be modified” [19]. Some important
applications requiring deep packet inspection are listed
below:
 Network intrusion detection and prevention systems
(NIDS/NIPS) generally scan the packet header and
payload in order to identify a given set of signatures of
well known security threats.
 Layer 7 switches and firewalls provide content-based
filtering, load-balancing, authentication and monitoring.
Application-aware web switches, for example, provide
scalable and transparent load balancing in data centers.
 Content-based traffic management and routing can be
used to differentiate traffic classes based on the type of
data in packets.
Deep packet inspection often involves scanning every byte
of the packet payload and identifying a set of matching
predefined patterns. Traditionally, rules have been
represented as exact match strings consisting of known
patterns of interest. Naturally, due to their wide adoption
and importance, several high speed and efficient string
matching algorithms have been proposed recently. Some of
the standard string matching algorithms such as AhoCorasick [7] Commentz-Walter [8], and Wu-Manber [9],
use a preprocessed data-structure to perform highperformance matching. A large body of research literature
has concentrated on enhancing these algorithms for use in
networking. In [11], Tuck et al. presents techniques to
enhance the worst-case performance of Aho-Corasick
algorithm. Their algorithm was guided by the analogy
between IP lookup and string matching and applies bitmap
and path compression to Aho-Corasick. Their scheme has
been shown to reduce the memory required for the string
sets used in NIDS by up to a factor of 50 while improving
performance by more than 30%.
Many researchers have proposed high-speed pattern
matching hardware architectures. In [12] Tan et al. propose
an efficient algorithm that converts an Aho-Corasick
automaton into multiple binary state machines, thereby
reducing the space requirements. In [13], the authors
present an FPGA-based design which uses character predecoding coupled with CAM-based pattern matching. In
[14], Yusuf et al. use hardware sharing at the bit level to
exploit logic design optimizations, thereby reducing the
area by a further 30%. Other work [25, 26, 27, 28, 29]
presents several efficient string matching architectures; their
performance and space efficiency are well summarized in
[14].
In [1], Sommer and Paxson note that regular expressions
might prove to be fundamentally more efficient and flexible
as compared to exact-match strings when specifying attack
signatures. The flexibility is due to the high degree of
expressiveness achieved by using character classes, union,
optional elements, and closures, while the efficiency is due
to the effective schemes to perform pattern matching. Open
source NIDS systems, such as Snort and Bro, use regular
expressions to specify rules. Regular expressions are also
the language of choice in several commercial security
products, such as TippingPoint X505 [20] from 3Com and
a family of security appliances from Cisco Systems [21].
Although some specialized engines such as RegEx from
Tarari [22] report packet scan rates up to 4 Gbps, the
throughput of most such devices remains limited to subgigabit rates. There is great interest in and incentive for
enabling multi-gigabit performance on regular expressions
based rules.
Consequently, several researchers have recently proposed
specialized hardware-based architectures which implement
finite automata using fast on-chip logic. Sindhu et al. [15]
and Clark et al. [16] have implemented nondeterministic
finite automata (NFAs) on FPGA devices to perform
regular expression matching and were able to achieve very
good space efficiency. Implementing regular expressions in
custom hardware was first explored by Floyd and Ullman
[18], who showed that an NFA can be efficiently
implemented using a programmable logic array. Moscola et
al. [17] have used DFAs instead of NFAs and demonstrated
significant improvement in throughput although their
datasets were limited in terms of the number of expressions.
These approaches all exploit a high degree of parallelism by
encoding automata in the parallel logic resources available
in FPGA devices. Such a design choice is guided partly by
the abundance of logic cells on FPGA and partly by the
desire to achieve high throughput as such levels of
throughput might be difficult to achieve in systems that
store automata in memory. While such a choice seems
promising for FPGA devices, it might not be acceptable in
systems where the expression sets needs to be updated
frequently. More importantly for systems which are already
in deployment, it might prove difficult to quickly resynthesize and update the regular expressions circuitry.
Therefore, regular expression engines which use memory
rather than logic, are often more desirable as they provide
higher degree of flexibility and programmability.
Commercial content inspection engines like Tarari’s RegEx
already emphasize the ease of programmability provided by
a dense multiprocessor architecture coupled to a memory.
Content inspection engines from other vendors [33, 34],
also use memory-based architectures. In this context, Yu et
al. [10] have proposed an efficient algorithm to partition a
large set of regular expressions into multiple groups, such
that overall space needed by the automata is reduced
dramatically. They also propose architectures to implement
the grouped regular expressions on both general-purpose
processor and multi-core processor systems, and
demonstrate an improvement in throughput of up to 4 times.
Emphasizing the importance of memory based designs, a
recently proposed representation of regular expressions
called delayed input DFA (D2FA) [xx] attempts to reduce
the number of transitions while keeping the number of
states the same. D2FAs use default transitions to reduce the
number of labeled transitions in a DFA. A default transition
is followed whenever the current input character does not
match any labeled transition leaving the current state. If two
states have a large number of “next states” in common, we
can replace the common transitions leaving one of the states
with a default transition to the other. No state can have
more than one default transition, but if the default
transitions are chosen appropriately, the amount of memory
needed to represent the parsing automaton can be
dramatically reduced.
Unfortunately, the use of default transitions also reduces the
throughput, since no input is consumed when a default
transition is followed, but memory must be accessed to
retrieve the next state. In [xx], authors develop an alternate
representation for D2FAs called the Content addressed
D2FA (CD2FA) that allows them to be both fast and
compact. A CD2FA is built upon a D2FA, whose state
numbers are replaced with content labels. The content
labels compactly contain information which are sufficient
for the CD2FA to avoid any default traversal, thus avoiding
unnecessary memory accesses and hence achieving higher
throughput. Authors argue that while a CD2FA requires
number of memory accesses equal to those required by a
DFA, in systems with a small data cache, CD2FA surpasses
a DFA in throughput, due to their small memory footprint
and higher cache hit rate.
3. Packet header lookup – new directions
Header lookup operation in IP network generally involves
determining the longest prefix matching the packet header
fields within a database of variable length prefixes. In this
research proposal, we focus on two novel methods to
improve the efficiency of longest prefix match operations.
The first method called HEXA is directly applicable to trie
based algorithms and it can reduce the memory required to
store a trie by up to an order of magnitude. Such a memory
reduction will in turn improve the lookup rate primarily due
to two reasons. First, the compressed trie can support higher
strides thereby reducing the number of memory accesses,
and second the memory being much smaller in size will run
at much higher clock speeds. Our first order analysis
suggests that HEXA based tries also preserves fast
incremental update properties.
Our second method attempts to improve the performance of
header lookup in a more general sense. A series of recent
papers have advocated the use of hash tables in order to
perform header lookup operations. Since the performance
of a hash table can deteriorate considerably in worst-case,
they have been coupled with Bloom filter based techniques.
We extend these techniques by introducing a novel hash
table implementation called Peacock hashing. Our
preliminary analysis suggests that Peacock hash tables have
several desirable properties, which can lead to a more
efficient implementation of header lookup operations. We
now elaborate on these two research directions.
3.1 HEXA
We propose a new approach to represent these directed
graph structures, which requires much less memory. The
approach called History based Encoding, eXecution, and
Addressing (HEXA) challenges the well accepted
assumption that we need log2n bits to identify each node
in a directed graph containing n nodes. More specifically,
we show that, HEXA identifies a node with less than
loglog2n bits, thus dramatically reducing the memory
requirement of the fast path, which mostly consists of
transitions (identifier of next states). The total memory also
gets reduced significantly, because auxiliary information
often represents a small fraction of the total memory.
The key idea behind HEXA is that, in any directed graph,
where nodes are not accessed in a random ad-hoc order but
in an order defined by its transitions, nodes to some extent,
can be uniquely identified by the way the parsing proceeds
in the graph. For instance, in a trie, if we begin parsing at
the root node, we can reach any given node only for a
unique stream of input symbols. In a state minimized finite
automaton, which recognizes regular expressions or strings,
each state again corresponds to a unique input pattern, and
the state can be reached only if the window of few previous
symbols corresponds to that unique pattern. Thus in such
graphs, as the parsing proceeds, we can remember last few
symbols, which can be used to uniquely identify the nodes.
We consider a simple example, before we formally
introduce the key concepts behind HEXA.
3.2 A Simple Example
Let us consider a simple directed graph, an IP lookup trie.
A set of 5 prefixes and the corresponding binary trie,
containing 9 nodes, is shown in Figure 1. Each node stores
the identifier of its left and right child and a bit indicating if
the node corresponds to a valid prefix. Since, there are 9
nodes, identifiers are 4-bits long, and a node requires total
9-bits in the fast path. The fast path trie representation is
1
0
(a)
(b)
1*
00*
11*
011*
0100*
P1
P2
P3
P4
P5
1
2
3
0
1
4
P1
5
0
P2
7
0
1
6
1
P3
8
P4
9
P5
Figure 1: a) routing table, b) corresponding binary trie.
shown below, where nodes are shown as 3-tuple, valid
prefix bit, left child and right child (NULL indicates no
child):
1. 0, 2, 3
4. 1, NULL, NULL 7. 0, 9, NULL
2. 0, 4, 5
5. 0, 6, 9
8. 1, NULL, NULL
3. 1, NULL, 7
6. 1, NULL, NULL 9. 1, NULL, NULL
Here, we assume 6.
that the next hops10.associated with a
matching node are stored in a shadow 
trie,1,which
P2 is stored
in a relatively slow memory. Note that, if the next hop trie
has a structure identical to the fast path
then the fast
 trie,
1, P3
path trie need not contain any additional information. Once
the fast path trie is traversed and the longest
node
 1,matching
P4
is found, we will read the next hop trie once, at the location
 node. We now
corresponding to the longest matching
consider storing the fast path of the trie using HEXA.
 1, P2
In HEXA, a node will be identified by the input stream over
which it will be reached. Thus, the HEXA
of the
 1,identifier
P3
nodes will be:
 1, P4
4. 1. 00
7. 010
5. 0
2. 01
8. 011
6. 1
3. 11
9. 0100
4. 0, a2 node will require a total
Since these identifiers are unique,
5. 0,which
4
of 3-bits, if we have a function,
maps each identifier
6.
1,
to a unique number in [1, 9]. ThisP1unique number will be
0, 2 3-bits are stored. The
the memory address where the10.
node’s
3-bits make the 3-tuple for the node; the first bit is set if
node corresponds to a valid prefix, and second and third
bits are set if node has a left and right child, respectively.
Thus, if there are n nodes in the trie, ni is the HEXA
identifier of the ith node and f is a one to one function
mapping ni’s to [1, n], then an array containing n 3-bits are
sufficient to represent the entire trie. For node i, the array
will be indexed with f(ni), and traversal of the trie will be
straightforward. We will start at the first trie node, whose 3bit tuple will be read from the array at index f(-). If the
match bit is set, we will make a note of the match, and fetch
the next symbol from the input stream, to proceed to the
next trie node. If the symbol is 0 (1) and the left (right)
child bit of the previous node was set, then we will compute
f(ni) (ni will now contain first bit of the input stream) and
read its 3-bits. We will continue in this manner until we
reach a node with no child. The most recent node with
match bit set will correspond to the longest matching prefix.
Continuing with the earlier trie of 9 nodes, let the mapping
function f, has the following values for the nine HEXA
identifiers listed above,
1. f(-) = 4
4. f(00) = 2
7. f(010) = 5
2. f(0) = 7
5. f(01) = 8
8. f(011) = 3
3. f(1) = 9
6. f(11) = 1
9. f(0100) = 6
0, be
2 programmed as
With this, the array of 3-bit tuples 1.
will
2. 0,next
4 hops.
follow; we also show the corresponding
3.
1,
1
2
3
4
5
6P1 7
8
9
1. 0, 2
Fast path 0,1,1 0,1,1 1,0,1 1,0,0 0,1,1 1,0,0 0,1,0 1,0,0 1,0,0
Next hop
P1 P2
P3
P4 P5
This array and the above mapping function are sufficient to
traverse thru the trie for any given input stream.
This example, suggests that, we can dramatically reduce the
memory requirements to represent a trie, by practically
eliminating all overheads associated with storing the node
identifiers, which requires log2n bits. However, this
requires a one-to-one function to map each HEXA identifier
to a unique memory location, devising which is not trivial.
In fact, when the trie is frequently updated, maintaining the
one-to-one mapping may become extremely difficult. We
will soon show that we can enable such a one-to-one
mapping with very low cost. We also ensure that our
approach maintains very fast incremental updates; i.e. when
node are added or deleted, a new one-to-one mapping is
computed quickly and with very few changes in the fast
path array.
3.3 Devising One-to-one Mapping
We have seen that we can compactly represent a directed
graph, if we have a function to map each HEXA identifier
to a unique number between 1 and n, where n is the total
number of nodes in the graph. We can generalize the
problem slightly, if we allow the memory array to have
space for m nodes, where m  n. Thus, we have to map n
HEXA identifiers to unique locations in a memory array
containing total m cells. This essentially is similar to
computing a perfect hash function. For large n, finding such
perfect hashing becomes extremely compute intensive and
impractical.
We can simplify the problem dramatically by considering
the fact that HEXA identifiers of these nodes can be
changed without changing their meaning. For instance we
can allow a node identifier to contain few additional (say c)
bits, which we can alter at our convenience. We call these
c-bits the node’s discriminator. Thus, HEXA identifier of a
node will be the history of labels on which we will reach the
node, plus its c-bits discriminator1. Having these
discriminators and the ability to alter them provides us with
potentially multiple choices of memory locations for a
node. We can use any sufficiently random hash function,
and each node will have 2c choices of HEXA identifiers and
hence up to 2c memory locations, of which we have to pick
one.
This problem can now be reduced to a bipartite graph
matching problem. The bipartite graph G = (V1+V2, E)
consists of the nodes of the original directed graph as the
left set of vertices, and the memory locations as the right set
of vertices. The edges connecting the left set of vertices to
the right set are essentially the result of the hash function.
Since discriminators are c-bits long, each left vertex will
have up to 2c edges connected to random right vertices.
Nodes
1
Input labels
-
Four choices of
HEXA identifiers
Choices of
memory locations
Bipartite graph and
a perfect matching
000000000, 010000000,
100000000, 110000000
h() = 0, h() = 4
h() = 1, h() = 5
0
1
2
0
000000001, 010000001,
100000001, 110000001
h() = 1, h() = 5
h() = 2, h() = 6
3
1
000001001, 010001001,
100001001, 110001001
h() = 0, h() = 4
h() = 1, h() = 5
2
4
00
000000010, 010000010,
100000010, 110000010
h() = 2, h() = 6
h() = 3, h() = 7
3
5
01
000001010, 010001010,
100001010, 110001010
h() = 1, h() = 5
h() = 2, h() = 6
4
6
11
000011010, 010011010,
100011010, 110011010
h() = 8, h() = 3
h() = 0, h() = 4
5
7
010
000010011, 010010011,
100010011, 110010011
h() = 1, h() = 5
h() = 2, h() = 6
6
8
011
000011011, 010011011,
100011011, 110011011
h() = 0, h() = 4
h() = 1, h() = 5
7
0100
000100100, 010100100,
100100100, 110100100
h() = 0, h() = 3
h() = 4, h() = 6
8
9
Figure 2: Memory mapping graph, bipartite matching.
bold. With this perfect matching, a node will require only 2bits to be uniquely represented (as c = 2).
We now briefly describe how our approach guarantees fast
incremental updates when a node is removed and another is
added to the graph.
We refer to G as the memory mapping graph. Clearly, we
intend to find a perfect matching in the memory mapping
graph G, i.e. match each node identifier to a unique memory
location. It is likely that no perfect matching exists. A
maximum matching M in G, which is the largest set of
pairwise non-adjacent edges, may not contain n edges, in
which case some nodes will not be assigned any memory
location. However using theoretical analysis, we show that,
when c is O(loglog n), a perfect matching will exist with
high probability, even if m = n. When m is slightly greater
than n, the probability of finding a perfect matching grows
very quickly.
3.4 Updating a Perfect Matching
Continuing with the previous trie shown in Figure 1, we
now seek to devise a one to one mapping using this method.
We consider m = n and assume that c is 2, thus a node can
have 4 possible HEXA identifiers, which will enable it to
have up to 4 choices of memory locations. A complication
in computing the hash values may arise because the HEXA
identifiers are not of equal length. We can resolve it by first
appending to a HEXA identifier its length and then padding
the short identifiers with zeros. Finally we append the
discriminators to them. The resulting choices of identifiers
and the memory mapping graph is shown in Figure 2, where
we assume that the hash function is simply the numerical
value of the identifier modulo 9. In the same figure, we also
show a perfect matching with the matching edges drawn in
point, an update affects less than 5 nodes in a one million
node graph, thus they can proceed very rapidly.
In several applications (e.g. in IP lookup), fast incremental
updates are critically important. Thus, HEXA
representations are practical only if it can quickly find a
new perfect matching, after some nodes are removed and
new ones are added to the graph. We show that HEXA
ensures that asymptotic complexity of updates are


Ο logloglogn n , thus when a node is removed and a new one is


added, the new node is mapped to a memory location with
less than
Ο logloglogn n nodes remapped. From practical
We now briefly discuss how updates are handled in HEXA;
an update involves an existing node u being removed and a
new node v added. Let the node u was mapped to memory
location x, thus location x is now free and current matching
is short of the perfect matching by a single edge. We will
try to match the newly added node v to a memory location
by finding an alternating path between the node v and the
memory location x. We assume that a perfect matching
exists in the graph, meaning that alternating paths between v
and x exists in the graph. We claim that, the shortest such
alternating path contains




Ο logloglogn n nodes, thus, when we
accomplish a perfect matching, it involves remapping
1
Note that, in directed graphs, which are more complex that a trie,
multiple nodes may be reached with the same set of input labels
(e.g. in a NFA), thus discriminators will be essential in these
situations to assign unique HEXA identifiers to these nodes.
Ο logloglogn n nodes.
Lemma 1. An alternating path between the node v and the
log n
memory location x exists and contains at most

log(logn 1)

left nodes.
Proof: The proof is trivial. If we start exploring the
alternating paths in a breadth first order starting at the node
v, we are guaranteed to reach memory location x, before we
cover all n nodes of the graph. Since the left nodes have
logn–1 non-matching edges incident upon them, and the
right ones have exactly one matching edge incident on
them, each pass of the alternating path breadth first
traversal will multiply the number of nodes covered by
logn–1. Thus we will cover all n nodes after roughly
log n
passes. Therefore, the shortest alternating path

log(logn 1)

between node v and memory location x will contain at most
log n
left nodes. ■

log(logn 1)

We now present an extension of HEXA, and show how we
can further reduce the memory requirements. We also
discuss how this extension of HEXA efficiently handles
complex graphs, when our present definition of HEXA
becomes less effective.
3.5 Peacock Hashing
While flexible and expressive, regular expression
traditionally requires substantial amounts of memory, and
the state-of-the art algorithms to perform regular expression
matching are unable to keep up with the ever increasing link
speeds. To see why, we must consider how regular
expressions are implemented. A regular expression is
typically represented by a finite automaton (FA). FA can be
of two basic types: non-deterministic finite automaton
(NFA) and deterministic finite automaton (DFA). The
distinction between a NFA and a DFA is that a NFA can
potentially make multiple moves on an input symbol, while
a DFA makes a single move on any given input symbol.
DFA, therefore results in deterministic and high
performance as there is a single active state at any point in
time. On the other hand, in order to simulate a NFA, one
has to keep track of all possible moves that it can make,
therefore the worst-case processing complexity of a NFA is
O(n2), when all n states are active at the same time.
Clearly, in order to achieve high performance with a NFA,
the underlying hardware must provide a high degree of
parallelism, so that all NFA moves can be simulated
simultaneously. Thus, NFAs are usually implemented on an
ASIC/FPGA using flip-flops, logic gates and wires. Each
state can be represented using a single flip-flop and the
transitions between the states can be realized by appropriate
interconnection between these flip-flops. If the NFA is in
any given set of states, the corresponding flip-flops are set,
and the subsequent transitions sets the flip-flops
representing the set of next-states. Another motivating
factor behind such a design choice is that, ASIC/FPGA
generally have limited on-chip memory resource, therefore
NFAs makes a natural choice, since for any regular
expression whose ASCII length is n, the resulting NFA has
only O(n) states.
Such circuit based approach to perform the regular
expressions matching have received much attention in the
FPGA community, however these design choices might not
be acceptable in systems where the rule set needs to be
updated frequently. More importantly for systems which are
already in deployment, it might prove difficult to quickly
re-synthesize and update the regular expressions circuitry.
Therefore, regular expression engines which use memory
rather than logic, are often more desirable as they provide
higher degree of flexibility and programmability.
In memory based regular expressions implementations,
bandwidth is a precious resource and it is important to
minimize the number of memory accesses, otherwise it may
become the performance bottleneck. Consequently, DFAs
are often the preferred method in such settings, because it
ensures that only one state traversal is needed for every
input character. However, the state space blowup problem
in DFAs appears to be a serious issue, and limits their
practical applicability.
To understand the nature of state space blowup, we must
examine the regular expressions which are commonly used
in networking systems. There are two primary reasons that
regular expressions based rules lead to state space blowup,
which incidentally are also the reason that regular are
expressions much more expressive than traditional string
sets: i) regular expressions allow rules to contain unions of
characters (e.g. case insensitive strings), and ii) they allow
rules to contain closures (zero of more repetitions) of subexpressions. When rules contain several closures over a
union of large number of characters, the parsing can get
stuck at one of the closures, and the subsequent characters
can be consumed without proceeding further from the
closure. These characters can partially or completely match
other rules therefore the DFA has to create separate states
in order to remember all such matches. Clearly, if there are
n simple expressions each of length k, and each contains a
single closure, then the resulting DFA can have as many as
nk states. Rules used in networking systems often contain
closures, so state space blowup is quite common. In fact,
for the current networking rule sets, a naïve DFA based
approach often leads to many more states than what the
current memory technology can cost-effectively store.
Therefore, a large part of recent research has concentrated
on avoiding the state space blowup.
One natural approach to avoid the state space blowup is to
split a set of n rules into m subsets, and construct m DFAs,
one for each subset. Clearly, this approach will require
execution of m DFAs, thus each input character will require
m state traversals, which will lead to m-fold increase in
memory bandwidth. However, it has been shown that, if we
carefully group the rules into the subsets, then modest
values of m can lead to dramatic reduction in the total
number of states. An alternative approach to reduce the
number of states is called lazy DFA, which is a middle
ground between NFA and DFA. The idea is to construct a
DFA for those portions of the rules which are matched
more often, and leaving the other portions in the form of
NFA.
It appears that the first approach of constructing multiple
DFAs trades both the worst- and average-case performance
with the total memory consumption, while the second
approach of constructing DFA only for small portions of
the rules attempts to trade memory consumption with only
the worst-case performance; the average-case performance
under normal input traffic is expected to remain good.
While these approaches appear effective in reducing the
memory, their quantitative effectiveness is not understood
well and moreover it is not clear if trading-off either or both
the worst- and average-case performance is acceptable. For
instance, if an approach ensures good average-case
performance and also ensures that the likelihood of worstcase conditions are negligible, then it may be preferable
over an alternative approach which reduces the averagecase performance in an attempt to improve the worst-case.
On the other hand, it is also important to ensure that the
architecture is not susceptible to denial of service (DoS)
attacks. More specifically, it is important to ensure that
when an attacker attempts to create the worst-case
condition, the normal traffic remains unaffected.
An orthogonal research direction in improving the
performance of DFA based approach is to develop
algorithms to efficiently implement a DFA. Early research
has shown that for any given expression, a DFA exists,
which has the minimum number of states [2, 3]. The
memory needed to represent a DFA is, in turn, determined
by the product of the number of states and the number of
transitions from each state. For an ASCII alphabet, each
state will have 256 outgoing transitions. Thus, typical sets
of regular expressions containing hundreds of patterns for
use in networking which yields DFAs with hundreds of
thousands of states, requires hundreds of megabytes of
memory. Table compression techniques are not effective for
these DFAs due to the relatively high number of unique
‘next-states’ from any given state.
A recently proposed representation of regular expressions
called delayed input DFA (D2FA) [xx] attempts to reduce
the number of transitions while keeping the number of
states the same. D2FAs use default transitions to reduce the
number of labeled transitions in a DFA. A default transition
is followed whenever the current input character does not
match any labeled transition leaving the current state. If two
states have a large number of “next states” in common, we
can replace the common transitions leaving one of the states
with a default transition to the other. No state can have
more than one default transition, but if the default
transitions are chosen appropriately, the amount of memory
needed to represent the parsing automaton can be
dramatically reduced.
Unfortunately, the use of default transitions also reduces the
throughput, since no input is consumed when a default
transition is followed, but memory must be accessed to
retrieve the next state. In [xx], authors develop an alternate
representation for D2FAs called the Content addressed
D2FA (CD2FA) that allows them to be both fast and
compact. A CD2FA is built upon a D2FA, whose state
numbers are replaced with content labels. The content
labels compactly contain information which are sufficient
for the CD2FA to avoid any default traversal, thus avoiding
unnecessary memory accesses and hence achieving higher
throughput. Authors argue that while a CD2FA requires
number of memory accesses equal to those required by a
DFA, in systems with a small data cache, CD2FA surpasses
a DFA in throughput, due to their small memory footprint
and higher cache hit rate.
It appears that the tradeoffs between memory consumption,
average-case performance and the worst-case performance
are not yet understood very clearly. At the same time, it is
not clear if the current solutions and architectures are
adequate and cover all points along the trade-off curve. It is
also not clear if they align well with the current hardware
and silicon technologies. Our first order evaluation suggests
that traditional methods to implement regular expressions
are unable to keep up with the ever increasing link rates.
Many current commercial regular expressions based
systems, which use traditional approach, runs at sub-Gigabit
per second rates, and there is substantial evidence that their
performance can be enhanced while also reducing their
implementation cost. Consequently, it is important to
investigate current solutions and come up with novel
architectures which best exploits the current hardware
capabilities.
The proposed research will focus primarily on algorithmic
solutions to the problem of deep packet inspection, aiming
to develop innovative architectures which can efficiently
implement the current and future deep packet inspection
rules. We propose to begin the research by systematically
studying the trade-offs involved in the traditional regular
expressions implementation. First we note that there are
ample avenues to improve over the popular DFA
representation, which suffers from both Amnesia and
Insomnia. In Amnesia, the DFA only remembers its current
state and ignores everything about the previous stream of
data. With this tendency, they usually require large number
of states in order to correctly recognize every possible input
pattern. In Insomnia, the automaton tends to stay awake and
unnecessarily match all portions of the expression, even
though the suffix portions can be turned off for normal
traffic which rarely leads to a match. Consequently, the
automaton again tends to unnecessarily require a large
number of states.
Our main objective in this proposal is to develop novel
machine architectures (possibly different from both NFA
and DFA) which can efficiently implement thousands of
complex regular expressions. At the same time, it is
important to ensure that the machine exploits current
hardware and memory technology in order to achieve high
throughout at low cost.
Another important objective of the research is to develop
algorithms to store a given machine (e.g. a finite automaton,
a push-down automaton or any new machine) into the
memory, so that i) the memory consumption can be
reduced, and ii) the memory bandwidth required to execute
the machine remains small. Traditional table compression
algorithms are known to be inefficient in packing the finite
automatons which are used in the networking systems.
Recently developed CD2FA approach appears promising,
however its applicability to more complex machines is not
yet known, therefore, we intend to extend these schemes so
that they can be applied to more general machines.
An orthogonal research objective of this proposal aims at
investigating the possibilities to further reduce the memory
by eliminating the overheads associated with explicitly
storing the transitions. Traditionally transition of an
automaton requires log2n bits, where n is the total number
of states. It appears that, with the aid of hashing techniques,
each node can be represented with fewer bits, thus the
number of bits needed to store a transition can be
dramatically reduced. Our preliminary analysis suggests
that when conventional methods require 20-bits to represent
each transition in a one million node machine, this
technique will only require 4-bits.
To summarize, we intend to undertake the following tasks:
4.
Detailed evaluation and analysis of traditional
approach to implement regular expressions
4.1. Evaluation of the characteristics of
expressions which are typically used in
networking systems and detailed study
trends in the evolution of the deep
inspection rules
regular
current
of the
packet
4.2. Implications of the characteristics (e.g. closures,
unions, etc) of the rules over the resulting finite
automaton
4.3. Trade-off between using NFAs and DFAs, and
analysis of their worst- and average-case
performance
4.4. Evaluation of intermediate approaches like lazy
DFA
5.
Introduction
to
representations
novel
regular
expression
5.1. Delayed input DFAs (D2FA)
5.2. Content addressed delayed input DFAs (CD2FA)
5.3. New machines, which are capable of performing
regular expressions matching, however are
different from finite automaton
6.
Investigation into new approaches to better implement
a machine
6.1. Table compression algorithms
6.2. Algorithms to reduce the number of bits needed
to represent the states of a machine
The remainder of the proposal is organized as follows.
Background on regular expressions and related work are
presented in Section 2. Section 3 describes the D2FA
representation. Details of our construction algorithm and
the compression results are presented in Section 4. Section
5 presents the system architecture, load balancing
algorithms and throughput results. The paper ends with
concluding remarks in Section 6.
4. LIMITATIONS OF THE TRADITIONAL
APPROACH
DFAs are the fastest known representation of regular
expressions hence they are often seen as the best candidate
for networking applications where high throughput and
real-time performance guarantees are desirable. DFAs are
fast, however, they suffer from the state space blowup
problem. Typical sets of regular expressions containing
hundreds of patterns for use in networking yield DFAs with
hundreds of thousands of states, limiting their practical use.
For more complex rules, which are used in current intrusion
detection systems (e.g. Snort), even the construction of a
DFA becomes impractical. Therefore, it is important to
develop new methods to implement regular expressions
which are fast as well as compact. Before we attempt to
develop these methods, we summarize some key properties
of the regular expressions which are used in the current
networking systems.
4.1 Current Regular Expressions
We have collected the regular expressions based rules
which are used in the Cisco Systems intrusion prevention
systems (IPS), Snort and Bro intrusion detection systems
(IDS), Linux layer-7 application protocol classifier, and
Extensible Markup Language (XML) filtering applications.
Our findings show that while the XML applications use
simple regular expressions rules (without many closures
and character classes), the remaining systems use
moderately complex regular expressions with several
closures and unions. Some of the properties of these rules
are listed in Table xxx. In this table we highlight the
number of bad sub-expressions found in the patterns, as
these bad sub-expressions often leads to the state space
blowup of a DFA. We also highlight the number of “.*”
found in the patterns because “.*”in patterns often leads to
the maximum state space blowup, and also makes the
pattern ambiguous. We will later see that “.*” also play a
decisive role in the tuning of the performance of any NFA
based approach. Below, we summarize the key differences
in the regular expressions used in these systems.

In contrast to the patterns used in Snort/Bro, patterns
used in Cisco IPS contain a large number of character
classes. This is mostly because most of the Cisco IPS
patterns are case-insensitive. Character classes do not
contribute to state space blowup; they only increase the
number of transitions.

Snort/Bro patterns contain several length restrictions
on the characters classes. These length restrictions not
only leads to the state space blowup of a DFA, they
also leads to a large number of states in a NFA. In
contrast, the XML and Cisco IPS patterns contain very
few length restrictions.

A large fraction of patterns in the Snort/Bro, Linux L7
and XML filter begins with “^” as compared to the
Cisco IPS patterns. Patterns which do not begin with a
“^” implicitly contain a “.*” at the beginning, and these
patterns when merged with other patterns contributes to
the state space blowup.
4.2 Deficiencies of the Current Solutions
The traditional implementation of regular expressions has
to deal with several complications, which can be classified
into two broad categories.
The first complication arises from the packet multiplexing
at the network links; the architecture of any pattern
matching engine has to be designed accordingly to cope
with it. At any network link, usually the data stream
associated with each connection needs to be individually
matched against the given set of regular expressions based
rules. Since, the packets belonging to different connections
can arrive, interspersed with each other, the pattern matcher
needs to remember the current state of every connection, so
that it can continue matching from that point when the next
packet of the connection will arrive. Consequently, upon a
switch from connection x to connection y, it first needs to
store the state of the machine for the current connection x.
Afterwards the state of the machine when the last packet of
connection y was processed is loaded and then the payload
of the newly arrived packet is parsed. Thus the machine
state needs to be stored on a per connection basis.
At high speed backbone links, the total number of active
connections can reach up to a million, therefore it is
important to limit the amount of states associated with any
machine. For instance, even though NFAs are compact, it is
possible that a large fraction of the states of an NFA is
active at the same time. Thus, for such machines, the
amount of space needed to store the “per connection
machine state”, and the bandwidth needed to load and store
these states may become the performance bottleneck. On
the other hand, in a DFA based machine, the amount of
state remains very small, since only one state is active at
any point in time. Additionally, DFAs are also much faster
in parsing the packet payload; they require only one state
traversal per input character, thus DFAs are often preferred
over NFAs. DFA based solutions, however suffers from
state space blowup problem, i.e. a DFA may have
exponential number of states. For the regular expressions
used in current networking systems, the cause of the state
space blowup in a DFA can be classified into three types:
i) Traditionally all finite automaton based approach
(including the DFAs) matches the complete regular
expression against the packet payload. However, in
IDS/IPS applications, the likelihood that a normal packet
completely matches a regular expression is very small; in
fact any normal data stream is expected to match only first
few symbols of a regular expression. Thus, in practice a
machine may only perform parsing against the prefix
portion of the regular expression, thus avoiding to perform
the parsing against the complete regular expression; we
refer to this deficiency of the traditional approach as
Insomnia. In Insomnia, the machine tends to remain awake
over the entire rule including both the prefix and suffix
portions even though the suffix portions can be turned off
for normal traffic. Without a cure, a machine suffering from
insomnia tends to unnecessarily require a large number of
states. The problem appears to have an easy cure, because
usually the suffix portions of the regular expressions
contain those portions which lead to the state space blowup
(closures, unions, and length restrictions).
ii) The second deficiency of DFA based machine can be
classified as Amnesia. Due to Amnesia, the DFA has very
limited memory, and it only remembers its current state and
forgets everything about the previous stream of data and the
associated partial matches. Due to this tendency, DFAs
usually require a large number of states in order to correctly
recognize every possible input pattern. It appears that if one
equips a DFA based machine with slightly more memory,
then the state space blowup can be avoided to a large
extent.
iii) The third deficiency of the finite automata can be
classified as xxxxx, due to which finite automata are unable
to efficiently count the occurrences of certain symbols in
the input stream. Thus, whenever a regular expression
contains length restriction of k on certain symbols, the finite
automaton requires at least k additional states. When
several expressions contain length restrictions, and they are
merged into a single DFA, the number of states can increase
exponentially. It appears that if we equip the automata with
a few counters, then the state space blowup can be avoided.
Our first solution, which we describe below, it attempts to
cure a DFA from insomnia.
4.3 Curing DFA from Insomnia
The traditional approach of regular expressions matching
constructs a finite automaton for the complete regular
expression. Thus, the finite automaton considers all
portions of the regular expression in the parsing process,
while, in practice, only the prefix portions of the expression
need to be considered, because normal data streams rarely
matches more than first few characters of any regular
expression. We refer to this deficiency as insomnia. In
practice, a large fraction of the regular expressions based
rules contains complicated constructs near the end, thus the
habit of insomnia unnecessarily leads to the state space
blowup of a DFA. An effective cure to insomnia can
significantly reduce the DFA size by only considering the
prefix portions of the regular expressions. In rare cases of a
match of a prefix portion, the suffix portion needs to be
handled separately.
Such an approach appears attractive in enhancing the
parsing performance of normal data streams. With the
capability of independently parsing the prefix and suffix
portions of regular expressions, one can bifurcate the packet
processing path into two portions: fast path and the slow
path. In fast path, only the prefix portions of the regular
expressions will be matched. If prefix portions are
sufficiently simple, then a composite DFA2 can be
constructed for all prefix regular expressions, thereby
enabling high performance. All packets by default will be
processed by the fast path and the expectation is that only a
small fraction of the normal data streams will lead to a
match in the fast path (of course anomalous data streams
will match). All such data streams or connections, which
announces a match in the fast path, will be henceforth be
processed by a slow path. In the slow path, the suffix
portions will be matched. Consequently, there is no need to
construct a composite DFA for suffixes of all regular
expressions, as only those suffixes need to be matched
2
At this point in discussion, we assume that such a composite
DFA can be constructed; we will later develop methods to
ensure that a DFA based approach is feasible, even though the
prefix portions are relatively complex.
whose prefixes have been matched. Thus, the state space
blowup problem can be avoided in slow path.
In order to make such a bifurcated architecture practical, we
have to tackle with several challenges. The first challenge
lies in determining the appropriate boundary between prefix
and suffix portions of the regular expression. A general
objective is to keep the prefix portions as small as possible,
so that the fast path DFA remains compact and fast. At the
same time, if one picks too small prefixes, then data stream
of the normal connections may match them quite often, thus
triggering the slow path very frequently. The second
challenge lies in properly handling the control and handoff
of a connection to the slow path. i.e. i) after the slow path is
triggered, when will it stop processing the connection, and
ii) how to avoid triggering of the slow path multiple times
for the same connection.
In this proposal, we develop a systematic approach to split a
given set of regular expressions into prefix and suffix sets.
Afterwards, we propose a fast path and slow path based
pattern matching architecture which is capable of
maintaining a high throughput because, i) the fast path is
compact and can parse the input data stream at high rates,
and ii) the slow path operates at relatively low parsing rate,
however it is guaranteed to handle only a small fraction of
the overall data. We begin with the discussion of our
splitting technique.
4.3.1 Splitting the regular expressions
The dual objective of the splitting procedure is that the
prefixes remain as small as possible, and at the same time,
the likelihood that normal data streams match these prefixes
remains low. The probability of matching a prefix depends
upon its length and the input data stream. In this context it
may not be acceptable to assume a uniformly random
distribution of the input symbols (i.e. every symbol appears
with a probability of 1/256). Clearly, in any data stream,
some symbols appear much more often than the others.
Thus, one needs to consider a trace driven probability
distribution of matching prefixes of different lengths under
the normal data stream as well as under those of the
anomalous data stream. Additionally, one also needs to pay
attention to the probabilities of making transitions from one
state to another; this probability is likely to be diverse,
because there is a strong correlation between the
occurrences of different symbols, and i.e. when and where
they occur with respect to each other.
More systematically, given the NFA of each regular
expression, we need to determine the probability with
which each state of the NFA becomes active and the
probability that the NFA takes its different transitions. An
obvious trend in these NFAs is that, as we move away from
the start state, the probability of subsequent states being
active reduces very quickly. Thus small prefixes may
appear to be sufficient for normal traffic. However, in order
to capture more extreme cases, we intend to generate
synthetic traffic in addition to the real traffic traces. We
plan to generate these synthetic traffic traces by first
constructing a NFA of all regular expressions, and then
traversing the NFA beginning at the start state. The
traversal will occur with a bias probability; high bias values
will force the NFA to make moves which will lead it to
those states which are further away from the start state. The
operation of the synthetic traffic generator is described
below:
nfa M(Q, q0, n, A, )
set state current;
map level: stateint;
current = q0;
assign-level(q0, 1);
do (true) 
Next synthetically generated char c = generatetraffic(bias);
current = current U n(current, c);
od
char function generate-traffic (float bias);
(1) for char c   
(2)
set state next = ;
(3)
for state s  current  next = next U n(s, c);
rof
(4)
if (sum(next) > max)  max = sum(next); maxc =
c; fi
(5)
if (random( ) > bias)  break;
(6) rof
(7) return maxc;
end;
int function sum(set state states);
(8) int total = 0;
(9) for state s  states  total += level[s]; rof
(10) return total;
end;
procedure assign-level(state s, int l,
modifies map mark: statebit);
(11) mark[s] = true; level[s] = l;
(12) for char c   
(13) for state t  n(s, c) 
(14)
if not mark[t]  assign-level(t, l+1); fi
(15) rof
end;
With the real and synthetic traffic traces in hand, we
compute the probability with which various NFA states
becomes active and the probability with which its
transitions are taken. Once these probabilities are
computed, we need to determine a cut in the NFA graph, so
that i) there are as few nodes as possible on the left hand
side of the cut, and ii) the probability that states on the right
hand side of the cut is active is sufficiently small. This will
ensure that the fast path remains compact and the slow path
is triggered only occasionally. While determining the cut,
we also need to ensure that the probability of those
transitions which leaves some NFA node on the right hand
side and enters some other node on the same side of the cut
remains small. This will ensure that, once the slow path is
triggered, it will stop after processing a few input symbols.
Clearly, the cut computed from the real traffic traces and
from the synthetic traffic traces are likely to be different,
therefore the corresponding prefixes will also be different.
We adopt the general policy of taking the longer prefix.
Below, we formally describe the procedure to determine a
cut in the NFA graph.
Let ps : Q  [0, 1] denote be the probability with which the
NFA states are active. Let the cut divides the NFA states
into a fast and a slow region. Since we want to minimize the
number of states in the fast region, we would like to keep
only those states in the fast region whose probability of
remaining active is high. Initially, we keep all states in the
slow region; thus the slow path probability  is  ps .
Afterwards, we begin moving states from the slow region to
the fast region. The movements are performed in a breadth
first order beginning at the start state of the NFA, and those
states are moved first, whose probabilities of being active
are higher. After a state s is moved to the fast region, ps[s]
is reduced from the slow path probability. We continue this
movement of states, until the slow path probability becomes
sufficiently small. This method gives us the first order
estimate of the cut between the fast and the slow path. Such
a cut will ensure that the slow path processes only  fraction
of the total bytes in the input stream. The procedure is
formally described below:
procedure find-cut(nfa M(Q, q0, n, A, ), map ps :
state[0,1]);
(1) heap h;
(2) map mark: statebit;
(3) set state fast;
(4) float p =  ps ;
(5) h.insert(q0, ps(q0));
(6) do h ≠ [ ] and p >  
(7)
state s := h.findmax(); h.remove(t);
(8)
mark[s] = 1; fast = fast U s;
p = p – ps(s);
(9)
for char c   
(10)
for state t  n(s, c) 
(11)
if not mark[t]  h.insert(t, ps(t)); fi
(12)
rof
(13) rof
(14) od
end;
statef
states
Slow path memory
C
Fast path
state
memory
Fast path
automaton
C
Slow path
automata
 B bits/sec
B bits/sec
Figure 2: Fast path and slow path processing in a
bifurcated packet processing architecture.
For a large majority of the regular expressions used in the
current systems, the above method will cleanly split them
into prefix and suffix portions. However, for certain types
of regular expressions, the above method will not result into
a clean split. For instance consider an expression
ab(cd|ef)gh. The resulting NFA may be cut at the
states which corresponds to the matches abc and abe. For
such cuts, there is no way the regular expression can be
cleanly split into the prefix and suffix parts. One way to
split this expression along this cut is to treat it as two
separate expressions abcdgh and abefgh and split them
individually. We rather propose to split such types of
expressions by extending the cut of the NFA until a clean
split of the expression is possible. Thus, in the above
example, we will extend the cut to the states which
corresponds to the matches abcd and abef; thus the prefix
portion will become ab(cd|ef) and the suffix will be
gh.
slow path can be triggered multiple times for the same
connection, the states associated with a connection can be
relatively large, say states. The expectation is that the states
will be at most 1/ times higher than the statef, thus slow
path and fast path connection state memories will remain
comparable in size.
In order to better understand how slow path and fast path
pattern matchers are constructed and how the slow path is
triggered, let us consider a simple example. Let there are
three regular expressions based patterns:
r1 = [gh][^ij]*[ij]def r2 = fag[^i]*i[^j]*j r3
a[gh]i[^l]*[ae]c
=
The NFA for these three patterns are shown below (a
composite DFA for these three patterns will have 144
states). In the figure, the probabilities with which various
NFA states become active are highlighted. A cut between
the fast and slow path is also shown; this cut divides the
states such that the cumulative probability of the slow path
states is less than 5%.
^i,j
0.8
*
CUT
1
i-j
2
0.02
d
3
6
0.01
a
7
0.008
g
8
0.2
g-h
^i
0.1
0
0.002
e
4
f
0.0002
f
5
^j
9
0.0006
j
10
0.008
a-e
14
0.0008
c
15
0.006
i
1.0
^l
a
0.02
0.1
11
g-h
12
0.016
i
13
4.3.2 The bifurcated packet processing
Having described the mechanism to split regular
expressions into prefix and suffix portions, we are now
ready to proceed with the description of our bifurcated
pattern matching architecture. The architecture (shown in
Figure xxx) consists of two components: fast path and slow
path. The fast path parses every byte of each input data
stream and matches against the prefix portions of all regular
expressions. The slow path parses only those data streams
which have found a match in the fast path, and matches
them only against those suffix portions, whose
corresponding prefix portions are matched. As mentioned
earlier, since the fast parses each input byte, they consist of
a single composite DFA of all prefix patterns. Having a
composite DFA for the fast path has an important advantage
that the states associated with every connection that needs
to be stored and loaded upon a connection switch is very
small. Thus, if there are C connections in total, we only
need C times statef memory, where statef is the bits needed
to represent a DFA state.
Slow path on the other hand handles only  fraction of all
input bytes, therefore it only needs to store the state of C
connections. However, since slow path consists of separate
DFA (or NFA) for each individual suffix, and since the
With this cut, the prefix portions of the regular expressions
will be p1 = [gh][^bc]*[bc]d; p2 = f; and p3 = j[gh]. The
corresponding suffix portions of he regular expressions will
be s1 = ef; s2 = ag[^i]*i[^j]*j; and s3 = i[^l]*[ae]c.
The fast path pattern matcher will consist of a composite
DFA of the prefix patterns p1, p2, and p3, which will have
only 14 states. The slow path will consist of three individual
DFA (or NFA) of the suffix patterns s1, s2, and s3, which
will have 3, 15 and 6 states respectively. Once a data stream
will match a prefix pattern say pi in the fast path, matching
of the corresponding suffix pattern, si will be triggered in
the slow path. Even though the slow path consists of
multiple automata, it is not likely to become the
performance bottleneck, because the expectation is that
only 5% of the total input bytes will be processed by the
slow path. The fast path state memory will only store a
single DFA state for each connection. On the other hand,
the slow path state memory may need to store several states
for every connection. For instance if the input data stream
of a connection is ”fagide”, then the slow path will be
triggered thrice, and will have three active states which will
have to be stored in the state memory upon a connection
switch.
f
a
g
i
d
e
0

06 

0,7,11 

0,1,8,12 

0,2,9,13 

0,3,9,13 

0,4,9,13




 
s2
s2
s2 s3
s2 s3
s1 s2 s3
While this example suggests that the number of active states
for a single connection in the slow path can be many more
than one, the general expectation is that this number will
not be high. Even though a handful of connections will have
many active states, the average number of active states for
connections in the slow path will only be slightly higher
than one.
Such a cure from Insomnia appears attractive, because it
ensures high average parsing rate, and also guarantees that
the anomalous connections will be diverted to the slow
path; thus they can not affect the performance received by
well behaving connections. At the same time, splitting
regular expressions into suffix and prefix portions avoids
the state space blowup to a large extent. However, since the
prefix portions are compiled into a composite DFA, if a
large number of prefixes contain Kleene closures, then
there may still be a state explosion. As a matter of fact, a
few tens of Kleene closures are sufficient to make a
composite DFA construction impractical. These state
explosions occur due to Amnesia; therefore we now present
an effective cure to Amnesia.
4.4 Curing DFAs from Amnesia
Before proceeding with the formal description of H-FA,
which is our cure to Amnesia, let us re-examine why
amnesia leads to state blowup of a DFA. The primary
reason behind the state space blowup is the existence of
Kleene closures over union of multiple characters (e.g. .* or
[a-z]*). With such patterns, when the parsing gets stuck at
the closure (.*) and subsequent characters are consumed
from the input stream without proceeding further from the
closure, then they can form a new prefix. Therefore the
DFA has to remember all such partially matched prefixes.
Since DFAs can not remember anything except its current
state (amnesia), each of these partial matching prefixes
requires new state. The problem is exacerbated when there
are multiple regular expressions, each containing such
Kleene closures. A collection of such regular expressions
often lead to exponential blowup in the number of states.
An intuitive solution to the problem is to construct a
machine, which can remember some additional information
other that just its current state. For instance, one can
simulate a NFA in such a way that it remembers all possible
sets of current states for any given input stream, and thus
alleviates the problem of state space blowup. However,
NFAs are slow in practice, and in the worst-case it requires
O(n2) memory accesses to consume a single character.
Therefore, it is important that the new machine requires
O(1) memory access per input character. Intuitively it
appears that, if a finite automaton is equipped with a small
amount of cache which it will use to remember key pieces
of information then the problem of state space blowup can
be alleviated. However, it is important to keep the cache
compact so that it can be stored in on-chip memory or
cache, and therefore can be accessed or loaded quickly. In
essence, the objective is to partition a finite automaton into
two portions, a graph representing the transitions which are
stored as tables in the memory, and a fast but compact
cache, which is used to avoid redundant accesses to the
graph and avoids the exponential blowup in number of
states.
We introduce a novel machine called History based Finite
Automaton (H-FA), which cures traditional DFAs from
Amnesia. H-FA augments a DFA with the capability to
remember a few but critical pieces of information. With this
capability, H-FA leads to orders of magnitude reduction in
the number of states.
The next obvious questions are i) how to partition a finite
automaton? ii) Is such a partitioning always possible? iii)
Can the cache size always be bounded? We now present
History based finite automaton (H-FA), which effectively
addresses each of these concerns.
4.4.1 Introducing H-FA
If we focus on the construction of a finite automaton from a
non-deterministic finite automaton, then we find that each
state of a FA represents a finite subset of states of the nondeterministic finite automaton. Clearly there can be up to 2 n
possible subsets of n states, thus the number of states in a
FA can grow exponentially. However, in practice, FAs do
not have these many states else they will become
impractical for regular expressions containing more than a
few 10s of symbols. As suggested previously, the increase
in the number of states occurs due to the presence of Kleene
closures (*). For instance if there are n simple expressions
each of length k, and each contains one Kleene closure, then
the number of states in the resulting finite automaton can be
as many as nk.
Consider two patterns listed below (one of these patterns
contains a Kleene closure; note that the Kleene closure is
over [^a] because this will keep the resulting FA of the first
expression, r1 simple, and thus the effect of the Kleene
closure on multiple expressions will be prominent):
r1 = ab[^a]*c;
r2 = def;
These patterns create a NFA with 7 states as shown below:
*
depending upon the contents of the history. Moreover, with
some transitions, there are associated actions which are
inserts into set, or removes from set, or both. H-FA can thus
be formally represented as a 6-tuple, which I will describe
later.
^a
NFA: ab[^a]*c; def
1
b
2
c
3
4
e
5
f
6
a
0
d
Shown below, let us look at the corresponding FA
constructed by the subset construction over the above NFA.
a
a
0, 3
d
e
0,2,4
a
d
d
f
0,1
d
d
e
a
a
0,2,6
a
b
0,5
f
0, 6
A simple way to avoid the state blowup in such regular
expressions is to enable the FA to remember whether it has
reached the Kleene closure or not. For instance in the
previous example, if the FA can remember if it has reached
the NFA state 2 or not, then the number of states can be
reduced. H-FA efficiently and optimally achieves this
objective. We now present the formal description of H-FA
and algorithms to construct it from a traditional FA or
directly from a NFA.
4.4.2 Formal Description of H-FA
History based Finite Automata (H-FA) differs from
traditional finite automata in that, in a H-FA, moves made
by the machine depends both upon the transitions and the
contents of a set called history, and moreover, as the
machine makes moves, the contents of the set are updated.
Thus, along with every transition of the H-FA, there is an
accompanied condition which becomes either true or false
0, 3
e
0,2,4
a
d
Clearly the blowup in the number of states is due to the
presence of the Kleene closure [^a]* in the expression r1.
When the NFA is in state 2, then the subsequent stream of
input symbols can partially or completely match r2,
therefore the FA requires additional states in order to keep
track of partially matched expressions r1 and r2. The subset
of the NFA states which forms the FA states also
illuminates this phenomenon, 5 FA states contain the NFA
state 2 in its subset. In general, it turns out that the NFA
states which represent the Kleene closures arises in multiple
subsets, and leads to the exponential blowup in number of
FA state. For instance, if there are only k Kleene closures
and there are total n symbols in the expression then there
can be up to nk subsets of NFA states containing one or
multiple occurrence of states representing the Kleene
closures. Thus, even if k is small, let us say 10 and n is
10,000 then there can be up to 1040 FA states. Thus, even a
handful of Kleene closures can render a FA impractical.
d
0
c
0,2
d
0,4
0,2,5
a
d
0
c
0,2
a
0,1
a
b
Continuing with the previous example, let us describe how
to build the corresponding H-FA. In one variant of the
construction, we begin from a FA, and identify those NFA
states which appear in most of the FA states. Thus, in the
previous FA, the NFA state 2 appears in 4 FA states, which
are highlighted below, we refer to the states in the
highlighted region as fading states.
d
0,4
0,2,5
f
0,2,6
d
d
e
0,5
f
0, 6
If we remove the NFA state 2 from the fading states, then
they can be eliminated (they will overlap with some FA
state in non-fading region). However, in order to remove
NFA state 2 from any FA state subset, we will need to make
a note that state 2 has been reached. Thus, all transitions
from a non-fading state to a fading state (containing 2) will
have an associated action to insert 2 into the history.
Additionally, all transitions from fading states which lead to
a non-fading state will have an associated action to remove
2 from the history. Furthermore, all transitions that remain
in the fading region will have an associated condition that
they are taken only if 2 is present in the history? Let us now
remove NFA state 2 from the fading FA state (0, 2). After
removal, this state will overlap with the FA state (0),
therefore we need to add conditional transitions at the FA
state (0). The resulting H-FA with state (0, 2) removed is
shown below:
a,|2,-2
2
b,+
a
a
0,1
0
d
d
0, 3
c,|2,-2
d,|2
a
a d
0,4
e
0,2,4
0,2,5
f
0,2,6
d
d
e
0,5
f
0, 6
Here a transition with “|s” implies that the transition is
taken when history contains s; “+s” implies that, when this
transition is taken, s is inserted into the history, and “-s”
implies that, when this transition is taken, s is removed from
the history. Once we remove all states in the fading region
from the FA, we will have the following H-FA:
a
a
a,|2,-2
2
b,+
a
0,1
a,|2,-2
a
0
d
0, 3
c,|2,-2
d
d
0,4
d
e
4.4.4 Implementing history
f
0,5
0, 6
d
Note that this H-FA has an additional conditional at 4 states
and the history will have at most one entry inside it. In
general, if we remove k Kleene closures, the history will
have up to k entries, on the other hand in the worst-case
there can be up to 2k additional conditional transitions. We
argue that such worst-case rarely appears, and moreover
whenever it appears, a slightly sub-optimal H-FA
construction can alleviate the problem. We now present a
brief analysis of the increase in the number of conditional
transitions.
4.4.3 Analysis of the Transition Blowup
Let us consider a set n of regular expressions, and let there
are total k Kleene closures. Let the ith expression containing
i
i *
conditional transitions. This is probably because, the
Kleene closures are usually over [.] or a set containing large
number of symbols, and the sub-expression immediately
following the closure usually contains only a handful of
characters.
i
i
Kleene closures is denoted by r1 [c1 ] c2 r2 , where r1 and r2
are prefix and suffix parts of the expression; Kleene closure
is over a set of characters denoted by c1 and c2 denotes the
set of characters which immediately follows the Kleene
closure. For such expressions, if c1 contains a large number
of characters, then there is likely to be a state space blowup
in the FA. The blowup in number of conditional transitions
in the resulting H-FA depends directly upon c2. For instance
i
if none of the c2 ’s overlaps with each other, then there will
only be up to k conditional transitions in the H-FA. In
i
situations when there are overlaps between c2 ’s, there can
be exponential blowup in the number of conditional
i
transitions. For instance lets say, each c2 is a, in this case,
there can be up to 2k conditional transitions over character
a, and the conditions will be the presence of each possible
combination of the k Kleene closure NFA states in the
history.
Second, the number of actions (insert/remove from history)
associated with the conditional transitions will depend upon
the characteristics of c1. For instance if c1 contains all
symbols in the alphabet, then there will not be any remove
action. On the other hand, if c1 contains only a handful of
symbols then a large number of transitions will have the
associated remove action (note that in this case, this Kleene
closure will not be the right candidate for the H-FA). In
general, the number of conditional transitions with an
associated insert operation will not be that many.
If we look at the regular expression sets used in practice, it
appears that there will be minimal blowup in number of
If, we are removing k Kleene closures, then there can be up
to k symbols in the history. Thus, each symbol in the history
array will require log2k bits, leading to a total of klog2k bits.
Clearly, even if k is 64, the size of history will only be 48
bytes. On the other hand, notice that with 64 entries in the
history, the number of states in a traditional FA suffering
from amnesia (hence state blowup) can be reduced by
several orders of magnitude.
Since history is likely to remain very compact, it will not
have significant impact on the performance when packets
will be arriving for different connections. If an average
packet contains 200 bytes of data, then parsing this packet
would require 200 memory accesses, therefore fetching 48
bytes of history will not have significant overhead.
4.4.5 Constructing H-FA to NFA
H-FAs can be directly synthesized from a NFA. I will write
a pseudo-code sometime later.
4.4.6 Comparing H-FA to NFA
In one way, H-FA appears to be similar to a NFA, in that
the total complexity of the machine is actually O(k), where
k is the maximum number of entries in the history.
However, in NFAs, there is no straightforward way to
partition the problem into two components such that the
processing of one component requires O(1) time but
requires a moderately large amount of space (hence stored
in memory), while the other component has a processing
complexity of O(k) but can be represented much more
compactly (hence stored in cache/on-chip). H-FA achieves
this objective and efficiently partitions the problem into two
such components. I am convinced that for any given set of
regular expressions, there is no machine which can attain an
O(1) worst-case processing time, and still require less space
than a traditional state minimized FA.
5. ACKNOWLEDGMENTS
I am grateful to Will Eatherton and John Williams for
providing the regular expression rule set used in Cisco
security appliances. I am also grateful to Prof. Michael
Mitzenmacher whose collaboration has helped in
developing the HEXA architecture. I am also thankful to
Prof. George Varghese who helped me in developing the HFA and HP-FA machines. I am also thankful to Prof.
Patrick Crowley, whose continued support, collaboration
and motivation has helped me with every step in my
research. Finally, I am thankful to Dr. Jonathan S. Turner,
who has always been a xxx. This work was supported in
part by the NSF Grant CNS-0325298 and a URP grant from
Cisco Systems.
6. REFERENCES
[1] R. Sommer, V. Paxson, “Enhancing Byte-Level
Network Intrusion Detection Signatures with Context,”
ACM conf. on Computer and Communication Security,
2003, pp. 262--271.
[2] J. E. Hopcroft and J. D. Ullman, “Introduction to
Automata Theory, Languages, and Computation,”
Addison Wesley, 1979.
[3] J. Hopcroft, “An nlogn algorithm for minimizing states
in a finite automaton,” in Theory of Machines and
Computation, J. Kohavi, Ed. New York: Academic,
1971, pp. 189--196.
[4] Bro: A System for Detecting Network Intruders in
Real-Time. http://www.icir.org/vern/bro-info.html
[5] M. Roesch, “Snort: Lightweight intrusion detection for
networks,” In Proc. 13th Systems Administration
Conference (LISA), USENIX Association, November
1999, pp 229–238.
[6] S. Antonatos, et. al, “Generating realistic workloads for
network intrusion detection systems,” In ACM
Workshop on Software and Performance, 2004.
[7] A. V. Aho and M. J. Corasick, “Efficient string
matching: An aid to bibliographic search,” Comm. of
the ACM, 18(6):333–340, 1975.
[8] B. Commentz-Walter, “A string matching algorithm
[15] R. Sidhu and V. K. Prasanna, “Fast regular expression
matching using FPGAs,” In IEEE Symposium on
Field- Programmable Custom Computing Machines,
Rohnert Park, CA, USA, April 2001.
[16] C. R. Clark and D. E. Schimmel, “Efficient
reconfigurable logic circuit for matching complex
network intrusion detection patterns,” In Proceedings
of 13th International Conference on Field Program.
[17] J. Moscola, et. al, “Implementation of a content-
scanning module for an internet firewall,” IEEE
Workshop on FPGAs for Custom Comp. Machines,
Napa, USA, April 2003.
[18] R. W. Floyd, and J. D. Ullman, “The Compilation of
Regular Expressions into Integrated Circuits”, Journal
of ACM, vol. 29, no. 3, pp 603-622, July 1982.
[19] Scott Tyler Shafer, Mark Jones, “Network edge courts
apps,”
http://infoworld.com/article/02/05/27/020527newebdev_1.html
[20] TippingPoint X505,
www.tippingpoint.com/products_ips.html
[21] Cisco IOS IPS Deployment Guide, www.cisco.com
[22] Tarari RegEx, www.
tarari.com/PDF/RegEx_FACT_SHEET.pdf
[23] Cu-11 standard cell/gate array ASIC, IBM.
www.ibm.com
[24] Virtex-4 FPGA, Xilinx. www.xilinx.com
[25] N.J. Larsson, “Structures of string matching and data
fast on the average,” Proc. of ICALP, pages 118–132,
July 1979.
compression,” PhD thesis, Dept. of Computer Science,
Lund University, 1999 .
[9] S. Wu, U. Manber,” A fast algorithm for multi-pattern
[26] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J.
searching,” Tech. R. TR-94-17, Dept. of Comp.
Science, Univ of Arizona, 1994.
[10] Fang Yu, et al., “Fast and Memory-Efficient Regular
Expression Matching for Deep Packet Inspection”,
UCB tech. report, EECS-2005-8.
[11] N. Tuck, T. Sherwood, B. Calder, and G. Varghese,
“Deterministic memory-efficient string matching
algorithms for intrusion detection,” IEEE Infocom
2004, pp. 333--340.
[12] L. Tan, and T. Sherwood, “A High Throughput String
Matching Architecture for Intrusion Detection and
Prevention,” ISCA 2005.
[13] I. Sourdis and D. Pnevmatikatos, “Pre-decoded CAMs
for Efficient and High-Speed NIDS Pattern Matching,”
Proc. IEEE Symp. on Field-Prog. Custom Computing
Machines, Apr. 2004, pp. 258–267.
[14] S. Yusuf and W. Luk, “Bitwise Optimised CAM for
Network Intrusion Detection Systems,” IEEE FPL
2005.
Lockwood, “Deep Packet Inspection using Parallel
Bloom Filters,” IEEE Hot Interconnects 12, August
2003. IEEE Computer Society Press.
[27] Z. K. Baker, V. K. Prasanna, “Automatic Synthesis of
Efficient Intrusion Detection Systems on FPGAs,” in
Field Prog. Logic and Applications, Aug. 2004, pp.
311–321.
[28] Y. H. Cho, W. H. Mangione-Smith, “Deep Packet
Filter with Dedicated Logic and Read Only
Memories,” Field Prog. Logic and Applications, Aug.
2004, pp. 125–134.
[29] M. Gokhale, et al., “Granidt: Towards Gigabit Rate
Network Intrusion Detection Technology,” Field
Programmable Logic and Applications, Sept. 2002, pp.
404–413.
[30] J. Levandoski, E. Sommer, and M. Strait, “Application
Layer Packet Classifier for Linux”. http://l7filter.sourceforge.net/.
[31] “MIT DARPA Intrusion Detection Data Sets,”
http://www.
ll.mit.edu/IST/ideval/data/2000/2000_data_index.html.
[32] Vern Paxson et al., “Flex: A fast scanner generator,”
http://www.gnu.org/software/flex/
[33] SafeXcel Content Inspection Engine, hardware regex
acceleration IP.
[34] Network Services Processor, OCTEON CN31XX,
CN30XX Family.
[35] R. Prim, “Shortest connection networks and some
generalizations,” Bell System Technical Journal,
36:1389-1401, 1957.
[36] J. B. Kruskal, “On the shortest spanning subtree of a
graph and the traveling salesman problem,” Proc. of
the American Mathematical Society, 7:48-50, 1956.
[37] F. M. Liang. A lower bound for on-line bin packing. In
Information Processing letters, pages 76-79, 1980.
[38] Will Eatherton, John Williams, “An encoded version of
reg-ex database from cisco systems provided for
research purposes”.
[39] Garey, M. R., and Johnson, D. S., “Bounded
Component Spanning Forest”, pp 208, Computers and
Intractability: A Guide to the Theory of NPCompleteness, 1979.