Doctoral Dissertation Proposal: Acceleration of Network Processing Algorithms Sailesh Kumar Washington University Computer Science and Engineering St. Louis, MO 63130-4899 +1-314-935-4306 [email protected] Research advisor: Jonathan S. Turner ABSTRACT Modern networks need to process and forward an increasingly large volume of traffic and the growth in the number of packets often outpaces the improvements in the processor, memory and software technology. Consequently, there is a persistent interest in novel network algorithms which can implement the network features more efficiently. In this proposal, we propose several new algorithms to accelerate and optimize the implementations of three core network functionalities namely: i) packet buffering and forwarding, ii) packet header processing, and iii) packet payload inspection. 1. INTRODUCTION Modern networking devices perform an array of operations upon receiving a packet. These operations have to be finished within a limited time budget in order to maintain a high packet throughput and low processing latency. There are two trends which put additional pressure on the performance: i) new features are regularly added in today’s networks, many of which are employed on a per packet basis, and ii) the rate of increase in the packet arrival rates generally outpaces the rate at which hardware and memory technology advances. Due to these performance pressures, it becomes critical to implement various network features efficiently. Clearly, it is crucial to efficiently implement and optimize any new feature; additionally the exiting features also need to be improved and updated with the advances in the technology. A good implementation of any network feature requires a good understanding of both the classical algorithmic methods and the hardware and system technology, thereby making it an interesting research problem. Additionally, due to their importance, these implementation methods have received an enormous attention in the networking research community. Two core network features which have remained the focus of the researchers are: i) packet buffering and scheduling, which generally involves fast packet buffering mechanisms coupled by a queuing and scheduling system, and ii) packet header processing, which includes header lookup operations in order to determine the next hop for the packet and packet classification in order to prioritize the packets based upon the source and destination addresses and the protocol. A third class of network feature which has recently seen a wide adoption is deep packet inspection, in which every byte of the packet payload is examined in order to search for a set of pre-defined patterns. Deep packet inspection is often used in the emerging application layer packet forwarding applications and intrusion detection systems. Due to the importance and broad deployment of these three network features, a collection of novel methods have been proposed to implement them efficiently. These methods often consider the constraints and capabilities of the current hardware platforms and involve a complex mix of ideas drawn from theoretical computer science (algorithms and data structures), and system and hardware technology. Since the hardware and system technology evolves rapidly, there is a constant need to upgrade these implementations, nevertheless there is also room to improve them in a more abstract and theoretical sense. In this proposal, we intend to undertake these tasks for the three network features mentioned above. More specifically, we intend to evaluate the existing methods to implement them, and propose novel algorithms and mechanisms to improve their performance on the next generation memory subsystems and hardware platforms. Our aim is to split the efforts evenly between the combination of the packet buffering and header lookup features and the deep packet inspection feature. The first two network features we are focusing on have already been comprehensively studied, and it appears that there is little room for any fundamental improvement. However, the evolution of new implementation platforms like network processors has opened up several opportunities for novel ideas and methods of implementation. Network processors are software-programmable devices and their feature sets are specifically targeted for the networking applications. They sport a collection of memory banks, running at different operating frequencies thereby creating memories of different bandwidth and access latency. The storage capacity of these memories is also diverse; a general trend is that larger memories have relatively lower bandwidth and higher access latency. The presence of such a diverse collection of memories presents new levels of challenges and opportunities in developing the memory sub-system. For example, if the data-structures used in the memory intensive features like packet buffering and header lookup, are spread out across various memories and the fast but small memories are prudently used, then the performance can be dramatically enhanced. Consequently, one of the objectives of the research is to develop innovative ways of distributing the data-structure across different memories such that both the total available bandwidth and space are uniformly utilized. The distribution mechanism can either be static, in which the memories will be pre-allocated to different datastructure segments or can be dynamic, in which portions of the data-structure will be allowed migrate from one memory to the other. Traditional caches, which often improve the average-case performance, are one form of such dynamic mechanism. Current network processor devices also contain specialized engines, like hash accelerator and content accessible memory (CAM), which offer further opportunities. Popular hash based techniques like hash tables and Bloom filters can now be cost-effectively employed. The presence of hashing eases the use of the randomized methods which can provide strong probabilistic performance guarantees at a reduced cost. CAM, on the other hand, opens up the opportunity to easily employ an associative caching scheme, which can greatly improve the average-case performance of the queuing and buffering features. In this research proposal, we also aim to explore these possibilities of utilizing various specialized hardware capabilities in improving the performance of the network features. The third network feature deep packet inspection, which is one of our primary research focuses, has recently gained widespread adoption. The key reason is that many emerging network services now handle packets based on payload content, in addition to the structured information found in packet headers. Forwarding packets based on content requires new levels of support in networking equipment, wherein every byte of the packet payload is inspected in addition to examining the packet headers. Traditionally, this deep packet inspection has been limited to comparing packet content to sets of strings. However, newly emerging systems are replacing string sets with regular expressions, due to their increased flexibility and expressiveness. Several content inspection engines have recently migrated to regular expressions, including: Snort [5], Bro [4], 3Com’s TippingPoint X505 [20], and various network security appliances from Cisco Systems [21]. Additionally, layer 7 filters based on regular expressions [30] are available for the Linux operating system. While flexible and expressive, regular expressions traditionally require substantial amounts of memory, and the state-of-the art algorithms to perform regular expression matching are unable to keep up with the ever increasing link speeds. To see why, we must consider how regular expressions are implemented. A regular expression is typically represented by a finite automaton (FA). FA can be of two basic types: non-deterministic finite automaton (NFA) and deterministic finite automaton (DFA). The distinction between a NFA and a DFA is that a NFA can potentially make multiple moves on an input symbol, while a DFA makes a single move on any given input symbol. DFA, therefore results in deterministic and high performance as there is a single active state at any point in time. However, the state space blowup problem in DFAs appears to be a serious issue, and limits their practical applicability. The proposed research will focus mostly on the algorithmic solutions to the problem, aiming at developing innovative architectures which can efficiently implement the current and future regular expressions. We propose to begin the research by systematically studying the trade-offs involved in the traditional regular expressions implementation and the hardware capabilities needed to execute any given finite automaton. A preliminary analysis suggests that the current hardware technologies are capable of executing machines which are much more complex that finite-automaton. Such machines can trade-off space and performance much more effectively than the finite automaton and fully utilize the capabilities of the current hardware platforms. These machines can also employ probabilistic methods to improve the performance because the likelihood of completely matching the patterns which are used in networking applications is remote. One of our main objectives in this proposal is to develop such novel machine architectures. Another important objective of the research is to develop algorithms to efficiently store a given machine (e.g. a finite automaton, a push-down automaton or any new machine) into the memory. The prime concerns are memory usage and bandwidth required to execute the machine. Traditional table compression algorithms are known to be inefficient in packing the finite automatons which are used in the networking systems. Recently developed CD2FA approach appears promising, however its applicability to more complex machines is not yet known, therefore, we intend to extend these schemes so that they can be applied to more general machines. An orthogonal research objective is to investigate the possibilities to further reduce the memory by eliminating the overheads involved in explicitly storing the transitions. Traditionally transition of an automaton requires log2n bits, where n is the total number of states. It appears that, with the aid of hashing techniques, each node can be represented with fewer bits, thus the number of bits needed to store a transition can be reduced. Our preliminary analysis suggests that when conventional methods require 20-bits to represent each transition in a one million node machine, this technique will only require 4-bits. To summarize, in this proposal we plan to work on three important network functions. For each of the function, we plan to evaluate the existing implementation methods and the challenges that arise with the introduction of new platforms. Specific for each feature, we plan to undertake the following tasks: 1. Packet buffering and queuing (25% of the total effort) 1.1. How to use randomization techniques to improve the buffering performance 1.2. How to build efficient buffering sub-system with a collection of memories of different size, bandwidth and access latency 1.3. Hashing and Bloom filter based buffering and caching sub-systems 2. Header lookup (25% of the total effort) 2.1. Architecture of header lookup engines capable of supporting tera-bit data throughput 2.2. Algorithms to compress lookup data-structure and the implication of caching on lookup performance 3. Deep packet inspection (50% of the total effort) 3.1. Evaluation the patterns used in current deep packet inspection systems 3.2. Trade-off between using NFA and DFA; analysis of the worst- and average-case performance 3.3. Evaluation of intermediate approaches like lazy DFA 3.4. Introduction to novel machines, potentially different from finite automaton, which are capable of performing regular expressions matching 3.5. Memory compression schemes (e.g. Delayed input DFAs (D2FA) and Content addressed delayed input DFAs (CD2FA) The remainder of the proposal is organized as follows. Background on regular expressions and related work are presented in Section 2. Section 3 describes the D2FA representation. Details of our construction algorithm and the compression results are presented in Section 4. Section 5 presents the system architecture, load balancing algorithms and throughput results. The paper ends with concluding remarks in Section 6. 2. BACKGROUND AND RELATED WORK We split this section into three subsections. Each subsection will cover some of the relevant related work for the “packet buffering and queuing”, “packet header lookup” and “deep packet inspection”, respectively. 2.1 Packet buffering and queuing Packet buffers in routers require substantial amounts of memory to store packets awaiting transmission. Router vendors typically dimension packet storage subsystems to have a capacity at least equal to the product of the link bandwidth and the typical network round-trip delay. While a recent paper Error! Reference source not found. has questioned the necessity of such large amounts of storage, current practice continues to rely on the bandwidth-delay product rule. The amount of storage used by routers is large enough to require the use of high density memory components. Since high density memories like DRAM have limited random access bandwidth and short packets are common in networks, it has become challenging for them to keep up with the continuously increasing link bandwidths. A number of architectures have been proposed to buffer packets at such high link rates. In reference [x], authors propose ping-pong buffer, which can double the random access bandwidth of a memory based packet buffer. Such a buffer has been shown to exhibit good utilization properties; the memory utilization remains as high as 95% in practice. In references [x][x], Iyer et al. have shown that a hybrid approach combining multiple off-chip memory channels with an on-chip SRAM can deliver high performance even in the presence of worstcase access patterns. The on-chip SRAM is used to provide a moderate amount of fast, per-queue storage, while the offchip memory channels provide bulk storage. Unfortunately, the amount of on-chip SRAM needed grows as the product of the number of memory modules and the number of queues, making it practical only when the number of individual queues is limited. More recently, multichannel packet storage systems [x][x] have been proposed that use randomization to enable high performance in the presence of arbitrary packet retrieval patterns. Such an architecture requires an on-chip SRAM, whose size is proportional only to the number of memory modules and doesn’t grow as the product of the number of memory modules and the number of queues, making it practical for any system irrespective of the number of queues. It has been shown that, even for systems which uses DRAM memories with large number of banks, the overall on-chip buffering requirements depends mostly on the number of channels and not the product of number of channels and the number of banks, thereby making such an approach highly scalable. While packet storage is important, modern routers often employ multiple queues to store packets. These queues are used to implement various packet scheduling policies, QoS, and other types of differentiated services applied to packet aggregates. The problem of scheduling real-time messages in packet switched networks has been studied extensively. Practical algorithms can be broadly classified as either timestamp or round-robin. Time stamp based algorithms [x] try to emulate a GPS [x] by sending packets, approximately, in the same order as sent by a reference GPS server. It involves computation of timestamps for various queues, and sorting them in an increasing order. Round-robin schedulers [x] avoid the sorting bottleneck by assigning time slots to the queues and transmitting multiple packets with cumulative size up to a maximum sized packet from the queue in the current slot. Many routers use multiple hierarchies of queues in order to implement sophisticated scheduling policies, e.g. the first set of queues may represent physical ports, the second classes of traffic and the third may consist of virtual (output) queues. When there are a large number of queues then off-chip memory is required to store the queuing and scheduling data-structure which complicates the design of the buffering and queuing subsystem. In fact, the off-chip memory can be a significant contributor to the cost of the queuing subsystem and can place serious limits on its performance, particularly as link speeds scale beyond 10 Gb/s. A recent series of papers has demonstrated queuing architectures which alleviates these problems and maintains a high throughput. In [x], authors show how queuing subsystems using a combination of implicit buffer pointers, multi-buffer list nodes and coarse grained scheduling can dramatically improve the worst-case performance, while also reducing the SRAM bandwidth and capacity needed to achieve high performance. In [x], authors propose a cache based queue engine, which consists of a hardware cache and a closely coupled queuing engine to perform queue operations. Such cache based engines are also available on modern network processors like Intel IXP series. While such NP based queuing assists have limited capability, a number of more sophisticated commercial packet queuing subsystems are available, which often targets QoS and traffic management applications. They often use custom logic and memory interfaces to achieve high performance, however there is a need of more general and programmable queuing solutions, which use commodity memory and processing components. 2.2 Packet header lookup Internet router processes and forwards the incoming packets based upon the structured information in the packet header. The next hop for the packet is determined after examining the destination IP address; this operation is often called IP lookup. An array of advanced services which determines the treatment a packet receives at a router examines the combination of the source and destination IP addresses and ports; this operation is called packet classification. The distinction between IP Lookup and Packet classification is simply that IP Lookup classifies a packet based on a single field in the header while Packet Classification classifies a packet based on multiple fields. The core of both functions consists of determining the longest prefix matching the header fields within a database of variable length prefixes. Longest prefix match algorithms has been widely studied. Some well known mechanisms encompasses from TCAM [9][10] to Bloom filters [6] to hash tables [1] based schemes. While these hardware based approaches, especially TCAM, have been widely adopted, they generally often consume a lot of power. Consequently, algorithmic solutions have remained one of the interests of the researchers. Algorithmic solutions often employ a trie to perform the longest prefix lookup. A trie can be built by traversing the bits in each prefix from left to right, and inserting appropriate nodes in the trie. This trie can be later be traversed to perform the lookup operations. A substantial number of papers have been written in this space to efficiently implement these tries so that the total memory consumption can be reduced and the lookup and update rates can be improved [xxx]. With the current memory technology, trie based implementations of header lookup can easily support a data throughput of 10 Gbps. However, at 40 Gbps data rates, a minimum sized 40-byte packet may arrive every 8 ns, and it may become challenging to perform lookup operations with a single memory. A number of researchers have proposed a pipelined trie. Such tries enable high throughput because when there are enough memory stages in the pipeline, no stage is accessed more than once for a search and during each cycle, each stage can service a memory request for a different lookup. Recently, Baboescu et al. [21] have proposed a circular pipelined trie, which is different from the previous ones in that the memory stages are configured in a circular, multipoint access pipeline so that lookups can be initiated at any stage. At a high-level, this multi-access and circular structure enables more flexibility in mapping trie nodes to pipeline stages, which in turn maintains uniform memory occupancy. A refined version of circular pipeline called CAMP has been introduced in [x], which employs relatively simple method to map the trie nodes to the pipeline stages, thereby improving the rate at which the trie can be updated. CAMP also presents relatively simple but effective methods to maintain a high memory utilization and scalability in the number of pipeline stages. Such circular pipeline based lookup implementations can not only provide a high lookup rate but also improve the memory utilization and reduce the power consumption. It will be extremely valuable to evaluate the feasibility of incorporating such specialized engines in modern network processors. A number of proposals have also been made in the context of header lookup implementations on a network processor. 2.3 Deep packet inspection Deep packet inspection has recently gained widespread popularity as it provides the capability to accurately classify and control traffic in terms of content, applications, and individual subscribers. Cisco and others today see deep packet inspection happening in the network and they argue that “Deep packet inspection will happen in the ASICs, and that ASICs need to be modified” [19]. Some important applications requiring deep packet inspection are listed below: Network intrusion detection and prevention systems (NIDS/NIPS) generally scan the packet header and payload in order to identify a given set of signatures of well known security threats. Layer 7 switches and firewalls provide content-based filtering, load-balancing, authentication and monitoring. Application-aware web switches, for example, provide scalable and transparent load balancing in data centers. Content-based traffic management and routing can be used to differentiate traffic classes based on the type of data in packets. Deep packet inspection often involves scanning every byte of the packet payload and identifying a set of matching predefined patterns. Traditionally, rules have been represented as exact match strings consisting of known patterns of interest. Naturally, due to their wide adoption and importance, several high speed and efficient string matching algorithms have been proposed recently. Some of the standard string matching algorithms such as AhoCorasick [7] Commentz-Walter [8], and Wu-Manber [9], use a preprocessed data-structure to perform highperformance matching. A large body of research literature has concentrated on enhancing these algorithms for use in networking. In [11], Tuck et al. presents techniques to enhance the worst-case performance of Aho-Corasick algorithm. Their algorithm was guided by the analogy between IP lookup and string matching and applies bitmap and path compression to Aho-Corasick. Their scheme has been shown to reduce the memory required for the string sets used in NIDS by up to a factor of 50 while improving performance by more than 30%. Many researchers have proposed high-speed pattern matching hardware architectures. In [12] Tan et al. propose an efficient algorithm that converts an Aho-Corasick automaton into multiple binary state machines, thereby reducing the space requirements. In [13], the authors present an FPGA-based design which uses character predecoding coupled with CAM-based pattern matching. In [14], Yusuf et al. use hardware sharing at the bit level to exploit logic design optimizations, thereby reducing the area by a further 30%. Other work [25, 26, 27, 28, 29] presents several efficient string matching architectures; their performance and space efficiency are well summarized in [14]. In [1], Sommer and Paxson note that regular expressions might prove to be fundamentally more efficient and flexible as compared to exact-match strings when specifying attack signatures. The flexibility is due to the high degree of expressiveness achieved by using character classes, union, optional elements, and closures, while the efficiency is due to the effective schemes to perform pattern matching. Open source NIDS systems, such as Snort and Bro, use regular expressions to specify rules. Regular expressions are also the language of choice in several commercial security products, such as TippingPoint X505 [20] from 3Com and a family of security appliances from Cisco Systems [21]. Although some specialized engines such as RegEx from Tarari [22] report packet scan rates up to 4 Gbps, the throughput of most such devices remains limited to subgigabit rates. There is great interest in and incentive for enabling multi-gigabit performance on regular expressions based rules. Consequently, several researchers have recently proposed specialized hardware-based architectures which implement finite automata using fast on-chip logic. Sindhu et al. [15] and Clark et al. [16] have implemented nondeterministic finite automata (NFAs) on FPGA devices to perform regular expression matching and were able to achieve very good space efficiency. Implementing regular expressions in custom hardware was first explored by Floyd and Ullman [18], who showed that an NFA can be efficiently implemented using a programmable logic array. Moscola et al. [17] have used DFAs instead of NFAs and demonstrated significant improvement in throughput although their datasets were limited in terms of the number of expressions. These approaches all exploit a high degree of parallelism by encoding automata in the parallel logic resources available in FPGA devices. Such a design choice is guided partly by the abundance of logic cells on FPGA and partly by the desire to achieve high throughput as such levels of throughput might be difficult to achieve in systems that store automata in memory. While such a choice seems promising for FPGA devices, it might not be acceptable in systems where the expression sets needs to be updated frequently. More importantly for systems which are already in deployment, it might prove difficult to quickly resynthesize and update the regular expressions circuitry. Therefore, regular expression engines which use memory rather than logic, are often more desirable as they provide higher degree of flexibility and programmability. Commercial content inspection engines like Tarari’s RegEx already emphasize the ease of programmability provided by a dense multiprocessor architecture coupled to a memory. Content inspection engines from other vendors [33, 34], also use memory-based architectures. In this context, Yu et al. [10] have proposed an efficient algorithm to partition a large set of regular expressions into multiple groups, such that overall space needed by the automata is reduced dramatically. They also propose architectures to implement the grouped regular expressions on both general-purpose processor and multi-core processor systems, and demonstrate an improvement in throughput of up to 4 times. Emphasizing the importance of memory based designs, a recently proposed representation of regular expressions called delayed input DFA (D2FA) [xx] attempts to reduce the number of transitions while keeping the number of states the same. D2FAs use default transitions to reduce the number of labeled transitions in a DFA. A default transition is followed whenever the current input character does not match any labeled transition leaving the current state. If two states have a large number of “next states” in common, we can replace the common transitions leaving one of the states with a default transition to the other. No state can have more than one default transition, but if the default transitions are chosen appropriately, the amount of memory needed to represent the parsing automaton can be dramatically reduced. Unfortunately, the use of default transitions also reduces the throughput, since no input is consumed when a default transition is followed, but memory must be accessed to retrieve the next state. In [xx], authors develop an alternate representation for D2FAs called the Content addressed D2FA (CD2FA) that allows them to be both fast and compact. A CD2FA is built upon a D2FA, whose state numbers are replaced with content labels. The content labels compactly contain information which are sufficient for the CD2FA to avoid any default traversal, thus avoiding unnecessary memory accesses and hence achieving higher throughput. Authors argue that while a CD2FA requires number of memory accesses equal to those required by a DFA, in systems with a small data cache, CD2FA surpasses a DFA in throughput, due to their small memory footprint and higher cache hit rate. 3. Packet header lookup – new directions Header lookup operation in IP network generally involves determining the longest prefix matching the packet header fields within a database of variable length prefixes. In this research proposal, we focus on two novel methods to improve the efficiency of longest prefix match operations. The first method called HEXA is directly applicable to trie based algorithms and it can reduce the memory required to store a trie by up to an order of magnitude. Such a memory reduction will in turn improve the lookup rate primarily due to two reasons. First, the compressed trie can support higher strides thereby reducing the number of memory accesses, and second the memory being much smaller in size will run at much higher clock speeds. Our first order analysis suggests that HEXA based tries also preserves fast incremental update properties. Our second method attempts to improve the performance of header lookup in a more general sense. A series of recent papers have advocated the use of hash tables in order to perform header lookup operations. Since the performance of a hash table can deteriorate considerably in worst-case, they have been coupled with Bloom filter based techniques. We extend these techniques by introducing a novel hash table implementation called Peacock hashing. Our preliminary analysis suggests that Peacock hash tables have several desirable properties, which can lead to a more efficient implementation of header lookup operations. We now elaborate on these two research directions. 3.1 HEXA We propose a new approach to represent these directed graph structures, which requires much less memory. The approach called History based Encoding, eXecution, and Addressing (HEXA) challenges the well accepted assumption that we need log2n bits to identify each node in a directed graph containing n nodes. More specifically, we show that, HEXA identifies a node with less than loglog2n bits, thus dramatically reducing the memory requirement of the fast path, which mostly consists of transitions (identifier of next states). The total memory also gets reduced significantly, because auxiliary information often represents a small fraction of the total memory. The key idea behind HEXA is that, in any directed graph, where nodes are not accessed in a random ad-hoc order but in an order defined by its transitions, nodes to some extent, can be uniquely identified by the way the parsing proceeds in the graph. For instance, in a trie, if we begin parsing at the root node, we can reach any given node only for a unique stream of input symbols. In a state minimized finite automaton, which recognizes regular expressions or strings, each state again corresponds to a unique input pattern, and the state can be reached only if the window of few previous symbols corresponds to that unique pattern. Thus in such graphs, as the parsing proceeds, we can remember last few symbols, which can be used to uniquely identify the nodes. We consider a simple example, before we formally introduce the key concepts behind HEXA. 3.2 A Simple Example Let us consider a simple directed graph, an IP lookup trie. A set of 5 prefixes and the corresponding binary trie, containing 9 nodes, is shown in Figure 1. Each node stores the identifier of its left and right child and a bit indicating if the node corresponds to a valid prefix. Since, there are 9 nodes, identifiers are 4-bits long, and a node requires total 9-bits in the fast path. The fast path trie representation is 1 0 (a) (b) 1* 00* 11* 011* 0100* P1 P2 P3 P4 P5 1 2 3 0 1 4 P1 5 0 P2 7 0 1 6 1 P3 8 P4 9 P5 Figure 1: a) routing table, b) corresponding binary trie. shown below, where nodes are shown as 3-tuple, valid prefix bit, left child and right child (NULL indicates no child): 1. 0, 2, 3 4. 1, NULL, NULL 7. 0, 9, NULL 2. 0, 4, 5 5. 0, 6, 9 8. 1, NULL, NULL 3. 1, NULL, 7 6. 1, NULL, NULL 9. 1, NULL, NULL Here, we assume 6. that the next hops10.associated with a matching node are stored in a shadow trie,1,which P2 is stored in a relatively slow memory. Note that, if the next hop trie has a structure identical to the fast path then the fast trie, 1, P3 path trie need not contain any additional information. Once the fast path trie is traversed and the longest node 1,matching P4 is found, we will read the next hop trie once, at the location node. We now corresponding to the longest matching consider storing the fast path of the trie using HEXA. 1, P2 In HEXA, a node will be identified by the input stream over which it will be reached. Thus, the HEXA of the 1,identifier P3 nodes will be: 1, P4 4. 1. 00 7. 010 5. 0 2. 01 8. 011 6. 1 3. 11 9. 0100 4. 0, a2 node will require a total Since these identifiers are unique, 5. 0,which 4 of 3-bits, if we have a function, maps each identifier 6. 1, to a unique number in [1, 9]. ThisP1unique number will be 0, 2 3-bits are stored. The the memory address where the10. node’s 3-bits make the 3-tuple for the node; the first bit is set if node corresponds to a valid prefix, and second and third bits are set if node has a left and right child, respectively. Thus, if there are n nodes in the trie, ni is the HEXA identifier of the ith node and f is a one to one function mapping ni’s to [1, n], then an array containing n 3-bits are sufficient to represent the entire trie. For node i, the array will be indexed with f(ni), and traversal of the trie will be straightforward. We will start at the first trie node, whose 3bit tuple will be read from the array at index f(-). If the match bit is set, we will make a note of the match, and fetch the next symbol from the input stream, to proceed to the next trie node. If the symbol is 0 (1) and the left (right) child bit of the previous node was set, then we will compute f(ni) (ni will now contain first bit of the input stream) and read its 3-bits. We will continue in this manner until we reach a node with no child. The most recent node with match bit set will correspond to the longest matching prefix. Continuing with the earlier trie of 9 nodes, let the mapping function f, has the following values for the nine HEXA identifiers listed above, 1. f(-) = 4 4. f(00) = 2 7. f(010) = 5 2. f(0) = 7 5. f(01) = 8 8. f(011) = 3 3. f(1) = 9 6. f(11) = 1 9. f(0100) = 6 0, be 2 programmed as With this, the array of 3-bit tuples 1. will 2. 0,next 4 hops. follow; we also show the corresponding 3. 1, 1 2 3 4 5 6P1 7 8 9 1. 0, 2 Fast path 0,1,1 0,1,1 1,0,1 1,0,0 0,1,1 1,0,0 0,1,0 1,0,0 1,0,0 Next hop P1 P2 P3 P4 P5 This array and the above mapping function are sufficient to traverse thru the trie for any given input stream. This example, suggests that, we can dramatically reduce the memory requirements to represent a trie, by practically eliminating all overheads associated with storing the node identifiers, which requires log2n bits. However, this requires a one-to-one function to map each HEXA identifier to a unique memory location, devising which is not trivial. In fact, when the trie is frequently updated, maintaining the one-to-one mapping may become extremely difficult. We will soon show that we can enable such a one-to-one mapping with very low cost. We also ensure that our approach maintains very fast incremental updates; i.e. when node are added or deleted, a new one-to-one mapping is computed quickly and with very few changes in the fast path array. 3.3 Devising One-to-one Mapping We have seen that we can compactly represent a directed graph, if we have a function to map each HEXA identifier to a unique number between 1 and n, where n is the total number of nodes in the graph. We can generalize the problem slightly, if we allow the memory array to have space for m nodes, where m n. Thus, we have to map n HEXA identifiers to unique locations in a memory array containing total m cells. This essentially is similar to computing a perfect hash function. For large n, finding such perfect hashing becomes extremely compute intensive and impractical. We can simplify the problem dramatically by considering the fact that HEXA identifiers of these nodes can be changed without changing their meaning. For instance we can allow a node identifier to contain few additional (say c) bits, which we can alter at our convenience. We call these c-bits the node’s discriminator. Thus, HEXA identifier of a node will be the history of labels on which we will reach the node, plus its c-bits discriminator1. Having these discriminators and the ability to alter them provides us with potentially multiple choices of memory locations for a node. We can use any sufficiently random hash function, and each node will have 2c choices of HEXA identifiers and hence up to 2c memory locations, of which we have to pick one. This problem can now be reduced to a bipartite graph matching problem. The bipartite graph G = (V1+V2, E) consists of the nodes of the original directed graph as the left set of vertices, and the memory locations as the right set of vertices. The edges connecting the left set of vertices to the right set are essentially the result of the hash function. Since discriminators are c-bits long, each left vertex will have up to 2c edges connected to random right vertices. Nodes 1 Input labels - Four choices of HEXA identifiers Choices of memory locations Bipartite graph and a perfect matching 000000000, 010000000, 100000000, 110000000 h() = 0, h() = 4 h() = 1, h() = 5 0 1 2 0 000000001, 010000001, 100000001, 110000001 h() = 1, h() = 5 h() = 2, h() = 6 3 1 000001001, 010001001, 100001001, 110001001 h() = 0, h() = 4 h() = 1, h() = 5 2 4 00 000000010, 010000010, 100000010, 110000010 h() = 2, h() = 6 h() = 3, h() = 7 3 5 01 000001010, 010001010, 100001010, 110001010 h() = 1, h() = 5 h() = 2, h() = 6 4 6 11 000011010, 010011010, 100011010, 110011010 h() = 8, h() = 3 h() = 0, h() = 4 5 7 010 000010011, 010010011, 100010011, 110010011 h() = 1, h() = 5 h() = 2, h() = 6 6 8 011 000011011, 010011011, 100011011, 110011011 h() = 0, h() = 4 h() = 1, h() = 5 7 0100 000100100, 010100100, 100100100, 110100100 h() = 0, h() = 3 h() = 4, h() = 6 8 9 Figure 2: Memory mapping graph, bipartite matching. bold. With this perfect matching, a node will require only 2bits to be uniquely represented (as c = 2). We now briefly describe how our approach guarantees fast incremental updates when a node is removed and another is added to the graph. We refer to G as the memory mapping graph. Clearly, we intend to find a perfect matching in the memory mapping graph G, i.e. match each node identifier to a unique memory location. It is likely that no perfect matching exists. A maximum matching M in G, which is the largest set of pairwise non-adjacent edges, may not contain n edges, in which case some nodes will not be assigned any memory location. However using theoretical analysis, we show that, when c is O(loglog n), a perfect matching will exist with high probability, even if m = n. When m is slightly greater than n, the probability of finding a perfect matching grows very quickly. 3.4 Updating a Perfect Matching Continuing with the previous trie shown in Figure 1, we now seek to devise a one to one mapping using this method. We consider m = n and assume that c is 2, thus a node can have 4 possible HEXA identifiers, which will enable it to have up to 4 choices of memory locations. A complication in computing the hash values may arise because the HEXA identifiers are not of equal length. We can resolve it by first appending to a HEXA identifier its length and then padding the short identifiers with zeros. Finally we append the discriminators to them. The resulting choices of identifiers and the memory mapping graph is shown in Figure 2, where we assume that the hash function is simply the numerical value of the identifier modulo 9. In the same figure, we also show a perfect matching with the matching edges drawn in point, an update affects less than 5 nodes in a one million node graph, thus they can proceed very rapidly. In several applications (e.g. in IP lookup), fast incremental updates are critically important. Thus, HEXA representations are practical only if it can quickly find a new perfect matching, after some nodes are removed and new ones are added to the graph. We show that HEXA ensures that asymptotic complexity of updates are Ο logloglogn n , thus when a node is removed and a new one is added, the new node is mapped to a memory location with less than Ο logloglogn n nodes remapped. From practical We now briefly discuss how updates are handled in HEXA; an update involves an existing node u being removed and a new node v added. Let the node u was mapped to memory location x, thus location x is now free and current matching is short of the perfect matching by a single edge. We will try to match the newly added node v to a memory location by finding an alternating path between the node v and the memory location x. We assume that a perfect matching exists in the graph, meaning that alternating paths between v and x exists in the graph. We claim that, the shortest such alternating path contains Ο logloglogn n nodes, thus, when we accomplish a perfect matching, it involves remapping 1 Note that, in directed graphs, which are more complex that a trie, multiple nodes may be reached with the same set of input labels (e.g. in a NFA), thus discriminators will be essential in these situations to assign unique HEXA identifiers to these nodes. Ο logloglogn n nodes. Lemma 1. An alternating path between the node v and the log n memory location x exists and contains at most log(logn 1) left nodes. Proof: The proof is trivial. If we start exploring the alternating paths in a breadth first order starting at the node v, we are guaranteed to reach memory location x, before we cover all n nodes of the graph. Since the left nodes have logn–1 non-matching edges incident upon them, and the right ones have exactly one matching edge incident on them, each pass of the alternating path breadth first traversal will multiply the number of nodes covered by logn–1. Thus we will cover all n nodes after roughly log n passes. Therefore, the shortest alternating path log(logn 1) between node v and memory location x will contain at most log n left nodes. ■ log(logn 1) We now present an extension of HEXA, and show how we can further reduce the memory requirements. We also discuss how this extension of HEXA efficiently handles complex graphs, when our present definition of HEXA becomes less effective. 3.5 Peacock Hashing While flexible and expressive, regular expression traditionally requires substantial amounts of memory, and the state-of-the art algorithms to perform regular expression matching are unable to keep up with the ever increasing link speeds. To see why, we must consider how regular expressions are implemented. A regular expression is typically represented by a finite automaton (FA). FA can be of two basic types: non-deterministic finite automaton (NFA) and deterministic finite automaton (DFA). The distinction between a NFA and a DFA is that a NFA can potentially make multiple moves on an input symbol, while a DFA makes a single move on any given input symbol. DFA, therefore results in deterministic and high performance as there is a single active state at any point in time. On the other hand, in order to simulate a NFA, one has to keep track of all possible moves that it can make, therefore the worst-case processing complexity of a NFA is O(n2), when all n states are active at the same time. Clearly, in order to achieve high performance with a NFA, the underlying hardware must provide a high degree of parallelism, so that all NFA moves can be simulated simultaneously. Thus, NFAs are usually implemented on an ASIC/FPGA using flip-flops, logic gates and wires. Each state can be represented using a single flip-flop and the transitions between the states can be realized by appropriate interconnection between these flip-flops. If the NFA is in any given set of states, the corresponding flip-flops are set, and the subsequent transitions sets the flip-flops representing the set of next-states. Another motivating factor behind such a design choice is that, ASIC/FPGA generally have limited on-chip memory resource, therefore NFAs makes a natural choice, since for any regular expression whose ASCII length is n, the resulting NFA has only O(n) states. Such circuit based approach to perform the regular expressions matching have received much attention in the FPGA community, however these design choices might not be acceptable in systems where the rule set needs to be updated frequently. More importantly for systems which are already in deployment, it might prove difficult to quickly re-synthesize and update the regular expressions circuitry. Therefore, regular expression engines which use memory rather than logic, are often more desirable as they provide higher degree of flexibility and programmability. In memory based regular expressions implementations, bandwidth is a precious resource and it is important to minimize the number of memory accesses, otherwise it may become the performance bottleneck. Consequently, DFAs are often the preferred method in such settings, because it ensures that only one state traversal is needed for every input character. However, the state space blowup problem in DFAs appears to be a serious issue, and limits their practical applicability. To understand the nature of state space blowup, we must examine the regular expressions which are commonly used in networking systems. There are two primary reasons that regular expressions based rules lead to state space blowup, which incidentally are also the reason that regular are expressions much more expressive than traditional string sets: i) regular expressions allow rules to contain unions of characters (e.g. case insensitive strings), and ii) they allow rules to contain closures (zero of more repetitions) of subexpressions. When rules contain several closures over a union of large number of characters, the parsing can get stuck at one of the closures, and the subsequent characters can be consumed without proceeding further from the closure. These characters can partially or completely match other rules therefore the DFA has to create separate states in order to remember all such matches. Clearly, if there are n simple expressions each of length k, and each contains a single closure, then the resulting DFA can have as many as nk states. Rules used in networking systems often contain closures, so state space blowup is quite common. In fact, for the current networking rule sets, a naïve DFA based approach often leads to many more states than what the current memory technology can cost-effectively store. Therefore, a large part of recent research has concentrated on avoiding the state space blowup. One natural approach to avoid the state space blowup is to split a set of n rules into m subsets, and construct m DFAs, one for each subset. Clearly, this approach will require execution of m DFAs, thus each input character will require m state traversals, which will lead to m-fold increase in memory bandwidth. However, it has been shown that, if we carefully group the rules into the subsets, then modest values of m can lead to dramatic reduction in the total number of states. An alternative approach to reduce the number of states is called lazy DFA, which is a middle ground between NFA and DFA. The idea is to construct a DFA for those portions of the rules which are matched more often, and leaving the other portions in the form of NFA. It appears that the first approach of constructing multiple DFAs trades both the worst- and average-case performance with the total memory consumption, while the second approach of constructing DFA only for small portions of the rules attempts to trade memory consumption with only the worst-case performance; the average-case performance under normal input traffic is expected to remain good. While these approaches appear effective in reducing the memory, their quantitative effectiveness is not understood well and moreover it is not clear if trading-off either or both the worst- and average-case performance is acceptable. For instance, if an approach ensures good average-case performance and also ensures that the likelihood of worstcase conditions are negligible, then it may be preferable over an alternative approach which reduces the averagecase performance in an attempt to improve the worst-case. On the other hand, it is also important to ensure that the architecture is not susceptible to denial of service (DoS) attacks. More specifically, it is important to ensure that when an attacker attempts to create the worst-case condition, the normal traffic remains unaffected. An orthogonal research direction in improving the performance of DFA based approach is to develop algorithms to efficiently implement a DFA. Early research has shown that for any given expression, a DFA exists, which has the minimum number of states [2, 3]. The memory needed to represent a DFA is, in turn, determined by the product of the number of states and the number of transitions from each state. For an ASCII alphabet, each state will have 256 outgoing transitions. Thus, typical sets of regular expressions containing hundreds of patterns for use in networking which yields DFAs with hundreds of thousands of states, requires hundreds of megabytes of memory. Table compression techniques are not effective for these DFAs due to the relatively high number of unique ‘next-states’ from any given state. A recently proposed representation of regular expressions called delayed input DFA (D2FA) [xx] attempts to reduce the number of transitions while keeping the number of states the same. D2FAs use default transitions to reduce the number of labeled transitions in a DFA. A default transition is followed whenever the current input character does not match any labeled transition leaving the current state. If two states have a large number of “next states” in common, we can replace the common transitions leaving one of the states with a default transition to the other. No state can have more than one default transition, but if the default transitions are chosen appropriately, the amount of memory needed to represent the parsing automaton can be dramatically reduced. Unfortunately, the use of default transitions also reduces the throughput, since no input is consumed when a default transition is followed, but memory must be accessed to retrieve the next state. In [xx], authors develop an alternate representation for D2FAs called the Content addressed D2FA (CD2FA) that allows them to be both fast and compact. A CD2FA is built upon a D2FA, whose state numbers are replaced with content labels. The content labels compactly contain information which are sufficient for the CD2FA to avoid any default traversal, thus avoiding unnecessary memory accesses and hence achieving higher throughput. Authors argue that while a CD2FA requires number of memory accesses equal to those required by a DFA, in systems with a small data cache, CD2FA surpasses a DFA in throughput, due to their small memory footprint and higher cache hit rate. It appears that the tradeoffs between memory consumption, average-case performance and the worst-case performance are not yet understood very clearly. At the same time, it is not clear if the current solutions and architectures are adequate and cover all points along the trade-off curve. It is also not clear if they align well with the current hardware and silicon technologies. Our first order evaluation suggests that traditional methods to implement regular expressions are unable to keep up with the ever increasing link rates. Many current commercial regular expressions based systems, which use traditional approach, runs at sub-Gigabit per second rates, and there is substantial evidence that their performance can be enhanced while also reducing their implementation cost. Consequently, it is important to investigate current solutions and come up with novel architectures which best exploits the current hardware capabilities. The proposed research will focus primarily on algorithmic solutions to the problem of deep packet inspection, aiming to develop innovative architectures which can efficiently implement the current and future deep packet inspection rules. We propose to begin the research by systematically studying the trade-offs involved in the traditional regular expressions implementation. First we note that there are ample avenues to improve over the popular DFA representation, which suffers from both Amnesia and Insomnia. In Amnesia, the DFA only remembers its current state and ignores everything about the previous stream of data. With this tendency, they usually require large number of states in order to correctly recognize every possible input pattern. In Insomnia, the automaton tends to stay awake and unnecessarily match all portions of the expression, even though the suffix portions can be turned off for normal traffic which rarely leads to a match. Consequently, the automaton again tends to unnecessarily require a large number of states. Our main objective in this proposal is to develop novel machine architectures (possibly different from both NFA and DFA) which can efficiently implement thousands of complex regular expressions. At the same time, it is important to ensure that the machine exploits current hardware and memory technology in order to achieve high throughout at low cost. Another important objective of the research is to develop algorithms to store a given machine (e.g. a finite automaton, a push-down automaton or any new machine) into the memory, so that i) the memory consumption can be reduced, and ii) the memory bandwidth required to execute the machine remains small. Traditional table compression algorithms are known to be inefficient in packing the finite automatons which are used in the networking systems. Recently developed CD2FA approach appears promising, however its applicability to more complex machines is not yet known, therefore, we intend to extend these schemes so that they can be applied to more general machines. An orthogonal research objective of this proposal aims at investigating the possibilities to further reduce the memory by eliminating the overheads associated with explicitly storing the transitions. Traditionally transition of an automaton requires log2n bits, where n is the total number of states. It appears that, with the aid of hashing techniques, each node can be represented with fewer bits, thus the number of bits needed to store a transition can be dramatically reduced. Our preliminary analysis suggests that when conventional methods require 20-bits to represent each transition in a one million node machine, this technique will only require 4-bits. To summarize, we intend to undertake the following tasks: 4. Detailed evaluation and analysis of traditional approach to implement regular expressions 4.1. Evaluation of the characteristics of expressions which are typically used in networking systems and detailed study trends in the evolution of the deep inspection rules regular current of the packet 4.2. Implications of the characteristics (e.g. closures, unions, etc) of the rules over the resulting finite automaton 4.3. Trade-off between using NFAs and DFAs, and analysis of their worst- and average-case performance 4.4. Evaluation of intermediate approaches like lazy DFA 5. Introduction to representations novel regular expression 5.1. Delayed input DFAs (D2FA) 5.2. Content addressed delayed input DFAs (CD2FA) 5.3. New machines, which are capable of performing regular expressions matching, however are different from finite automaton 6. Investigation into new approaches to better implement a machine 6.1. Table compression algorithms 6.2. Algorithms to reduce the number of bits needed to represent the states of a machine The remainder of the proposal is organized as follows. Background on regular expressions and related work are presented in Section 2. Section 3 describes the D2FA representation. Details of our construction algorithm and the compression results are presented in Section 4. Section 5 presents the system architecture, load balancing algorithms and throughput results. The paper ends with concluding remarks in Section 6. 4. LIMITATIONS OF THE TRADITIONAL APPROACH DFAs are the fastest known representation of regular expressions hence they are often seen as the best candidate for networking applications where high throughput and real-time performance guarantees are desirable. DFAs are fast, however, they suffer from the state space blowup problem. Typical sets of regular expressions containing hundreds of patterns for use in networking yield DFAs with hundreds of thousands of states, limiting their practical use. For more complex rules, which are used in current intrusion detection systems (e.g. Snort), even the construction of a DFA becomes impractical. Therefore, it is important to develop new methods to implement regular expressions which are fast as well as compact. Before we attempt to develop these methods, we summarize some key properties of the regular expressions which are used in the current networking systems. 4.1 Current Regular Expressions We have collected the regular expressions based rules which are used in the Cisco Systems intrusion prevention systems (IPS), Snort and Bro intrusion detection systems (IDS), Linux layer-7 application protocol classifier, and Extensible Markup Language (XML) filtering applications. Our findings show that while the XML applications use simple regular expressions rules (without many closures and character classes), the remaining systems use moderately complex regular expressions with several closures and unions. Some of the properties of these rules are listed in Table xxx. In this table we highlight the number of bad sub-expressions found in the patterns, as these bad sub-expressions often leads to the state space blowup of a DFA. We also highlight the number of “.*” found in the patterns because “.*”in patterns often leads to the maximum state space blowup, and also makes the pattern ambiguous. We will later see that “.*” also play a decisive role in the tuning of the performance of any NFA based approach. Below, we summarize the key differences in the regular expressions used in these systems. In contrast to the patterns used in Snort/Bro, patterns used in Cisco IPS contain a large number of character classes. This is mostly because most of the Cisco IPS patterns are case-insensitive. Character classes do not contribute to state space blowup; they only increase the number of transitions. Snort/Bro patterns contain several length restrictions on the characters classes. These length restrictions not only leads to the state space blowup of a DFA, they also leads to a large number of states in a NFA. In contrast, the XML and Cisco IPS patterns contain very few length restrictions. A large fraction of patterns in the Snort/Bro, Linux L7 and XML filter begins with “^” as compared to the Cisco IPS patterns. Patterns which do not begin with a “^” implicitly contain a “.*” at the beginning, and these patterns when merged with other patterns contributes to the state space blowup. 4.2 Deficiencies of the Current Solutions The traditional implementation of regular expressions has to deal with several complications, which can be classified into two broad categories. The first complication arises from the packet multiplexing at the network links; the architecture of any pattern matching engine has to be designed accordingly to cope with it. At any network link, usually the data stream associated with each connection needs to be individually matched against the given set of regular expressions based rules. Since, the packets belonging to different connections can arrive, interspersed with each other, the pattern matcher needs to remember the current state of every connection, so that it can continue matching from that point when the next packet of the connection will arrive. Consequently, upon a switch from connection x to connection y, it first needs to store the state of the machine for the current connection x. Afterwards the state of the machine when the last packet of connection y was processed is loaded and then the payload of the newly arrived packet is parsed. Thus the machine state needs to be stored on a per connection basis. At high speed backbone links, the total number of active connections can reach up to a million, therefore it is important to limit the amount of states associated with any machine. For instance, even though NFAs are compact, it is possible that a large fraction of the states of an NFA is active at the same time. Thus, for such machines, the amount of space needed to store the “per connection machine state”, and the bandwidth needed to load and store these states may become the performance bottleneck. On the other hand, in a DFA based machine, the amount of state remains very small, since only one state is active at any point in time. Additionally, DFAs are also much faster in parsing the packet payload; they require only one state traversal per input character, thus DFAs are often preferred over NFAs. DFA based solutions, however suffers from state space blowup problem, i.e. a DFA may have exponential number of states. For the regular expressions used in current networking systems, the cause of the state space blowup in a DFA can be classified into three types: i) Traditionally all finite automaton based approach (including the DFAs) matches the complete regular expression against the packet payload. However, in IDS/IPS applications, the likelihood that a normal packet completely matches a regular expression is very small; in fact any normal data stream is expected to match only first few symbols of a regular expression. Thus, in practice a machine may only perform parsing against the prefix portion of the regular expression, thus avoiding to perform the parsing against the complete regular expression; we refer to this deficiency of the traditional approach as Insomnia. In Insomnia, the machine tends to remain awake over the entire rule including both the prefix and suffix portions even though the suffix portions can be turned off for normal traffic. Without a cure, a machine suffering from insomnia tends to unnecessarily require a large number of states. The problem appears to have an easy cure, because usually the suffix portions of the regular expressions contain those portions which lead to the state space blowup (closures, unions, and length restrictions). ii) The second deficiency of DFA based machine can be classified as Amnesia. Due to Amnesia, the DFA has very limited memory, and it only remembers its current state and forgets everything about the previous stream of data and the associated partial matches. Due to this tendency, DFAs usually require a large number of states in order to correctly recognize every possible input pattern. It appears that if one equips a DFA based machine with slightly more memory, then the state space blowup can be avoided to a large extent. iii) The third deficiency of the finite automata can be classified as xxxxx, due to which finite automata are unable to efficiently count the occurrences of certain symbols in the input stream. Thus, whenever a regular expression contains length restriction of k on certain symbols, the finite automaton requires at least k additional states. When several expressions contain length restrictions, and they are merged into a single DFA, the number of states can increase exponentially. It appears that if we equip the automata with a few counters, then the state space blowup can be avoided. Our first solution, which we describe below, it attempts to cure a DFA from insomnia. 4.3 Curing DFA from Insomnia The traditional approach of regular expressions matching constructs a finite automaton for the complete regular expression. Thus, the finite automaton considers all portions of the regular expression in the parsing process, while, in practice, only the prefix portions of the expression need to be considered, because normal data streams rarely matches more than first few characters of any regular expression. We refer to this deficiency as insomnia. In practice, a large fraction of the regular expressions based rules contains complicated constructs near the end, thus the habit of insomnia unnecessarily leads to the state space blowup of a DFA. An effective cure to insomnia can significantly reduce the DFA size by only considering the prefix portions of the regular expressions. In rare cases of a match of a prefix portion, the suffix portion needs to be handled separately. Such an approach appears attractive in enhancing the parsing performance of normal data streams. With the capability of independently parsing the prefix and suffix portions of regular expressions, one can bifurcate the packet processing path into two portions: fast path and the slow path. In fast path, only the prefix portions of the regular expressions will be matched. If prefix portions are sufficiently simple, then a composite DFA2 can be constructed for all prefix regular expressions, thereby enabling high performance. All packets by default will be processed by the fast path and the expectation is that only a small fraction of the normal data streams will lead to a match in the fast path (of course anomalous data streams will match). All such data streams or connections, which announces a match in the fast path, will be henceforth be processed by a slow path. In the slow path, the suffix portions will be matched. Consequently, there is no need to construct a composite DFA for suffixes of all regular expressions, as only those suffixes need to be matched 2 At this point in discussion, we assume that such a composite DFA can be constructed; we will later develop methods to ensure that a DFA based approach is feasible, even though the prefix portions are relatively complex. whose prefixes have been matched. Thus, the state space blowup problem can be avoided in slow path. In order to make such a bifurcated architecture practical, we have to tackle with several challenges. The first challenge lies in determining the appropriate boundary between prefix and suffix portions of the regular expression. A general objective is to keep the prefix portions as small as possible, so that the fast path DFA remains compact and fast. At the same time, if one picks too small prefixes, then data stream of the normal connections may match them quite often, thus triggering the slow path very frequently. The second challenge lies in properly handling the control and handoff of a connection to the slow path. i.e. i) after the slow path is triggered, when will it stop processing the connection, and ii) how to avoid triggering of the slow path multiple times for the same connection. In this proposal, we develop a systematic approach to split a given set of regular expressions into prefix and suffix sets. Afterwards, we propose a fast path and slow path based pattern matching architecture which is capable of maintaining a high throughput because, i) the fast path is compact and can parse the input data stream at high rates, and ii) the slow path operates at relatively low parsing rate, however it is guaranteed to handle only a small fraction of the overall data. We begin with the discussion of our splitting technique. 4.3.1 Splitting the regular expressions The dual objective of the splitting procedure is that the prefixes remain as small as possible, and at the same time, the likelihood that normal data streams match these prefixes remains low. The probability of matching a prefix depends upon its length and the input data stream. In this context it may not be acceptable to assume a uniformly random distribution of the input symbols (i.e. every symbol appears with a probability of 1/256). Clearly, in any data stream, some symbols appear much more often than the others. Thus, one needs to consider a trace driven probability distribution of matching prefixes of different lengths under the normal data stream as well as under those of the anomalous data stream. Additionally, one also needs to pay attention to the probabilities of making transitions from one state to another; this probability is likely to be diverse, because there is a strong correlation between the occurrences of different symbols, and i.e. when and where they occur with respect to each other. More systematically, given the NFA of each regular expression, we need to determine the probability with which each state of the NFA becomes active and the probability that the NFA takes its different transitions. An obvious trend in these NFAs is that, as we move away from the start state, the probability of subsequent states being active reduces very quickly. Thus small prefixes may appear to be sufficient for normal traffic. However, in order to capture more extreme cases, we intend to generate synthetic traffic in addition to the real traffic traces. We plan to generate these synthetic traffic traces by first constructing a NFA of all regular expressions, and then traversing the NFA beginning at the start state. The traversal will occur with a bias probability; high bias values will force the NFA to make moves which will lead it to those states which are further away from the start state. The operation of the synthetic traffic generator is described below: nfa M(Q, q0, n, A, ) set state current; map level: stateint; current = q0; assign-level(q0, 1); do (true) Next synthetically generated char c = generatetraffic(bias); current = current U n(current, c); od char function generate-traffic (float bias); (1) for char c (2) set state next = ; (3) for state s current next = next U n(s, c); rof (4) if (sum(next) > max) max = sum(next); maxc = c; fi (5) if (random( ) > bias) break; (6) rof (7) return maxc; end; int function sum(set state states); (8) int total = 0; (9) for state s states total += level[s]; rof (10) return total; end; procedure assign-level(state s, int l, modifies map mark: statebit); (11) mark[s] = true; level[s] = l; (12) for char c (13) for state t n(s, c) (14) if not mark[t] assign-level(t, l+1); fi (15) rof end; With the real and synthetic traffic traces in hand, we compute the probability with which various NFA states becomes active and the probability with which its transitions are taken. Once these probabilities are computed, we need to determine a cut in the NFA graph, so that i) there are as few nodes as possible on the left hand side of the cut, and ii) the probability that states on the right hand side of the cut is active is sufficiently small. This will ensure that the fast path remains compact and the slow path is triggered only occasionally. While determining the cut, we also need to ensure that the probability of those transitions which leaves some NFA node on the right hand side and enters some other node on the same side of the cut remains small. This will ensure that, once the slow path is triggered, it will stop after processing a few input symbols. Clearly, the cut computed from the real traffic traces and from the synthetic traffic traces are likely to be different, therefore the corresponding prefixes will also be different. We adopt the general policy of taking the longer prefix. Below, we formally describe the procedure to determine a cut in the NFA graph. Let ps : Q [0, 1] denote be the probability with which the NFA states are active. Let the cut divides the NFA states into a fast and a slow region. Since we want to minimize the number of states in the fast region, we would like to keep only those states in the fast region whose probability of remaining active is high. Initially, we keep all states in the slow region; thus the slow path probability is ps . Afterwards, we begin moving states from the slow region to the fast region. The movements are performed in a breadth first order beginning at the start state of the NFA, and those states are moved first, whose probabilities of being active are higher. After a state s is moved to the fast region, ps[s] is reduced from the slow path probability. We continue this movement of states, until the slow path probability becomes sufficiently small. This method gives us the first order estimate of the cut between the fast and the slow path. Such a cut will ensure that the slow path processes only fraction of the total bytes in the input stream. The procedure is formally described below: procedure find-cut(nfa M(Q, q0, n, A, ), map ps : state[0,1]); (1) heap h; (2) map mark: statebit; (3) set state fast; (4) float p = ps ; (5) h.insert(q0, ps(q0)); (6) do h ≠ [ ] and p > (7) state s := h.findmax(); h.remove(t); (8) mark[s] = 1; fast = fast U s; p = p – ps(s); (9) for char c (10) for state t n(s, c) (11) if not mark[t] h.insert(t, ps(t)); fi (12) rof (13) rof (14) od end; statef states Slow path memory C Fast path state memory Fast path automaton C Slow path automata B bits/sec B bits/sec Figure 2: Fast path and slow path processing in a bifurcated packet processing architecture. For a large majority of the regular expressions used in the current systems, the above method will cleanly split them into prefix and suffix portions. However, for certain types of regular expressions, the above method will not result into a clean split. For instance consider an expression ab(cd|ef)gh. The resulting NFA may be cut at the states which corresponds to the matches abc and abe. For such cuts, there is no way the regular expression can be cleanly split into the prefix and suffix parts. One way to split this expression along this cut is to treat it as two separate expressions abcdgh and abefgh and split them individually. We rather propose to split such types of expressions by extending the cut of the NFA until a clean split of the expression is possible. Thus, in the above example, we will extend the cut to the states which corresponds to the matches abcd and abef; thus the prefix portion will become ab(cd|ef) and the suffix will be gh. slow path can be triggered multiple times for the same connection, the states associated with a connection can be relatively large, say states. The expectation is that the states will be at most 1/ times higher than the statef, thus slow path and fast path connection state memories will remain comparable in size. In order to better understand how slow path and fast path pattern matchers are constructed and how the slow path is triggered, let us consider a simple example. Let there are three regular expressions based patterns: r1 = [gh][^ij]*[ij]def r2 = fag[^i]*i[^j]*j r3 a[gh]i[^l]*[ae]c = The NFA for these three patterns are shown below (a composite DFA for these three patterns will have 144 states). In the figure, the probabilities with which various NFA states become active are highlighted. A cut between the fast and slow path is also shown; this cut divides the states such that the cumulative probability of the slow path states is less than 5%. ^i,j 0.8 * CUT 1 i-j 2 0.02 d 3 6 0.01 a 7 0.008 g 8 0.2 g-h ^i 0.1 0 0.002 e 4 f 0.0002 f 5 ^j 9 0.0006 j 10 0.008 a-e 14 0.0008 c 15 0.006 i 1.0 ^l a 0.02 0.1 11 g-h 12 0.016 i 13 4.3.2 The bifurcated packet processing Having described the mechanism to split regular expressions into prefix and suffix portions, we are now ready to proceed with the description of our bifurcated pattern matching architecture. The architecture (shown in Figure xxx) consists of two components: fast path and slow path. The fast path parses every byte of each input data stream and matches against the prefix portions of all regular expressions. The slow path parses only those data streams which have found a match in the fast path, and matches them only against those suffix portions, whose corresponding prefix portions are matched. As mentioned earlier, since the fast parses each input byte, they consist of a single composite DFA of all prefix patterns. Having a composite DFA for the fast path has an important advantage that the states associated with every connection that needs to be stored and loaded upon a connection switch is very small. Thus, if there are C connections in total, we only need C times statef memory, where statef is the bits needed to represent a DFA state. Slow path on the other hand handles only fraction of all input bytes, therefore it only needs to store the state of C connections. However, since slow path consists of separate DFA (or NFA) for each individual suffix, and since the With this cut, the prefix portions of the regular expressions will be p1 = [gh][^bc]*[bc]d; p2 = f; and p3 = j[gh]. The corresponding suffix portions of he regular expressions will be s1 = ef; s2 = ag[^i]*i[^j]*j; and s3 = i[^l]*[ae]c. The fast path pattern matcher will consist of a composite DFA of the prefix patterns p1, p2, and p3, which will have only 14 states. The slow path will consist of three individual DFA (or NFA) of the suffix patterns s1, s2, and s3, which will have 3, 15 and 6 states respectively. Once a data stream will match a prefix pattern say pi in the fast path, matching of the corresponding suffix pattern, si will be triggered in the slow path. Even though the slow path consists of multiple automata, it is not likely to become the performance bottleneck, because the expectation is that only 5% of the total input bytes will be processed by the slow path. The fast path state memory will only store a single DFA state for each connection. On the other hand, the slow path state memory may need to store several states for every connection. For instance if the input data stream of a connection is ”fagide”, then the slow path will be triggered thrice, and will have three active states which will have to be stored in the state memory upon a connection switch. f a g i d e 0 06 0,7,11 0,1,8,12 0,2,9,13 0,3,9,13 0,4,9,13 s2 s2 s2 s3 s2 s3 s1 s2 s3 While this example suggests that the number of active states for a single connection in the slow path can be many more than one, the general expectation is that this number will not be high. Even though a handful of connections will have many active states, the average number of active states for connections in the slow path will only be slightly higher than one. Such a cure from Insomnia appears attractive, because it ensures high average parsing rate, and also guarantees that the anomalous connections will be diverted to the slow path; thus they can not affect the performance received by well behaving connections. At the same time, splitting regular expressions into suffix and prefix portions avoids the state space blowup to a large extent. However, since the prefix portions are compiled into a composite DFA, if a large number of prefixes contain Kleene closures, then there may still be a state explosion. As a matter of fact, a few tens of Kleene closures are sufficient to make a composite DFA construction impractical. These state explosions occur due to Amnesia; therefore we now present an effective cure to Amnesia. 4.4 Curing DFAs from Amnesia Before proceeding with the formal description of H-FA, which is our cure to Amnesia, let us re-examine why amnesia leads to state blowup of a DFA. The primary reason behind the state space blowup is the existence of Kleene closures over union of multiple characters (e.g. .* or [a-z]*). With such patterns, when the parsing gets stuck at the closure (.*) and subsequent characters are consumed from the input stream without proceeding further from the closure, then they can form a new prefix. Therefore the DFA has to remember all such partially matched prefixes. Since DFAs can not remember anything except its current state (amnesia), each of these partial matching prefixes requires new state. The problem is exacerbated when there are multiple regular expressions, each containing such Kleene closures. A collection of such regular expressions often lead to exponential blowup in the number of states. An intuitive solution to the problem is to construct a machine, which can remember some additional information other that just its current state. For instance, one can simulate a NFA in such a way that it remembers all possible sets of current states for any given input stream, and thus alleviates the problem of state space blowup. However, NFAs are slow in practice, and in the worst-case it requires O(n2) memory accesses to consume a single character. Therefore, it is important that the new machine requires O(1) memory access per input character. Intuitively it appears that, if a finite automaton is equipped with a small amount of cache which it will use to remember key pieces of information then the problem of state space blowup can be alleviated. However, it is important to keep the cache compact so that it can be stored in on-chip memory or cache, and therefore can be accessed or loaded quickly. In essence, the objective is to partition a finite automaton into two portions, a graph representing the transitions which are stored as tables in the memory, and a fast but compact cache, which is used to avoid redundant accesses to the graph and avoids the exponential blowup in number of states. We introduce a novel machine called History based Finite Automaton (H-FA), which cures traditional DFAs from Amnesia. H-FA augments a DFA with the capability to remember a few but critical pieces of information. With this capability, H-FA leads to orders of magnitude reduction in the number of states. The next obvious questions are i) how to partition a finite automaton? ii) Is such a partitioning always possible? iii) Can the cache size always be bounded? We now present History based finite automaton (H-FA), which effectively addresses each of these concerns. 4.4.1 Introducing H-FA If we focus on the construction of a finite automaton from a non-deterministic finite automaton, then we find that each state of a FA represents a finite subset of states of the nondeterministic finite automaton. Clearly there can be up to 2 n possible subsets of n states, thus the number of states in a FA can grow exponentially. However, in practice, FAs do not have these many states else they will become impractical for regular expressions containing more than a few 10s of symbols. As suggested previously, the increase in the number of states occurs due to the presence of Kleene closures (*). For instance if there are n simple expressions each of length k, and each contains one Kleene closure, then the number of states in the resulting finite automaton can be as many as nk. Consider two patterns listed below (one of these patterns contains a Kleene closure; note that the Kleene closure is over [^a] because this will keep the resulting FA of the first expression, r1 simple, and thus the effect of the Kleene closure on multiple expressions will be prominent): r1 = ab[^a]*c; r2 = def; These patterns create a NFA with 7 states as shown below: * depending upon the contents of the history. Moreover, with some transitions, there are associated actions which are inserts into set, or removes from set, or both. H-FA can thus be formally represented as a 6-tuple, which I will describe later. ^a NFA: ab[^a]*c; def 1 b 2 c 3 4 e 5 f 6 a 0 d Shown below, let us look at the corresponding FA constructed by the subset construction over the above NFA. a a 0, 3 d e 0,2,4 a d d f 0,1 d d e a a 0,2,6 a b 0,5 f 0, 6 A simple way to avoid the state blowup in such regular expressions is to enable the FA to remember whether it has reached the Kleene closure or not. For instance in the previous example, if the FA can remember if it has reached the NFA state 2 or not, then the number of states can be reduced. H-FA efficiently and optimally achieves this objective. We now present the formal description of H-FA and algorithms to construct it from a traditional FA or directly from a NFA. 4.4.2 Formal Description of H-FA History based Finite Automata (H-FA) differs from traditional finite automata in that, in a H-FA, moves made by the machine depends both upon the transitions and the contents of a set called history, and moreover, as the machine makes moves, the contents of the set are updated. Thus, along with every transition of the H-FA, there is an accompanied condition which becomes either true or false 0, 3 e 0,2,4 a d Clearly the blowup in the number of states is due to the presence of the Kleene closure [^a]* in the expression r1. When the NFA is in state 2, then the subsequent stream of input symbols can partially or completely match r2, therefore the FA requires additional states in order to keep track of partially matched expressions r1 and r2. The subset of the NFA states which forms the FA states also illuminates this phenomenon, 5 FA states contain the NFA state 2 in its subset. In general, it turns out that the NFA states which represent the Kleene closures arises in multiple subsets, and leads to the exponential blowup in number of FA state. For instance, if there are only k Kleene closures and there are total n symbols in the expression then there can be up to nk subsets of NFA states containing one or multiple occurrence of states representing the Kleene closures. Thus, even if k is small, let us say 10 and n is 10,000 then there can be up to 1040 FA states. Thus, even a handful of Kleene closures can render a FA impractical. d 0 c 0,2 d 0,4 0,2,5 a d 0 c 0,2 a 0,1 a b Continuing with the previous example, let us describe how to build the corresponding H-FA. In one variant of the construction, we begin from a FA, and identify those NFA states which appear in most of the FA states. Thus, in the previous FA, the NFA state 2 appears in 4 FA states, which are highlighted below, we refer to the states in the highlighted region as fading states. d 0,4 0,2,5 f 0,2,6 d d e 0,5 f 0, 6 If we remove the NFA state 2 from the fading states, then they can be eliminated (they will overlap with some FA state in non-fading region). However, in order to remove NFA state 2 from any FA state subset, we will need to make a note that state 2 has been reached. Thus, all transitions from a non-fading state to a fading state (containing 2) will have an associated action to insert 2 into the history. Additionally, all transitions from fading states which lead to a non-fading state will have an associated action to remove 2 from the history. Furthermore, all transitions that remain in the fading region will have an associated condition that they are taken only if 2 is present in the history? Let us now remove NFA state 2 from the fading FA state (0, 2). After removal, this state will overlap with the FA state (0), therefore we need to add conditional transitions at the FA state (0). The resulting H-FA with state (0, 2) removed is shown below: a,|2,-2 2 b,+ a a 0,1 0 d d 0, 3 c,|2,-2 d,|2 a a d 0,4 e 0,2,4 0,2,5 f 0,2,6 d d e 0,5 f 0, 6 Here a transition with “|s” implies that the transition is taken when history contains s; “+s” implies that, when this transition is taken, s is inserted into the history, and “-s” implies that, when this transition is taken, s is removed from the history. Once we remove all states in the fading region from the FA, we will have the following H-FA: a a a,|2,-2 2 b,+ a 0,1 a,|2,-2 a 0 d 0, 3 c,|2,-2 d d 0,4 d e 4.4.4 Implementing history f 0,5 0, 6 d Note that this H-FA has an additional conditional at 4 states and the history will have at most one entry inside it. In general, if we remove k Kleene closures, the history will have up to k entries, on the other hand in the worst-case there can be up to 2k additional conditional transitions. We argue that such worst-case rarely appears, and moreover whenever it appears, a slightly sub-optimal H-FA construction can alleviate the problem. We now present a brief analysis of the increase in the number of conditional transitions. 4.4.3 Analysis of the Transition Blowup Let us consider a set n of regular expressions, and let there are total k Kleene closures. Let the ith expression containing i i * conditional transitions. This is probably because, the Kleene closures are usually over [.] or a set containing large number of symbols, and the sub-expression immediately following the closure usually contains only a handful of characters. i i Kleene closures is denoted by r1 [c1 ] c2 r2 , where r1 and r2 are prefix and suffix parts of the expression; Kleene closure is over a set of characters denoted by c1 and c2 denotes the set of characters which immediately follows the Kleene closure. For such expressions, if c1 contains a large number of characters, then there is likely to be a state space blowup in the FA. The blowup in number of conditional transitions in the resulting H-FA depends directly upon c2. For instance i if none of the c2 ’s overlaps with each other, then there will only be up to k conditional transitions in the H-FA. In i situations when there are overlaps between c2 ’s, there can be exponential blowup in the number of conditional i transitions. For instance lets say, each c2 is a, in this case, there can be up to 2k conditional transitions over character a, and the conditions will be the presence of each possible combination of the k Kleene closure NFA states in the history. Second, the number of actions (insert/remove from history) associated with the conditional transitions will depend upon the characteristics of c1. For instance if c1 contains all symbols in the alphabet, then there will not be any remove action. On the other hand, if c1 contains only a handful of symbols then a large number of transitions will have the associated remove action (note that in this case, this Kleene closure will not be the right candidate for the H-FA). In general, the number of conditional transitions with an associated insert operation will not be that many. If we look at the regular expression sets used in practice, it appears that there will be minimal blowup in number of If, we are removing k Kleene closures, then there can be up to k symbols in the history. Thus, each symbol in the history array will require log2k bits, leading to a total of klog2k bits. Clearly, even if k is 64, the size of history will only be 48 bytes. On the other hand, notice that with 64 entries in the history, the number of states in a traditional FA suffering from amnesia (hence state blowup) can be reduced by several orders of magnitude. Since history is likely to remain very compact, it will not have significant impact on the performance when packets will be arriving for different connections. If an average packet contains 200 bytes of data, then parsing this packet would require 200 memory accesses, therefore fetching 48 bytes of history will not have significant overhead. 4.4.5 Constructing H-FA to NFA H-FAs can be directly synthesized from a NFA. I will write a pseudo-code sometime later. 4.4.6 Comparing H-FA to NFA In one way, H-FA appears to be similar to a NFA, in that the total complexity of the machine is actually O(k), where k is the maximum number of entries in the history. However, in NFAs, there is no straightforward way to partition the problem into two components such that the processing of one component requires O(1) time but requires a moderately large amount of space (hence stored in memory), while the other component has a processing complexity of O(k) but can be represented much more compactly (hence stored in cache/on-chip). H-FA achieves this objective and efficiently partitions the problem into two such components. I am convinced that for any given set of regular expressions, there is no machine which can attain an O(1) worst-case processing time, and still require less space than a traditional state minimized FA. 5. ACKNOWLEDGMENTS I am grateful to Will Eatherton and John Williams for providing the regular expression rule set used in Cisco security appliances. I am also grateful to Prof. Michael Mitzenmacher whose collaboration has helped in developing the HEXA architecture. I am also thankful to Prof. George Varghese who helped me in developing the HFA and HP-FA machines. I am also thankful to Prof. Patrick Crowley, whose continued support, collaboration and motivation has helped me with every step in my research. Finally, I am thankful to Dr. Jonathan S. Turner, who has always been a xxx. This work was supported in part by the NSF Grant CNS-0325298 and a URP grant from Cisco Systems. 6. REFERENCES [1] R. Sommer, V. Paxson, “Enhancing Byte-Level Network Intrusion Detection Signatures with Context,” ACM conf. on Computer and Communication Security, 2003, pp. 262--271. [2] J. E. Hopcroft and J. D. Ullman, “Introduction to Automata Theory, Languages, and Computation,” Addison Wesley, 1979. [3] J. Hopcroft, “An nlogn algorithm for minimizing states in a finite automaton,” in Theory of Machines and Computation, J. Kohavi, Ed. New York: Academic, 1971, pp. 189--196. [4] Bro: A System for Detecting Network Intruders in Real-Time. http://www.icir.org/vern/bro-info.html [5] M. Roesch, “Snort: Lightweight intrusion detection for networks,” In Proc. 13th Systems Administration Conference (LISA), USENIX Association, November 1999, pp 229–238. [6] S. Antonatos, et. al, “Generating realistic workloads for network intrusion detection systems,” In ACM Workshop on Software and Performance, 2004. [7] A. V. Aho and M. J. Corasick, “Efficient string matching: An aid to bibliographic search,” Comm. of the ACM, 18(6):333–340, 1975. [8] B. Commentz-Walter, “A string matching algorithm [15] R. Sidhu and V. K. Prasanna, “Fast regular expression matching using FPGAs,” In IEEE Symposium on Field- Programmable Custom Computing Machines, Rohnert Park, CA, USA, April 2001. [16] C. R. Clark and D. E. Schimmel, “Efficient reconfigurable logic circuit for matching complex network intrusion detection patterns,” In Proceedings of 13th International Conference on Field Program. [17] J. Moscola, et. al, “Implementation of a content- scanning module for an internet firewall,” IEEE Workshop on FPGAs for Custom Comp. Machines, Napa, USA, April 2003. [18] R. W. Floyd, and J. D. Ullman, “The Compilation of Regular Expressions into Integrated Circuits”, Journal of ACM, vol. 29, no. 3, pp 603-622, July 1982. [19] Scott Tyler Shafer, Mark Jones, “Network edge courts apps,” http://infoworld.com/article/02/05/27/020527newebdev_1.html [20] TippingPoint X505, www.tippingpoint.com/products_ips.html [21] Cisco IOS IPS Deployment Guide, www.cisco.com [22] Tarari RegEx, www. tarari.com/PDF/RegEx_FACT_SHEET.pdf [23] Cu-11 standard cell/gate array ASIC, IBM. www.ibm.com [24] Virtex-4 FPGA, Xilinx. www.xilinx.com [25] N.J. Larsson, “Structures of string matching and data fast on the average,” Proc. of ICALP, pages 118–132, July 1979. compression,” PhD thesis, Dept. of Computer Science, Lund University, 1999 . [9] S. Wu, U. Manber,” A fast algorithm for multi-pattern [26] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. searching,” Tech. R. TR-94-17, Dept. of Comp. Science, Univ of Arizona, 1994. [10] Fang Yu, et al., “Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection”, UCB tech. report, EECS-2005-8. [11] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “Deterministic memory-efficient string matching algorithms for intrusion detection,” IEEE Infocom 2004, pp. 333--340. [12] L. Tan, and T. Sherwood, “A High Throughput String Matching Architecture for Intrusion Detection and Prevention,” ISCA 2005. [13] I. Sourdis and D. Pnevmatikatos, “Pre-decoded CAMs for Efficient and High-Speed NIDS Pattern Matching,” Proc. IEEE Symp. on Field-Prog. Custom Computing Machines, Apr. 2004, pp. 258–267. [14] S. Yusuf and W. Luk, “Bitwise Optimised CAM for Network Intrusion Detection Systems,” IEEE FPL 2005. Lockwood, “Deep Packet Inspection using Parallel Bloom Filters,” IEEE Hot Interconnects 12, August 2003. IEEE Computer Society Press. [27] Z. K. Baker, V. K. Prasanna, “Automatic Synthesis of Efficient Intrusion Detection Systems on FPGAs,” in Field Prog. Logic and Applications, Aug. 2004, pp. 311–321. [28] Y. H. Cho, W. H. Mangione-Smith, “Deep Packet Filter with Dedicated Logic and Read Only Memories,” Field Prog. Logic and Applications, Aug. 2004, pp. 125–134. [29] M. Gokhale, et al., “Granidt: Towards Gigabit Rate Network Intrusion Detection Technology,” Field Programmable Logic and Applications, Sept. 2002, pp. 404–413. [30] J. Levandoski, E. Sommer, and M. Strait, “Application Layer Packet Classifier for Linux”. http://l7filter.sourceforge.net/. [31] “MIT DARPA Intrusion Detection Data Sets,” http://www. ll.mit.edu/IST/ideval/data/2000/2000_data_index.html. [32] Vern Paxson et al., “Flex: A fast scanner generator,” http://www.gnu.org/software/flex/ [33] SafeXcel Content Inspection Engine, hardware regex acceleration IP. [34] Network Services Processor, OCTEON CN31XX, CN30XX Family. [35] R. Prim, “Shortest connection networks and some generalizations,” Bell System Technical Journal, 36:1389-1401, 1957. [36] J. B. Kruskal, “On the shortest spanning subtree of a graph and the traveling salesman problem,” Proc. of the American Mathematical Society, 7:48-50, 1956. [37] F. M. Liang. A lower bound for on-line bin packing. In Information Processing letters, pages 76-79, 1980. [38] Will Eatherton, John Williams, “An encoded version of reg-ex database from cisco systems provided for research purposes”. [39] Garey, M. R., and Johnson, D. S., “Bounded Component Spanning Forest”, pp 208, Computers and Intractability: A Guide to the Theory of NPCompleteness, 1979.
© Copyright 2025 Paperzz