Directed Acyclic Word Graphs
(DAWGs) in IDS Search
Slobodan Petrović
03.09.2013.
Contents
• Introduction
• DAWG definition
• The on-line construction algorithm for DAWG
2/61
Introduction
• Intrusion detection systems (IDS)
– Automatically detect attacks against
hosts/networks
– Classified as
• Host-based and Network-based
• Misuse detection systems and Anomaly detection
systems
– Misuse detection
• Search for indicators (signatures) of attacks in traffic
3/61
Introduction
• Search in misuse detection systems
– Must be efficient
• The IDS has to operate in real time – gigabit networks
– Since the search problem is significant for many
fields of computer science, it was extensively
studied in the last 50 years
• Several hundreds of (good) search algorithms
• In spite of that, only one algorithm (Aho-Corasick) is the
default algorithm in most IDS (at least open-source)
4/61
Introduction
• The Aho-Corasick algorithm (1)
– Multi-pattern
– 2 phases
• The first phase - building a finite automaton that
corresponds to the given set of search patterns
• The second phase - symbols of the search string appear
at the input of the automaton, one at a time
– The input symbol determines the next state of the machine
– The “failure” function – if the machine is in one state and the
next input symbol cannot move it to any legal state
5/61
Introduction
• The Aho-Corasick algorithm (2)
– 3 elements of the finite state machine
• The goto function
• The failure function
• The output function
6/61
Introduction
• The Aho-Corasick algorithm (3)
– Example – Search patterns: snow, snort, or
• The goto function
7/61
Introduction
• The Aho-Corasick algorithm (4)
– Example – Search patterns: snow, snort, or
• The failure function
i
1
2
3
4
5
6
7
8
f(i)
0
0
7
0
8
0
0
0
• The output function
i
1
2
3
4
5
o(i)
{}
{}
{}
{snow} {or}
6
7
8
{snort}
{}
{or}
8/61
Introduction
• The Aho-Corasick algorithm (5)
– Advantages over other search algorithms
• Multi-pattern search by design
• So-called algorithmic attack impossible
– The average-case and the worst-case complexities
approximately the same
– Disadvantage
• Slow at an average case, compared to other algorithms
9/61
Introduction
• Can we do it better?
– There are IDS that implement other search
algorithms
• Example
– Suricata implements BNDMq – Backward Non-Deterministic
DAWG Matching tuned with q-grams
• Still, Snort and Suricata use the Aho-Corasick algorithm
as the default search algorithm – why?
10/61
DAWG definition
• DAWG – Directed Acyclic Word Graph
– A special directed graph assigned to any given
string
– Any graph is defined by defining its set of vertices
and edges
• In the case of a DAWG, to define its set of vertices we
need a special function EndPos, whose argument is any
substring of the given string w and the output is the set
of all ending positions of the argument in w
11/61
DAWG definition
• Example
– Let the given string be w = aabbab and take one of
its substrings, s = ab
– Then
• EndPos(s)={3,6}
12/61
DAWG definition
• All the substrings of the given string w with
the same value of the function EndPos are
considered equivalent
• Such a definition of equivalence determines
equivalence classes
– The sets of all substrings of w ending at the same
positions in w
13/61
DAWG definition
• Every distinct value of the function EndPos
defines a distinct equivalence class
• Every distinct equivalence class represents a
vertex in the DAWG
• In addition to these vertices, any DAWG
contains an additional special vertex - the root
– This vertex does not have an equivalence class
assigned
14/61
DAWG definition
• Example – w = aabbab (1)
– The set of all distinct substrings of w is
Substr(w) = {aabbab, abbab, bbab, bab, ab, b,
aabba, abba, bba, ba, a,
aabb, abb, bb,
aab,
aa}
15/61
DAWG definition
• Example – w = aabbab (2)
– Then
EndPos(aabbab)={6}, EndPos(aabba)={5},
EndPos(aabb)={4}, EndPos(aab)={3}, EndPos(aa)={2},
EndPos(a)={1,2,5}, EndPos(abbab)= {6},
EndPos(abba)= {5}, EndPos(abb)={4}, EndPos(ab)= {3,6},
EndPos(bbab)={6}, EndPos(bba)={5}, EndPos(bb)= {4},
EndPos(b)={3,4,6}, EndPos(bab)={6}, EndPos(ba)= {5}
16/61
DAWG definition
• Example – w = aabbab (3)
– The equivalence classes are
{1,2,5}: a
{2}: aa
{3}: aab
{3,4,6}: b
{3,6}: ab
{4}: aabb, abb, bb
{5}: aabba, abba, bba, ba
{6}: aabbab, abbab, bbab, bab
17/61
DAWG definition
• Example – w = aabbab (4)
– Each equivalence class defines a vertex in the
DAWG assigned to the given string w
– Thus, the DAWG has 8+1 vertices
• The additional vertex is the root
18/61
DAWG definition
• The vertices in a DAWG are usually not labeled
with the corresponding equivalence classes
since such a labeling would be impractical for
its length
• Instead, the vertices are labeled in the usual
way, starting from the vertex 1 etc.
• This does not influence the DAWG
construction algorithm
19/61
DAWG definition
• The edges of the DAWG are usually not given
explicitly
• Instead, paths from the root are given
– Contiguous sequences of edges starting from the
root and ending at vertices corresponding to the
equivalence classes
• Each element of an equivalence class defines
a distinct path
20/61
DAWG definition
• Example – w = aabbab
– If we observe the vertex {6} and the elements of
the corresponding equivalence class (aabbab,
abbab, bbab, and bab), each of these elements
defines a path from the root to the observed
vertex
– Each edge on any path is labeled with a symbol
from the given string w
21/61
DAWG definition
• By giving the set of vertices and paths in the
DAWG, we have completely defined it
• DAWGs are usually presented in a graphical
way, as any other graph
22/61
DAWG definition
• Example – w = aabbab – complete labeling
23/61
DAWG definition
• Example – w = aabbab – conventional labeling
24/61
DAWG definition
• Since the length of a vertex label in complete
labeling depends on the number of
corresponding paths leading from the root to
the observed vertex and that number grows
with the length of the given string w in a nonlinear way, the conventional way of labeling
vertices is normally employed when
presenting DAWGs instead of the complete
labeling
25/61
DAWG definition
• The lengths of the labels in conventional
labeling are fixed and it can be shown that
using conventional labeling does not affect
construction of DAWGs
• The order of the labels is not relevant
• The only requirement for the labels is that
each label is unique
26/61
DAWG definition
• Directed in DAWG means that the graph only
contains one-way edges
• Acyclic means that there are no one-way
closed paths (cycles) in it
27/61
DAWG definition
• A DAWG representation of a string is very
convenient for application in search
algorithms
• A DAWG is the most efficient representation
of a string in the sense that a DAWG is the
minimal DFA (Deterministic Finite Automaton)
that recognizes all the suffixes of the given
string
28/61
DAWG definition
• If DAWG(w)=(V,E), where V is the set of
vertices and E is the set of edges, then
|V|≤ 2|w|-1
|E|≤ 3|w|-3
• This means that the size of the DAWG
corresponding to a given string w grows
linearly with the length (i.e. the size) of w
29/61
DAWG definition
• It is obvious that a DAWG of a given string w
can be constructed by following the definition
– For such a construction we need enumeration of
all the substrings of w
– The number of these substrings grows
quadratically with the length of w
• Because of that, we need more efficient
methods for DAWG construction
30/61
DAWG definition
• Two groups of DAWG construction algorithms
– Off-line construction
• Efficient, but the whole string w must be known in
advance
– On-line construction
• Updates the current DAWG after receiving each symbol
of the string w
• More often used in practice
31/61
On-line DAWG construction
• Suffix link (1)
– Consider the equivalence class corresponding to a
vertex in the DAWG assigned to the given string w
– Identify the shortest path from the root belonging
to that equivalence class
– Split it into two parts
• The first edge on that path
• The rest of the path
32/61
On-line DAWG construction
• Suffix link (2)
– Identify a vertex in the DAWG, whose equivalence
class contains the path equal to the second part of
the path identified before
– If we now link these two vertices of the DAWG,
such a link is called a suffix link
– Each vertex of the DAWG, except the root, has a
suffix link
33/61
On-line DAWG construction
• Example
34/61
On-line DAWG construction
• Solid edge (1)
– Consider an edge (x,y) of a DAWG, labeled with
the symbol s
– Let the length of the longest path from the root
leading to the vertex x be lx and let the length of
the longest path from the root leading to the
vertex y be ly
– If ly=lx+1 then the edge (x,y) is called a solid edge
35/61
On-line DAWG construction
• Solid edge (2)
– Other edges (i.e. those for which ly>lx+1) are called
non-solid
• They are shortcuts in a DAWG
– No input symbol in any stage of the on-line DAWG
construction algorithm can change any solid edge
– But non-solid edges can be changed, i.e.
redirected or removed
36/61
On-line DAWG construction
• Example
37/61
On-line DAWG construction
• The on-line DAWG construction algorithm
updates the current DAWG after receiving the
next symbol of the string w
• Let wi be a prefix of the string w of length i
• Let DAWGi be the DAWG associated with wi
• It is also supposed that the suffix links of all
the vertices of DAWGi are known
38/61
On-line DAWG construction
• Suppose now that the next symbol of the
string w is s
• Then wi+1=wi+s, where the + sign denotes
concatenation
• What changes in the DAWG does that imply?
39/61
On-line DAWG construction
• Adding a new symbol s at the end of the prefix
wi produces new substrings
– All are suffixes of wi+1
• All the new substrings end in the new symbol s
– None of the substrings of wi disappears
40/61
On-line DAWG construction
• Example (1)
– Let wi=aabba, (i=5) and s=b
– Then wi+1=wi+s=aabbab
– The set of all the distinct substrings of wi is
Substr(wi)={aabba, abba, bba, ba, a,
aabb, abb, bb, b,
aab, ab,
aa}
41/61
On-line DAWG construction
• Example (2)
– The set of all the distinct substrings of wi+1 is
Substr(wi+1)={aabbab, abbab, bbab, bab, ab, b,
aabba, abba, bba, ba, a,
aabb, abb, bb,
aab,
aa}
42/61
On-line DAWG construction
• Example (3)
– We can observe that
• All the substrings of wi are retained in the set of
substrings of wi+1
• The set of substrings of wi+1 contains new substrings
that are all suffixes of wi+1
43/61
On-line DAWG construction
• Example (4)
– The equivalence classes of DAWGi are
{1,2,5}: a
{2}: aa
{3}: aab, ab
{3,4}: b
{4}: aabb, abb, bb
{5}: aabba, abba, bba, ba
44/61
On-line DAWG construction
• Example (5)
– The equivalence classes of DAWGi+1 are
{1,2,5}: a
{2}: aa
{3}: aab
{3,4,6}: b
{3,6}: ab
{4}: aabb, abb, bb
{5}: aabba, abba, bba, ba
{6}: aabbab, abbab, bbab, bab
45/61
On-line DAWG construction
• Example (6)
– We first observe that the equivalence class
corresponding to the sink vertex of the DAWG
contains the paths from the root that can also be
found in the first row of the listed sets of distinct
substrings
• The rest of the substrings in the first row are the suffix
links, each following the previous one
46/61
On-line DAWG construction
• Example (7)
– Then we notice that if we add a symbol s at the
end of wi, all the vertices of DAWGi on this chain
of suffix link vertices must get an outgoing edge
labeled with s
• (Compare the first row of substrings of wi and wi+1)
47/61
On-line DAWG construction
• Example (8)
– If such an edge does not exist in DAWGi, it must
be added in DAWGi+1
– If such an edge exists, then we have to compare
the equivalence classes of DAWGi and DAWGi+1
48/61
On-line DAWG construction
• Example (9)
– Specifically, we observe the edge (x,y) in DAWGi,
where x=suf(sink) and y=son(x,s)
• sink is the final vertex of DAWGi
• suf(sink) is its suffix link
• son(x,s) is the vertex in DAWGi pointed to by the edge
going out from x and labeled with the symbol s
49/61
On-line DAWG construction
• Example (10)
– Let us first observe the graph representations of
DAWGi and DAWGi+1
– The sink node of DAWGi is v7 and its suffix link
suf(sink) is the node v2
50/61
On-line DAWG construction
• Example (11)
51/61
On-line DAWG construction
• Example (12)
52/61
On-line DAWG construction
• Example (13)
– The edge (x,y)=(suf(sink),son(suf(sink),s)) exists in
DAWGi
• It is the edge (v2,v4) labeled with the symbol s=b
• This edge is non-solid, which means that this edge can
be changed (erased or redirected)
53/61
On-line DAWG construction
• Example (14)
– Consider now the equivalence classes
corresponding to the vertex v4 in DAWGi and
DAWGi+1
– We observe that in DAWGi the equivalence class
corresponding to v4 is {3}: aab,ab, whereas in
DAWGi+1 this equivalence class is {3}: aab
54/61
On-line DAWG construction
• Example (15)
– The substring ab in DAWGi+1 belongs to a new
equivalence class {3,6}: ab that corresponds to the
vertex v9
– Thus, in the algorithm for on-line construction of
DAWG, we split the vertex v4 into two vertices, v4
and v9
55/61
On-line DAWG construction
• Example (16)
– All the outgoing edges and the suffix link of the
new node v9 are the same as for v4 in DAWGi
– A new solid edge is introduced between the
vertex v2 and the new (''clone'') vertex v9
56/61
On-line DAWG construction
• If a vertex at some step of the on-line DAWG
construction algorithm is split and some of the
in-going paths to the split vertex correspond
to the members of its equivalence class that
moved to its ''clone'', then the final edges of
those paths must be redirected to the ''clone''
• This is a separate step in the vertex-splitting
part of the DAWG on-line construction
algorithm
57/61
On-line DAWG construction
• Example (1)
– Consider the DAWG corresponding to the string
wi=aabbab and let s=b, i.e. wi+1=aabbabb
58/61
On-line DAWG construction
• Example (2)
– By using the definition of a DAWG, we get the set
Substr(wi+1), the equivalence classes, and the
DAWGi+1
59/61
On-line DAWG construction
• Example (3)
60/61
On-line DAWG construction
61/61
© Copyright 2026 Paperzz