Affix Trees

Affix Trees
Moritz G. Maaß, June 2000
http://www.informatik.tu-muenchen.de/∼maass
[email protected]
Agenda
1. Introduction
2. Construction of Suffix Trees
3. Construction Affix Trees
4. Complexity
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
1
Important Suffix Tree Properties
• Representation of repeated substrings
• Right branching substrings are
• Each tree position represents a
• Moving down in the tree extends
y
w
represented by branching nodes
unique string
x v
w
w
x
y
the string, moving towards the
v’
v
v’
root shortens the string.
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
2
Representation of Tree Positions with Reference Pairs
1
c
c
3
root
a
b
2
a
b
c
4
root + a
b
5
+ ε
3
+ ε
2
+ ab
5
6
c a
b
c
9
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
3
Suffix Links
root
a
b
c
• Two ways of “shortening” the
represented substring abab
at the front to come to the position of bab
• Suffix links operate at the
front of the represented tree,
while edges operate at the
end
b
c
1
b
2
c
3
a
a
b
c
4
a
5
c a
b
c
6
b
a
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
b
9
4
Reverse Tree
root
a
b
c
b
c
1
b
2
c
3
a
a
b
c
4
a
b
5
c a
b
c
6
b
a
Linear Bidirectional On-Line Construction of Affix Trees
−1
(.)
root
c
bb
2
b c
a
a
3
a b 5
6
bc c
4 a
c
1
a
b
a
b
c
9
9
Moritz G. Maaß, June 2000
5
Affix Trees
CST(ababc)
CAT(ababc)
root
c ba
1
3
c
c
2
1
b
c
a
5
b c a
6
root
c b ba
b
3
a
c
b
4
c
4
9
Linear Bidirectional On-Line Construction of Affix Trees
b
c
a
a a
5
b c ba
a
b
c
a
9
root
root
8
c
1
b
6
CST(cbaba)
c
a
2
CAT(cbaba)
b
b
c
b
a
2
3
5
a
7
c
b
4
c
b
b
a
6
b
a
b
b
b
a
b
a
a
c
a
a
7
b
8
c
9
Moritz G. Maaß, June 2000
a
a
a
a
b
a
b
a
b
a
b
a
7
b
8
c
9
6
Definition 1 (right branching and left branching). A substring
w of t is right branching (left-branching), if there w occurs at two
different positions in t with distinct succeding (preceding) letters (w
is r.b. in t, if ∃x, y ∈ Σ, u, v, u0 , v 0 ∈ Σ∗ .t = uwxv ∧ t =
u0 wyv 0 ∧ x 6= y).
Definition 2 (Σ+ -tree). A Σ+ -tree T is a rooted, directed tree with
w
x
v’
v
root
a b
edge labels from Σ+ . For each a
∈ Σ, every node in T has at most
one outgoing edge whose label starts with a.
y
b
a
a
Definition 3 (path(n)). If n is a node in Σ+ -tree T , then path(n) is the string built
by concatenating all edge labels from the root to n. It is a unique identifier for the
tree position.
Definition 4 (words(T )). A string u is in words(T ), if there is a node n in T s.t.
∃v ∈ Σ∗ .uv = path(n).
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
7
Suffix Trees and Suffix Links
root
a
b
c
Definition
5
(Suffix
tree).
A
t is a
Σ+ -tree with words(T )
=
{u|u is a substring of t}.
suffix
tree
of
string
b
c
1
b
2
c
3
a
a
b
c
4
a
b
5
c a
b
c
6
b
a
9
Definition 6 (Suffix Link). A suffix link is an auxiliary edge from node n to node m
where m is the node s.t.
path(m) is the longest proper suffix of path(n)
represented by a node in T .
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
8
Reverse Tree and Affix Trees
Definition 7 (Reverse tree
T
−1
). The reverse tree
1
T −1 of a Σ+ -tree T augmented with suffix links is
defined as the tree that is formed by the suffix links of
T , where the direction of each link is reversed, but the
label is kept.
3
root
c b
2
b
a
6
a
5
b
4
a
9
Definition 8 (Affix tree). An affix tree T of a string t is a Σ+ -tree s.t.
words(T ) = {u|u is a substring of t} and
words(T −1 ) = {u|u is a substring of t−1 }.
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
9
Affix Trees
root
ab
root
c
c
1
c
b
3
a
a
a
b
b
a
a
5
b
c
4
6
b
c
2
a
ba
7
8
a
7
c
a
c
6
a
b
4
8
a
c
9
9
Linear Bidirectional On-Line Construction of Affix Trees
b
3
b
b
b
c
abc
5
a
a
1
a
b
aba
b
2
b
b
c
c
Moritz G. Maaß, June 2000
10
Previous Work
• Weiner, McCreight: linear suffix tree construction
• Ukkonen: linear on-line suffix tree construction, reference pairs, open edges
• Giegerich and Kurtz: relationship between suffix tree and its reverse tree
through suffix links
• Stoye: affix tree data structure
• Blumer et al.: DAWG, c-DAWG with suffix links invariant under reversal
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
11
1. Introduction
2. Construction of Suffix Trees
3. Construction Affix Trees
4. Complexity
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
12
On-Line Suffix Tree Construction
b
root
b
a
a
a
b
b ba
a
a
a
a
b
b
b aa ba
a
a b
a
b
a
a
CST(bbabaaba)
Linear Bidirectional On-Line Construction of Affix Trees
b
a
b
a
a
b
a
b
root
a
a
b
a
b
a
ab
b ab
a
b
a b ab
a
a
b
b
a
b
CST(bbabaabab)
Moritz G. Maaß, June 2000
13
Anti-On-Line Suffix Tree Construction
b
b
n
root
a
n’’
a +"b"b
ab ab
b
b
b
a
a
b b
n’
a
b
a
b
b
b
b
a
b
b
b
CST(abaababb)
Linear Bidirectional On-Line Construction of Affix Trees
b
root
a
b a
b
b aa a
b bb
a
b
b
a
b
ab a b
b
b a
b ab
b a
b
b
CST(babaababb)
Moritz G. Maaß, June 2000
14
Complexity of Suffix Tree Construction
Lemma 1. Ukkonen’s algorithm constructs CST(t) on-line in time O(|t|).
Lemma 2. With the additional information of knowing the length of the
s of t before inserting it, it takes O(|t|) time
to construct CST(t) in an anti-onactive prefix for any suffix
line manner.
Linear Bidirectional On-Line Construction of Affix Trees
root
a
b
b b a
a
b
root
a
b
b b a
a
b
b
b
Moritz G. Maaß, June 2000
15
1. Introduction
2. Construction of Suffix Trees
3. Construction Affix Trees
4. Complexity
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
16
The Problem in Constructing Affix Trees
a
root
a
a
a
b
a
a
a
Linear Bidirectional On-Line Construction of Affix Trees
ba
2.
a
baa
a
a
b
a
a
b
ba
a
b
b
b
ba
a
a
b
1. b
a
b
a
a
ba
b
a
Moritz G. Maaß, June 2000
17
The Problem in Constructing Affix Trees (continued)
a
a
b
a
a
a
Linear Bidirectional On-Line Construction of Affix Trees
a
baa
baa
ba
b
b
a
ab
a
ba
a
a
b
a
b
a
a
b
a
root
ba
b
a
a
a
b
a a
b
a
a
b
a
b
a
a
b
a
a
b
a
a
a
Moritz G. Maaß, June 2000
18
The Solution: Paths
2. a
a
1.a
a
b
a
a
a
Linear Bidirectional On-Line Construction of Affix Trees
b
a
b
a
a
b
ba
b
a
b
a
baa
a
ba
a
b
b
b
ba
a
a
root
a
ba
b
a
a
Moritz G. Maaß, June 2000
19
The Solution: Paths (continued)
a
a
a
b
a
a
a
a
Linear Bidirectional On-Line Construction of Affix Trees
a
baa
baa
ba
b
b
a
ab
a
ba
a
a
b
b
b
a
a
root
ba
b
a
a
a
b
a a
b
a
b
a
b
a
a
b
a
a
a
b
a
a
a
Moritz G. Maaß, June 2000
20
Additional Steps in Affix Tree Construction
• Updating Paths:
Lemma 3. The prefix parent of the active suffix leaf is (also) a prefix node.
• Keeping track of the active suffix point, the active prefix point, the active suffix
leaf, and the active prefix leaf:
Lemma 4. The active prefix will grow in the iteration from t to ta, iff the new
active suffix of ta is represented by a prefix leaf in CAT(t).
• Deleting Nodes
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
21
Summary of all Steps
1. Remove the suffix link from the active suffix link to s.
2. Lengthen the text, thereby lengthening all open edges.
3. Insert the prefix node for t as suffix parent of ta and link it to s.
4. Insert relevant suffixes and update suffix links.
5. Make the location of the new active suffix α(ta) explicit and add a suffix link
from the new active suffix link to it.
6. Update the active prefix, possibly deleting a node.
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
22
Example of a Single Iteration
root
a
b
a
a
3
b
b
s
asp
4
b
asl
a
a
b
b
b
s
asp
a
a
b
4
a
3
asl
b
b
(b)
Linear Bidirectional On-Line Construction of Affix Trees
a
7
a
ab
b
a
b
asl
b
b
a
asp
b
7
a
b
5
a
b
3
b
4
ab
Moritz G. Maaß, June 2000
a
6
asl
b
1
(d)
asp
a
b
a
6
1
(c)
b
ba
4
ab
a
a
a
b
a
3
b
2
5
6
b
1
asl
root
endpoint
b
a
4
3
1
(a)
s
asp
b
2
5
root
a
b
a
2
5
a
root
a
b
a
2
5
a
root
a
b
1
(e)
23
1. Introduction
2. Construction of Suffix Trees
3. Construction Affix Trees
4. Complexity
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
24
Complexity of Affix Tree Construction
Theorem 1. CAT(t) can be constructed in an on-line manner from left to right or
from right to left in time O(|t|).
Theorem 2. Bidirectional construction of affix trees has linear time complexity.
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
25
Bidirectional Construction - Changes to the Active Prefix Point
root
Linear Bidirectional On-Line Construction of Affix Trees
root
Moritz G. Maaß, June 2000
26
Bidirectional Construction - Changes to the Active Suffix Point
• Growth of the active suffix in a reverse iteration
adds
node-accounted
part.
• Insertion of suffix nodes
in a reverse iteration is
node −accounted
only relevant to nodeaccounted part.
• node-accounted
can not be nested.
Linear Bidirectional On-Line Construction of Affix Trees
part
string−accounted
string part of
active suffix
Moritz G. Maaß, June 2000
27
Conclusion
• Affix trees are a natural extension of suffix trees.
• Construction can be done in linear time, on-line and bidirectional.
• Affix tree augmented by paths behave like suffix trees.
• The view can be switched from the suffix to the prefix tree at any time.
• Right branching and left branching substrings represented in one structure.
Linear Bidirectional On-Line Construction of Affix Trees
Moritz G. Maaß, June 2000
28