Affix Trees Moritz G. Maaß, June 2000 http://www.informatik.tu-muenchen.de/∼maass [email protected] Agenda 1. Introduction 2. Construction of Suffix Trees 3. Construction Affix Trees 4. Complexity Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 1 Important Suffix Tree Properties • Representation of repeated substrings • Right branching substrings are • Each tree position represents a • Moving down in the tree extends y w represented by branching nodes unique string x v w w x y the string, moving towards the v’ v v’ root shortens the string. Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 2 Representation of Tree Positions with Reference Pairs 1 c c 3 root a b 2 a b c 4 root + a b 5 + ε 3 + ε 2 + ab 5 6 c a b c 9 Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 3 Suffix Links root a b c • Two ways of “shortening” the represented substring abab at the front to come to the position of bab • Suffix links operate at the front of the represented tree, while edges operate at the end b c 1 b 2 c 3 a a b c 4 a 5 c a b c 6 b a Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 b 9 4 Reverse Tree root a b c b c 1 b 2 c 3 a a b c 4 a b 5 c a b c 6 b a Linear Bidirectional On-Line Construction of Affix Trees −1 (.) root c bb 2 b c a a 3 a b 5 6 bc c 4 a c 1 a b a b c 9 9 Moritz G. Maaß, June 2000 5 Affix Trees CST(ababc) CAT(ababc) root c ba 1 3 c c 2 1 b c a 5 b c a 6 root c b ba b 3 a c b 4 c 4 9 Linear Bidirectional On-Line Construction of Affix Trees b c a a a 5 b c ba a b c a 9 root root 8 c 1 b 6 CST(cbaba) c a 2 CAT(cbaba) b b c b a 2 3 5 a 7 c b 4 c b b a 6 b a b b b a b a a c a a 7 b 8 c 9 Moritz G. Maaß, June 2000 a a a a b a b a b a b a 7 b 8 c 9 6 Definition 1 (right branching and left branching). A substring w of t is right branching (left-branching), if there w occurs at two different positions in t with distinct succeding (preceding) letters (w is r.b. in t, if ∃x, y ∈ Σ, u, v, u0 , v 0 ∈ Σ∗ .t = uwxv ∧ t = u0 wyv 0 ∧ x 6= y). Definition 2 (Σ+ -tree). A Σ+ -tree T is a rooted, directed tree with w x v’ v root a b edge labels from Σ+ . For each a ∈ Σ, every node in T has at most one outgoing edge whose label starts with a. y b a a Definition 3 (path(n)). If n is a node in Σ+ -tree T , then path(n) is the string built by concatenating all edge labels from the root to n. It is a unique identifier for the tree position. Definition 4 (words(T )). A string u is in words(T ), if there is a node n in T s.t. ∃v ∈ Σ∗ .uv = path(n). Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 7 Suffix Trees and Suffix Links root a b c Definition 5 (Suffix tree). A t is a Σ+ -tree with words(T ) = {u|u is a substring of t}. suffix tree of string b c 1 b 2 c 3 a a b c 4 a b 5 c a b c 6 b a 9 Definition 6 (Suffix Link). A suffix link is an auxiliary edge from node n to node m where m is the node s.t. path(m) is the longest proper suffix of path(n) represented by a node in T . Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 8 Reverse Tree and Affix Trees Definition 7 (Reverse tree T −1 ). The reverse tree 1 T −1 of a Σ+ -tree T augmented with suffix links is defined as the tree that is formed by the suffix links of T , where the direction of each link is reversed, but the label is kept. 3 root c b 2 b a 6 a 5 b 4 a 9 Definition 8 (Affix tree). An affix tree T of a string t is a Σ+ -tree s.t. words(T ) = {u|u is a substring of t} and words(T −1 ) = {u|u is a substring of t−1 }. Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 9 Affix Trees root ab root c c 1 c b 3 a a a b b a a 5 b c 4 6 b c 2 a ba 7 8 a 7 c a c 6 a b 4 8 a c 9 9 Linear Bidirectional On-Line Construction of Affix Trees b 3 b b b c abc 5 a a 1 a b aba b 2 b b c c Moritz G. Maaß, June 2000 10 Previous Work • Weiner, McCreight: linear suffix tree construction • Ukkonen: linear on-line suffix tree construction, reference pairs, open edges • Giegerich and Kurtz: relationship between suffix tree and its reverse tree through suffix links • Stoye: affix tree data structure • Blumer et al.: DAWG, c-DAWG with suffix links invariant under reversal Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 11 1. Introduction 2. Construction of Suffix Trees 3. Construction Affix Trees 4. Complexity Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 12 On-Line Suffix Tree Construction b root b a a a b b ba a a a a b b b aa ba a a b a b a a CST(bbabaaba) Linear Bidirectional On-Line Construction of Affix Trees b a b a a b a b root a a b a b a ab b ab a b a b ab a a b b a b CST(bbabaabab) Moritz G. Maaß, June 2000 13 Anti-On-Line Suffix Tree Construction b b n root a n’’ a +"b"b ab ab b b b a a b b n’ a b a b b b b a b b b CST(abaababb) Linear Bidirectional On-Line Construction of Affix Trees b root a b a b b aa a b bb a b b a b ab a b b b a b ab b a b b CST(babaababb) Moritz G. Maaß, June 2000 14 Complexity of Suffix Tree Construction Lemma 1. Ukkonen’s algorithm constructs CST(t) on-line in time O(|t|). Lemma 2. With the additional information of knowing the length of the s of t before inserting it, it takes O(|t|) time to construct CST(t) in an anti-onactive prefix for any suffix line manner. Linear Bidirectional On-Line Construction of Affix Trees root a b b b a a b root a b b b a a b b b Moritz G. Maaß, June 2000 15 1. Introduction 2. Construction of Suffix Trees 3. Construction Affix Trees 4. Complexity Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 16 The Problem in Constructing Affix Trees a root a a a b a a a Linear Bidirectional On-Line Construction of Affix Trees ba 2. a baa a a b a a b ba a b b b ba a a b 1. b a b a a ba b a Moritz G. Maaß, June 2000 17 The Problem in Constructing Affix Trees (continued) a a b a a a Linear Bidirectional On-Line Construction of Affix Trees a baa baa ba b b a ab a ba a a b a b a a b a root ba b a a a b a a b a a b a b a a b a a b a a a Moritz G. Maaß, June 2000 18 The Solution: Paths 2. a a 1.a a b a a a Linear Bidirectional On-Line Construction of Affix Trees b a b a a b ba b a b a baa a ba a b b b ba a a root a ba b a a Moritz G. Maaß, June 2000 19 The Solution: Paths (continued) a a a b a a a a Linear Bidirectional On-Line Construction of Affix Trees a baa baa ba b b a ab a ba a a b b b a a root ba b a a a b a a b a b a b a a b a a a b a a a Moritz G. Maaß, June 2000 20 Additional Steps in Affix Tree Construction • Updating Paths: Lemma 3. The prefix parent of the active suffix leaf is (also) a prefix node. • Keeping track of the active suffix point, the active prefix point, the active suffix leaf, and the active prefix leaf: Lemma 4. The active prefix will grow in the iteration from t to ta, iff the new active suffix of ta is represented by a prefix leaf in CAT(t). • Deleting Nodes Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 21 Summary of all Steps 1. Remove the suffix link from the active suffix link to s. 2. Lengthen the text, thereby lengthening all open edges. 3. Insert the prefix node for t as suffix parent of ta and link it to s. 4. Insert relevant suffixes and update suffix links. 5. Make the location of the new active suffix α(ta) explicit and add a suffix link from the new active suffix link to it. 6. Update the active prefix, possibly deleting a node. Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 22 Example of a Single Iteration root a b a a 3 b b s asp 4 b asl a a b b b s asp a a b 4 a 3 asl b b (b) Linear Bidirectional On-Line Construction of Affix Trees a 7 a ab b a b asl b b a asp b 7 a b 5 a b 3 b 4 ab Moritz G. Maaß, June 2000 a 6 asl b 1 (d) asp a b a 6 1 (c) b ba 4 ab a a a b a 3 b 2 5 6 b 1 asl root endpoint b a 4 3 1 (a) s asp b 2 5 root a b a 2 5 a root a b a 2 5 a root a b 1 (e) 23 1. Introduction 2. Construction of Suffix Trees 3. Construction Affix Trees 4. Complexity Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 24 Complexity of Affix Tree Construction Theorem 1. CAT(t) can be constructed in an on-line manner from left to right or from right to left in time O(|t|). Theorem 2. Bidirectional construction of affix trees has linear time complexity. Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 25 Bidirectional Construction - Changes to the Active Prefix Point root Linear Bidirectional On-Line Construction of Affix Trees root Moritz G. Maaß, June 2000 26 Bidirectional Construction - Changes to the Active Suffix Point • Growth of the active suffix in a reverse iteration adds node-accounted part. • Insertion of suffix nodes in a reverse iteration is node −accounted only relevant to nodeaccounted part. • node-accounted can not be nested. Linear Bidirectional On-Line Construction of Affix Trees part string−accounted string part of active suffix Moritz G. Maaß, June 2000 27 Conclusion • Affix trees are a natural extension of suffix trees. • Construction can be done in linear time, on-line and bidirectional. • Affix tree augmented by paths behave like suffix trees. • The view can be switched from the suffix to the prefix tree at any time. • Right branching and left branching substrings represented in one structure. Linear Bidirectional On-Line Construction of Affix Trees Moritz G. Maaß, June 2000 28
© Copyright 2026 Paperzz