Suffix Trees Construction and Applications João Carreira 2008 Outline Why Suffix Trees? ● Definition ● Ukkonen's Algorithm (construction) ● Applications ● Why Suffix Trees? Why Suffix Trees? ● Asymptotically fast. Why Suffix Trees? Asymptotically fast. ● The basis of state of the art data structures. ● Why Suffix Trees? Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Why Suffix Trees? Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Challenging. ● Why Suffix Trees? Asymptotically fast. ● The basis of state of the art data structures. ● You don't need a Phd to use them. ● Challenging. ● Expose interesting algorithmic ideas. ● Definition Suffix Tree for an m-character string: ● m leaves numbered 1 to m Definition Suffix Tree for an m-character string: m leaves numbered 1 to m ● edge-label vs node-label ● Definition Suffix Tree for an m-character string: m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● Definition Suffix Tree for an m-character string: m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● the label of the leaf j is S[ j..m ] ● Definition Suffix Tree for an m-character string: m leaves numbered 1 to m ● edge-label vs node-label ● each internal node has at least two children ● the label of the leaf j is S[ j..m ] ● no two edges out of the same node can have edge-labels beginning with the same character ● Definition Example String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac Implicit vs Explicit ● What if we have “axabx” ? Ukkonen's Algorithm suffix tree construction Ukkonen's Algorithm suffix tree construction Text: S[ 1..m ] ● m phases ● phase j is divided into j extensions: ● In extension j of phase i + 1: ● find the end of the path from the root labeled with substring S[ j..i ] ● extend the substring by adding the character S(i + 1) to its end Extension Rules ● Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge. Extension Rules ● Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path continues from the end of β. Extension Rules ● Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing. Ukkonen's Algorithm suffix tree construction Complexity: Ukkonen's Algorithm suffix tree construction Complexity: ● m phases Ukkonen's Algorithm suffix tree construction Complexity: m phases ● phase j -> j extensions ● Ukkonen's Algorithm suffix tree construction Complexity: m phases ● phase j -> j extensions ● find the end of the path of substring β: O(|β|) = O(m) ● Ukkonen's Algorithm suffix tree construction Complexity: m phases ● phase j -> j extensions ● find the end of the path of substring β: O(|β|) = O(m) ● ● each extension: O(1) Ukkonen's Algorithm suffix tree construction Complexity: m phases ● phase j -> j extensions ● find the end of the path of substring β: O(|β|) = O(m) ● ● each extension: O(1) 3 O(m ) “First make it run, then make it run fast.” Brian Kernighan Suffix Links Definition: For an internal node v with path-label xα, if there is another node s(v), with path-label α, then a pointer from v to s(v) is called a suffix link. ● Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. ● If Rule 2 applies: Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. ● If Rule 2 applies: ● S[ j..i ] continues with c ≠ S(i + 1) Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. ● If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1) ● S[ j + 1..i ] continues with c. ● Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w). Node Depth The node-depth of v is at most one greater than the node depth of s(v). xß xß ß xα xλ xα α λ equal node-depth: 3 xλ Node depth: 4 ß α λ Node depth: 3 Skip/count Trick γ number of characters in an edge ● “Directly implemented” edge traversal: O(|γ|) ● Skip/count Trick γ number of characters in an edge ● “Directly implemented” edge traversal: O(|γ|) ● “Jump” from node to node. ● K = number of nodes in a path ● Time to traverse a path: O(|K|) ● Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: ● There are i + 1 ≤ m extensions in phase i + 1 Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one. ● Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one. ● Each down-walk moves to a node of greater depth. ● Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one. ● Each down-walk moves to a node of greater depth. ● Over the entire phase the node-depth is decremented at most 2m times. ● Ukkonen's Algorithm Using the skip/count trick: ● any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 ● In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. ● The up-walk decreases the current node-depth by at most one. ● Each suffix link traversal decreases the node-depth by at most another one. ● Each down-walk moves to a node of greater depth. ● Over the entire phase the node-depth is decremented at most 2m times. ● No node can have depth greater than m, so the total increment to current node-depth (down walks) is bounded by 3m over the entire phase. ● Ukkonen's Algorithm m phases ● 1 phase: O(m) ● Ukkonen's Algorithm m phases ● 1 phase: O(m) ● 2 O(m ) “First make it run fast, then make it run faster.” João Carreira Edge-Label Compression ● A string with m characters has m suffixes. ● If edge labels are represented with characters, O(m2) space is needed. Edge-Label Compression ● A string with m characters has m suffixes. ● If edge labels are represented with characters, O(m2) space is needed. To achieve O(m) space, each edge-label: (p, q) Two more tricks... Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why? Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why? ● When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1. Rule 3 is a show stopper ● End any phase i +1 the first time rule 3 applies. ● The remaining extensions are said to be done implicitly. Once a leaf always a leaf Leaf created => always a leaf in all successive trees. ● No mechanism for extending a leaf edge beyond its current leaf. ● Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase. ● Once a leaf always a leaf Leaf created => always a leaf in all successive trees. ● No mechanism for extending a leaf edge beyond its current leaf. ● Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase. ● Leaf Edge Label: (p, e) Single Phase Algorithm In each phase i: Single Phase Algorithm During construction: Implicit to Explicit One last phase to add character $: O(m) Suffix Trees are a Swiss Knife Applications Exact String Matching: Applications Exact String Matching: Preprocessing: O(m) Search: O(n + k) Three ocurrences of string aw. Applications And much more.. Longest common substring ● Longest repeated substring ● Longest palindrome ● Most frequently occurring substrings of a minimum length ● Shortest substrings occurring only once ● Lempel-Ziv decomposition ● ..... ● O(n) O(n) O(n) O(n) O(n) O(n) “Biology easily has 500 years of exciting problems to work on.” Donald Knuth web.ist.utl.pt/joao.carreira web.ist.utl.pt/joao.carreira Questions?
© Copyright 2026 Paperzz