Suffix Trees

Suffix Trees
Construction and Applications
João Carreira
2008
Outline
Why Suffix Trees?
● Definition
● Ukkonen's Algorithm (construction)
● Applications
●
Why Suffix Trees?
Why Suffix Trees?
●
Asymptotically fast.
Why Suffix Trees?
Asymptotically fast.
● The basis of state of the art data structures.
●
Why Suffix Trees?
Asymptotically fast.
● The basis of state of the art data structures.
● You don't need a Phd to use them.
●
Why Suffix Trees?
Asymptotically fast.
● The basis of state of the art data structures.
● You don't need a Phd to use them.
● Challenging.
●
Why Suffix Trees?
Asymptotically fast.
● The basis of state of the art data structures.
● You don't need a Phd to use them.
● Challenging.
● Expose interesting algorithmic ideas.
●
Definition
Suffix Tree for an m-character string:
●
m leaves numbered 1 to m
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
● edge-label vs node-label
●
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
● edge-label vs node-label
● each internal node has at least two children
●
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
● edge-label vs node-label
● each internal node has at least two children
● the label of the leaf j is S[ j..m ]
●
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
● edge-label vs node-label
● each internal node has at least two children
● the label of the leaf j is S[ j..m ]
● no two edges out of the same node can have edge-labels
beginning with the same character
●
Definition Example
String: xabxac
Length (m): 6 characters
Number of Leaves: 6
Node 5 label: ac
Implicit vs Explicit
●
What if we have “axabx” ?
Ukkonen's Algorithm
suffix tree construction
Ukkonen's Algorithm
suffix tree construction
Text: S[ 1..m ]
● m phases
● phase j is divided into j extensions:
●
In extension j of phase i + 1:
● find the end of the path from the root labeled with substring S[ j..i ]
● extend the substring by adding the character S(i + 1) to its end
Extension Rules
●
Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.
Extension Rules
●
Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path
continues from the end of β.
Extension Rules
●
Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.
Ukkonen's Algorithm
suffix tree construction
Complexity:
Ukkonen's Algorithm
suffix tree construction
Complexity:
●
m phases
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
● phase j -> j extensions
●
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
● phase j -> j extensions
● find the end of the path of substring β:
O(|β|) = O(m)
●
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
● phase j -> j extensions
● find the end of the path of substring β:
O(|β|) = O(m)
●
●
each extension: O(1)
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
● phase j -> j extensions
● find the end of the path of substring β:
O(|β|) = O(m)
●
●
each extension: O(1)
3
O(m )
“First make it run, then make it run fast.”
Brian Kernighan
Suffix Links
Definition:
For an internal node v with path-label xα, if there is another node s(v), with
path-label α, then a pointer from v to s(v) is called a suffix link.
●
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.
●
If Rule 2 applies:
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.
●
If Rule 2 applies:
●
S[ j..i ] continues with c ≠ S(i + 1)
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.
●
If Rule 2 applies:
S[ j..i ] continues with c ≠ S(i + 1)
● S[ j + 1..i ] continues with c.
●
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
4. If a new internal w was created in extension j – 1 (by rule 2), then string α must
end at node s(w), the end node for the suffix link from w. Create the suffix link
(w, s(w)) from w to s(w).
Node Depth
The node-depth of v is at most one greater than the node depth of s(v).
xß
xß
ß
xα
xλ
xα
α
λ
equal node-depth: 3
xλ
Node depth: 4
ß
α
λ
Node depth: 3
Skip/count Trick
γ number of characters in an edge
● “Directly implemented” edge traversal: O(|γ|)
●
Skip/count Trick
γ number of characters in an edge
● “Directly implemented” edge traversal: O(|γ|)
●
“Jump” from node to node.
●
K = number of nodes in a path
● Time to traverse a path:
O(|K|)
●
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
●
There are i + 1 ≤ m extensions in phase i + 1
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
● In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
●
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
● In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
● The up-walk decreases the current node-depth by at most one.
●
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
● In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
● The up-walk decreases the current node-depth by at most one.
● Each suffix link traversal decreases the node-depth by at most another one.
●
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
● In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
● The up-walk decreases the current node-depth by at most one.
● Each suffix link traversal decreases the node-depth by at most another one.
● Each down-walk moves to a node of greater depth.
●
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
● In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
● The up-walk decreases the current node-depth by at most one.
● Each suffix link traversal decreases the node-depth by at most another one.
● Each down-walk moves to a node of greater depth.
● Over the entire phase the node-depth is decremented at most 2m times.
●
Ukkonen's Algorithm
Using the skip/count trick:
● any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
● In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
● The up-walk decreases the current node-depth by at most one.
● Each suffix link traversal decreases the node-depth by at most another one.
● Each down-walk moves to a node of greater depth.
● Over the entire phase the node-depth is decremented at most 2m times.
● No node can have depth greater than m, so the total increment to current node-depth
(down walks) is bounded by 3m over the entire phase.
●
Ukkonen's Algorithm
m phases
● 1 phase: O(m)
●
Ukkonen's Algorithm
m phases
● 1 phase: O(m)
●
2
O(m )
“First make it run fast, then make it run faster.”
João Carreira
Edge-Label Compression
●
A string with m characters has m suffixes.
●
If edge labels are represented with characters, O(m2) space is needed.
Edge-Label Compression
●
A string with m characters has m suffixes.
●
If edge labels are represented with characters, O(m2) space is needed.
To achieve O(m) space, each edge-label:
(p, q)
Two more tricks...
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also apply in all further
extensions until the end of the phase.
Why?
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also apply in all further
extensions until the end of the phase.
Why?
●
When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and
so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.
Rule 3 is a show stopper
●
End any phase i +1 the first time rule 3 applies.
●
The remaining extensions are said to be done implicitly.
Once a leaf always a leaf
Leaf created => always a leaf in all successive trees.
● No mechanism for extending a leaf edge beyond its current leaf.
●
Once there is a leaf labeled j, extension rule 1 will always apply to extension j
in any sucessive phase.
●
Once a leaf always a leaf
Leaf created => always a leaf in all successive trees.
● No mechanism for extending a leaf edge beyond its current leaf.
●
Once there is a leaf labeled j, extension rule 1 will always apply to extension j
in any sucessive phase.
●
Leaf Edge Label:
(p, e)
Single Phase Algorithm
In each phase i:
Single Phase Algorithm
During construction:
Implicit to Explicit
One last phase to add character $:
O(m)
Suffix Trees are a Swiss Knife
Applications
Exact String Matching:
Applications
Exact String Matching:
Preprocessing: O(m)
Search: O(n + k)
Three ocurrences of string aw.
Applications
And much more..
Longest common substring
● Longest repeated substring
● Longest palindrome
● Most frequently occurring substrings of a minimum length
● Shortest substrings occurring only once
● Lempel-Ziv decomposition
● .....
●
O(n)
O(n)
O(n)
O(n)
O(n)
O(n)
“Biology easily has 500 years of exciting problems to work on.”
Donald Knuth
web.ist.utl.pt/joao.carreira
web.ist.utl.pt/joao.carreira
Questions?