Suffix Trees - Bioinformatics Group Freiburg

Suffix Trees
Rolf Backofen
Lehrstuhl für Bioinformatik
Institut für Informatik
Course Bioinformatics II — WS 11/12
String Matching
find efficiently all occurrences of a pattern P of length m in a
text T of length n (m << n)
Counting query: reports the number of occurrences of P in T
Reporting query: reports all occurrences of P in T where
string matching can be solved with a suffix tree
advantage over other string-matching algorithms:
if T is static, the suffix tree is constructed once in a preprocessing
step
the subsequent string matchings are then “fast“
Definitions
T = t1 t 2 . . . t n
Definition
The substring t1 ...ti is called the i-th prefix of T (1  i  n).
Example: T =ACCTTCCT
first prefix: A
fourth prefix: ACCT
Definition
The substring ti ...tn is called the i-th suffix of T (1  i  n).
Example: T =ACCTTCCT
first suffix: ACCTTCCT
fourth suffix: TTCCT
Definition
Suffix Tree
A suffix tree for a text T of length n over the alphabet ⌃ is a rooted
directed tree with n leaves. Apart from the root node, all internal
nodes have at least two children. All edges are labeled with a
non-empty substring of T and all outgoing edges from a node start
with a di↵erent character. Each leaf in the suffix tree is labeled with
an integer i 2 {1 . . . n} such that the concatenation of the ege labels
on the path from the root to the leaf node spells out the suffix of T
that starts at position i. The suffix tree can be constructed in O(n)
time and requires O(n) space.
Remark: In order to have a one-to-one correspondence between the
suffixes of T and the leaves of the suffix tree, we add a new character
$ 62 ⌃ to the end of T . This ensures that no suffix is a prefix of
another suffix.
Example for Suffix Tree
Suffixes: ACCTTCCT$
CCTTCCT$
CTTCCT$
TTCCT$
TCCT$
CCT$
CT$
T$
$
T = ACCTTCCT$
$
9
$T
CC
T
T
C
TC
A
C
TC
T
1
$
$
6
T
C
C
T
$
2
$
7
T
C
C
T
$
3
8
C
C
T
$
5
T
C
C
T
$
4
Notations
for a node
v in the suffix tree, v denotes the concatenation of all
in the path
edge
path labels from the root to v
|v | denotes the string depth of a node v
in order to identify a node v in the suffix tree with v = x, we
write x
a suffix link sl(v ) of an internal node v = cb, where c is a
character and b is a string, is the node w = b
Searching in a Suffix Tree
Task: find pattern P = p1 . . . pm of length m in the suffix tree for text
T of length n
1
set cur node=root and cur char =p1
2
locate the correct outgoing edge from the cur node which starts
with cur char
3
match the subsequent characters of the pattern to the label of
the edge located in step 2 character-by-character until the whole
pattern was matched (go to step 4 a)) or one ends up at a node
v . Assume we already matched p1 . . . pi : set cur node = v and
cur char = pi+1
repeat step 2 and 3 until:
4
a) the whole pattern was matched
b) there is no outgoing edge that starts with cur char (step 2) or the
subsequent characters of P can not be matched (step 3)
Searching in a Suffix Tree (cont.)
step 4a):
the whole pattern was matched
suppose the search procedure ended at node w or on the incoming
edge of node w
) the occurrences of P in T can be found in the subtree rooted at w
step 4b)
there is no outgoing edge that starts with cur char
the subsequent characters of P can not be matched
) P does not occur in T
Searching in a Suffix Tree (cont.)
Counting query: reports the number of occurrences of P in T
step 4a): occurrences of the pattern found
) return the number of leaves in the subtree rooted at w
(assuming that all nodes in the suffix tree are labeled with their
subtree sizes, this can be done in constant time)
step 4b): no occurrence of the pattern found
) return 0 (constant time)
Runtime for counting query: O(m) i.e. find the final node in m steps
Reporting query: reports all occurrences of P in T
step 4a): occurrences of the pattern found
) output the labels of all leaves in the subtree rooted at w in
(O(OccTP )) time, where OccTP is the number of occurrences of P
in T
step 4b): no occurrence of the pattern found
) output the empty set (constant time)
Runtime for reporting query: O(m + OccTP )
Example for Searching
P=CCT
P=CG
$
9
$T
CC
T
TT
A
CC
C
TC
T
1
$
$
6
T
C
C
T
$
2
$
7
T
C
C
T
$
3
8
C
C
T
$
5
T
C
C
T
$
4
Summary
Task
Find pattern P of length m in a text T of length n.
Suffix Tree
The suffix tree for T can be constructed in O(n) time and space.
With the suffix tree, the counting query can be solved in O(m) time
and the reporting query in O(m + OccTP ) time, where OccTP is the
number of occurrences of P in T .
Applications
1
searching for exact patterns (already discussed)
2
find Maximal Unique Matches
3
find all maximal pairs
2. Maximal Unique Matches
We have as an input two sequences A and B.
Definition
an occurrence of the same substring in A and B is called a match
a match in A and B is left (right) maximal if the match cannot
be extended to the left (right), i.e. the characters to the
immediate left (right) di↵er
a Maximal Unique Match (MUM) is a substring that occurs
exactly once in both A and B and is left and right maximal
Example: MUMs for A=ATGAC and B=AGAGGAC
GAC is a Maximal Unique Match as it occurs only once in A and
B and cannot be extended
GA
AG is not a Maximal Unique Match as it occurs twice in B
2. Maximal Unique Matches (cont.)
being unique means no ambiguity
Why do we need MUMs?
) for global alignments of large sequences
a significantly long MUM is almost certain to be part of a global
alignment of the sequences A and B
to get the full alignment we only need to align the sequences in
the gap between the MUMs
How to find efficiently all MUMs?
generalized suffix tree for the string A#B$
2. Maximal Unique Matches (cont.)
leaf labels: first
number identifies
the string and the
second one the
starting position
example for A=CGAA and B=CGA,
CGAA#CGA$
$
#
G C
$ A
B,4
A,5
observation: we can
delete the edge
label on leaf nodes
after the #
$
B,3
GA
C
G
A
A
w
#
C
G
A
$
A,4
A
#
C
G
A
$
A,3
$
B,1
v
A
#
C
G
A
$
A,1
blue arrows = suffix links
sl(w)=v <==> w = Xv
$
B,2
A
#
C
G
A
$
A,2
2. Maximal Unique Matches (cont.)
1
create the generalized suffix tree T for A#B$
2
mark each internal node v of T with exactly two child nodes
where one is a leaf from A and the other is a leaf from B
3
for each internal node v unmark sl(v )
4
report all marked nodes as Maximal Unique Matches
because it has to occurr only once
2. Maximal Unique Matches (cont.)
$
#
G C
$ A
B,4
A,5
$
B,3
GA
C
G
A
A
w
#
C
G
A
$
A,4
A
#
C
G
A
$
A,3
$
B,1
v
A
#
C
G
A
$
A,1
$
B,2
A
#
C
G
A
$
A,2
1
create generalized suffix tree for CGAA#CGA$
2
mark nodes v and w
3
unmark node v , as v = sl(w ) (blue arrows)
report node vw= CGA as a Maximal Unique Match
4
2. Maximal Unique Matches (cont.)
CGA is a MUM as node w = CGA has exactly one child labeled
with A and one with B and it cannot be extended to the left (no
GA is no MUM as GA can be extended to the left
v = sl(w)
example for A=CGAA and B=CGA, CGAA#CGA$
$
#
G C
$ A
B,4
A,5
$
B,3
GA
C
G
A
A
w
#
C
G
A
$
A,4
A
#
C
G
A
$
A,3
$
B,1
v
A
#
C
G
A
$
A,1
$
B,2
A
#
C
G
A
$
A,2
sl)
3. All maximal pairs
Definition
left
A maximal pair in a sequence A is a pair of occurrences of the
substring ↵ in A such that the characters to the immediate left (right)
of the two occurrences di↵er (the pair is left and right maximal). A
maximal pair is represented by (i, j, |↵|), where i and j are the starting
positions of the occurrences of ↵. and i < j
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A= A G A C C A G A C A T A G A C A
maximal pair AGAC: (1,6,4) and (1,12,4)
maximal pair AGACA: (6,12,5)
what of (6,12,4) ???
3. All maximal pairs (cont.)
build suffix tree for sequence A
leaf annotation: in addition to the position i of the suffix, we
store the character Ai 1 that occurs immediately before the suffix
ACCTTCCT$
123456789
$
9
T9
$T
1
CC
T
T
C
TC
A
C
TC
T
_1
$
T
C
C
T
$
$
6
T6
2
T
C
C
T
$
$
A2
7
C7
3
8
C3
C8
C
C
T
$
5
T
C
C
T
$
T5
4
C4
3. All maximal pairs (cont.)
observation: a substring ↵ can only be a maximal pair if the
corresponding node ↵ has at least two children () right
maximal) with di↵erent characters in their annotation () left
maximal)
How to find all maximal pairs of a node v ?
2. how to report max pairs for a node
Reporting: for each character x and each child v 0 of v , the
cartesian product of the list for x at v 0 with the union of every
list for a character x 0 6= x at a child w 6= v 0 is formed; each pair
in this list together with the string depth of v is a maximal pair
1. how to create lists
Linking: to create the list for character x at node v , we link the
lists for character x that exist for each of v 0 s children
do a post-order traversal of the nodes in the suffix tree to get all
maximal pairs (left,right,current)
3. how to find all vertices to report
from the largest substring to the smallest
3. All maximal pairs (cont.)
$
9
T9
$T
1
CC
T
T
C
TC
C
TC
v
_1
$
6
A
T6
T
T6
A2
T
C
C
T
$
2
w
T
C
C
T
$
$
A2
7
$
C7
3
8
C8
C
C
T
$
5
T
C
C
T
$
T5
4
C4
C3
for node v = CCT, we report the maximal pair CCT as (2,6,3)
we build the annotation for node v by combining the two leaf
annotations of the children of v
3. All maximal pairs (cont.)
$
9
T9
$T
1
CC
T
T
C
TC
C
TC
v
_1
$
6
A
T6
T
T6
A2
T
C
C
T
$
2
w
7
$
T
C
C
T
$
$
A2
C73
C7
3
8
C8
C
C
T
$
5
T
C
C
T
$
T5
4
C4
C3
for node w = CT, we report no maximal pair
we build the annotation for node w by combining the two leaf
annotations of the children of w
3. All maximal pairs (cont.)
$
9
T9
$T
1
CC
T
C
TT
C T6
A2
z C73
T
TC
v
_1
$
6
CA
T6
T6
A2
T
C
C
T
$
2
C48
T5
w
A2
$
T
C
C
T
$
$
7
C73
C7
3
8
C3
repeat steps for all internal nodes
report the following maximal pairs:
1
z
2
3
CCT as (2,6,3)
C as (6,7,1), (3,6,1), (2,3,1), (2,7,1)
T as (5,8,1),(4,5,1)
C8
C
C
T
$
5
T
C
C
T
$
T5
4
C4
3. All maximal pairs (cont.)
Runtime analysis
creation of the suffix tree, the post-order traversal, and all the list
linking take O(n) time
each operation of the cartesian product produces an unique
maximal pair
) O(k) time, where k is the number of maximal pairs
in total the algorithm takes O(n + k) time
in many applications we are only interested in maximal pairs of a
certain length m
) runtime is reduced to O(n + km ), where km is the number of
maximal pairs with length m
Recommended Reading
Dan Gusfield:
Algorithms on Strings, Trees, and Sequences.
Cambridge University Press (1997)
A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O.
White, and S. L. Salzberg:
Alignment of Whole Genomes
Nucleic Acids Research, 27:2369-2376, 1999