A#B

資料結構與演算法(下)
呂學一 (Hsueh-I Lu)
http://www.csie.ntu.edu.tw/~hil/
2010/5/28
1
Today

Applications of suffix trees
–
–
–
–
2010/5/28
Substring problem (暖身)
“Exact string matching” revisited
Linearization of circular string (挪移乾坤)
Longest common substring (異中求同)
2
Application One
Substring Problem (recap as a warm-up)
2010/5/28
3
Substring Problem

Input: two strings P and S,
– where S is allowed to be preprocessed in O(|S|)
time.


Output: an occurrence of P in S.
Objective: done in O(|P|) time.
2010/5/28
4
12345678
S=bbabbaab
[1,1]
1
[3,3]
1
[3,3]
[7,–]
1 [2,3]
[4,–]
1
[7,–]
[1,–]
[2,–]
[4,–]
[3,–]
[4,–]
[3,–]
2010/5/28
12
[7,–]
5
12345678
S=bbabbaab
Q: Where are
abba, baa, bb?
[1,1]
[3,3]
[7,–]
[3,3]
[2,3]
[4,–]
[7,–]
[4,–]
[4,–]
2010/5/28
[7,–]
6
12345678
S=bbabbaab
Q: Where are
abba, baa, bb?
[1,1]
[3,3]
1
[3,3]
[7,–]
3
6
[2,3]
[4,–]
1
[7,–]
2
[4,–]
5
3
2010/5/28
[4,–]
2
[7,–]
4
1
7
Application Two
Exact String Matching
2010/5/28
8
Exact String Matching

Input: two strings P and S,
– where S is allowed to be preprocessed in O(|S|)
time.


Output: all occurrences of P in S.
Challenge: solving this in O(|P| + k) time,
– where k is the number of occurrences of P in S.
2010/5/28
9
Idea
Each internal node keeps the labels of
all its descendant leaves.
2010/5/28
10
12345678
S=bbabbaab
Q: Something’s
missing?
[1,1]
[3,3]
5,2,4,1
[3,3]
[7,–]
6,
3
6
[2,3]
[4,–]
4,1
[7,–]
5,
2
5
3
2010/5/28
Q: How do
we fix this
problem?
[4,–]
2
[4,–]
[7,–]
4
1
11
123456789
S=bbabbaab$
Q: Obtainable
in O(|S|) time?
[9,–]
9
[1,1]
[3,3]
5,2,4,1,8
[3,3]
[7,–]
6,3,7
6
4,1
[7,–]
1
[9,–]
5,
2
5
[4,–]
7
2010/5/28
8
[2,3]
[4,4]
[4,–]
[5,–]
3
[9,–]
2
[4,–]
[7,–]
4
1
12
Perhaps not…

S=aaaaa$
1,2,3,4,5
6
1,2,3,4
5
1,2,3
4
1,2
3
1
2010/5/28
2
13
An observation

Consider the sequence L of leaves from
left to right. The descendant leaves of each
internal node has to be consecutive in L.
2010/5/28
14
123456789
123456789
S=bbabbaab$ L=637524189
[9,–]
9
4,8
[1,1]
[3,3]
5,2,4,1,8
[7,–]
[3,3]
1,3
6,3,7
6
[2,3]
[4,4]
4,5
[7,–]
[5,–]
[9,–]
5
[4,–]
2010/5/28
2
8
6,7
4,1
5,
2
7
3
[9,–]
[4,–]
[7,–]
4
1
15
Application Three
Circular String Linearization (挪移乾坤)
2010/5/28
16
Notation

Let
挪(S, i)
denote the string
S[i…|S|] S[1…i – 1].
i
S
挪(S,i)
2010/5/28
17
b
The problem
b
b
a
a
a
b
b

Input
– a string S.

Output
– an index i that
maximizes the
alphabetical order of
挪(S, i).
2010/5/28
12345678
挪(S,1) = b b a b b a a b
挪(S,2) = b a b b a a b b
挪(S,3) = a b b a a b b b
挪(S,4) = b b a a b b b a
挪(S,5) = b a a b b b a b
挪(S,6) = a a b b b a b b
挪(S,7) = a b b b a b b a
挪(S,8) = b b b a b b a a
18
Naïve algorithm
Time
complexity?
let j = 1;
for i = 2 to |S| do {
if (挪(S,i) > 挪(S,j)) {
let j = i;
}
}
output j;
2010/5/28
19
Q: Can we beat O(|S|2)?
b
b
b
a
a
a
b
b
2010/5/28
12345678
挪(S,1) = b b a b b a a b
挪(S,2) = b a b b a a b b
挪(S,3) = a b b a a b b b
挪(S,4) = b b a a b b b a
挪(S,5) = b a a b b b a b
挪(S,6) = a a b b b a b b
挪(S,7) = a b b b a b b a
挪(S,8) = b b b a b b a a
20
Linear-Time Algorithm via
Suffix Tree
2010/5/28
21
First attempt – going right
1 2 3
b b a
b a
a
4
b
b
b
b
5
b
b
b
b
b
Q: How to fix
the problem?
2010/5/28
6
a
a
a
a
a
a
7
a
a
a
a
a
a
a
8
b
b
b
b
b
b
b
b
22
Second Attempt
Suffix tree for SS
2010/5/28
23
Key observation


Each length-|S| substring of SS is a 挪(S, j)
for some index j with 1≤ j ≤ |S|.
Each 挪(S, j) with 1≤ j ≤ |S| is a length-|S|
substring of SS.
2010/5/28
24
1234567890123456
SS=bbabbaabbbabbaab
[1,1]
1
[3,3]
1
[2,2]
[3,3]
[7,–]
[4,5]
[4,–]
[3,–]
2010/5/28
[10,–]
1 [2,3]
12
[3,3]
1
2
[10,–]
[6,–]
1
[7,–]
[1,–]
[2,–]
[4,–]
[4,–]
[3,–]
[7,–]
12
34
5
25
1234567890123456
SS=bbabbaabbbabbaab
Q: How to use
this suffix tree?
[1,1]
[3,3]
[2,2]
[3,3]
[7,–]
[4,5]
[10,–]
[3,3]
[7,–]
[4,–]
[10,–]
[6,–]
2010/5/28
[4,–]
[7,–]
26
Equivalently, …

Output the index i such that SS[i…|SS|]
corresponds to the rightmost leaf of the
suffix tree for SS.
– Clearly, this takes O(|S|) time.
2010/5/28
27
Application Four
Longest common substring (異中求同)
2010/5/28
28
The problem


Input: two strings A
and B.
Output: a longest
string C that occurs in
both A and B.






2010/5/28
A=bbbabbaab
B=baabbabbab
C=bb
C=baab
C=abba
C=bbabba
29
Naïve algorithm
Time
complexity?
build suffix tree for B;
for L = |A| downto 1 do
for i = 1 to |A|-L+1 do {
if A[i…i+L-1] occurs in B {
output A[i…i+L-1] and halt;
}
}
}
output “no common substring”;
2010/5/28
30
O(|A|3+|B|)

Can we do
better than this?
The for-loop takes time
| A|
=  (| A | -i + 1) O (i )
L =1
| A|
| A|


2
= O | A |  i -  i + | A | 


L =1
L =1
= O | A |3
( )
2010/5/28
31
A faster algorithm
Time
complexity?
build suffix tree for B;
for i = 1 to |A| do {
find the largest integer L(i)
such that A[i…i+L(i)-1]
occurs in B by binary search;
}
output A[i…L(i)] for the i with
the largest L(i);
2010/5/28
32
O(|A|2 log|A|+|B|)

The for-loop takes O(|A|2 log|A|) time.
– Each binary search takes time O(|A| log |A|).
– There are overall O(|A|) binary searches.
Can we do
better than this?
2010/5/28
33
Donald E. Knuth
conjectured in 1970 that …
it is impossible to solve this longest
common substring problem in
O(|A|+|B|) time.
2010/5/28
34
Longest Common Substring
in O(|A|+|B|) time via suffix tree
2010/5/28
35
A-suffix
Idea
A
#
B
$
B-suffix


Construct a suffix tree T for A#B$, where
# and $ are two characters not in A and B.
There are exactly |A|+|B|+2 leaves in T,
each leaf corresponds to a suffix of A#B$.
– A-leaf: with label in {1, 2, …, |A|}

corresponds to an A-suffix.
– B-leaf: with label in {|A|+2, …,|A|+|B|+1}

2010/5/28
corresponds to a B-suffix.
36
Observation

Let v be an arbitrary position
of T (i.e., v is not necessarily
a node of T.)
root
v
– v has a descendant A-leaf if
and only if v corresponds to a
prefix of an A-suffix of A#B$.
– v has a descendant B-leaf if
and only if v corresponds to a
prefix of a B-suffix of A#B$.
2010/5/28
37
root
Lemma

v
Let v be a position of T.
– v has descendant A-leaf and Bsuffix if and only if v
corresponds to a common
substring of A and B.
A-suffix
A
#
B
$
B-suffix
2010/5/28
38
root
Question

v
Do we really need ‘#’ to
separate A and B in the
concatenated string A#B$?
A-suffix
A
B
$
B-suffix
2010/5/28
39
The algorithm


Construct the suffix
tree T of A#B$.
Output the string
corresponding to a
deepest internal node
v such that the subtree
of T rooted at v
contains both A-leaf
and B-leaf.
2010/5/28


Q: why not checking
leaves?
Q: why not checking
positions of T that are
not internal nodes of
T?
40
It suffices to check
internal nodes…
root

v
2010/5/28
If the position v contains both
kinds of descendant leaves,
then so does its closest
internal node below.
41
Time = O(|A|+|B|)




O(|A|+|B|) time for constructing T.
O(|A|+|B|) time for marking the colors of
each node, including each leaf and each
internal nodes
O(|A|+|B|) time for computing the depths
of all nodes
O(|A|+|B|) time to find a deepest internal
node with both colors.
2010/5/28
42
Space Complexity is also
O(|A|+|B|).



Q: Can we further improve the time and
space complexity?
“No” for the time complexity.
“Yes” for the space complexity.
2010/5/28
43
Reducing the space to O(|A|)
2010/5/28
44
Longest Common Substring



Input: two strings A and B.
Output: a longest string C that occurs in
both A and B.
Objective:
– O(|A|+|B|) time
– O(|A|) space

Idea:
– Construct the suffix tree of A only.
2010/5/28
45
The algorithm


Construct the suffix tree T of A,
keeping all the suffix links.
For i = 1 to |B| do
– Find the largest integer 深(i) such that
B[i…i+深(i)–1] occurs in B.

Output B[i…i+深(i)–1] where i is the
index with maximum 深(i).
2010/5/28
46
Naïvely, …


Finding 深(i) for each
i takes O(深(i) +1)
time by traversing T
from the root.
But all these |B|
iterations would take
O(|A||B|) time in total.
2010/5/28
depth = 深(i)
47
Observation

What if this
suffix link
does not exist?
深(i+1) ≥深(i) – 1.
深(i) – 1
深(i)
2010/5/28
48
12345678
A=bbabbaab
12345678901
B=babaabbaaba
[1,1]
Record: 深(1) = 3
[3,3]
Record: 深(3) = 4
[2,3] Record: 深(5) = 6
[3,3]
[7,8]
6
1
[7,8]
[4,8]
3
2010/5/28
[4,8]
5
[7,8]
[4,8]
1
2
4
1
49
Time and space


Clearly, the space complexity is O(|A|).
The time complexity still O(|A|+|B|).
– We first show that the time is O(|A|+|B|)
without considering that of suffix link
traversal.
– We then show that the time for suffix link
traversal is also O(|A|+|B|).
2010/5/28
50
Without considering
suffix link traversal



The for-loop has exactly |B| iterations.
Suppose the i-th iteration moves the
↑arrow to the right by d(i) units.
d(1)+d(2)+…+d(|B|)=|B|,
– because the ↑ arrow never goes left.


The i-th iteration takes time O(d(i)+1).
So, the overall time complexity is O(|B|).
2010/5/28
51
The time complexity for
suffix link traversal



Let (i) denote the distance
between the position of ○
at the end of the i-th
iteration and the closest
internal node above ○.
Let t(i) be the number of
internal nodes touched in
the downward suffix link
traversal of the i-th iteration.
(i) ≤ (i – 1) + d(i) – t(i).
2010/5/28
t(i)
d(i)
52
(i) ≤ (i – 1) + d(i) – t(i)


t(i) ≤ (i – 1) + d(i) – (i)
Therefore, t(1)+t(2)+…+t(|B|) is at most
d(1)+d(2)+…+d(|B|) + (0) – (|B|),
which is clearly O(|B|+|A|).
– (0) ≤ |A|, and
– d(1)+d(2)+…+d(|B|) = |B|.
2010/5/28
53