le05t6

String Matching
The problem:
• Input: a text T (very long string) and a
pattern P (short string).
• Output: the index in T where a copy of P
begins.
1
Some Notations and Terminologies
• |P| and |T|: the lengths of P and T.
• P[i]: the i-th letter of P.
• Prefix of P: a substring of P starting with
P[1].
• P[1..i]: the prefix containing the first i
letters of P.
• suffix of P[1..i]: a substring of P[1..i]
ending at P[i], e.g. P[3..i], P[5..i] (i>4).
2
Straightforward method
• Basic idea:
1. i=1;
2. Start with T[i] and match P with
T[i],T[i+1], ... T[i+|P|-1]
3. whenever a mismatch is found,
i=i+1 and goto 2 until i+|P|-1>|T|.
• Example 1: T=ABABABCCA and P=ABABC
P: ABABC
ABABC
ABABC
|||||
|
|||||
T: ABABABCCA ABABABCCA ABABABCCA
3
Analysis
• Step 2 takes O(|P|) comparisons in the worst
case.
• Step 2 could be repeated O(|T|) times.
• Total running time is O(|T||P|).
4
Knuth-Morris-Pratt Method
(linear time algorithm)
Story:
• The algorithm was independently invented by
Knuth and Pratt and by Morris.
• The work was published jointly
Reference:
Donald E. Knuth, James H. Morris and Vaughan R.
Pratt, Fast pattern matching in strings, SIAM
Journal on Computing, 6(2): 323-350, 1977.
5
Knuth-Morris-Pratt Method
(linear time algorithm)
A better idea
• In step 3, when there is a mismatch we move
forward one position (i=i+1) in the naïve algorithm
• We may move more than one position at a time
when a mismatch occurs if we carefully study the
pattern P.
For example:
P: ABABC
ABA
T: ABABABCCA ABABABCCA
6
Questions:
• How to decide how many positions we should
jump when a mismatch occurs?
• How much we can benefit? O(|T|+|P|).
Example 2:
P: abcabcabcaa
|
T: abcabcabcabcaa
|
abcabcab
back here
7
• We can move forward more than one position. Reason?
• Study of Pattern P
P[1..7] abcabca
P[1..10] abcabcabca
P[1..7]
abcabca
P[1..4]
abca
• P[1..7] is the longest prefix that is also a suffix of
P[1..10].
• P[1..4] is a prefix that is a suffix of P[1..10], but
not the longest.
• Hint: When mismatch occurs at P[i+1], we want to
find the longest prefix of P[1..i] which is also a
suffix of P[1..i].
• Suffix of P is a substring of P ending at the last position of P.
8
Failure function(moderate)
• f(i) is the length of the longest prefix that is a suffix of P[1..i].
• That is, P[1,f(i)] is the longest prefix that is a suffix of P[1..i].
• f(i) is the largest r with (r<i) such that
P[1] P[2] ...P[r] = P[i-r+1]P[i-r+2], ..., P[i].
Prefix of length r
Suffix of P[1]P[2]…P[i] of length r
• Example 3: P=ababaccc and i=5.
P[1] P[2] P[3]
a
a b a
b
b
a
a
P[3] P[4] P[5]
(r=3) f(5)=3.
9
A native algorithm
to compute the failure function
We can test if f(i)=k:as follows:
P[1]
P[2]
… P[k]
|
|
|
P[i-k+1] P[I-k+2]… P[i]
Thus, to get f(i) we can
Test if f(i)=i-1? Test if f(i)=i-2? Test if f(i)=i-3?
…
Test if f(i)=0?
10
• Example 4:
P=abcabbabcabbaa
It is easy to verify that
f(1)=0, f(2)=0, f(3)=0, f(4)=1, f(5)=2,
f(6)=0, f(7)=1, f(8)=2, f(9)=3, f(10)=4,
f(11)=5, f(12)=6, f(13)=7, f(14)=1.
11
The Scan Algorithm
(Do not have to remember, but should understand)
•
•
1.
2.
3.
i: indicates that T[i] is the next character in T to be
compared with the head of the pattern.
q: indicates that P[q+1] is the next character in P to be
compared with T[i].
i=1 and q=0;
Compare T[i] with P[q+1]
case 1: T[i]==P[q+1]
i=i+1;q=q+1;
if q+1==|P| then print "P occurs at i+1-|P|"
case 2: T[i]≠P[q+1] and q≠0
q=f(q);
case 3: T[i]≠P[q+1] and q==0
i=i+1;
Repeat step2 until i==|T|.
12
• Example 5: P=abcabbabcabbaa
i
1 2 3 4 5 6 7 8 9 10 11 12 13 14
f(i) 0 0 0 1 2 0 1 2 3 4 5 6 7 1
T=abcabcabbabbabcabbabcabbaa
abcabb
|
|
|
abcabbabc
|
abc
|
a(i=i+1)
abcabbabcabbaa(q+1=|p|)
13
Running time complexity(hard)
• The running time of the scan algorithm is O(|T|).
• Proof:
– There are two pointers i and p.
– i: the next character in T to be compared.
– p: the position of P[1]. (See figure below)
p
i
P:abcabcabcaa
|
T:abcabcabcabcaa
|
P:
abcabcaa
p
i
14
Facts:
1. When a match is found, move i forward.(case 1)
2. When a mismatch is found and pi, move p
forward. (case 2)
3. When p=i and a mismatch occurs, move both i
and p forward. (Case 3)
From 1, 2 and 3, it is easy to see that the total
number of comparisons is at most 2|T|.
(Reason: i changes from 1, 2, … |T|. The pointer p
points to T[1], T[2], …T[|T|].)
Thus, the time complexity is O(|T|).
(See my movie for this proof.)
15
Another version of scan algorithm (code)
n=|T|
m=|P|
q=0
for i=1 to n
{
while q>0 and P[q+1]≠T[i] do
{
q=f(q)
}
if P[q+1]==T[i] then
q=q+1
if q==m then
{
print "pattern occurs at i-m+1"
q=f(q)
}
}
16
Failure Function Construction
Basic idea:
Case 1: f(1) is always 0.
Case 2: if P[q]==P[f(q-1)+1] then f(q)=f(q-1)+1.
P[1]P[2]…P[f(q-1)] P[f(q-1)+1]
|
|
|
|
|
P[ ]P[ ]…P[q-1] P[q]
17
Case 3: if P[q]P[f(q-1)+1] and f(q-1)0 then
consider P[q] ?= P[f(f(q-1))+1]
(Do it recursively)
abcabcabcd q=10. P[10]=d
abcabca f(9)=6. P[f(9)+1]=aP[10]
abca f(f(9))=f(6)=3. P[f(f(9))+1]=P[4]=aP[10]
a f(f(f(9)))=f(3)=0.
Case 4: if P[q]  P[f(q-1)+1] and f(q-1)==0
then f[q]=0. (f(10)=0 for this example.)
18
The algorithm (code) to compute failure function
1.
2.
3.
4.
5.
6.
7.
8.
m=|P|;
f(1)=0;
(case 1)
k=0;
for q=2 to |P| do
{
k=f(q-1);
if(k>0 and P[k+1]!=P[q])
(case 3)
{
k=f(k);
goto 6; }
if(k>0 and P[k+1]==P[q])
(case 2)
{
f[q]=k+1;
}
if(k==0)
{
if(P[k+1]==P[q]) f[q]=1;
(case 2)
else f[q]=0;
(case 4)
}
}
19
Another version
1.
2.
3.
4.
5.
6.
7.
8.
9.
m=|P|;
f(1)=0;
k=0;
for q=2 to |P| do
{
k=f(q-1);
while(k>0 and P[k+1]!=P[q]) do
{
k=f(k);
}
if(P[k+1]==P[q]) then k=k+1;
f[q]=k;
}
20
• Example 3:
1 2 3 4 5 6 7 8 9 10 11 12
P=a b c a b c a b c a a c
f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2;
f(6)=3; f(7)=4; f(8)=5; f(9)=6; f(10)=7;
f(11)=1.
(The computation of f(11) is very interesting.)
Question: Do we need to compute f(12)?
Yes, if you want to find ALL occurrences of P.
No, if you just want to find the first occurrence of P.
21
Computation of f(11):
P=a b c a b c a b c a a c
a b c a b c a b
a b c a b
a b
a
f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=4;
f(8)=5; f(9)=6; f(10)=7; f(11)=1.
22
Example:
P=abaabc
i
123456
T=abcabcabc
f(i) 0 0 0 1 2 3
abcabc
abcabc
When a match is found at the end of P, call f(|p|).
Running time complexity (Fun Part, not required)
The running time of failure function construction algorithm is
O(|P|). (The proof is similar to that for scan algorithm.)
Pattern abababb is an good example.
Total running time complexity
The total complexity for failure function construction and
scan algorithm is O(|P|+|T|).
23
Summary of String Matching
1.Definition of f(i).
2. Scan algorithm for one pattern
3. Proof the O(|T|) time for the scan
algorithm (for one pattern only)
4. Understand the algorithm for construction
of f(i)(proof is not required. No need to
remember the code)
5. Constructing the failure function for
Multiple pattern
24
Linear Time Algorithm for Multiple
patterns (Fun Part)
• Input: a string T (very long) and a set of
patterns P1,P2,...,Pk.
• Output: all the occurrences of Pi's in T.
Let us consider the set of patterns { he, she,
his, hers }. We can construct an automata
as follows:
25
Depth 1
Depth 3
Depth 2
Depth 4
e,i,r
h
0
e
r
1
s
2
8
s
i
6
s
3
s:
9
h
4
7
e
5
1, 2, 3, 4, 5, 6, 7, 8, 9
f(s) 0, 0, 0, 1, 2, 0, 3, 0, 3
26
• g(s,a)=s' means that at state s if the next
input letter is a then the next state is s'.
• The states of the automata is organized
column by column.
• Each state corresponds to a prefix of some
pattern Pi.
• F: the set of final states (dark circled)
corresponding to the ends of patterns.
• For the starting state 0, add g(0,a)=0, if
g(0,a) is originally fail.
27
• Exercise: write down the g() function for the
above automata.
• Failure function
f(s) = the state for the longest prefix of some pattern
Pi that is a suffix of the string in the path from
state 0 (starting state) to s.
• Example:
he is the longest prefix in hers that is a suffix of the
string she.
shers
she
hers
28
The scan algorithm
Text: T[1]T[2]...T[n]
s=0;
for i:=1 to n do
{
while g(s,T[i])=fail do s=f(s);
s:=g(s,T[i]);
if s is in F then
{s=f(s);return "yes";}
}
return "no“
why?
29
Theorem: The scan algorithm takes O(|T|)
time. (Fun part, not required.)
Proof: Again, the two pointer argument.
• When a match is found, move the first
pointer forward. (s:=g(s,T[i]);)
• When a mismatch is found (g(s,T[i])==fail),
move the second pointer forward. (s=f(s);)
• When a final state is meet, declare the
finding of a pattern. (if s is in F then return
"yes";)
30
• Example:
i=1 2 3 4 5 6 7 8
s h e r s h i i
3 4 5
2 8 9
3 4
1 6
0
0
s
123456789
f(s) 0 0 0 1 2 0 3 0 3
31
Failure function construction
• Basic idea: similar to that for one pattern.
for each state s of depth 1 do
f(s)=0
for each depth d>=1 do
for each state sd of depth d and character a
such that g(sd,a)=s' do
{
s=f(sd)
while g(s,a)=fail do
{
s=f(s)
}
f(s')=g(s,a)
a
}
Sd
S’
g(sd,a)=s'
32
• g(0,c)fail for any possible character c.
The failure function for {he, she, his, hers} is
s
123456789
f(s) 0 0 0 1 2 0 3 0 3
Time complexity: O(|P1|+|P2|+...+|Pk|).
Proof: Two pointer argument.
33

Download Report

le05t6

Paperzz.com

Your Paperzz