Document

Average Case Analysis of an
Exact String Matching Algorithm
Advisor: Professor R. C. T. Lee
Speaker: S. C. Chen
1
Problem Definition
We are given text T=t1t2…tn with length n and
a pattern P=p1p2…pm with length m and we
are asked to find all occurrences of P in T.
Example:
T  AGCCTAAGCTCCTAAGTC
P  CCTA
There are two occurrences of P in T
as shown below:
AGCCTAAGCTCCTAAGTC
2
There are many rules in exact string matching
algorithms. For example, the Suffix to Prefix
Rule, the Substring Matching Rule, ….
3
We use the idea, the substring matching rule,
in this algorithm.
4
The Substring Matching Rule
For any substring S
in T, find a nearest S T
in P which is to the
left of it. If such an
S in P exists, move
P such then the two
S’s match; otherwise, T
we may define a
new partial window.
windows
S
S
Exactly matched
P
windows
S
S
P
5
windows
i-m+1
i-r+1 i
T
S
P
windows
i-r+1 i
i-m+1
T
S
P
6
In this algorithm,
we first check
whether S=T[i- T
r+1…i] is a
substring of P or P
not.
If S does not
occur in P, we
shift P to right
m-r steps.
1
windows
m-r+1 m
S
1
windows
m-r+1 m
T
S
P
7
• If S occurs in P, according to the Substring
Matching Rule, we should slide P so that the
two substrings S match as shown below.
windows
S
T
S
P
8
• But, our algorithm is not that smart, instead of
sliding P so that the two substrings S match, we
simply examine the entire window starting from
i-m+1 to 2i-r to see whether P occurs in this
window, as shown below.
2m-r
i-r+1 i
i-m+1
T
2i-r
S
P
9
• Note that our not so smart algorithm covers
the case of sliding P to match the two
substrings S.
2m-r
i-r+1 i
i-m+1
T
2i-r
S
P
S
10
Algorithm
•Algorithm fast-on-average;
• i=m;
• while i≦n do begain
• if T[i-r+1…i] is a substring of P then
• compute all occurrences of P whose starting
positions are in T[i-m+1…i-r+1] applying KMP
algorithm.
• else { P does not start in T[i-m+1…i-r+1] }
•
i=i+m-r
•end
11
Analysis
First of all, let us note that in the above
algorithm, we have to determine whether the
suffix S occurs in P or not. This is again an
exact string matching problem. Let us assume
that there is a pre-processing to construct a
suffix tree of P. Whether S occurs in P or not
can be determined by feeding S into the suffix
tree of P. Because the length of S is r, we can
determine whether S occurs in P in O(r).
12
For reasons which will become clear later, we
assume that r  2log  m
13
We assume that the text is a random string
and the size of alphabet is α.
14
There are αr possible substrings with length
r consisting of α distinct characters.
There are only m-r substrings with length r in P
whose length is m .
Thus, the probability that S is a substring of P
is not great than
(m  r )
1

r
 (m  r )
1

2 log m 
mr 1


2
m
m
15
If S is a substring of P, we find all occurrences
of P in T[i-m…2i-r] using KMP algorithm.
i-m+1
T
2m-r
i-r+1 i
2i-r
S
P
16
Because the length of T[i-m…2i-r] is 2m-r,
time complexity of Step i using KMP algorithm
is O(m)
i-m+1
T
2m-r
i-r+1 i
2i-r
S
P
17
(1)The probability that S occurs in P is
1
m
.
(2)When S occurs in P, the time complexity that
we use KMP algorithm to find all occurrences of
P in T[i-m+1…2i-r] is O(m).
Summary of (2) and (3), the average time-complexity
of applying the KMP algorithm is O m 1   O(1)

m
In the above, the time complexity of checking
whether S occurs in P is O(r).
Thus, the average time-complexity of applying
the KMP algorithm once is O(r).
18
Thus, if S does not occurs in P, the time
complexity of Step i is only the checking timecomplexity which is O(r)
.
If does, the time complexity of Step i is O(r).
19
n
are O m  windows
 
Because there
with length
m in T, the time complexity of this algorithm
on average is O n r   O n log  m  .
m 

m

20
Reference
• [KMP77] Faster Pattern Matching in Strings,
SIAM Journal on Computing 6 (2),1977,
pp. 323–350.
• [CR2002] Section 2.2:Boyer-Moore algorithm
and its variations, Jewels of Stringology, 2002,
pp. 30-31.
21
Thank you
22

Download Report

Document

Paperzz.com

Your Paperzz