String Matching

String Matching
Algorithm Design and Analysis
(Week 7)
1
Battle Plan
• String matching problem
• Notation and terminology
• Four different algorithms
Algorithm
Naïve
Rabin-Karp
Finite automaton
Knuth-Morris-Pratt
Preprocessing Time Matching Time
0
𝑂( 𝑛 − 𝑚 + 1 𝑚)
Θ(𝑚)
𝑂( 𝑛 − 𝑚 + 1 𝑚)
𝑂(𝑚 Σ )
Θ(𝑛)
Θ(𝑚)
Θ(𝑛)
2
String-Matching Problem
• “Where’s the hotel in idahotelescope?”
• Formalization of the string-matching problem
– Text is an array 𝑇,1. . 𝑛- of length 𝑛
– Pattern is an array 𝑃,1. . 𝑚- of length 𝑚 ≤ 𝑛
– 𝑇 and 𝑃 are drawn from a finite alphabet Σ, they are often
called strings of characters
– 𝑃 occurs with shift 𝒔 in 𝑇 if 0 ≤ 𝑠 ≤ 𝑛 − 𝑚 and
𝑇 𝑠 + 1. . 𝑠 + 𝑚 = 𝑃,1. . 𝑚Text 𝑇
i d a h o t e l e s c o p e
Pattern 𝑃
𝑠=3
h o t e l
3
Notation and Terminology
• Strings
–
–
–
–
Σ∗
𝜖
𝑥
𝑥𝑦
set of all finite-length strings with characters from Σ
zero-length empty string also belongs to Σ ∗
length of a string 𝑥
concatenation of strings 𝑥 and 𝑦 has length 𝑥 + 𝑦
• Prefix and suffix
– string 𝑤 is a prefix of a string 𝑥, denoted as 𝑤 ⊏ 𝑥, if
𝑥 = 𝑤𝑦 for some string 𝑦 ∈ Σ ∗
– string 𝑤 is a suffix of a string 𝑥, denoted as 𝑤 ⊐ 𝑥, if
𝑥 = 𝑦𝑤 for some string 𝑦 ∈ Σ ∗
– 𝑆𝑘 denotes the 𝑘-character prefix 𝑆,1. . 𝑘- of the string
𝑆 1. . 𝑛 and thus 𝑆0 = 𝜖 and 𝑆𝑛 = 𝑆 = 𝑆 1. . 𝑛 .
4
Observations
• Strings
– 𝜖 =0
• Prefix and suffix
– for any string 𝑥, 𝜖 ⊏ 𝑥 and 𝜖 ⊐ 𝑥
– if 𝑤 ⊏ 𝑥 or 𝑤 ⊐ 𝑥, then 𝑤 ≤ 𝑥
– for any two strings 𝑥 and 𝑦 and any character 𝑎,
𝑥 ⊏ 𝑦 → 𝑎𝑥 ⊏ 𝑎𝑦 and 𝑥 ⊐ 𝑦 → 𝑥𝑎 ⊐ 𝑦𝑎
– both ⊏ and ⊐ are transitive relations
• Reformulated string-matching problem
– finding all shifts 𝑠 in the range 0 ≤ 𝑠 ≤ 𝑛 − 𝑚 such that
𝑃 ⊐ 𝑇𝑠+𝑚
5
Examples
• Assume Σ = a, b, c
– Σ ∗ = *𝜖, a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, … +
– 𝑥 = ab and 𝑦 = ba
–
–
–
–
–
–
–
–
𝑥 = 𝑦 =2
𝑥𝑦 = abba
𝑥𝑦 = 𝑥 + 𝑦 = 4
𝜖 ⊏ abba and 𝜖 ⊐ abba
a ⊏ abba and a ⊐ abba
ab ⊏ abba and ba ⊐ abba
abb ⊏ abba and bba ⊐ abba
abba ⊏ abba and abba ⊐ abba
6
Overlapping-Suffix Lemma
• Assume 𝑥, 𝑦, and 𝑧 are strings such that 𝑥 ⊐ 𝑧 and
𝑦⊐𝑧
– if 𝑥 ≤ |𝑦|, then 𝑥 ⊐ 𝑦
– if 𝑥 ≥ |𝑦|, then 𝑦 ⊐ 𝑥
– if 𝑥 = |𝑦|, then 𝑥 = 𝑦
• Proof
𝑥
𝑥
𝑥
𝑧
𝑧
𝑧
𝑦
𝑦
𝑦
𝑥
𝑥
𝑥
𝑦
𝑦
𝑦
7
Naïve String-Matching Algorithm
NAÏVE-STRING-MATCHER 𝑇, 𝑃
1 𝑛 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑇
2 𝑚 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑃
3 for 𝑠 ← 0 to 𝑛 − 𝑚
4
do if 𝑃 1. . 𝑚 = 𝑇 𝑠 + 1. . 𝑠 + 𝑚
5
then print “Pattern occurs with shift” 𝑠
• Comparing two stings (line 4) takes time Θ 𝑡 + 1
– 𝑡 denotes the number of matching characters
– “+1” to cater for non-matching strings (≠ 𝑂 0 )
• Naïve algorithm takes time 𝑂 𝑛 − 𝑚 + 1 𝑚
– tight bound in the worst-case Θ 𝑛 − 𝑚 + 1 𝑚
– consider matching text an and the pattern a𝑚
– if 𝑚 = 𝑛 2 , the worst-case running time is Θ 𝑛2
8
Example
• Graphical interpretation
– sliding pattern over text in steps of length 1
– noting for which shifts all of pattern characters equal the
corresponding text characters
a c a a b c
𝑠=0
a a b
a c a a b c
𝑠=1
a c a a b c
𝑠=2
a a b
a a b
a c a a b c
𝑠=3
a a b
9
Rabin-Karp Algorithm
• Motivation
– comparing numbers is “cheaper” than matching strings
– represent text and pattern as numbers
– use number-theoretic notions to match strings
• Assumptions and notation
– Σ10 = *0,1,2, … , 9+, but in the general case each character
will be a digit in radix-𝑑 notation where 𝑑 = Σ
– 𝑝 denotes the value corresponding to 𝑃 1. . 𝑚
– given 𝑇,1. . 𝑛-, 𝑡𝑠 denotes the value of the length-𝑚
substring 𝑇 𝑠 + 1. . 𝑠 + 𝑚 , for 𝑠 = 0,1, … , 𝑛 − 𝑚
– 𝑡𝑠 = 𝑝 ⇔ 𝑇 𝑠 + 1. . 𝑠 + 𝑚 = 𝑃 1. . 𝑚
10
Rabin-Karp Algorithm
• Goal
– compute 𝑝 in time Θ 𝑚
– compute all 𝑡𝑠 values in a total time Θ 𝑛 − 𝑚 + 1
– get all valid shifts in time Θ 𝑚 + Θ 𝑛 − 𝑚 + 1 = Θ(𝑛)
• Computing 𝑝 from 𝑃 1. . 𝑚
– can be done in Θ(𝑚) using Horner’s rule
– 𝑝 = 𝑃 𝑚 + 𝑑(𝑃 𝑚 − 1 + 𝑑 𝑃 𝑚 − 2 + ⋯ + 𝑑 𝑃 2 + 𝑑𝑃 1 ⋯ )
11
Rabin-Karp Algorithm
• Computing 𝑡0 from 𝑇 1. . 𝑛
– use Horner’s rule to compute 𝑡0 in time Θ 𝑚
• Computing 𝑡1 , 𝑡2 , … , 𝑡𝑛−𝑚 from 𝑇,1. . 𝑛– can be done in time Θ 𝑛 − 𝑚 since 𝑡𝑠+1 can be
computed from 𝑡𝑠 in constant time
– 𝑡𝑠+1 = 𝑑 𝑡𝑠 − 𝑑𝑚−1 𝑇 𝑠 + 1 + 𝑇,𝑠 + 𝑚 + 1-
• Example
– assume Σ10 , 𝑇 = ,3,1,4,1,5,9,2,6-, 𝑃 = ,1,4,1-, and 𝑚 = 3
– 𝑝 = 141, 𝑡0 = 314
– 𝑡1 = 10 𝑡0 − 102 𝑇 1 + 𝑇 4
= 10 314 − 300 + 1 = 141
12
All’s Well That Ends Well
• Yes, Bill! But we’re not done yet...
– 𝑝 and 𝑡𝑠 may be too large to work with conveniently
– assuming arithmetic operations on these numbers take
“constant time” is unreasonable
• Simple solution
– compute 𝑝 and all 𝑡𝑠 modulo a suitable modulus 𝑞
– adding one operation does not change compute time
– 𝑞 is typically chosen as a prime such that 𝑑𝑞 fits within one
computer word
– 𝑡𝑠+1 = 𝑑 𝑡𝑠 − 𝑇 𝑠 + 1 ℎ + 𝑇 𝑠 + 𝑚 + 1 mod 𝑞,
where ℎ ≡ 𝑑𝑚−1 mod 𝑞
13
Make It As Simple As Possible But Not Simpler
• Okay, Al! Maybe we went too far this time...
– t 𝑠 ≡ 𝑝 mod 𝑞 does not imply 𝑡𝑠 = 𝑝
– t 𝑠 ≢ 𝑝 mod 𝑞 does imply 𝑡𝑠 ≠ 𝑝
• Example
– Assume Σ10 , 𝑝 = 31415, and 𝑞 = 13
– 31415 ≡ 7 (mod 13)
– 67399 ≡ 7 mod 13
• Solution
– use negative test as a fast heuristic to rule out invalid shifts
– positive test must be validated to sort out spurious hits
– if 𝑞 is large, spurious hits are likely to occur less frequently
14
Rabin-Karp Algorithm
RABIN-KARP-MATCHER 𝑇, 𝑃, 𝑑, 𝑞
1
2
3
4
5
6
7
8
9
10
11
12
13
14
𝑛 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑇
𝑚 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑃
ℎ ← 𝑑 𝑚−1 mod 𝑞
𝑝←0
𝑡0 ← 0
for 𝑖 ← 1 to 𝑚
Preprocessing
do 𝑝 ← 𝑑𝑝 + 𝑃 𝑖 mod 𝑞
𝑡0 ← 𝑑𝑡0 + 𝑇 𝑖 mod 𝑞
for 𝑠 ← 0 to 𝑛 − 𝑚
Matching
do if 𝑝 = 𝑡𝑠
then if 𝑃 1. . 𝑚 = 𝑇,𝑠 + 1. . 𝑠 + 𝑚then print “Pattern occurs with shift” 𝑠
if 𝑠 < 𝑛 − 𝑚
then 𝑡𝑠+1 ← 𝑑 𝑡𝑠 − 𝑇 𝑠 + 1 ℎ + 𝑇 𝑠 + 𝑚 + 1 mod 𝑞
15
Run-Time Analysis
• Worst case
– Θ(𝑚) to preprocess and Θ 𝑛 − 𝑚 + 1 𝑚 to match
• Heuristic analysis of average case
– “modulo 𝑞” acts as a random mapping from Σ ∗ to ℤ𝑞
– number of spurious hits expected to be 𝑂 𝑛 𝑞 since the
probability of 𝑡𝑠 ≡ 𝑝 (mod 𝑞) can be estimated as 1 𝑞
– expected matching time of Rabin-Karp algorithm
no match
match
𝑂 𝑛 + 𝑂(𝑚 𝑣 + 𝑛 𝑞 )
where 𝑣 is the number of valid shifts
– if 𝑣 = 𝑂 1 and 𝑞 ≥ 𝑚, the running time is 𝑂 𝑛 + 𝑚 and
since 𝑚 ≤ 𝑛 it is even expected to be 𝑂 𝑛 !
16
String Matching with Finite Automata
• Idea
– build a finite automaton to scan 𝑇 for all occurrences of 𝑃
– examine each character exactly once and in constant time
– matching time Θ(𝑛), but preprocessing time can be large
• A finite automaton 𝑀 is a 5-tuple (𝑄, 𝑞0 , 𝐴, Σ, 𝛿)
–
–
–
–
–
𝑄 is a finite set of states
𝑞0 ∈ 𝑄 is the start state
𝐴 ⊆ 𝑄 is a distinguished set of accepting states
Σ is a finite input alphabet
𝛿 is a function from 𝑄 × Σ into 𝑄, called transition
function of 𝑀
17
String Matching with Finite Automata
• Finite automaton
–
–
–
–
begins in state 𝑞0 , reads one input character 𝑎 at a time
transitions from state 𝑞 into state 𝛿(𝑞, 𝑎)
accepts the string read so far if current state 𝑞 ∈ 𝐴
reject the string read so far if current state 𝑞 ∉ 𝐴
• A finite automaton induces a final-state function 𝜙
– 𝜙: Σ ∗ → 𝑄, such that 𝑞 = 𝜙(𝑤) is the state 𝑀 is in after
scanning the string 𝑤
– 𝑀 accepts a string 𝑤 if and only if 𝜙 𝑤 ∈ 𝐴
– recursive definition of 𝜙
𝜙 𝜖 = 𝑞0
𝜙 𝑤𝑎 = 𝛿 𝜙 𝑤 , 𝑎 for 𝑤 ∈ Σ ∗ , 𝑎 ∈ Σ
18
String-Matching Automata
• For every pattern 𝑃 1. . 𝑚 , we need to construct a
string-matching automaton in preprocessing
– the state set 𝑄 is 0,1, … , 𝑚 , where start state 𝑞0 is state
0 and state 𝑚 is the only accepting state
– the transition function is defined as 𝛿 𝑞, 𝑎 = 𝜎 𝑃𝑞 𝑎 for
any state 𝑞 and character 𝑎
• Suffix function 𝜎 for a given pattern 𝑃 1. . 𝑚
– 𝜎: Σ → 0,1, … , 𝑚 such that 𝜎 𝑥 = max 𝑘: 𝑃𝑘 ⊐ 𝑥 is
the length of the longest prefix of 𝑃 that is a suffix of 𝑥
– for a pattern 𝑃 of length 𝑚, 𝜎 𝑥 = 𝑚 if and only if 𝑃 ⊐ 𝑥
– if 𝑥 ⊐ 𝑦, then 𝜎 𝑥 ≤ 𝜎(𝑦)
19
Example
• Assume pattern 𝑃 = ababaca
0
a
1
a
–
–
–
–
–
–
–
b
2
a
a
3
b
4
a
a
5
b
c
6
a
a
7
b
8 states and a “spine” of forward transitions
𝛿 1, a = 1, since 𝑃1 a = aa and 𝜎 𝑃1 a = 1
𝛿 3, a = 1, since 𝑃3 a = abaa and 𝜎 𝑃3 a = 1
𝛿 5, a = 1 since 𝑃5 a = ababaa and 𝜎 𝑃5 a = 1
𝛿 5, b = 4, since 𝑃5 b = ababab and 𝜎 𝑃5 b = 4
𝛿 7, a = 1, since 𝑃7 a = ababacaa and 𝜎 𝑃7 a = 1
𝛿 7, b = 2, since 𝑃7 b = ababacab and 𝜎 𝑃7 b = 2
20
String-Matching Automata
FINITE-AUTOMATON-MATCHER 𝑇, P, Σ, 𝑚
1
2
3
4
5
6
7
𝑛 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑇
𝛿 ← COMPUTE-TRANSITION-FUNCTION 𝑃, Σ
𝑞←0
for 𝑖 ← 1 to 𝑛
do 𝑞 ← 𝛿(𝑞, 𝑇 𝑖 )
if 𝑞 = 𝑚
then print “Pattern occurs with shift” 𝑖 − 𝑚
• Matching time on a text of length 𝑛 is Θ(𝑛)
– simple loop structure with 𝑛 iterations
– does not account for the time required to compute the
transition function 𝛿
21
Computing the Transition Function δ
COMPUTE-TRANSITION-FUNCTION 𝑃, Σ
1 𝑚 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑃
2 for 𝑞 ← 0 to 𝑚
3
do for each character 𝑎 ∈ Σ
4
do 𝑘 ← min(𝑚 + 1, 𝑞 + 2)
5
repeat 𝑘 ← 𝑘 − 1
6
until 𝑃𝑘 ⊐ 𝑃𝑞 𝑎
7
𝛿 𝑞, 𝑎 ← 𝑘
8 return 𝛿
• Computing transition function takes time 𝑂 𝑚3 Σ
– outer two for loops contribute a factor of 𝑚3 Σ
– inner repeat loop can run at most 𝑚 + 1 times
– test 𝑃𝑘 ⊐ 𝑃𝑞 𝑎 can require up to 𝑚 comparisons
22
Knuth-Morris-Pratt Algorithm
• Idea
– avoid both computing transition function 𝛿 in time
𝑂 𝑚 Σ and testing useless shifts as in naïve algorithm
– use auxiliary function 𝜋 1. . 𝑚 that can be pre-computed
from the pattern in time Θ 𝑚
– array 𝜋 allows 𝛿 to be computed efficiently “on the fly” as
needed, in the amortized sense
• Prefix function 𝜋 for a pattern 𝑃 1. . 𝑚
– 𝜋: 1,2, … , 𝑚 → 0,1, … , 𝑚 − 1 such that 𝜋 𝑞 =
max*𝑘: 𝑘 < 𝑞 and 𝑃𝑘 ⊐ 𝑃𝑞 +
– 𝜋 𝑞 is the length of the longest prefix of 𝑃 that is a proper
suffix of 𝑃𝑞
23
Example
• What’s the next possible shift that should be tested?
b a c b a b a b a a b c b a b 𝑇
𝑠
a b a b a c a 𝑃
𝑞
b a c b a b a b a a b c b a b 𝑇
Bad Idea!
a b a b a c a 𝑃
b a c b a b a b a a b c b a b 𝑇
𝑠+ 𝑞−𝜋 𝑞
a b a b a c a 𝑃
“Knowledge Horizon”
𝑃𝑖
𝜋𝑖
𝑖 1 2 3 4 5 6 7
a b a b a c a
0 0 1 2 3 0 1
24
Knuth-Morris-Pratt Algorithm
KMP-MATCHER 𝑇, 𝑃
1
2
3
4
5
6
7
8
9
10
11
12
𝑛 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑇
𝑚 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑃
𝜋 ← COMPUTE-PREFIX-FUNCTION 𝑃
𝑞←0
Number of characters matched
for 𝑖 ← 1 to 𝑛
Scan the text from left to right
do while 𝑞 > 0 and 𝑃 𝑞 + 1 ≠ 𝑇 𝑖
do 𝑞 ← 𝜋 𝑞
Next character does not match
if 𝑃 𝑞 + 1 = 𝑇 𝑖
then 𝑞 ← 𝑞 + 1
Next character matches
if 𝑞 = 𝑚
Is all of 𝑃 matched?
then print “Pattern occurs with shift” 𝑖 − 𝑚
𝑞←𝜋 𝑞
Look for the next match
25
Computing the Prefix Function 𝜋
COMPUTE-PREFIX-FUNCTION 𝑃
1
2
3
4
5
6
7
8
9
10
𝑚 ← 𝑙𝑒𝑛𝑔𝑡ℎ 𝑃
𝜋 1 ←0
𝑘←0
for 𝑞 ← 2 to 𝑚
do while 𝑘 > 0 and 𝑃 𝑘 + 1 ≠ 𝑃 𝑞
do 𝑘 ← 𝜋 𝑘
if 𝑃 𝑘 + 1 = 𝑃 𝑞
then 𝑘 ← 𝑘 + 1
𝜋 𝑞 ←𝑘
return 𝜋
26
Run-Time Analysis
• Computing the prefix function takes time Θ 𝑚
– outer for loop takes time Θ 𝑚
– amortized cost of for loop body is 𝑂 1
• amortized analysis with a potential of 𝑘, corresponding to the
current state of 𝑘 in the algorithm
• in each iteration of the for loop, 𝑘 increases at most by 1
• since 𝜋 𝑘 < 𝑘, there is a decrease of 𝑘 for each increase of 𝑘
• String-matching takes time Θ 𝑛
– with 𝑞 as the potential function, the same amortized
argument as above can be made for the matching time
27

Download Report

String Matching

Paperzz.com

Your Paperzz