String Matching

String Matching
Algorithm Design and Analysis
(Week 7)
1
Battle Plan
β€’ String matching problem
β€’ Notation and terminology
β€’ Four different algorithms
Algorithm
Naïve
Rabin-Karp
Finite automaton
Knuth-Morris-Pratt
Preprocessing Time Matching Time
0
𝑂( 𝑛 βˆ’ π‘š + 1 π‘š)
Θ(π‘š)
𝑂( 𝑛 βˆ’ π‘š + 1 π‘š)
𝑂(π‘š Ξ£ )
Θ(𝑛)
Θ(π‘š)
Θ(𝑛)
2
String-Matching Problem
β€’ β€œWhere’s the hotel in idahotelescope?”
β€’ Formalization of the string-matching problem
– Text is an array 𝑇,1. . 𝑛- of length 𝑛
– Pattern is an array 𝑃,1. . π‘š- of length π‘š ≀ 𝑛
– 𝑇 and 𝑃 are drawn from a finite alphabet Ξ£, they are often
called strings of characters
– 𝑃 occurs with shift 𝒔 in 𝑇 if 0 ≀ 𝑠 ≀ 𝑛 βˆ’ π‘š and
𝑇 𝑠 + 1. . 𝑠 + π‘š = 𝑃,1. . π‘šText 𝑇
i d a h o t e l e s c o p e
Pattern 𝑃
𝑠=3
h o t e l
3
Notation and Terminology
β€’ Strings
–
–
–
–
Ξ£βˆ—
πœ–
π‘₯
π‘₯𝑦
set of all finite-length strings with characters from Ξ£
zero-length empty string also belongs to Ξ£ βˆ—
length of a string π‘₯
concatenation of strings π‘₯ and 𝑦 has length π‘₯ + 𝑦
β€’ Prefix and suffix
– string 𝑀 is a prefix of a string π‘₯, denoted as 𝑀 ⊏ π‘₯, if
π‘₯ = 𝑀𝑦 for some string 𝑦 ∈ Ξ£ βˆ—
– string 𝑀 is a suffix of a string π‘₯, denoted as 𝑀 ⊐ π‘₯, if
π‘₯ = 𝑦𝑀 for some string 𝑦 ∈ Ξ£ βˆ—
– π‘†π‘˜ denotes the π‘˜-character prefix 𝑆,1. . π‘˜- of the string
𝑆 1. . 𝑛 and thus 𝑆0 = πœ– and 𝑆𝑛 = 𝑆 = 𝑆 1. . 𝑛 .
4
Observations
β€’ Strings
– πœ– =0
β€’ Prefix and suffix
– for any string π‘₯, πœ– ⊏ π‘₯ and πœ– ⊐ π‘₯
– if 𝑀 ⊏ π‘₯ or 𝑀 ⊐ π‘₯, then 𝑀 ≀ π‘₯
– for any two strings π‘₯ and 𝑦 and any character π‘Ž,
π‘₯ ⊏ 𝑦 β†’ π‘Žπ‘₯ ⊏ π‘Žπ‘¦ and π‘₯ ⊐ 𝑦 β†’ π‘₯π‘Ž ⊐ π‘¦π‘Ž
– both ⊏ and ⊐ are transitive relations
β€’ Reformulated string-matching problem
– finding all shifts 𝑠 in the range 0 ≀ 𝑠 ≀ 𝑛 βˆ’ π‘š such that
𝑃 ⊐ 𝑇𝑠+π‘š
5
Examples
β€’ Assume Ξ£ = a, b, c
– Ξ£ βˆ— = *πœ–, a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, … +
– π‘₯ = ab and 𝑦 = ba
–
–
–
–
–
–
–
–
π‘₯ = 𝑦 =2
π‘₯𝑦 = abba
π‘₯𝑦 = π‘₯ + 𝑦 = 4
πœ– ⊏ abba and πœ– ⊐ abba
a ⊏ abba and a ⊐ abba
ab ⊏ abba and ba ⊐ abba
abb ⊏ abba and bba ⊐ abba
abba ⊏ abba and abba ⊐ abba
6
Overlapping-Suffix Lemma
β€’ Assume π‘₯, 𝑦, and 𝑧 are strings such that π‘₯ ⊐ 𝑧 and
π‘¦βŠπ‘§
– if π‘₯ ≀ |𝑦|, then π‘₯ ⊐ 𝑦
– if π‘₯ β‰₯ |𝑦|, then 𝑦 ⊐ π‘₯
– if π‘₯ = |𝑦|, then π‘₯ = 𝑦
β€’ Proof
π‘₯
π‘₯
π‘₯
𝑧
𝑧
𝑧
𝑦
𝑦
𝑦
π‘₯
π‘₯
π‘₯
𝑦
𝑦
𝑦
7
Naïve String-Matching Algorithm
NAÏVE-STRING-MATCHER 𝑇, 𝑃
1 𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇
2 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃
3 for 𝑠 ← 0 to 𝑛 βˆ’ π‘š
4
do if 𝑃 1. . π‘š = 𝑇 𝑠 + 1. . 𝑠 + π‘š
5
then print β€œPattern occurs with shift” 𝑠
β€’ Comparing two stings (line 4) takes time Θ 𝑑 + 1
– 𝑑 denotes the number of matching characters
– β€œ+1” to cater for non-matching strings (β‰  𝑂 0 )
β€’ Naïve algorithm takes time 𝑂 𝑛 βˆ’ π‘š + 1 π‘š
– tight bound in the worst-case Θ 𝑛 βˆ’ π‘š + 1 π‘š
– consider matching text an and the pattern aπ‘š
– if π‘š = 𝑛 2 , the worst-case running time is Θ 𝑛2
8
Example
β€’ Graphical interpretation
– sliding pattern over text in steps of length 1
– noting for which shifts all of pattern characters equal the
corresponding text characters
a c a a b c
𝑠=0
a a b
a c a a b c
𝑠=1
a c a a b c
𝑠=2
a a b
a a b
a c a a b c
𝑠=3
a a b
9
Rabin-Karp Algorithm
β€’ Motivation
– comparing numbers is β€œcheaper” than matching strings
– represent text and pattern as numbers
– use number-theoretic notions to match strings
β€’ Assumptions and notation
– Ξ£10 = *0,1,2, … , 9+, but in the general case each character
will be a digit in radix-𝑑 notation where 𝑑 = Ξ£
– 𝑝 denotes the value corresponding to 𝑃 1. . π‘š
– given 𝑇,1. . 𝑛-, 𝑑𝑠 denotes the value of the length-π‘š
substring 𝑇 𝑠 + 1. . 𝑠 + π‘š , for 𝑠 = 0,1, … , 𝑛 βˆ’ π‘š
– 𝑑𝑠 = 𝑝 ⇔ 𝑇 𝑠 + 1. . 𝑠 + π‘š = 𝑃 1. . π‘š
10
Rabin-Karp Algorithm
β€’ Goal
– compute 𝑝 in time Θ π‘š
– compute all 𝑑𝑠 values in a total time Θ 𝑛 βˆ’ π‘š + 1
– get all valid shifts in time Θ π‘š + Θ 𝑛 βˆ’ π‘š + 1 = Θ(𝑛)
β€’ Computing 𝑝 from 𝑃 1. . π‘š
– can be done in Θ(π‘š) using Horner’s rule
– 𝑝 = 𝑃 π‘š + 𝑑(𝑃 π‘š βˆ’ 1 + 𝑑 𝑃 π‘š βˆ’ 2 + β‹― + 𝑑 𝑃 2 + 𝑑𝑃 1 β‹― )
11
Rabin-Karp Algorithm
β€’ Computing 𝑑0 from 𝑇 1. . 𝑛
– use Horner’s rule to compute 𝑑0 in time Θ π‘š
β€’ Computing 𝑑1 , 𝑑2 , … , π‘‘π‘›βˆ’π‘š from 𝑇,1. . 𝑛– can be done in time Θ 𝑛 βˆ’ π‘š since 𝑑𝑠+1 can be
computed from 𝑑𝑠 in constant time
– 𝑑𝑠+1 = 𝑑 𝑑𝑠 βˆ’ π‘‘π‘šβˆ’1 𝑇 𝑠 + 1 + 𝑇,𝑠 + π‘š + 1-
β€’ Example
– assume Ξ£10 , 𝑇 = ,3,1,4,1,5,9,2,6-, 𝑃 = ,1,4,1-, and π‘š = 3
– 𝑝 = 141, 𝑑0 = 314
– 𝑑1 = 10 𝑑0 βˆ’ 102 𝑇 1 + 𝑇 4
= 10 314 βˆ’ 300 + 1 = 141
12
All’s Well That Ends Well
β€’ Yes, Bill! But we’re not done yet...
– 𝑝 and 𝑑𝑠 may be too large to work with conveniently
– assuming arithmetic operations on these numbers take
β€œconstant time” is unreasonable
β€’ Simple solution
– compute 𝑝 and all 𝑑𝑠 modulo a suitable modulus π‘ž
– adding one operation does not change compute time
– π‘ž is typically chosen as a prime such that π‘‘π‘ž fits within one
computer word
– 𝑑𝑠+1 = 𝑑 𝑑𝑠 βˆ’ 𝑇 𝑠 + 1 β„Ž + 𝑇 𝑠 + π‘š + 1 mod π‘ž,
where β„Ž ≑ π‘‘π‘šβˆ’1 mod π‘ž
13
Make It As Simple As Possible But Not Simpler
β€’ Okay, Al! Maybe we went too far this time...
– t 𝑠 ≑ 𝑝 mod π‘ž does not imply 𝑑𝑠 = 𝑝
– t 𝑠 β‰’ 𝑝 mod π‘ž does imply 𝑑𝑠 β‰  𝑝
β€’ Example
– Assume Ξ£10 , 𝑝 = 31415, and π‘ž = 13
– 31415 ≑ 7 (mod 13)
– 67399 ≑ 7 mod 13
β€’ Solution
– use negative test as a fast heuristic to rule out invalid shifts
– positive test must be validated to sort out spurious hits
– if π‘ž is large, spurious hits are likely to occur less frequently
14
Rabin-Karp Algorithm
RABIN-KARP-MATCHER 𝑇, 𝑃, 𝑑, π‘ž
1
2
3
4
5
6
7
8
9
10
11
12
13
14
𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇
π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃
β„Ž ← 𝑑 π‘šβˆ’1 mod π‘ž
𝑝←0
𝑑0 ← 0
for 𝑖 ← 1 to π‘š
Preprocessing
do 𝑝 ← 𝑑𝑝 + 𝑃 𝑖 mod π‘ž
𝑑0 ← 𝑑𝑑0 + 𝑇 𝑖 mod π‘ž
for 𝑠 ← 0 to 𝑛 βˆ’ π‘š
Matching
do if 𝑝 = 𝑑𝑠
then if 𝑃 1. . π‘š = 𝑇,𝑠 + 1. . 𝑠 + π‘šthen print β€œPattern occurs with shift” 𝑠
if 𝑠 < 𝑛 βˆ’ π‘š
then 𝑑𝑠+1 ← 𝑑 𝑑𝑠 βˆ’ 𝑇 𝑠 + 1 β„Ž + 𝑇 𝑠 + π‘š + 1 mod π‘ž
15
Run-Time Analysis
β€’ Worst case
– Θ(π‘š) to preprocess and Θ 𝑛 βˆ’ π‘š + 1 π‘š to match
β€’ Heuristic analysis of average case
– β€œmodulo π‘žβ€ acts as a random mapping from Ξ£ βˆ— to β„€π‘ž
– number of spurious hits expected to be 𝑂 𝑛 π‘ž since the
probability of 𝑑𝑠 ≑ 𝑝 (mod π‘ž) can be estimated as 1 π‘ž
– expected matching time of Rabin-Karp algorithm
no match
match
𝑂 𝑛 + 𝑂(π‘š 𝑣 + 𝑛 π‘ž )
where 𝑣 is the number of valid shifts
– if 𝑣 = 𝑂 1 and π‘ž β‰₯ π‘š, the running time is 𝑂 𝑛 + π‘š and
since π‘š ≀ 𝑛 it is even expected to be 𝑂 𝑛 !
16
String Matching with Finite Automata
β€’ Idea
– build a finite automaton to scan 𝑇 for all occurrences of 𝑃
– examine each character exactly once and in constant time
– matching time Θ(𝑛), but preprocessing time can be large
β€’ A finite automaton 𝑀 is a 5-tuple (𝑄, π‘ž0 , 𝐴, Ξ£, 𝛿)
–
–
–
–
–
𝑄 is a finite set of states
π‘ž0 ∈ 𝑄 is the start state
𝐴 βŠ† 𝑄 is a distinguished set of accepting states
Ξ£ is a finite input alphabet
𝛿 is a function from 𝑄 × Ξ£ into 𝑄, called transition
function of 𝑀
17
String Matching with Finite Automata
β€’ Finite automaton
–
–
–
–
begins in state π‘ž0 , reads one input character π‘Ž at a time
transitions from state π‘ž into state 𝛿(π‘ž, π‘Ž)
accepts the string read so far if current state π‘ž ∈ 𝐴
reject the string read so far if current state π‘ž βˆ‰ 𝐴
β€’ A finite automaton induces a final-state function πœ™
– πœ™: Ξ£ βˆ— β†’ 𝑄, such that π‘ž = πœ™(𝑀) is the state 𝑀 is in after
scanning the string 𝑀
– 𝑀 accepts a string 𝑀 if and only if πœ™ 𝑀 ∈ 𝐴
– recursive definition of πœ™
πœ™ πœ– = π‘ž0
πœ™ π‘€π‘Ž = 𝛿 πœ™ 𝑀 , π‘Ž for 𝑀 ∈ Ξ£ βˆ— , π‘Ž ∈ Ξ£
18
String-Matching Automata
β€’ For every pattern 𝑃 1. . π‘š , we need to construct a
string-matching automaton in preprocessing
– the state set 𝑄 is 0,1, … , π‘š , where start state π‘ž0 is state
0 and state π‘š is the only accepting state
– the transition function is defined as 𝛿 π‘ž, π‘Ž = 𝜎 π‘ƒπ‘ž π‘Ž for
any state π‘ž and character π‘Ž
β€’ Suffix function 𝜎 for a given pattern 𝑃 1. . π‘š
– 𝜎: Ξ£ β†’ 0,1, … , π‘š such that 𝜎 π‘₯ = max π‘˜: π‘ƒπ‘˜ ⊐ π‘₯ is
the length of the longest prefix of 𝑃 that is a suffix of π‘₯
– for a pattern 𝑃 of length π‘š, 𝜎 π‘₯ = π‘š if and only if 𝑃 ⊐ π‘₯
– if π‘₯ ⊐ 𝑦, then 𝜎 π‘₯ ≀ 𝜎(𝑦)
19
Example
β€’ Assume pattern 𝑃 = ababaca
0
a
1
a
–
–
–
–
–
–
–
b
2
a
a
3
b
4
a
a
5
b
c
6
a
a
7
b
8 states and a β€œspine” of forward transitions
𝛿 1, a = 1, since 𝑃1 a = aa and 𝜎 𝑃1 a = 1
𝛿 3, a = 1, since 𝑃3 a = abaa and 𝜎 𝑃3 a = 1
𝛿 5, a = 1 since 𝑃5 a = ababaa and 𝜎 𝑃5 a = 1
𝛿 5, b = 4, since 𝑃5 b = ababab and 𝜎 𝑃5 b = 4
𝛿 7, a = 1, since 𝑃7 a = ababacaa and 𝜎 𝑃7 a = 1
𝛿 7, b = 2, since 𝑃7 b = ababacab and 𝜎 𝑃7 b = 2
20
String-Matching Automata
FINITE-AUTOMATON-MATCHER 𝑇, P, Ξ£, π‘š
1
2
3
4
5
6
7
𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇
𝛿 ← COMPUTE-TRANSITION-FUNCTION 𝑃, Ξ£
π‘žβ†0
for 𝑖 ← 1 to 𝑛
do π‘ž ← 𝛿(π‘ž, 𝑇 𝑖 )
if π‘ž = π‘š
then print β€œPattern occurs with shift” 𝑖 βˆ’ π‘š
β€’ Matching time on a text of length 𝑛 is Θ(𝑛)
– simple loop structure with 𝑛 iterations
– does not account for the time required to compute the
transition function 𝛿
21
Computing the Transition Function Ξ΄
COMPUTE-TRANSITION-FUNCTION 𝑃, Ξ£
1 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃
2 for π‘ž ← 0 to π‘š
3
do for each character π‘Ž ∈ Ξ£
4
do π‘˜ ← min(π‘š + 1, π‘ž + 2)
5
repeat π‘˜ ← π‘˜ βˆ’ 1
6
until π‘ƒπ‘˜ ⊐ π‘ƒπ‘ž π‘Ž
7
𝛿 π‘ž, π‘Ž ← π‘˜
8 return 𝛿
β€’ Computing transition function takes time 𝑂 π‘š3 Ξ£
– outer two for loops contribute a factor of π‘š3 Ξ£
– inner repeat loop can run at most π‘š + 1 times
– test π‘ƒπ‘˜ ⊐ π‘ƒπ‘ž π‘Ž can require up to π‘š comparisons
22
Knuth-Morris-Pratt Algorithm
β€’ Idea
– avoid both computing transition function 𝛿 in time
𝑂 π‘š Ξ£ and testing useless shifts as in naïve algorithm
– use auxiliary function πœ‹ 1. . π‘š that can be pre-computed
from the pattern in time Θ π‘š
– array πœ‹ allows 𝛿 to be computed efficiently β€œon the fly” as
needed, in the amortized sense
β€’ Prefix function πœ‹ for a pattern 𝑃 1. . π‘š
– πœ‹: 1,2, … , π‘š β†’ 0,1, … , π‘š βˆ’ 1 such that πœ‹ π‘ž =
max*π‘˜: π‘˜ < π‘ž and π‘ƒπ‘˜ ⊐ π‘ƒπ‘ž +
– πœ‹ π‘ž is the length of the longest prefix of 𝑃 that is a proper
suffix of π‘ƒπ‘ž
23
Example
β€’ What’s the next possible shift that should be tested?
b a c b a b a b a a b c b a b 𝑇
𝑠
a b a b a c a 𝑃
π‘ž
b a c b a b a b a a b c b a b 𝑇
Bad Idea!
a b a b a c a 𝑃
b a c b a b a b a a b c b a b 𝑇
𝑠+ π‘žβˆ’πœ‹ π‘ž
a b a b a c a 𝑃
β€œKnowledge Horizon”
𝑃𝑖
πœ‹π‘–
𝑖 1 2 3 4 5 6 7
a b a b a c a
0 0 1 2 3 0 1
24
Knuth-Morris-Pratt Algorithm
KMP-MATCHER 𝑇, 𝑃
1
2
3
4
5
6
7
8
9
10
11
12
𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇
π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃
πœ‹ ← COMPUTE-PREFIX-FUNCTION 𝑃
π‘žβ†0
Number of characters matched
for 𝑖 ← 1 to 𝑛
Scan the text from left to right
do while π‘ž > 0 and 𝑃 π‘ž + 1 β‰  𝑇 𝑖
do π‘ž ← πœ‹ π‘ž
Next character does not match
if 𝑃 π‘ž + 1 = 𝑇 𝑖
then π‘ž ← π‘ž + 1
Next character matches
if π‘ž = π‘š
Is all of 𝑃 matched?
then print β€œPattern occurs with shift” 𝑖 βˆ’ π‘š
π‘žβ†πœ‹ π‘ž
Look for the next match
25
Computing the Prefix Function πœ‹
COMPUTE-PREFIX-FUNCTION 𝑃
1
2
3
4
5
6
7
8
9
10
π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃
πœ‹ 1 ←0
π‘˜β†0
for π‘ž ← 2 to π‘š
do while π‘˜ > 0 and 𝑃 π‘˜ + 1 β‰  𝑃 π‘ž
do π‘˜ ← πœ‹ π‘˜
if 𝑃 π‘˜ + 1 = 𝑃 π‘ž
then π‘˜ ← π‘˜ + 1
πœ‹ π‘ž β†π‘˜
return πœ‹
26
Run-Time Analysis
β€’ Computing the prefix function takes time Θ π‘š
– outer for loop takes time Θ π‘š
– amortized cost of for loop body is 𝑂 1
β€’ amortized analysis with a potential of π‘˜, corresponding to the
current state of π‘˜ in the algorithm
β€’ in each iteration of the for loop, π‘˜ increases at most by 1
β€’ since πœ‹ π‘˜ < π‘˜, there is a decrease of π‘˜ for each increase of π‘˜
β€’ String-matching takes time Θ 𝑛
– with π‘ž as the potential function, the same amortized
argument as above can be made for the matching time
27