String Matching Algorithm Design and Analysis (Week 7) 1 Battle Plan β’ String matching problem β’ Notation and terminology β’ Four different algorithms Algorithm Naïve Rabin-Karp Finite automaton Knuth-Morris-Pratt Preprocessing Time Matching Time 0 π( π β π + 1 π) Ξ(π) π( π β π + 1 π) π(π Ξ£ ) Ξ(π) Ξ(π) Ξ(π) 2 String-Matching Problem β’ βWhereβs the hotel in idahotelescope?β β’ Formalization of the string-matching problem β Text is an array π,1. . π- of length π β Pattern is an array π,1. . π- of length π β€ π β π and π are drawn from a finite alphabet Ξ£, they are often called strings of characters β π occurs with shift π in π if 0 β€ π β€ π β π and π π + 1. . π + π = π,1. . πText π i d a h o t e l e s c o p e Pattern π π =3 h o t e l 3 Notation and Terminology β’ Strings β β β β Ξ£β π π₯ π₯π¦ set of all finite-length strings with characters from Ξ£ zero-length empty string also belongs to Ξ£ β length of a string π₯ concatenation of strings π₯ and π¦ has length π₯ + π¦ β’ Prefix and suffix β string π€ is a prefix of a string π₯, denoted as π€ β π₯, if π₯ = π€π¦ for some string π¦ β Ξ£ β β string π€ is a suffix of a string π₯, denoted as π€ β π₯, if π₯ = π¦π€ for some string π¦ β Ξ£ β β ππ denotes the π-character prefix π,1. . π- of the string π 1. . π and thus π0 = π and ππ = π = π 1. . π . 4 Observations β’ Strings β π =0 β’ Prefix and suffix β for any string π₯, π β π₯ and π β π₯ β if π€ β π₯ or π€ β π₯, then π€ β€ π₯ β for any two strings π₯ and π¦ and any character π, π₯ β π¦ β ππ₯ β ππ¦ and π₯ β π¦ β π₯π β π¦π β both β and β are transitive relations β’ Reformulated string-matching problem β finding all shifts π in the range 0 β€ π β€ π β π such that π β ππ +π 5 Examples β’ Assume Ξ£ = a, b, c β Ξ£ β = *π, a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, β¦ + β π₯ = ab and π¦ = ba β β β β β β β β π₯ = π¦ =2 π₯π¦ = abba π₯π¦ = π₯ + π¦ = 4 π β abba and π β abba a β abba and a β abba ab β abba and ba β abba abb β abba and bba β abba abba β abba and abba β abba 6 Overlapping-Suffix Lemma β’ Assume π₯, π¦, and π§ are strings such that π₯ β π§ and π¦βπ§ β if π₯ β€ |π¦|, then π₯ β π¦ β if π₯ β₯ |π¦|, then π¦ β π₯ β if π₯ = |π¦|, then π₯ = π¦ β’ Proof π₯ π₯ π₯ π§ π§ π§ π¦ π¦ π¦ π₯ π₯ π₯ π¦ π¦ π¦ 7 Naïve String-Matching Algorithm NAÏVE-STRING-MATCHER π, π 1 π β πππππ‘β π 2 π β πππππ‘β π 3 for π β 0 to π β π 4 do if π 1. . π = π π + 1. . π + π 5 then print βPattern occurs with shiftβ π β’ Comparing two stings (line 4) takes time Ξ π‘ + 1 β π‘ denotes the number of matching characters β β+1β to cater for non-matching strings (β π 0 ) β’ Naïve algorithm takes time π π β π + 1 π β tight bound in the worst-case Ξ π β π + 1 π β consider matching text an and the pattern aπ β if π = π 2 , the worst-case running time is Ξ π2 8 Example β’ Graphical interpretation β sliding pattern over text in steps of length 1 β noting for which shifts all of pattern characters equal the corresponding text characters a c a a b c π =0 a a b a c a a b c π =1 a c a a b c π =2 a a b a a b a c a a b c π =3 a a b 9 Rabin-Karp Algorithm β’ Motivation β comparing numbers is βcheaperβ than matching strings β represent text and pattern as numbers β use number-theoretic notions to match strings β’ Assumptions and notation β Ξ£10 = *0,1,2, β¦ , 9+, but in the general case each character will be a digit in radix-π notation where π = Ξ£ β π denotes the value corresponding to π 1. . π β given π,1. . π-, π‘π denotes the value of the length-π substring π π + 1. . π + π , for π = 0,1, β¦ , π β π β π‘π = π β π π + 1. . π + π = π 1. . π 10 Rabin-Karp Algorithm β’ Goal β compute π in time Ξ π β compute all π‘π values in a total time Ξ π β π + 1 β get all valid shifts in time Ξ π + Ξ π β π + 1 = Ξ(π) β’ Computing π from π 1. . π β can be done in Ξ(π) using Hornerβs rule β π = π π + π(π π β 1 + π π π β 2 + β― + π π 2 + ππ 1 β― ) 11 Rabin-Karp Algorithm β’ Computing π‘0 from π 1. . π β use Hornerβs rule to compute π‘0 in time Ξ π β’ Computing π‘1 , π‘2 , β¦ , π‘πβπ from π,1. . πβ can be done in time Ξ π β π since π‘π +1 can be computed from π‘π in constant time β π‘π +1 = π π‘π β ππβ1 π π + 1 + π,π + π + 1- β’ Example β assume Ξ£10 , π = ,3,1,4,1,5,9,2,6-, π = ,1,4,1-, and π = 3 β π = 141, π‘0 = 314 β π‘1 = 10 π‘0 β 102 π 1 + π 4 = 10 314 β 300 + 1 = 141 12 Allβs Well That Ends Well β’ Yes, Bill! But weβre not done yet... β π and π‘π may be too large to work with conveniently β assuming arithmetic operations on these numbers take βconstant timeβ is unreasonable β’ Simple solution β compute π and all π‘π modulo a suitable modulus π β adding one operation does not change compute time β π is typically chosen as a prime such that ππ fits within one computer word β π‘π +1 = π π‘π β π π + 1 β + π π + π + 1 mod π, where β β‘ ππβ1 mod π 13 Make It As Simple As Possible But Not Simpler β’ Okay, Al! Maybe we went too far this time... β t π β‘ π mod π does not imply π‘π = π β t π β’ π mod π does imply π‘π β π β’ Example β Assume Ξ£10 , π = 31415, and π = 13 β 31415 β‘ 7 (mod 13) β 67399 β‘ 7 mod 13 β’ Solution β use negative test as a fast heuristic to rule out invalid shifts β positive test must be validated to sort out spurious hits β if π is large, spurious hits are likely to occur less frequently 14 Rabin-Karp Algorithm RABIN-KARP-MATCHER π, π, π, π 1 2 3 4 5 6 7 8 9 10 11 12 13 14 π β πππππ‘β π π β πππππ‘β π β β π πβ1 mod π πβ0 π‘0 β 0 for π β 1 to π Preprocessing do π β ππ + π π mod π π‘0 β ππ‘0 + π π mod π for π β 0 to π β π Matching do if π = π‘π then if π 1. . π = π,π + 1. . π + πthen print βPattern occurs with shiftβ π if π < π β π then π‘π +1 β π π‘π β π π + 1 β + π π + π + 1 mod π 15 Run-Time Analysis β’ Worst case β Ξ(π) to preprocess and Ξ π β π + 1 π to match β’ Heuristic analysis of average case β βmodulo πβ acts as a random mapping from Ξ£ β to β€π β number of spurious hits expected to be π π π since the probability of π‘π β‘ π (mod π) can be estimated as 1 π β expected matching time of Rabin-Karp algorithm no match match π π + π(π π£ + π π ) where π£ is the number of valid shifts β if π£ = π 1 and π β₯ π, the running time is π π + π and since π β€ π it is even expected to be π π ! 16 String Matching with Finite Automata β’ Idea β build a finite automaton to scan π for all occurrences of π β examine each character exactly once and in constant time β matching time Ξ(π), but preprocessing time can be large β’ A finite automaton π is a 5-tuple (π, π0 , π΄, Ξ£, πΏ) β β β β β π is a finite set of states π0 β π is the start state π΄ β π is a distinguished set of accepting states Ξ£ is a finite input alphabet πΏ is a function from π × Ξ£ into π, called transition function of π 17 String Matching with Finite Automata β’ Finite automaton β β β β begins in state π0 , reads one input character π at a time transitions from state π into state πΏ(π, π) accepts the string read so far if current state π β π΄ reject the string read so far if current state π β π΄ β’ A finite automaton induces a final-state function π β π: Ξ£ β β π, such that π = π(π€) is the state π is in after scanning the string π€ β π accepts a string π€ if and only if π π€ β π΄ β recursive definition of π π π = π0 π π€π = πΏ π π€ , π for π€ β Ξ£ β , π β Ξ£ 18 String-Matching Automata β’ For every pattern π 1. . π , we need to construct a string-matching automaton in preprocessing β the state set π is 0,1, β¦ , π , where start state π0 is state 0 and state π is the only accepting state β the transition function is defined as πΏ π, π = π ππ π for any state π and character π β’ Suffix function π for a given pattern π 1. . π β π: Ξ£ β 0,1, β¦ , π such that π π₯ = max π: ππ β π₯ is the length of the longest prefix of π that is a suffix of π₯ β for a pattern π of length π, π π₯ = π if and only if π β π₯ β if π₯ β π¦, then π π₯ β€ π(π¦) 19 Example β’ Assume pattern π = ababaca 0 a 1 a β β β β β β β b 2 a a 3 b 4 a a 5 b c 6 a a 7 b 8 states and a βspineβ of forward transitions πΏ 1, a = 1, since π1 a = aa and π π1 a = 1 πΏ 3, a = 1, since π3 a = abaa and π π3 a = 1 πΏ 5, a = 1 since π5 a = ababaa and π π5 a = 1 πΏ 5, b = 4, since π5 b = ababab and π π5 b = 4 πΏ 7, a = 1, since π7 a = ababacaa and π π7 a = 1 πΏ 7, b = 2, since π7 b = ababacab and π π7 b = 2 20 String-Matching Automata FINITE-AUTOMATON-MATCHER π, P, Ξ£, π 1 2 3 4 5 6 7 π β πππππ‘β π πΏ β COMPUTE-TRANSITION-FUNCTION π, Ξ£ πβ0 for π β 1 to π do π β πΏ(π, π π ) if π = π then print βPattern occurs with shiftβ π β π β’ Matching time on a text of length π is Ξ(π) β simple loop structure with π iterations β does not account for the time required to compute the transition function πΏ 21 Computing the Transition Function Ξ΄ COMPUTE-TRANSITION-FUNCTION π, Ξ£ 1 π β πππππ‘β π 2 for π β 0 to π 3 do for each character π β Ξ£ 4 do π β min(π + 1, π + 2) 5 repeat π β π β 1 6 until ππ β ππ π 7 πΏ π, π β π 8 return πΏ β’ Computing transition function takes time π π3 Ξ£ β outer two for loops contribute a factor of π3 Ξ£ β inner repeat loop can run at most π + 1 times β test ππ β ππ π can require up to π comparisons 22 Knuth-Morris-Pratt Algorithm β’ Idea β avoid both computing transition function πΏ in time π π Ξ£ and testing useless shifts as in naïve algorithm β use auxiliary function π 1. . π that can be pre-computed from the pattern in time Ξ π β array π allows πΏ to be computed efficiently βon the flyβ as needed, in the amortized sense β’ Prefix function π for a pattern π 1. . π β π: 1,2, β¦ , π β 0,1, β¦ , π β 1 such that π π = max*π: π < π and ππ β ππ + β π π is the length of the longest prefix of π that is a proper suffix of ππ 23 Example β’ Whatβs the next possible shift that should be tested? b a c b a b a b a a b c b a b π π a b a b a c a π π b a c b a b a b a a b c b a b π Bad Idea! a b a b a c a π b a c b a b a b a a b c b a b π π + πβπ π a b a b a c a π βKnowledge Horizonβ ππ ππ π 1 2 3 4 5 6 7 a b a b a c a 0 0 1 2 3 0 1 24 Knuth-Morris-Pratt Algorithm KMP-MATCHER π, π 1 2 3 4 5 6 7 8 9 10 11 12 π β πππππ‘β π π β πππππ‘β π π β COMPUTE-PREFIX-FUNCTION π πβ0 Number of characters matched for π β 1 to π Scan the text from left to right do while π > 0 and π π + 1 β π π do π β π π Next character does not match if π π + 1 = π π then π β π + 1 Next character matches if π = π Is all of π matched? then print βPattern occurs with shiftβ π β π πβπ π Look for the next match 25 Computing the Prefix Function π COMPUTE-PREFIX-FUNCTION π 1 2 3 4 5 6 7 8 9 10 π β πππππ‘β π π 1 β0 πβ0 for π β 2 to π do while π > 0 and π π + 1 β π π do π β π π if π π + 1 = π π then π β π + 1 π π βπ return π 26 Run-Time Analysis β’ Computing the prefix function takes time Ξ π β outer for loop takes time Ξ π β amortized cost of for loop body is π 1 β’ amortized analysis with a potential of π, corresponding to the current state of π in the algorithm β’ in each iteration of the for loop, π increases at most by 1 β’ since π π < π, there is a decrease of π for each increase of π β’ String-matching takes time Ξ π β with π as the potential function, the same amortized argument as above can be made for the matching time 27
© Copyright 2026 Paperzz