Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching: Analysis of Knuth-Morris-Pratt type algorithms using the Subadditive Ergodic Theorem 14 July 2017 Tobias Reichl JASS04 - Sequential Pattern Matching 1 Overview 1. Pattern Matching • • Sequential Algorithms Knuth-Morris-Pratt-Algorithm 2. Probabilistic tools • • Subadditive Ergodic Theorem Martingales and Azuma's Inequality 3. Analysis of KMP-Algorithms • • • Properties of KMP Establishing subadditivity Analysis Tobias Reichl JASS04 - Sequential Pattern Matching 2 Pattern Matching Pattern-text comparison: M(l,k)=1 Pattern p abcde Text t xxxxxabxxxabcxxxabcde Alignment position AP n 1 , m 1 • Text t pattern p 1 t l is compared to p k • Comparison: M (l , k ) otherwise 0 • Alignment Position: M ( AP (k 1), k ) 1 for some k. Tobias Reichl JASS04 - Sequential Pattern Matching 3 Sequential Algorithms - Definition i. Semi-sequential: AP are non-decreasing. ii. Strongly semi-sequential: (i) and comparisons M li , ki define non-decreasing text positions li . iii. Sequential: (i) and M l , k 1 t l 1 l ( k 1) k 1 1 p abcde Text is compared only if following a prefix of the pattern. Example: xxxxxabxxxabcxxxabcde iv. Strongly sequential: (i), (ii) and (iii) Tobias Reichl JASS04 - Sequential Pattern Matching 4 Example: Naive / brute force algorithm +1 +1 +1 abcde abcde abcde xxxxxabxxxabcxxxabcde • Every text position is alignment position. • Text is scanned until... – pattern is found - then done. – mismatch occurs - then shift by one and retry. • Sequential algorithm. Tobias Reichl JASS04 - Sequential Pattern Matching 5 Knuth-Morris-Pratt type algorithms (1) +S ababcde ababcde xxxxxabxxxabcxxxabcde • Idea: (Morris-Pratt) Disreagard APs already known not to be followed by a prefix of p. • Knowledge: – Already processed pattern – Pre-processing of p. • Strongly sequential algorithm. Tobias Reichl JASS04 - Sequential Pattern Matching 6 Knuth-Morris-Pratt type algorithms (2) • Morris-Pratt: ababcde ababcde xxxxxabxxxabcxxxabcde S min{ k ; min{ s 0 : psk11 p1k ( s 1) } } • Knuth-Morris-Pratt: ababcde (KMP also skips mismatching letters) ababcde xxxxxabxxxabcxxxabcde S min{ k ; min{ s : psk11 p1k ( s 1) and pkk pkkss } } Tobias Reichl JASS04 - Sequential Pattern Matching 7 Pattern Matching - Complexity cr ,s t , p M l, k l[ r , s ] k[1, m ] • Overall complexity: c1, n : cn • Pattern or text is a realization of random sequence: Cn • Question: complexity of KMP? Tobias Reichl JASS04 - Sequential Pattern Matching 8 Subadditivity – Deterministic Sequence Fekete (1923) • Subadditivity: xm n xm xn xn xm lim inf n n m 1 m • Superadditivity: xm n xm xn xn xm lim sup n n m 1 m Tobias Reichl JASS04 - Sequential Pattern Matching 9 Example: Longest Common Subsequence ababcafbcdabcde abcdeabcdfabcab LCS: "abcabcdabc" (10) L1,n max{ K : X ik Y jk ababcafb abcdeabc cdabcde dfabcab LCS: "abcab" (5), "dabc" (4) for 1 k K where 1 i1 i2 ik n, and 1 j1 j2 jk n } • Superadditive: L1,n L1,m Lm,n • Hence: an E Lm lim sup n n m m 1 ? 0.8284 Tobias Reichl JASS04 - Sequential Pattern Matching (Conjectured by Steele in 1982) 10 Subadditivity – "Almost subadditive" DeBruijn and Erdös (1952) • cn positive and non-decreasing sequence ck 2 k 1 k • "Almost subadditive": xm n xm xn cm n xn xm lim inf n n m 1 m Tobias Reichl JASS04 - Sequential Pattern Matching 11 Subadditive Ergodic Theorem Kingman (1976), Liggett (1985) i. X 0,n X 0,m X m ,n ii. k : iii. X X m,m k nk ,( n 1) k , n 1 is a stationary sequence , k 1 does not depend on m iv. E[ X 0,1 ] and lim n E[ X 0,n ] n E[ X 0,n ] c0 n where inf X 0 ,n lim n n E[ X 0,m ] m1 m c0 : EX (a.s.) Tobias Reichl JASS04 - Sequential Pattern Matching 12 Almost Subadditive Ergodic Theorem Deriennic (1983) • Subadditivity can be relaxed to X 0,n X 0,m X m,n An with lim E An n 0 n • Then, too: lim n X 0, n n (a.s.) Tobias Reichl JASS04 - Sequential Pattern Matching 13 Martingales • A sequence Yn f X 1 ,, X n n 0 is a martingale with respect to the filtration Fn ( X 0 ,, X n ) if for all n 0 : E Yn EYn1 | X 0 , X 1 ,, X n EYn1 | Fn Yn • EYn1 | Fn defines a random variable depending on the knowledge contained in X 1 ,, X n . Tobias Reichl JASS04 - Sequential Pattern Matching 14 Martingale Differences • The martingale difference is defined as Dn Yn Yn1 n so that: Yn Y0 Di i 1 • Observe: E[ Dn 1 | Fn ] E[Yn 1 | Fn ] E[Yn | Fn ] Yn Yn 0 Tobias Reichl JASS04 - Sequential Pattern Matching 15 Azuma's Inequality (1) • Let Yn f n ( X 1 ,, X n ) be a martingale • Define the martingale difference as Di EYn | Fi EYn | Fi 1 (The mean of the same element but depending on different knowledge) • Observe: EYn | Fn Yn n D i 1 i and EYn | F0 EYn E Yn | Fn E Yn | F0 Yn E Yn (Deviation from the mean) Tobias Reichl JASS04 - Sequential Pattern Matching 16 Hoeffding's Inequality • Let Yn be a martingale • Let there exist constant cn n 0 Yn Yn1 Dn cn • Then: Pr Yn Yo x Pr Di x i 1 n 2 x 2 exp 2 n c2 i 1 i Tobias Reichl JASS04 - Sequential Pattern Matching 17 Azuma's Inequality (2) • Summary: – If Di is bounded, we know how to assess the deviation from the mean. – So now we need a bound on Di . • Trick: Let X̂ i be an independent copy of X i. • Then: E f n X 1 ,, X i ,, X n | Fi 1 E f n X 1 ,, Xˆ i ,, X n | Fi Tobias Reichl JASS04 - Sequential Pattern Matching 18 Azuma's Inequality (3) • Hence: Di E f n X 1 , , X i , , X n | Fi E f n X 1 , , X i ,, X n | Fi 1 E f n X 1 , , X i , , X n | Fi E f n X 1 ,, Xˆ i , , X n | Fi • And we can postulate: Di ci Tobias Reichl JASS04 - Sequential Pattern Matching 19 Azuma's Inequality (4) • Let Yn f n X 1 ,, X n be a martingale • If there exists constant ci such that f n X 1 ,, X i ,, X n f n X 1 ,, Xˆ i ,, X n ci where X̂ i is an independent copy of X i • Then: Pr Yn EYn x Pr f n X 1 , , X i , , X n E f n X 1 , , Xˆ i ,, X n x 2 x 2 exp 2 n 2 c i 1 i Tobias Reichl JASS04 - Sequential Pattern Matching 20 KMP: Unavoidable alignment positions • A position in the text is called unavoidable AP if for any r,l r i and l i m it's an AP l when run on t r . • KMP-like algorithms have the same set of unavoidable alignment positions U l 1 U l n U l min{ min { t p }, l 1 } 1k l • Example: l k where abcde xxxxxabxxxabcxxxabcde Ul l Tobias Reichl JASS04 - Sequential Pattern Matching 21 Pattern Matching: l-convergence • An algorithm is l-convergent if there exists an increasing sequence of unavoidable alignment positions Ui in1 satisfying U i 1 U i l • l-convergence indicates the maximum size "jumps" for an algorithm. Tobias Reichl JASS04 - Sequential Pattern Matching 22 KMP: Establishing m-convergence • • • • Let AP be an alignment position Define: l AP m p m l m 1 Ul l Hence: U l AP m and so KMP-like algorithms are m-convergent. Tobias Reichl JASS04 - Sequential Pattern Matching 23 KMP: Establishing subadditivity (1) • If cn (number of comparisons) is subadditive we can prove linear complexity of KMP-like algorithms. • We have to show: cn is (almost) subadditive: c1,n c1,r cr ,n a • Approach: An l-convergent sequential algorithm satisfies: c1,n c1,r cr ,n m 2 lm Tobias Reichl JASS04 - Sequential Pattern Matching 24 KMP: Establishing subadditivity (2) • Proof: – U r : the smallest unavoidable AP greater than r. – We split c1,n c1,r cr ,n into c1,n c1,r cU r ,n and cr , n cU , n . r c1, n c1,r cU r ,n cr ,n cU r ,n r Ur Tobias Reichl JASS04 - Sequential Pattern Matching 25 KMP: Establishing subadditivity (3) Contributing to c1, n only Contributing to c1, n and c1, r ? ? ? ? ? ? r S1 S2 Contributing to c1, n and cU r ,n Ur • Comparisons done after r with AP before r: S1 2 M i , i AP 1 m AP r i r • Comparisons with AP between r and U r : S2 • M AP (i 1), i lm r APU r i m No more than m comparisons can be saved at U r Tobias Reichl JASS04 - Sequential Pattern Matching 26 KMP: Establishing subadditivity (4) Contributing to cr , n only ? ? ? ? r S3 Contributing to cr , n and cU r ,n Ur • Comparisons with AP between r and U r: S3 • U r 1 M AP (i 1), i lm AP r i No more than m comparisons can be saved at U r Tobias Reichl JASS04 - Sequential Pattern Matching 27 KMP: Establishing subadditivity (5) • So we are able to bound: c1,n c1,r cr ,n S1 S 2 S3 m 2 lm • We have shown: cn is (almost) subadditive: c1,n c1,r cr ,n a • Now we are able to apply the Subadditive Ergodic Theorem. Tobias Reichl JASS04 - Sequential Pattern Matching 28 KMP: Different Modeling Assumptions • Deterministic Model: Text and pattern are non random. • Semi-Random Model: Text is a realization of a stationary and ergodic sequence, pattern is given. • Stationary model: Both text and pattern are realizations of a stationary and ergodic sequence. Tobias Reichl JASS04 - Sequential Pattern Matching 29 KMP: Applying the Subadditive Ergodic Theorem • We have shown: cn is (almost) subadditive • Deterministic Model: max t cn t , p lim 1 ( p ) n n • Semi-Random Model: Cn ( p ) lim 2 ( p ) (a.s.) n n Et Cn ( p) lim 2 ( p) n n • Stationary Model: lim Et , p Cn 3 n n Tobias Reichl JASS04 - Sequential Pattern Matching 30 KMP: Applying Azuma's Inequality • C n satisfies: Cn T1 ,, Ti ,, Tn Cn T1 ,, Tˆi ,, Tn 2m 2 where Tˆ is an independent copy of T . i i • So, using Azuma's Inequality: 2 n 1 o1 Pr Cn n n 2 exp 2 2 n 2m • C n is concentrated around its mean: ECn n1 o1 Tobias Reichl JASS04 - Sequential Pattern Matching 31 Conclusion • Using the Subadditive Ergodic Theorem we can show there exists a linearity constant for the worst and average case resp. KMP has linear complexity. • The Subadditive Ergodic Theorem proves the existence of this constant but says nothing how to compute it. • Using Azuma's Inequality we can show that the number of comparisons is well concentrated around its mean. Tobias Reichl JASS04 - Sequential Pattern Matching 32
© Copyright 2024 Paperzz