Sequential Pattern Matching

Joint Advanced Student School 2004
Complexity Analysis of String Algorithms
Sequential Pattern Matching:
Analysis of Knuth-Morris-Pratt type algorithms
using the Subadditive Ergodic Theorem
14 July 2017
Tobias Reichl JASS04 - Sequential Pattern Matching
1
Overview
1. Pattern Matching
•
•
Sequential Algorithms
Knuth-Morris-Pratt-Algorithm
2. Probabilistic tools
•
•
Subadditive Ergodic Theorem
Martingales and Azuma's Inequality
3. Analysis of KMP-Algorithms
•
•
•
Properties of KMP
Establishing subadditivity
Analysis
Tobias Reichl JASS04 - Sequential Pattern Matching
2
Pattern Matching
Pattern-text comparison: M(l,k)=1
Pattern p
abcde
Text t xxxxxabxxxabcxxxabcde
Alignment position AP
n
1 ,
m
1
• Text t pattern p




1
t
l
is
compared
to
p
k

• Comparison: M (l , k )  
otherwise
0
• Alignment Position:
M ( AP  (k  1), k )  1
for some k.
Tobias Reichl JASS04 - Sequential Pattern Matching
3
Sequential Algorithms - Definition
i.
Semi-sequential: AP are non-decreasing.
ii.
Strongly semi-sequential: (i) and comparisons
M li , ki  define non-decreasing text positions li .
iii.
Sequential: (i) and
M l , k   1  t
l 1
l ( k 1)
k 1
1
p
abcde
Text is compared only if following a
prefix of the pattern. Example: xxxxxabxxxabcxxxabcde
iv.
Strongly sequential: (i), (ii) and (iii)
Tobias Reichl JASS04 - Sequential Pattern Matching
4
Example: Naive / brute force algorithm
+1
+1
+1
abcde
abcde
abcde
xxxxxabxxxabcxxxabcde
• Every text position is alignment position.
• Text is scanned until...
– pattern is found - then done.
– mismatch occurs - then shift by one and retry.
• Sequential algorithm.
Tobias Reichl JASS04 - Sequential Pattern Matching
5
Knuth-Morris-Pratt type algorithms (1)
+S
ababcde
ababcde
xxxxxabxxxabcxxxabcde
• Idea: (Morris-Pratt) Disreagard APs already
known not to be followed by a prefix of p.
• Knowledge:
– Already processed pattern
– Pre-processing of p.
• Strongly sequential algorithm.
Tobias Reichl JASS04 - Sequential Pattern Matching
6
Knuth-Morris-Pratt type algorithms (2)
• Morris-Pratt:
ababcde
ababcde
xxxxxabxxxabcxxxabcde
S  min{ k ; min{ s  0 : psk11  p1k ( s 1) } }
• Knuth-Morris-Pratt:
ababcde (KMP also skips
mismatching letters)
ababcde
xxxxxabxxxabcxxxabcde
S  min{ k ; min{ s : psk11  p1k ( s 1) and pkk  pkkss } }
Tobias Reichl JASS04 - Sequential Pattern Matching
7
Pattern Matching - Complexity
cr ,s t , p  
 M l, k 
l[ r , s ]
k[1, m ]
• Overall complexity: c1, n : cn
• Pattern or text is a realization of random
sequence: Cn
• Question: complexity of KMP?
Tobias Reichl JASS04 - Sequential Pattern Matching
8
Subadditivity – Deterministic Sequence
Fekete (1923)
• Subadditivity: xm n  xm  xn
xn
xm
 lim
 inf

n  n
m 1 m
• Superadditivity: xm n  xm  xn
xn
xm
 lim
 sup

n  n
m 1 m
Tobias Reichl JASS04 - Sequential Pattern Matching
9
Example: Longest Common Subsequence
ababcafbcdabcde
abcdeabcdfabcab
LCS: "abcabcdabc" (10)
L1,n  max{ K : X ik  Y jk
ababcafb
abcdeabc
cdabcde
dfabcab
LCS: "abcab" (5), "dabc" (4)
for 1  k  K
where
1  i1  i2    ik  n,
and
1  j1  j2    jk  n }
• Superadditive: L1,n  L1,m  Lm,n
• Hence:
an
E Lm 
lim
 sup

n  n
m
m 1
?
 0.8284
Tobias Reichl JASS04 - Sequential Pattern Matching
(Conjectured by
Steele in 1982)
10
Subadditivity – "Almost subadditive"
DeBruijn and Erdös (1952)
• cn positive and non-decreasing sequence

ck


2
k 1 k
• "Almost subadditive":
xm n  xm  xn  cm n
xn
xm
 lim
 inf

n  n
m 1 m
Tobias Reichl JASS04 - Sequential Pattern Matching
11
Subadditive Ergodic Theorem
Kingman (1976), Liggett (1985)
i.
X 0,n  X 0,m  X m ,n
ii. k :
iii.
X
X
m,m  k
nk ,( n 1) k
, n  1  is a stationary sequence
, k  1  does not depend on m
iv. E[ X 0,1 ]   and
 lim
n
E[ X 0,n ]
n
E[ X 0,n ]  c0 n where
 inf
X 0 ,n
lim

n n
E[ X 0,m ]
m1
m
c0  
:   EX 
(a.s.)
Tobias Reichl JASS04 - Sequential Pattern Matching
12
Almost Subadditive Ergodic Theorem
Deriennic (1983)
• Subadditivity can be relaxed to
X 0,n  X 0,m  X m,n  An
with
lim E An n  0
n 
• Then, too: lim
n
X 0, n
n

(a.s.)
Tobias Reichl JASS04 - Sequential Pattern Matching
13
Martingales
• A sequence Yn  f  X 1 ,, X n n  0
is a martingale with respect to the filtration
Fn  ( X 0 ,, X n ) if for all n  0 :
 
 E Yn  
 EYn1 | X 0 , X 1 ,, X n   EYn1 | Fn   Yn
• EYn1 | Fn  defines a random variable depending
on the knowledge contained in  X 1 ,, X n  .
Tobias Reichl JASS04 - Sequential Pattern Matching
14
Martingale Differences
• The martingale difference is defined as
Dn  Yn  Yn1
n
so that:
Yn  Y0   Di
i 1
• Observe:
E[ Dn 1 | Fn ]  E[Yn 1 | Fn ]  E[Yn | Fn ]
 Yn  Yn  0
Tobias Reichl JASS04 - Sequential Pattern Matching
15
Azuma's Inequality (1)
• Let Yn  f n ( X 1 ,, X n ) be a martingale
• Define the martingale difference as
Di  EYn | Fi   EYn | Fi 1 
(The mean of the same element but depending on
different knowledge)
• Observe:
EYn | Fn   Yn
n
D
i 1
i
and
EYn | F0   EYn 
 E Yn | Fn   E Yn | F0   Yn  E Yn 
(Deviation from the mean)
Tobias Reichl JASS04 - Sequential Pattern Matching
16
Hoeffding's Inequality
• Let  Yn  be a martingale
• Let there exist constant cn

n 0
Yn  Yn1  Dn  cn
• Then:
Pr Yn  Yo  x  

Pr 


Di  x 

i 1

n
2


x

 2 exp  
 2 n c2 
 i 1 i 
Tobias Reichl JASS04 - Sequential Pattern Matching
17
Azuma's Inequality (2)
• Summary:
– If Di is bounded, we know how to assess the
deviation from the mean.
– So now we need a bound on Di .
• Trick: Let X̂ i be an independent copy of X i.
• Then: E f n  X 1 ,, X i ,, X n  | Fi 1  
 
 
E f n X 1 ,, Xˆ i ,, X n | Fi
Tobias Reichl JASS04 - Sequential Pattern Matching
18
Azuma's Inequality (3)
• Hence:
Di 
E  f n  X 1 , , X i , , X n  | Fi   E  f n  X 1 , , X i ,, X n  | Fi 1  
 
 
E  f n  X 1 , , X i , , X n  | Fi   E f n X 1 ,, Xˆ i , , X n | Fi
• And we can postulate: Di  ci
Tobias Reichl JASS04 - Sequential Pattern Matching
19
Azuma's Inequality (4)
• Let Yn f n  X 1 ,, X n  be a martingale
• If there exists constant ci such that


f n  X 1 ,, X i ,, X n   f n X 1 ,, Xˆ i ,, X n  ci
where X̂ i is an independent copy of X i
• Then: Pr Yn  EYn   x 

 
Pr f n  X 1 , , X i , , X n   E f n X 1 , , Xˆ i ,, X n
   x


2


x
 2 exp  
2 
n
 2 c 
i 1 i 

Tobias Reichl JASS04 - Sequential Pattern Matching
20
KMP: Unavoidable alignment positions
• A position in the text is called unavoidable AP
if for any r,l r  i and l  i  m it's an AP
l
when run on t r .
• KMP-like algorithms have the same set of
unavoidable alignment positions
U  l 1  U l 
n
U l  min{ min { t  p }, l  1 }
1k l
• Example:
l
k
where
abcde
xxxxxabxxxabcxxxabcde
Ul
l
Tobias Reichl JASS04 - Sequential Pattern Matching
21
Pattern Matching: l-convergence
• An algorithm is l-convergent if there exists an
increasing sequence of unavoidable alignment
positions Ui in1 satisfying
U i 1  U i  l
• l-convergence indicates the maximum size
"jumps" for an algorithm.
Tobias Reichl JASS04 - Sequential Pattern Matching
22
KMP: Establishing m-convergence
•
•
•
•
Let AP be an alignment position
Define: l  AP  m
p  m  l  m 1  Ul  l
Hence: U l  AP  m and so KMP-like
algorithms are m-convergent.
Tobias Reichl JASS04 - Sequential Pattern Matching
23
KMP: Establishing subadditivity (1)
• If cn (number of comparisons) is subadditive
we can prove linear complexity of KMP-like
algorithms.
• We have to show:
cn is (almost) subadditive:
c1,n  c1,r  cr ,n  a
• Approach:
An l-convergent sequential algorithm satisfies:
c1,n  c1,r  cr ,n   m 2  lm
Tobias Reichl JASS04 - Sequential Pattern Matching
24
KMP: Establishing subadditivity (2)
• Proof:
– U r : the smallest unavoidable AP greater than r.
– We split c1,n  c1,r  cr ,n into




c1,n  c1,r  cU r ,n and cr , n  cU , n .
r
c1, n
c1,r  cU r ,n
cr ,n  cU r ,n
r
Ur
Tobias Reichl JASS04 - Sequential Pattern Matching
25
KMP: Establishing subadditivity (3)
Contributing
to c1, n only
Contributing
to c1, n and c1, r
?
?
?
?
?
?
r
S1
S2
Contributing
to c1, n and cU
r
,n
Ur
• Comparisons done after r with AP before r:
S1 

2


M
i
,
i

AP

1

m

AP r i  r
• Comparisons with AP between r and U r :
S2 
•
  M  AP  (i 1), i   lm
r  APU r i  m
No more than m comparisons can be saved at U r
Tobias Reichl JASS04 - Sequential Pattern Matching
26
KMP: Establishing subadditivity (4)
Contributing
to cr , n only
?
?
?
?
r
S3
Contributing
to cr , n and cU
r
,n
Ur
• Comparisons with AP between r and U r:
S3 
•
U r 1
  M  AP  (i 1), i   lm
AP r
i
No more than m comparisons can be saved at U r
Tobias Reichl JASS04 - Sequential Pattern Matching
27
KMP: Establishing subadditivity (5)
• So we are able to bound:
c1,n  c1,r  cr ,n   S1  S 2  S3  m 2  lm
• We have shown:
cn is (almost) subadditive:
c1,n  c1,r  cr ,n  a
• Now we are able to apply the Subadditive
Ergodic Theorem.
Tobias Reichl JASS04 - Sequential Pattern Matching
28
KMP: Different Modeling Assumptions
• Deterministic Model:
Text and pattern are non random.
• Semi-Random Model:
Text is a realization of a stationary and ergodic
sequence, pattern is given.
• Stationary model:
Both text and pattern are realizations of a
stationary and ergodic sequence.
Tobias Reichl JASS04 - Sequential Pattern Matching
29
KMP: Applying the Subadditive Ergodic Theorem
• We have shown: cn is (almost) subadditive
• Deterministic Model:
max t cn t , p 
lim
 1 ( p )
n 
n
• Semi-Random Model:
Cn ( p )
lim
  2 ( p ) (a.s.)
n 
n
Et Cn ( p)
lim
  2 ( p)
n 
n
• Stationary Model: lim Et , p Cn   
3
n 
n
Tobias Reichl JASS04 - Sequential Pattern Matching
30
KMP: Applying Azuma's Inequality
• C n satisfies:


Cn T1 ,, Ti ,, Tn   Cn T1 ,, Tˆi ,, Tn  2m 2
where Tˆ is an independent copy of T .
i
i
• So, using Azuma's Inequality:
2




n
1  o1
Pr Cn  n  n   2 exp  
2
 2  n  2m

• C n is concentrated around its mean:
ECn   n1  o1
Tobias Reichl JASS04 - Sequential Pattern Matching
31
Conclusion
• Using the Subadditive Ergodic Theorem we can
show there exists a linearity constant for the
worst and average case resp.
KMP has linear complexity.
• The Subadditive Ergodic Theorem proves the
existence of this constant but says nothing how
to compute it.
• Using Azuma's Inequality we can show that the
number of comparisons is well concentrated
around its mean.
Tobias Reichl JASS04 - Sequential Pattern Matching
32