On the exact complexity of string matching

On The Exact Complexity Of String Matching
Livio Colussi*
Zvi Galilt
Raffaele Giancarlot
(Extended Abstract)
Summary of Results
We investigate the maximal number of character comparisons made by a linear-time string matching algorithm, given a text of length n and a pattern of length
m over a general alphabet. We denote it by c(n,m)
or approximate it by (1 + C)n, where C is a universal constant. We add the subscript “on-line” when we
restrict attention to on-line algorithms and the superscript “1” when we consider algorithms that find only
one occurrence of the pattern in the text. It is well
known that n 5 c(n, m) 5 2n - m + 1 or 0 5 C 5 1.
The upper bound was established 20 years ago [16, 181
and no progress has been made for 19 years. The only
lower bound has been the obvious one, c(n,m) 2 n.
We improve these bounds and determine Con-line exactly. The bounds below contain the period size z of
the pattern.
U p p e r Bounds:
+
(iii) For 0 < m 5 2, c(n, m) >_ n for infinitely many n’s.
For odd m 2 3, c(n, m) 2 n [&J for infinitely
many n’s.
+
(iv) For 0 < m 5 2, con-line(n,m) 2 n for infinitely
many n’s. For odd m 2 3, con-ljne(n,m? 2 n
for infinitely many n’s.
~~WJ
+
(v) For 0 < m 5 3, ~ ~ , - ~ ~ ~ 2
~ (n nfor
, m
infinitely
)
many n’s. For even m 2 4, ~ ~ ~ -m) ~2 n~ ~ ~ ( n ,
1-J for infinitely many n’s.
+
Remarks:
1. It follows that
Coln-line
A simple and practical on-line linear-time, i.e.,
O(n + m), algorithm that performs at most
n + [(n - m)(””(~”’-*))J 5 1.5n - .5m character
comparisons. Experimental results show that the
new algorithm is better than the Knuth-MorrisPratt [16] and Boyer-Moore [5] algorithms in the
worst case.
5
3.
$ 5 C 5 Con-,jne = $,
5
2. The lower bounds in (iii), (iv) and (v) are tight
for m = 1 , 2 , 3 and for many families of patterns.
Moreover, they apply also to randomized a l p
rithms that use character comparisons to perform
string matching.
3. The simple algorithm of (i) is obtained by using
several times Hoare’s correctness proof of programs
[Ill as a tool to obtain a speed-up.
A slightly more complicated on-line linear-time algorithm that performs at most
n [ ( n- m)(min(l/3, .-)J
5 i n - $m
character comparisons.
+
4. The algorithms use a novel approach which has no
For m = 1 , 2 and for m = 3, except for patterns of the
form a h , the two algorithms make at most n character
‘Dipartimento di Matematica Pura ed Applicata, University
of Padova. ITALY
t Dept. of Computer Science,Columbia Univemity, NY 10027,
U.S.A. and Tel-Aviv University, Tel-Aviv, ISRAEL. Work s u p
ported in part by NSF Grants CCR-8&05-353 and CCR 8814977.
t A T k T Bell L a b , Murray,Hill, NJ 07974, U.S.A.. On
Leave from University of Pdenno, ITALY, and SUNY at Buffalo,
U.S.A.. Work performed while at Columbia Univemity. Work
supported in part by NSF Grants CCR-8605-353 and CCR 88
14977 and by the Italian hlinistry of University and Scientific
Research, Project “Algoritmi e Sistemi di Calcolo”.
CH2925-6/90/0000/0135$01.OO (9 1990 IEEE
comparisons. This bound also holds for the important
special case of nonperiodic patterns, i.e., z = m, of any
length. The O ( n m) time bound of both algorithms
does not depend on the alphabet size.
Lower Bounds:
135
equivalent in any of the known string matching algorithm. They compare from left to right only a
selected subset of positions leaving “holes ” on the
way. In case all positions from the selected subset
match, the algorithms fill up the holes from right
to left.
5. The analysis of the algorithms is quite complex and
requires the proof of new properties of strings and
the invention of a new charging mechanism.
6. Apostolico and Crochemore [2] have independently
shown that c(n,m) 5 1.5n-m+l, a bound weaker
than the one we have obtained.
1
Introduction
for the time taken by the preprocessing.
For this kind of algorithms, the best upper bound
known for character comparisons is 2n - m 1 and
corresponds to the number of comparisons made by
the Knuth-Morris-Pratt algorithm [16], which we call
K M P in short. This bound has been obtained 20 years
ago and no progress has been made for 19 years. Even
the amazingly efficient Boyer-Moore algorithm [5] ( B M
for short) could not beat the 2n - m
1 bound. Indeed, Knuth [16] showed that B M performs 6n character comparisons assuming that the pattern does not
occur in the text. Under the same assumptions, Guibas
and Odlysko [lo]reduced this bound to 4n and, very recently, Cole [SI has obtained a tight bound of 3n. Apostolico and Giancarlo [3] designed a variation of B M that
achieves the 2n - m+ 1 bound, irrespective of how many
times the pattern occurs in the text. All the mentioned
analyses of B M are quite intricate. Recently, Perrin
and Crochemore [7] devised a time-space optimal string
matching algorithm that makes a t most 2n - m+ 1 character comparisons. All these results have indicated that
2n - m 1might be best possible. One of the results of
our paper is to show that this is not the case. Note that
one can use the failure function used by K M P to derive
deterministic finite automata [l] which perform string
matching in exactly n comparisons. But this class of
algorithms must know the alphabet to work correctly
and have a running time that does depend (by a multiplicative factor) on the alphabet size. The algorithms
we are interested in have a running time independent of
the alphabet size.
The best lower bound known is the obvious one:
c ( n , m ) 2 n . Notice that Rivest [20] showed that any
string matching algorithm must etamine, in the worst
case, n - m 1 text characters to find an occurrence of
the pattern in the text. Rivest’s model is incomparable
to ours. On the one hand it is stronger as one character
examination involves possibly many character comparisons. On the other hand his algorithms must have prior
knowledge of the alphabet.
We denote by c ( n , m) the maximal number of comparison needed by a linear-time string matching algorithm, excluding any preprocessing step. It follows from
the discussion above that n 5 c ( n , m) 5 2 n - m + l . Our
major goal is to determine c ( n , m ) exactly. Similarly, we
want to determine Con-iine(n, m) ( ~ ~ ~ -m )~,risp.)
~ ~
exactly, where the subscript indicates that we restrict
attention to on-lane algorithms (on-line algorithms that
find only one Occurrence of the pattern in the text).
On-line algorithms process the text in increasing order
of its positions and their access to the text is limited
to a sliding window of size m. The best known bounds
~ ~ have
~ ( been
~ , no betfor Con-iine(n, m) and c A ~ - ~ m)
ter than the ones for c ( n , m ) . Let z denotes the period
size of the pattern, i.e., the minimal integer i such that
+
Given a computational problem, the ultimate goal is to
determine its computational complexity exactly. HOWever, this goal is usually unattainable for several reasons. One is that the exact complexity or even the
constant factors depend on the machine model. Consequently, an exact determination of the time required
by the problem is possible only if we restrict the model
of computation, for example using a comparison tree
model or considering straight line programs over a given
set of operations.
For several problems involving order statistics the exact number of comparisons is known. The simplest
of these are computing the maximum or minimum in
n - 1 comparisons [folklore], computing the maximum
and the minimum in [ i n - 21 comparisons [19] and
computing the largest two elements in n - 1 [logn]
comparisons [14, 221. For many other problems, such
knowledge is partial or not available. For instance, exact bounds for selecting the third largest element are
known for all but a finite set of cases [12, 13, 231. For a
classic and important problem such as sorting, the exact number of comparisons is known only up to n = 12
[I, 81 and for some special cases [15]. A fascinating open
problem is to find the exact number of comparisons for
determining the median: the best upper bound is 3n
[21] and the best lower bound is 2n [ 4 ] .
In this abstract, our model of computation is a RAM
with uniform cost criterion, or alternatively a binary decision tree model [ l ] .We evaluate the time complexity
of algorithms counting all operations and the number
of comparisons. For example, the straightforward algorithm for computing the minimum of n numbers takes
O ( n ) time and performs n - 1 comparisons in our model.
We investigate the exact complexity of string matching over a general alphabet. String Matching is the
problem of finding all occurrences of a given pattern
p [ l , m] in a given text t [ l ,n]. We say that pattern occurs
a t text position i if t [ i + 1 , i+m] = p [ l ,m]. By “general
alphabet” we refer to the case of an infinite alphabet or
of a finite alphabet unknown to the algorithm (thus it
4,h}.)
must work if the alphabet is { a , b, c, d } or {V,O,
Most known linear-time, i.e., O ( n + m ) , string matching
algorithm do not depend on the knowledge of the alphabet, since they only compare symbols of the pattern
with symbols of the text. Another common feature of
these algorithms is that they preprocess the pattern, i.e.,
they gather knowledge about the structure of the pattern and then start looking for it in the text. We count
only the character comparisons that these algorithms
perform while looking for occurrences of the pattern in
the text, excluding the ones performed during a possible
preprocessing step. (At most 2 m - 1 comparisons are
needed for the preprocessing.) However, we do account
+
+
+
+
136
~
(
n
,
p[l,m - i] = p[i + 1,m]. The results of this paper can
be summarized as follows.
1. New upper and lower bounds for c(n,m),
Con-/ine(n, m) and &-/ine(nr m)*
2. A new simple on-line linear-time algorithm that
makes at most n + L(n - m)(min(:m-zl)J 5 n +
1-J character comparisons. Experimentally, it
is always better than the K M P and B M algorithms in the worst case.
3. A new method to design and improve algorithms.
4. New properties of periodicities in strings.
5. New accounting techniques dealing with an intricate time analysis.
We can show that, for 0 < m 5 2, c(n,m) =
con-/ine(n,m)= n for infinitely many n’s. C(n,m),
con-line(n,m) and ~ : ~ - , ~ m)
~ ~ are
( n bounded
,
above
by n + [(n - m)(min(l/3, min(zi:-z)+2))J.
For odd
m 2 3, c(n,m) 2 n
L&J and Con-/ine(n,m) 2
n+
for infinitely many n’s. For even m 2 4,
~&.,-,~,,~(n,
m) 2 n+ [TJ.
The lower bounds are tight
for m = 1,2,3and, for m > 3, they are tight for many
families of patterns. We remark that for the important class of nonperiodic patterns, i.e., z = m, we obtain an exact bound c(n, m) = con-line(n,m) = n. All
the known linear-time string matching algorithms require more than n comparisons for this class of patterns.
Moreover, for almost all m length patterns t = m - 0 ( 1 )
and thus the number of comparisons is no larger than
n O(n/m).
The algorithms use a novel approach that is a nontrivial combination of K M P and B M . Assume that we
want to check whether p[l, m] is equal to t[i + 1,i + m].
K M P checks for the equality of these two strings proceeding from left to right. B M does the same check
from right to left. Our algorithms do the same check
by first matching a selected subset of pattern positions
fromleft to right (as in K M P ) . If no mismatch is found
during this pass, the remaining pattern positions are
processed from right to left (as in B M ) .
We use the correctness of programs and its corresponding proof rules from Hoare’s axiomatic semantics
[ll]as a tool to speed-up algorithms. It may be the first
example in which correctness of programs plays a crucial
role in improving the efficiency of an algorithm. Informally, we proceed as follows. We write a correctness
proof of the naive string matching algorithm. Then,
we use the proof rule of null’program statements (see
Section 2 for definitions) to identify places in which the
program is not using some information. We use such information and derive, very easily, K M P . We apply the
[WJ,
+
+
same procedure to K M P and obtain a very simple algorithm improving K M P . (The algorithm is so simple
that we present it in full in Section 3). A more precise
description of the above technique is given in Section 2.
In order to analyze our algorithms, we have to investigate several combinatorial problems on strings. Although a wide body of knowledge is available on combinatorics on strings (see for instance Encyclopedia of
Mathematics [17]),none of the problems we address has
been studied. We are interested in the dynamics of periodicities in a given string w . For instance, one of the
problems we address can be stated as: We are given
an integer r and an arbitrary set A of positions of w
satisfying a certain property P. We want to determine
how many positions i E A are such that i - r satisfies
the same property P. We derive tight bounds on this
number for several nontrivial properties. Due to space
limitations, we cannot present here any of these results.
As we describe in detail in Section 4, our algorithms
sometimes forget the result of some comparisons when
the pattern is moved over the text. These comparisons
may have to be repeated in the future. On the positive
side, our algorithms avoid testing some text positions.
Since we want a tight analysis, we must account for
every comparison that the algorithms perform in the
worst case. An “accounting” idea would be to amortize
repeated comparisons against untested text positions.
The implementation of this idea is complicated by the
fact that both the untested text positions and the forgotten comparisons are irregularly distributed over the
text and there seems to be no way to obtain tight upper
bounds on the cardinality of both sets. We overcome
this difficulty by using a new accounting scheme, which
can be intuitively described as follows: The algorithms
are given a portfolio of n units and an unlimited credit
line. The units in the portfolio have a due date and they
cannot be spent before that date. The due date is not
known in advance and depends on the execution of the
algorithms. At any given time, our analysis establishes
how much money the algorithms need and how much
money can be spent from the portfolio. If the available
money does not cover the current needs, the credit line
is used. A more technical outline is given in Section 4.
We remark that obliviousness in a string matching
algorithm is not new. Indeed, the difficulty of the analysis of B M given in [3,6, 10, 161 is mostly due to the
obliviousness of B M . However, none of techniques in
[3,6,10, 161 can be used in our case since the mechanics
behind the obliviousness of B M is different from that
of our algorithms, as it is explained in Section 4.
The lower bounds are general enough to apply even
to randomized algorithms that use character comparisons to perform string matching. We use an “unrestricted” adversary that need not know the strategy of
the algorithm beforehand. It fixes the pattern and then
137
We apply this scheme t o the naive string matching
the text is chosen adaptively according to the questions
algorithm and, very easily, obtain K M P . Then, we
asked by the algorithm. In Section 5 , we give proofs USapply it to K M P and obtain the algorithm that we
ing the pattern aba (alternative patterns are m o m , dad,
present in the next section. Application of the technique
bob ...). We remark that for m = 3 , we obtain the best
is simple but, due to space limitations, we cannot show
lower bounds for C and Con-lineand the exact value of
any detail.
the latter. For m = 4, we obtain the best lower bound
for Coln-line.
We remark that Apostolico and Crochemore [2] have
3 The New String Matching Al1 by
independently shown that c ( n , m ) 5 1.571 - m
means of a simple modification of K M P . Their bound
gorithm
is weaker than the one given here and their variation of
K M P seems t o be a special case of the simplest of our
The algorithm that we present is a blend of the two best
new algorithms.
known and most efficient string matching algorithms deA number of open problems remain. It would be invised so far: Knuth-Morris-Pratt [16] and Boyer-Moore
teresting t o obtain the exact value of C. It is still a
[51.
challange t o close the gap between the upper and lower
Let w be a string. We denote the positions i,i
bound for c(n, m ) , con-line(n,m ) and ~ i ~ - ~m~) , re~ ~ ( 1n , . ,. . , j of w as [ i , j ] ;w [ i ,j ] is the substring of w that
spectively. These gaps grow as m grows. Note that we
starts a t position i and ends at position j of w . Thus,
do not have a lower bound for c'(n, m ) , the number of
w = w [ l ,s ] for some s.
comparisons for off-line algorithms that search for only
Given an alignment of the pattern p [ l , m ] with text
one occurrence of the pattern in the text. It would be
characters t [ i 1 , i m ] , K M P checks whether these
also interesting to apply our design technique to other
two string are equal proceeding from left to right. B M
problems.
performs the same check proceeding from right to left.
Both algorithms then shift the pattern over the text by
some precomputed amount and the check is repeated on
2 Correctness Proof of Prothe resulting new alignment. The K M P and B M approaches to string matching seem t o be orthogonal and
grams as a Tool To Improve
no efficient and mutually advantageous way of combinAlgorithms
ing- them has been found so far. The main difficulty in
combining the two approaches is to find a partition of
The design technique used to obtain our algorithm is
the set of m pattern positions such that there is an efthe formal verification of programs [ l l ] . Let Prog be a
ficient strategy to shift the pattern over the text when
program solving a problem P and let A and B be two
a mismatch is found. In what follows, we design such
assertions in the correctness proof of Prog. Let a null
partition and the corresponding strategy. Our starting
program statement be a statement that does nothing.
point is K M P .
Assume that A is the precondition of a null program
We need a few definitions. We say that a string
statement in Prog and that B is the postcondition of
w [ l ,m ] is k periodic if w [ l ,m - k ] = w [ k 1 , m ] . We
the same statement. Recall that in Hoare axiomatic serefer to k as a period of w [ l ,m ] . The period of a string
mantics, the proof rule of a null program statement is
w is the minimal integer 1 > 0 such that 1 is a peWe say that such null program statement is
riod of w . Given a string w [ l , m ]and an index j < na,
true if A G B does not hold. Our technique is based on
we say that k m i n ( j ) is defined and equal to if is
the observation that a true null program statement is
the minimal integer such that w [ l , j lis d periodic and
an indicator that some information about the problem
w [ j - d 11 # w[j 11. Thus, k m i n ( j ) is defined if a
is not used by Prog. Using that information or avoidperiodicity ends at j . If no such d exists we say that
ing its computation helps in improving the efficiency of
k m i n ( j ) is undefined.
Prog. Therefore, we proceed as follows:
Let the pattern be aligned with t[i+ 1 , i + m ] . Assume
that K M P has found out that p [ i , j ] = t[i $ l , i j ]
1 . Write down a formal correctness proof of the
and that p [ j 11 # t [ i j
13, for some j , 0 5 j < m .
pseudo-code of the algorithm.
The pattern must be shifted over the text by some
amount. There are two possibilities: shift past text po2 . Look for true null program statements in the
sition i j l (where the mismatch occurred) and not
pseudo-code.
shift past text position i j 1 (thus having some over3. Determine the information that the algorithm is
lap with the part that was matched). K M P chooses
neglecting. Either use it to speed the algorithm up
the first possibility when there is no integer g such
or find a way not to compute that information.
that p [ l , j ] is g periodic and p [ j 13 # p [ j - g 11,
+
+
+
+
+
+
+
+
+ +
+
+ +
+ +
+
138
+
be all pattern positions such that kmin is undefined,
hnd+l 5 m - 1. Let h , = 0. Given a position j of the
pattern, let r m i n ( j ) be the minimal period of p[l,m]
greater than j.
i.e., kmin(j) is undefined, since it can be easily shown
that there cannot be any occurrence of the pattern in
the interval [i + 1,i + j + 11. K M P chooses the second possibility when there is an integer g such that
p [ l , j ] is g periodic and pb 11 # p [ j - g
11, i.e.,
kmin(j) is defined. Then, K M P shifts the pattern
over the text by k m i n ( j ) positions since no occurrence
of the pattern is in the interval [i l , i kmin(j)].
Moreover, there may still be an occurrence of the pattern in i + kmin(j) + 1, since p [ I , j - kmin(j)] =
p[kmin(j) 1,j] = t [ i + 1 kmin(j),i j] and a pattern character different from p l j + 11 will be aligned with
t [ i + j + 11 (recall the definition of kmin).
We claim that if kmin(j) is defined and p l j 11 #
t [ i + j + l ] , we need not know that p[l,j] = t [ i + l , i + j ]
to deduce that the pattern must be shifted by kmin(j)
positions over text. Indeed, we can reach the same conclusion under weaker conditions. Moreover, we can still
use some of the knowledge acquired about t [ i 1,i j]
after the shift by k m i n ( j ) takes place. Let hl < h2 <
. . . < hnd, hnd 5 m - 1, be the pattern positions such
that kmin is defined and let f i r s t ( x ) denotes the least
integer y such that x 5 h,. The following lemma is a
precise statement of our claim (see Figure 1).
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+ +
+
+
+
+
+
+
+
+
An operative conclusion that we can draw from
Lemma 1 is that when checking (from left to right)
if p[l,m] = t [ i 1 , i m ] , we can compare only pattern positions j
1 having kmin(j) defined and shift
by kmin(j) in case of a mismatch. (It is here that we
use the paradigm of not computing unneeded information.) Now, assume we know that p[h,+l] = t[i+h,+l],
1 5 s 5 nd, we must decide in which order to compare
the pattern positions p b + 11 having kmin(j) undefined
with the corresponding text positions and what to do in
the case that a mismatch is found. If we find out that
p[3^ 13 # t [ i j
11, kmin(j) undefined, we can shift
past text position i j 1. (Recall, a shift with overlap must have Icmin(j) defined.) In order to guarantee
the longest shifts while allowing the algorithm to retain
more of the knowledge it acquires about t[i+l, i+m], we
process all pattern positions Q 1, kmin(q) undefined,
in decreasing order, i.e., from right to left.
Lemma 2 describes more precisely such processing
(See Figure 2). Let hnd+l > hnd+2 > ... > h,-l
+
+
+
+
Lemma 1 Let the pattern be aligned with t [ i + l , i+m].
Assume that p [ h , 11 = t [ i h,
11, f o r 1 5 s <
r 5 nd, and p[hr 11 # t[i h, 11. Let in,, = i
kmin(h,). Then, there is no occurrence of the pattern
in the interval [i+ 1,in,,] and the pattern can be shifled
over the text b y kmin(h,) positions. Moreover, p[h8
11 = t[inew+ h, 11, for 1 5 s < f i r s t ( h , - kmin~’,?,)),
and string matching can be restarted b y testing wttdher
j i r s t ( h , - k m i n ( h , ) ) +I] = t[inew +hjirst(h,-kmin(h,))
11.
+ +
+ +
+
The operative conclusion that we can draw from
Lemma 2 is that when we have completed a left to right
pass and start a right to left pass, we are guaranteed
that we need not consider text positions t [ i + r m i n ( b ) +
1,i+m] ever again, since a prefix of the pattern matches
t [ i rmin(h,) 1,i m ]and each future shift affects
only the length of the match, since we always shift by
a period. (Here again, we use the paradigm.)
The subsequences hl < hz < ... < hnd and
hnd+l > hnd+2 > ... > h , provide a partition of
the integers ( 0 , . + ,.m - 1) which induces a partition
on the pattern positions { 1, . , m}. The way to put
these two sequences together to perform string matching is given by Lemmas 1 and 2. Indeed, we check
whether p[l,m] = t [ i 1,i + m] using the sequence
hl 1,h2 1,..., hnd+ 1,hnd+l 1, ..., h,,, 1 in the
order given. We remark that such sequence can be computed in O ( m ) time and 2m - 1 character comparisons
using the same preprocessing algorithm of K M P [16].
The algorithm that we have obtained is so simple and
compact that we can actually give the code. We define
two tables d i s p and start having m 1 entries each as
follows:
disp[i] = kmin(hi) and start[i] = f i r s t ( h , k m i n ( h ; ) ) , 1 5 i 5 n d ; disp[i] = r m i n ( h ; ) and
start[i] = f i r s t ( m - r m i n ( h i ) ) , nd < i 5 m ;disp[m +
11 = d and s t a r t [ m 11 = f i r s t ( m - d ) , where d is the
period of the pattern. These tables are computed in the
preprocessing stage.
Variable pstart is equal to the index from which the
algorithm starts examining the sequence h l , ..., h , at
the beginning of each iteration. Variable tlast denotes
an index such that text characters in t [ l , tlast] are not
examined anymore by the algorithm.
+ +
+
+
+ +
+
+
+
+
+
Lemma 2 Let the pattern be aligned with t [ i + l , i + m ] .
Assume that p[h, + 13 = t[i + h, 11, 1 5 s < r and
that p[h, 11 # t[i h, 11, r > n d . Let inew = i
rmin(h,). Then, there is no occurrence of the pattern in
the interval [i+ 1,in,,]. Moreowerp[l,m - r m i n ( h , ) ] =
t[inew 1,i + m ] and the string matching process can be
restarted b y testing whether p[hjirsr(m-rm;n(h,)) 11 =
t[inew
hjirst(m-rmin(ii,)) 11-
+
+ +
Algorithm SM
begin
b + 0; pstart +- 1; tlast t 0
repeat
e + pstart;
while e 5 m and tlast < b he 1 and
p[he 13 = tezt[b he 11 do e c e + 1;
+
+
139
+ +
+ +
+
if e = m 1 or tlast 2 b
occurrence found
if e > nd then tlast c b
b + b disp[e];
p s t a d c start[e];
until b > n - m ;
+
+ he + 1 then
+ m;
end
4
Outline of the Analysis and
Improvement of S M
We say that an integer i is an iteration if algorithm S M
sets variable b = i during its execution. An iteration i
corresponds to an alignment of the pattern with t [ i
1,i + m ] . Moreover, we refer to each position j of the
pattern, 1 < j
m, having k m i n ( j - 1) defined as
nohole. We refer to the remaining pattern positions as
holes. Notice that S M tries to match the noholes first
from left to right skipping the holes.
One can observe a potentially serious limitation to a
good performance of S M : After the pattern is shifted
by some amount, there are situations in which the algorithm forgets that some of the text positions match
the corresponding pattern positions and may compare
them again. Consider a nohole h , 1 of the pattern, for
some fixed s 5 nd. It can happen that while k m i n ( h , )
is defined, there may exist a nohole h,, 1, s < s' 5 n d ,
such that kmin(h, - k m i n ( h , , ) ) is undefined. In other
words, noholes are unstable under shifts corresponding
to interrupted periodicities. The effect of this instability is that a text position matching a nohole in a given
iteration i may be aligned with a hole in the next iteration and be considered again if the algorithm starts
matching pattern positions from right to left. Figure 1
gives an example of matched text positions that may be
compared again.
Obliviousness in a string matching algorithm is not
new: B M forgets part of what it knows about the
text after a shift. This phenomenon has been extensively studied. However, none of the techniques used
in [3, 6 , 9, 10, 161 to account for the obliviousness of
B M can be used in our case. Intuitively, the obliviousness of B M is macroscopic whereas that of S M is microscopic. Indeed, B M forgets that contiguous chunks
of text match some pattern substrings, whereas S M
forgets that some text characters, not necessarily contiguous and irregularly distributed over the text, match
some pattern characters.
Another property of S M is that it skips over some
text positions while matching noholes. Then the pattern is shifted over the text. The text positions skipped
over and to the left of the current alignment are left
untested, i.e., they are never compared with any pattern
+
<
+
+
position. Figures 1 and 2 give a qualitative description
of this phenomenon. Again, we remark that the text
positions left untested are irregularly distributed over
the text.
In order to obtain a tight analysis of S M , one would
like to use all the untested text positions to pay for the
mismatches and some of the repeated comparisons resulting from the obliviousness of S M . The achievement
of this goal requires tight upper bounds on the number
of comparisons forgotten after a shift and a flexible and
precise charging scheme that allows to amortize character comparisons against untested text positions. Such
tasks are complicated by the arbitrariness and irregularity of the distributions (over the text) of forgotten
comparisons and untested positions. We now give the
main points of our analysis, omitting the combinatorics
on strings on which they are based.
We need a few definitions. An iteration r matches a
suffix of the pattern if t [ r + h j + l ] = p [ h j + l ] , 1 5 i 5 n d ,
and, a t least, t[r+m] = p [ m ] . Notice that this definition
implies that variable tlast in algorithm S M is set to
r m. Two iterations r and U, r < U, have a sufixprefix overlap if r matches a suffix of the pattern and
U 5 r
m - 1. For any iteration U , let t l a s t ( u ) be the
value of tlast in algorithm S M when U starts. One can
show that U < t l a s t ( u ) if and only if u has a suffix-prefix
overlap with an iteration U' < U.
The iterations of algorithm S M are divided into sequences of consecutive iterations. There are two kinds
of sequences: light and heavy defined as follows.
A sequence S I , sa, . . , sy of consecutive iterations is
light if s1 is either the first iteration of the string matching algorithm and does not match a suffix of the pattern or it is preceded by an iteration that ends a heavy
sequence. In addition; s; 2 t l a s t ( s i ) , 1 < i 5 y, and either sy is the last iteration of the algorithmor U matches
a suffix of the pattern, where U is the iteration succeeding sy.
A sequence u1, u2, . . ,ug of consecutive iterations is
heavy if ul is either the first iteration of the string
matching algorithm or it is preceded by an iteration
that ends a sequence, either light or heavy. In addition,
uj < tlast(Ui), 1 < i 5 g, and either ug is the last iteration of the algorithm or t l a s t ( s ) s, where s is the
iteration succeeding ug.
Given an iteration 'U starting either a light or a heavy
sequence, let c ( v ) be the number of noholes in the pattern between 0 and hstarqv). We can show:
+
+
<
Lemma 3 Let si,s2, -,sy be a light sequence and let
ul be the iteration that starts the succeeding heavy sequence. Algorithm S M performs at most u l - s1 +
c(u1) - c(s1) character comparisons during the execution of such a light sequence.
Proof Idea: The proof is by induction. Fix an iteration
+
s;. Let G be the set of text positions in [si l , s ; + l ]
aligned with holes of the pattern when iteration s;
starts. Let F be the set of character comparisons that
S M forgets when iteration 8;+1 starts. It is pmsible to
show that IF1 5 IGl 5 si+l - si - .(si).
0
Theorem 2 Let p[l,m] be a pattern with period z and
let t [ l , n ] be a text. Algorithm SM‘ finds all occurrences of p [ 1 , m ] in t [ l , n ] an O(n + m) time. In the
worst case, the algorithm p e r f o m s at most n + [ ( n m)(min(l/3, -))J
character comparisons.
Lemma 4 Let p[l, m] be a pattern with period z . Let
u1,u2,...,ugbe a heavy sequence and let s1 be the
succeeding iteration. Algorithm S M performs at most
s1 - u1 + c(s1) - c(u1)+ [(SI - ul)(mi”(~m-”))J
character comparisons during the execution of such a heavy
sequence.
5
We now address the question of how many character
comparisons between the text and the pattern are necessary in order to find all occurrences of the pattern
in the text for general alphabets, or when no knowledge of the alphabet is available. We also address the
same question for on-line algorithms that find only one
occurrence of the pattern in the text.
Our model of computation is a RAM with uniform
cost criterion [ l ]slightly modified. We assume that the
text t [ l , n ]is stored in a special random access buffer.
This buffer can be accessed only through a server which
is not part of the algorithm executing on the R A M . The
server accepts only requests of the form “p[i]= t [ j ] ?”
or “ t [ i ]= t [ j ] ?” and it provides an answer to each request in one unit of time. Therefore, in our model, any
string matching algorithm can have knowledge about
the pattern, but it can acquire knowledge about the
text only through comparisons with the pattern. Alternatively, we could use a binary decision tree model.
Let A, be any algorithm that correctly finds all occurrences of p[l,m] in any given text t [ l , n ] . We say
that A, decides for text position i if, based on the character comparisons performed up to that point, it can
establish whether or not there is an occurrence of the
pattern starting at i . We remark that an algorithm can
decide for several text positions at the same time. A, is
on-lane if the text character that can be compared with
a pattern character at a given time is in the interval
[ i , i m - 11, where i is the minimal text position not
decided yet by A,.
U I ,u2, . ,u g
into d consecutive subsequences U1, u2, , u d as fol< ~ i be~ the
- iterations
~
lows. Let u ; ~< U;, <
among ul,u2, . ,u g that match a suffix of the pattern.
Then, U1 = {ul, ., ui1},U2 = {uil+l,.. ., uia}, ...,
ud = { ~ ; ~ - ~ + 1 ,,ug}.
. . . w e show that the sum of all
the character comparisons made by the last elements
in Ul,...,Ud
is bounded by SI - u1 - c(u1) c(s1).
The sum of all the character comparisons made by
the remaining elements in U1,. . ., u d is bounded by
[(SI - ul)(m i n ( y - * l ) j .
0
Proof Idea: We partition the sequence
-
+
Lower Bounds
.
+
Summing the bounds in Lemmas 3 and 4 for all possible light and heavy sequences, we obtain:
Theorem 1 Let p [ l , m ] be a pattern with period z and
let t[l,n] be a text. Algorithm S M finds all occurrences of p [ l ,m] in t [ l ,n] in O ( n + m ) time. In the
worst case, the algorithm performs at most n [(n ”) )J ch(Ira cter co mpa Tisons.
m)(
+
The analysis of algorithm S M shows that its main
source of inefficiency (for character comparisons) is due
to some shifts by one (of the pattern over the text)
during heavy sequences of iterations. We can devise
a new algorithm (algorithm S M ’ ) that takes care of
this shortcoming of S M while preserving all of its positive features. Informally , SM’ simulates S M when
executing an iteration in a light sequence or an iteration in a heavy sequence that does not cause a shift by
one if a mismatch is found. When executing an iteration in a heavy sequence that can cause a shift by one,
SM’ decides whether or not it is appropriate to simulate S M on that iteration. The decision is based on the
length of the suffix-prefix overlap between the current
iteration and the most recent iteration that matched
a suffix of the pattern. When SM‘ does not simulate
SM during iteration U ,it tries to find the longest prefix
of t[tlast(u) 1,n] that matches the regular expression
a+b, where a‘b is a prefix of p for some 1 > 0. We can
show:
+
We prove lower bounds for c(n, m),C o n - l ; n e ( n , m),
and ~ i , - ~ ~ ~ ~ ( The
n , mbounds
).
are derived using an
adversary that need not know the strategy of the algorithm beforehand. Indeed, it selects the pattern;
then it observes the questions that the algorithm asks
about the text and it fixes the text based on these questions. In doing so, the adversary forces the algorithm
to “make mistakes” (i.e., to compare certain text characters twice). In what follows, we give a proof only for
Con-line(n9
+
m).
Theorem 3 Assume that the alphabet size is greater
than 2. For 0 < m 5 2 , Con-line(n,m) 2 n for infinitely many n’s. For odd m
3, C o n - l ; n e ( n , m ) 2
n -; :Z?+J
for infinitely many n’s.
>
141
+
and t[no l , n 1 ] = ba. Therefore, the answer to
the question is no. Now, there is an occurrence
of the pattern in the text starting a t no, and any
algorithm must perform at least 2 additional comparisons to find it. Therefore, the total number of
questions involving text positions in t[no 1 , n1] is
at least 3 . Since the algorithm has made at least
no+ 1-J
character comparisons before making
comparisons with characters beyond t [ n o ] ,adding
3 to this bound and recalling that n1 = no 2,
we obtain a lower bound of nl
L
12for text
t [ l ,7 2 1 3 . (In fact, we get a larger bound in this case.)
Moreover, such text has the pattern as suffix and
no comparison made beyond n l .
Pro05 The proof for the case m = 1 , 2 is obvious. We
give a proof for the case m = 3 . It can be generalized
to any m > 3 , m odd.
We choose a pattern $ [ l ,31 = aba and show that there
exists a sequence of texts of increasing length for which
the lower bound holds. The adversary fixes the text
on line, as the algorithm keeps asking for more text
characters. The first step of our construction, for text
of length 3 is as follows: The adversary sets t [ 1 , 3 ] =
@ [ 1 , 3 ] . Any on-line algorithm must perform at least
3 comparisons between text and pattern characters in
order to discover the occurrence at 1. Therefore, the
lower bound holds.
Now assume that there exists a text t [ l , n o ] having
$ [ l ,31 as suffix, for which the lower bound holds and for
which the algorithm already knows all occurrences of
the pattern in the text. Assume also that no comparison
made beyond no. We extend t [ l , n o ] to a text t [ l , n l ]
having p [ l , 31 as suffix, for which the lower bound holds
and no comparison made beyond n l .
Since the algorithm already knows all occurrences of
the pattern in the text and t [ l , n o ] has p [ 1 , 3 ]as suffix,
the leftmost undecided position in t is no. The on-line
algorithm can ask questions involving text characters
in t[no,no + 21. Since the algorithm knows that t[no] =
a and we are interested in finding a lower bound, we
can assume it will not ask questions involving this text
position.
The adversary fixes text characters t [ q 1 , no 21
as well as nl and t [ n l ] according to the first question
that the algorithm asks about these text positions. It
will force the algorithm to “make mistakes” as its first
answer will be “no”. This is done as follows.
3 . The algorithm asks first “t[no 11 = t b ] ” (‘‘t[no
21 = tb]”,resp.), j 5 no
2. We consider two
subcases.
1 . The algorithm asks first “t[no + 11 = b?” or
“t[no+ 21 = a?”. The adversary fixes nl = no 3
and t[no + l , n l ] = aba. Therefore, the answer to
the question is no. The algorithm can rule out
Theorem 4 Assume that the alphabet size is greater
than 2. ForO < m 5 2, c(n,m)2 n for infinitely many
n’s. For odd m 2 3, c ( n , m) 2 n
for infinitely
many n’s.
+
+
+
+
+
+
+
+
+
0
For the general case, we can show:
+ Le]
The proof of Theorem 4 is more complicated than
that of Theorem 3 . In particular, to handle comparisons
of the form “ t [ i ]= t [ j ] ?” we define a graph where the
nodes are text positions and the edges are comparisons
between them. To obtain the lower bound, we need to
keep track of the number of connected components of
this graph.
Finally, we prove a lower bound for ~ : , + ~ ~ ~ ~ ( n , m ) ,
the number of comparisons needed to find on-line an
occurrence of the pattern in the text or alternatively
compute the predicate that has value true if and only if
the pattern occurs in the text.
+
+
Theorem 5 Assume that the alphabet site is greater
than 2. ForO < m 5 3, ~ : , , - ~ ~ , ,m)
~ (2
n ,n for infinitely
many n’s. For even m 2 4, ~ : ~ - ~m)~2 ~n+~Ln+J ( n ,
for infinitely many n ’s.
2. The algorithm asks first ‘?[no + 21 = b?” or
“t[ng 11 = a?”. The adversary fixes nl = no 2
+
+
+
For m = 3 , the theorem follows inductively.
+
+
+
= no 2 ( j = no 1 , resp.). The adversary
fixes nl = no 2 and t[no 1,no 21 = ba.
The bound follows by a reasoning analogous
to the one in ( 2 . ) .
- j
+
+
+
+
+
text position no, since no occurrence can be there.
1 since
But it cannot rule out text position no
t[no 1 , no 31 = aba. The algorithm must ask
at least 3 additional questions to find the occurrence of the pattern starting at no 1 . Therefore,
the total number of questions involving text pcr
sitions in t[no 1 , n l ] is 4. Since the algorithm
has made a t least no
L Z q , J character comparisons before making comparisons with characters beyond t [ n o ] ,adding 4 to this bound and recalling that nl = no 3 , we obtain a lower bound
of nl
12
Jfor text t [ l , n l ] . Moreover, such
text has the pattern as.suffix and no comparison
made beyond nl .
+
+
5 no: If t [ j ] = b ( t b ] = a , resp.), the adversary fixes nl = no 3 and t[no 1 , n l ] = aba.
A proof that the bound holds can be obtained as in ( 1 . ) . If t [ j ] = a ( t [ j ] = b,
resp.), the adversary fixes n1 = no 2 and
t[no 1 , no 21 = ba. A proof that the bound
holds can be obtained as in ( 2 . ) .
- j
+
+
+
+
+
142
References
[15] D.E. Knuth. The Art of Computer Programming,
VOL. 3: Sorting and Searching. Addison-Wesley,
Reading, MA., 1973.
[l] A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The
Design and Analysis of Computer Algorithms.
Addison-Wesley, Reading, MA., 1974.
[16] D.E. Knuth, J.H. Morris, and V.B. Pratt. Fast pattern matching in strings. SIAM J. on Computing,
6:189-195, 1977.
[2] A. Apostolico and M. Crochemore. Optimal canonization of all substrings of a string. Technical report, TR 89-75, LITP, Universite Paris 7,
FRANCE, October 1989.
[17] M. Lothaire. Combinatorics on Words. In Giancarlo Rota, editor, Encyclopedia of Mathematics,
Reading, MA. U.S.A., 1983. Addison-Wesley.
[3] A. Apostolico and R. Giancarlo. The Boyer-MooreGalil string searching strategies revisited. Siam J.
on Computing, 15:98-105, 1986.
[18] J.H. Morris and V.R. Pratt. A linear pattern
matching algorithm. Technical report, Dept. of
Computer Science, University of California, Berkeley, CA., 1970.
[4] S.W. Bent and J.W. John. Finding the median requires 2n comparisons. In Proc. 17th Symposium
on Theory of Computing, pages 213-216. ACM,
1985.
[19] I. Pohl. A sorting problem and its complexity.
Comm. of the A C M , 15:462-464, 1972.
[5] R.S. Boyer and J.S. Moore. A fast string searching
algorithm. Comm. of the A C M , 20:762-772, 1977.
[20] R.V. Rivest. On the worst case behavior of stringsearching algorithms. Siam J . on Computing,
6:669474, 1977.
[6] R. Cole. Tight bounds on the complexity of
the Boyer-Moore pattern matching algorithm.
manuscript, 1990.
[21] A. Schonhage, M. Paterson, and N . Pippenger.
Finding the median. J. of Computer and System
Sciences, 13:184-199, 1976.
[7] M. Crochemore and D. Perrin. Two way pattern
matching. Technical report, Dept. of Computer
Science, University of Paris, Paris, FRANCE, 1989.
[22] J. Schreier. On tournament elimination systems.
Mathesis Polska, 7:154-160, 1932.
[23] C.K. Yap. New upper bounds for selection. Comm.
of the A C M , 19:501-508,1979.
[8] L.R. Ford and S.M. Johnson. A tournament problem. American Mathematical Monthly, 66:387-389,
1959.
[9] Z. Galil. On improving the worst case running
time of the Boyer-Moore string matching algorithm. Comm. of the ACM, 22:505-508, 1979.
[lo] L.J. Guibas and A.M. Odlyzko. A new proof of the
linearity of the Boyer-Moore string matching a l g e
rithm. SIAM J. on Computing, 9:672482, 1980.
[ll] C.A. Hoare. An axiomatic basis of computer p r e
gramming. Comm. of the A C M , 12:576-580, 1969.
[12] D.G. Kirkpatrick. Topics in the complexity of combinatorial algorithms. Technical report, Dept. of
Computer Science, University of Toronto, Toronto,
CANADA, 1974.
[13] D.G. Kirkpatrick. A unified lower bound for selection and set partitioning problems. J. of the A C M ,
28:150-165, 1981.
[14] S.S. Kislitsyn. On the selection of the kth element
of an ordered set by pairwise comparison. Sibirskii
Mat. Zhumal, 5:557-564, 1964.
143
P
he+1
i
(4)
......
t
b
b+h&l
Figure 1: A mismatch with a selected position (a nohole) in iteration b. The shift is by the period kmin(h,)
of the interrupted periodicity. The algorithm resumes with the first nohole aligned with a text position not
before the position of the mismatch ( l ) ,since smaller noholes (e.g. (2) ) must match. The comparison of
(3) is forgotten, because after the shift it is aligned with a hole. On the other hand, if (4) was not compared
before iteration b, it will not be compared at all.
p......
t
......
------I
......
b+m
b+h ,+1
Figure 2: A mismatch with a hole. A complete suffix of the pattern in iiiat,ched. The shift is by the smallest
period of the pattern past the mismatch. (In the first alignment a suffix of the pattern is shown and after
the shift a prefix is shown.) The algorithm resumes as in Figure I . AfLcr the shift the algorithm will only
compare text positions greater than tlclsr= b m.
+
144

Download Report

On the exact complexity of string matching

Paperzz.com

Your Paperzz