Chapter 14 Another Bit-Parallel Approach to Solve Problem 1: The

Chapter 15
Algorithm
The WM2 Algorithm, the NB
The NB Algorithm to solve Problem 1 can be understood by considering the
following data where T  accgtctgtbbbbbbbbbbb , P  cgcc and k  1 . We can
easily see that the substring bbb of T can be ignored because there is simply no
hope to have any substring whose edit distance with P can be less than or equal to
k . So, our question is: How can we determine a substring of T can be ignored.
The Algorithms discussed in this chapter belong to a class of approximate string
matching algorithms based upon the filtering concept. That is, the algorithm is
divided into two stages: the filtering stage and the checking stage. In the
filtering stage, we eliminate regions of the text string T which cannot contain
solutions. In the checking stage, we check whether there are solutions in all of
the regions which remain.
In the previous chapters dealing with Problem 1 of the approximate string
matching, no algorithm involves opening windows. In the algorithms introduced in
this chapter, windows are opened. They are opened in the checking stage.
Section 15.1 Rule A3 in the Filtering Stage
In this section, we shall introduce a mechanism to perform the filtering. Let us
consider the case where we are given two strings A and B . We divide B into two
pieces: B1 and B2 . Assume that ED( A, B)  1 . We claim that either B1 , or B2 ,
will appear in A exactly. Assume the otherwise, let the substring in A which has
the shortest edit distance with B1 ( B2 ) be denoted as A1 ( A2 ) . ED( A1 , B1 )  1 and
ED( A2 , B2 )  1 . Thus it takes at least 1 step to transform B1 ( B2 ) into A1 ( A2 ) .
Since B1 and B2 are non-overlapping, ED( A, B)  1 which is contradictory to our
assumption that ED( A, B)  1 .
The above discussion is based upon the following theorem proved by Navarro
and Baeza-Yates in 1999.
Theorem 15.1-1 NB99
Let A and B be two strings. Let B be divided into j pieces
B1 , B2 , , B j . If ED( A, B )  k , then there exist at least one piece Bi and a
substring Ai in A such that ED( Ai , Bi )   k  .
 j 
Proof:
15-1
We prove by contradiction. Suppose there does not exist one piece Bi and a
substring Ai in A such that ED( Ai , Bi )   k  . Then for every Bi , let Ai
 j 
be the substring in A which has the shortest edit distance with Bi .
ED( Ai , Bi )   k  . It thus takes at least more than
 j 
into Ai . Since there are j Bi ’s, we conclude
contradictory to our assumption that ED( A, B )  k .
 k  steps to transform B
i
 j 
that ED( A, B )  k which is
Thus then there exist at least
one piece Bi and a substring Ai in A such that ED( Ai , Bi )   k  .
 j 
A special case of the above theorem in which j  k  1 was proved in [WM92].
The above theorem will be our Rule A3 for solving Problem 1. To use Rule
A3, we may simply let j  k  1 . This was pointed out in [WM92]. Thus we shall
call the algorithm based upon j  k  1 the WM2 Algorithm because in Chapter 13,
In this case  k   0 and there will be
 j 
at least one Bi which will exactly appear in A .
we also introduced the WM1 Algorithm..
Example 15.1-1
The First Example of Rule A3
Let A  agttgca and B  attgca . It can be seen that ED( A, B)  1 . Thus
k  1 . Let j  k  1  2 . We divide B  attgca into two pieces, say B1  att
and B2  gca . We can see that B2  gca exactly appears in A  agttgca .
Example 15.1-2
The Second Example of Rule A3
Let A  aatgga and B  attgca . Suppose we divide B  attgca as we did
in Example 15.1-1., namely B1  att and B2  gca . We can see that neither B1
nor B2 exactly in A  aatgga , we may conclude that ED( A, B)  1 .
Example 15.1-3 The Third Example of Rule A3
Let A  actgctaatga , B  agtaatga . It can be seen that ED( A, B)  3 . In
this case, k  3 . Let j  k  1  4 . We divide B  agtaatga into four pieces: ag,
ta, at and ga. We can see that the last three pieces all appear in A  actgctaatga .
Note that for every scheme of dividing B  agtaatga into four pieces, there
must be one piece which appears exactly in A  actgctaatga . For example, we may
divide B  agtaatga into agt, aat, g and a. Again, we can see that the last three
pieces all appear exactly in A .
15-2
We must be careful about the following points:
(1) For any dividing of B into j  k  1 pieces, if no piece of B exactly
appears in A , we are sure that ED( A, B )  k .
(2) If for some dividing, one piece of B exactly appears in A ,, we cannot
conclude that ED( A, B)  k . Further checking is still needed.
(3) If the dividing of B is such that one piece contains only one character, that
piece will very likely appear exactly in A and thus a further checking will be
performed. To avoid this kind of trouble, it is advised to divide B into equal sized
pieces.
We may use Rule A3 in the following way to solve Problem 1.
Rule A3 for Solving Problem 1
For a given pattern P , we divide it into j  k  1 pieces. If none of these
pieces exactly appears in the given text T , we claim that there is no solution for
Problem 1 with error bound k in this case. If some pieces exactly appear in T ,
we open windows in T which contain pieces of P and ignore all windows
which do not contain any piece of P .
In the above, we discussed the filtering stage. The next stage is the checking
stage in which windows are opened. We of course have to know the appropriate size
and location of a window. This will be discussed in the next section.
Section 15.2
The Checking Stage of the WM2Algorithm.
Suppose that we are given two strings A and B with sizes n and m
respectively, if ED( A, B )  k , the difference between n and m can be neither too
small nor too large. In fact, we will use Rule A2 which was presented in Section
11.5..
Rule A2
Given two strings A and B with sizes n and m respectively, if
ED( A, B )  k , m  k  n  m  k .
Suppose that, by using Rule A3, we have found some substring which may
contain a solution, we may now open a window containing this substring. Based upon
Rule A2, we conclude that the window size should be m  2k as shown in Fig.
15.2-1.
15-3
k
m
k
Fig. 15.2-1 The window size for the BN Algorithm
Our next point to note is the exact location of the window. As we indicated
before, the window must contain a substring Ti which exactly matches with a
piece Pi of P . Our window should thus be located in such a way that Ti is
aligned with Pi
To prove this, we show that an error may occur if Ti is not
aligned with Pi
Let us consider the following case:
1
2
3
4
5
6
7
8
9
10
c
g
t
g
a
a
T
=
t
g
c
a
P
=
a
c
c
t
Assume that k  1 .
We can see that there is a solution ending at t 7 .
Let j  k  1  2 . Suppose we divide P  acct into two pieces, namely
P1  ac and P2  ct . Since P1  ac appears exactly in T , we know that there is
hope for a solution. As we indicated before, we need to open a window whose size
is 4  2  6 . Consider the window W  T (1,6)  tgcacg which contains P1  ac
and suppose pattern P is aligned with W as below:
T
=
P
=
1
2
3
4
5
6
7
8
9
10
t
g
c
a
c
g
t
g
a
a
a
c
c
t
For this case, although the window contains P1  ac , P1  ac is not aligned
with ac in the window. For this window and the location of P , actually no
solution can be found. Let us now align P1  ac exactly with the ac in T as
shown below:
T
=
P
=
1
2
3
4
5
6
7
8
9
10
t
g
c
a
c
g
t
g
a
a
a
c
c
t
15-4
The window now is W  T (3,8)  cacgtg . It can be seen that the solution
ending at t 7 is in this window. If we use Seller’s Algorithm on this window W and
P , a solution ending at t 7 can be found.
The reader may now appreciate that the WM2 Algorithm as a filtering algorithm
because some windows are not opened. If we do not Rule A3 to filter out many
regions, we would have to open much more windows as we pointed out before.
Let Ti be a substring in T which is identical to a piece Pi . Since we
know that if in a window W containing Ti , Ti is not aligned with Pi , this
window may miss a solution, we conclude that for any window W containing
Ti , Ti must be aligned with Pi .
For a
Pi which appears in T , we construct a window by the following
procedure:
Procedure A for the Opening of a Window in the WM2 Algorithm
Procedure A
Step 1. Let Ti be a substring in T which is identical to Pi .
Step 2. Align Ti in T with Pi in P .
Step 3. Let S be the substring in T which is aligned with P by Step 2.
Step 4. Extend S in T k locations to the right and k locations to the left.
The resulting substring is window W .
Fig. 15.2-1 illustrates the resulting window.
W
m
k
Ti
S
Pi
P
Fig. 15.2-1 An illustration of Procedure A
15-5
k
It is understood here that if the alignment causes the window to be out of the
boundary, as shown in Fig. 15.2-2, that widow is to be ignored because the window
cannot cover the entire pattern P .
T
W
T
Ti
k
P
k
Pi
Fig. 15.2-2 An out of the boundary window
TheWM2 Algorithm works as follows:
Algorithm 15.1
The WM2 Algorithm for Solving Problem 1
Input: A text string T  t1t 2 t n , a pattern string P  p1 p2  pm and an error
bound k .
Output: Every location i in T such that there exists a suffix of T (1, i ) whose
edit distance with P is less than or equal to k .
Step 1: Divide P into j  k  1 pieces P1 , P2 , , Pj .
Step 2: Check whether any Pi exactly appears in T . If no Pi exactly appears
in T , declare the solution is  and exit.
Step 3: Otherwise, for every Pi which exactly appears in T , open a window by
using Procedure A.
Step 4. Apply any algorithm which solves Problem 1 on W and P and report
the result.
Example 15.2-1
Consider the following set of data where k  1 .
1
2
3
4
5
6
7
8
9
10
11
12
13
t
g
t
a
t
t
g
c
t
T
=
a
c
g
t
P
=
c
g
t
g
Let j  k  1  2 . We divide P  cgtg into P1  cg and P2  tg . Both
P1 and P2 appear exactly in T . The solution of Problem 1 can be found by
opening the following windows:
1. Window 1: T (1,6) which aligns P1  cg in P with a cg in T as follows:
15-6
T=
1
2
3
4
5
6
a
c
g
t
t
g
c
g
t
g
P=
From this window, we can see that locations 5 and 6 are solutions.
2. Window 2:
T ( 2,7) which aligns P2  tg in P with a tg in T as follows:
2
3
4
5
6
7
c
g
t
t
g
t
c
g
t
g
In this window, location 6 is a solution.
3. Window 3: T (7,12) which aligns P2  tg in P with a tg in T as follows:
7
8
9
10
11
12
t
g
t
t
g
c
c
g
t
g
No solution for Problem 1 can be found in this window.
Conclusion:
The solution is {5,6} .
Example 15.2-2
Consider the following data with k  1 .
1
2
3
4
5
6
7
8
9
10
11
12
13
t
g
t
a
t
t
g
c
t
T
=
a
c
g
t
P
=
c
a
t
c
We divide P  catc into P1  ca and P2  tc . Since neither P1  ca nor
P2  tc exactly appears in T , we conclude the solution is  .
Section 15.3
The NB Algorithm
15-7
In the above, the WM2 Algorithm was introduced. In the WM2 Algorithm, the
pattern is divided into k  1 pieces. It can be imagined that each piece will be
relatively small. Thus there is a high probability that many pieces may exactly
appear in the text string. For each such a case, we have to open a window to check
whether the pattern appears in the window within the error bound. This may be
quite time-consuming.
To understand the NB Algorithm, we need two lemmas based upon Theorem
15.1-1.
Lemma 15.3-1
Let A and B
B1 , B2 , , B j .
be two strings.
Let B be divided into j pieces
L
Let B  B1 B2  B j1 and B R  B j1 1 B j1 2  B j .
If
ED( A, B )  k , then either for B L , there exists a substring A L in A such that
 j k
ED( A L , B L )   1  or for B R , there exists a substring A R in A such that
 j 
 ( j  j1 )k 
ED( A R , B R )  
.
j


Lemma 15.3-2
Let A and B be two strings.
Let B be divided into k  1 pieces
L
B1 , B2 , , Bk 1 . Let B  B1 B2  B j1 and B R  B j11B j12  Bk 1 . If no Bi ,
1  i  j1 ( j1  1  i  k  1) , appears in A , then there exists no substring A L ( A R )
 j k 
 (k  1  j1 )k  
in A such that ED( AL , B L )   1  ED( AR , B R )  
.
k  1  
 k  1 

Example 15.3-1
Consider T  cctagattttagcctct , P  acagcgaa and k  3 .
to WM2 Algorithm, we would do the following:
Thus, according
(1) Divide P into k  1  4 pieces, namely P1  ac, P2  ag , P3  cg and P4  aa .
(2) We find only P2  ag appears exactly in T and it appears twice.
(3) We open two windows with size m  2k  8  2  3  14 .
For the NB Algorithm, we proceed as follows:
(1) Merge P1 and P2 into P L  acag and P3 and P4 into P R  cgaa .
(2) We now have j1  2 . Applying Lemma 15.3-1, we now try to see whether there
is a substring in T which has edit distance with P L or P R less than or equal to
 2k   6 
 k  1   4   1
15-8
(3) By using Lemma 15.3-2, we may ignore P R .
(4) To test P L , we again have to open two windows. But the size of the each
m
 2 1  4  2  6 .
window is now
2
In fact, in this example, P L fails the test and thus we may terminate with no
solution. We can see that the NB Algorithm is less time-consuming in this example.
Algorithm 15.2 The NB Algorithm for Solving Problem 1
Input: A text string T  t1t 2 t n , a pattern string P  p1 p2  pm and an error
bound k .
Output: Every location i in T such that there exists a suffix of T (1, i ) whose
edit distance with P is less than or equal to k .
Step 1: Split P into k  1 pieces, named P1 , P 2 , , P k 1 .
Step 2: Check whether any piece P j exactly appears in T , where 1  j  k  1 .
If no piece exactly appears in T , declare the solution is  and exit.
Step 3: Let jmid  (k  1) 2 .
Step 3.1 If there exists a j  jmid such that P j exactly appears in T , for each
such P j , open a window W in T by using Procedure A with P j as the
input, apply any algorithm to solve Problem 1 on W and string
P L  P1 P 2  P jmid with error bound  jmid  k (k  1) . For each P j whose
answer is positive, label it as P j '
Step 3.2 If there exists a j  jmid such that P j exactly appears in T , for each
such P j , open a window W in T by using Procedure A with P j as the
input, apply any algorithm to solve Problem 1 on W and string
P R  P jmid 1 P jmid 2  P k 1 with error bound (k  1  jmid )  k (k  1) . For each
P j whose answer is positive, label it as P j '
Step 3.3. If neither P L nor P R appears in T with their error bound,
declare the solution is  and exit.
Step 4 For each P j ' obtained in Steps 3.1 and 3.2, open a window W in T by
using Procedure A with P j as the input, apply any algorithm to solve Problem 1 on
W and string P with error bound k . Report the solutions if there is any.
Example 15.3-2
Let the input data as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T = a t a g g t t g c a g g t
a g t
a a t
P = t g c a g c c c
k 3
(1) P1  tg, P 2  ca, P 3  gc and P 4  cc .
15-9
(2) j mid  2 .
(3) Only P1  tg and P 2  ca appear exactly in T .
(4) P L  P1 P 2  tgca . We ignore P R  P 3 P 4 because none of P 3 and P 4
appears exactly in T .
(5) For P1  tg , we open W  T (6,11)  ttgcag which aligns with P1  tg .
There exists one substring in this window having edit distance smaller than or equal to
1 with P L  tgca . Thus we label P1 as P1' .
(6) Similarly, we label P 2 as P 2' . The window is the same.
(7) We open window T (4,17)  ggttgcaggtagta and find that there is a substring
inside it, namely T (7,14)  tgcaggta whose edit distance with P  tgcagccc is
3  k . Thus location 14 is the solution.
Example 15.3.-3
Let the input data as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T = a t a g g t t g c a g g t
a g t
a a t
P = t c c a t c c c
k 3
(1) P1  tc, P 2  ca, P 3  tc and P 4  cc
(2) j mid  2 .
(3) Only P 2  ca appears exactly in T .
(4) P L  P1 P 2  tcca . We ignore P R  P 3 P 4 because none of P 3 and P 4
appears exactly in T .
(5) For P 2  ca , we open window W  T (6,11)  ttgcag which aligns with
P 2  ca . There exists no substring in this window having edit distance smaller than
or equal to 1 with P L  tcca .
(6) Terminate and report no solution.
15-10