PART I

PART II
Approximate String
Matching Algorithms
10-1
Chapter 10
The Edit Distance
In Part I of this book, the algorithms are all exact string matching algorithms. The
problem is: Given a text string T and a pattern string P , the exact string
matching problem is to find whether P appears in T and if it does, where it
appears. For the approximate string matching problem, we are also given a text
string T and a pattern string P . We ask whether a substring which is quite similar
to P appears in T . For example, let us consider the following case:
T
=
1
2
3
4
5
6
7
8
9
10
11
12
13
a
c
g
t
t
t
a
a
c
t
t
g
c
and
P
=
t
t
c
a
c
We can see that T (5,9)  ttaac is quite similar to P  ttcac . Therefore, we may
say that there is an approximate solution of this problem.
To precisely define the approximate string matching problem, we need a precise
definition of similarity. We shall use the edit distance to measure the similarity
between two strings:
Section 10.1
The Definition of Edit Distance
Given two strings, we define three operations: insertion, deletion and substitution
between them.
Insertion:
Let
A
=
1
2
3
4
5
6
7
8
9
10
a
c
t
a
c
g
t
g
a
a
1
2
3
4
5
6
7
8
9
a
c
t
c
g
t
g
a
a
and
B
=
10-2
Suppose we insert “ a ” between b3 and b4 , string B will become identical to
string A .
Deletion:
Suppose we have
B
=
1
2
3
4
5
6
7
8
9
10
11
a
c
t
a
g
c
g
t
g
a
a
By deleting b5 , string B will become identical to string A .
Substitution:
Suppose that we have:
B
=
1
2
3
4
5
6
7
8
9
10
a
c
t
a
c
t
t
g
a
a
If we substitute b6  t by g , string B will become identical to string A .
Definition 10.1-1
Edit Distance
Given two strings A and B , the edit distance between A and B , denoted as
ED( A, B) , is defined as the minimum number of insertions, deletions and
substitutions needed to transform string B to string A . Although these
operations can be performed on both A and B , we stipulate that all operations
are performed on string B . Note that this rule does not lose generality.
Example 10.1-1
Let us assume that
A
=
1
2
3
4
5
6
7
8
9
10
a
c
c
t
a
g
t
t
a
g
and
10-3
B
=
1
2
3
4
5
6
7
8
9
10
11
a
c
t
g
a
g
t
a
a
g
t
It can be proved that the following four operations will transform string B to
string A :
1.
2.
3.
4.
inserting c after b2 .
deleting b4  g
substituting b8  a by t .
deleting b11  t .
It can also be proved the minimum number of insertions, deletions and
substitutions to transform string B to string A is 4. Thus ED( A, B )  4 .
Example 10.1-2
We have
A
=
1
2
3
4
5
6
7
8
a
c
c
g
t
a
t
g
and
B
=
1
2
3
4
5
6
7
8
9
c
c
g
a
c
c
c
g
a
It can be proved that we need at least six operations to transform string B to
string A . Thus ED( A, B )  6 . The following is one set of these six operations:
1.
2.
3.
4.
inserting a
inserting t
deleting b5
substituting
before b1 .
after b3 .
 c and b6  c
b7  c by t
5. deleting b9  a
Having defined the edit distance, we need an algorithm to find the edit distance
between strings. The following section will give the first method.
10-4
Section 10.2 The First Dynamic Programming Algorithm
to Find the Edit Distance
The problem of finding the edit distance is an optimization problem and can be solved
by the dynamic programming approach. We are given two strings: A  a1a 2  ai
and B  b1b2 b j . Our job is to find ED( A, B) . Let us first define a new term,
denoted as ed (i, j )  ED( A(1, i), B(1, j )) . Then we have the following statement:
If a i  b j , ed (i, j )  ed (i  1, j  1)
The above statement is obviously correct because the addition of a pair of
characters which are identical to each other to A(1, i  1)  a1a2 ai 1 and
B(1, j  1)  b1b2 b j 1 will not alter the edit distance between A(1, i  1) and
B(1, j  1) .
If a i  b j , we have to examine one of the following operations:
1. Inserting a character which is equal to a i after the character b j in B as
shown in Fig. 10.2-1.
A  a1 a2 ... ai 1 ai
B  b1 b2 ... bj ai
inserted
Fig. 10.2-1 An insertion operation in the finding of the edit distance.
In this case, we merely have to take a look at A(1, i  1)  a1a2 ai 1 and
B(1, j )  b1b2 b j .
Suppose
we
have
already
computed
ed (i  1, j )  ED( A(1, i  1), b(1, j )) , we then simply add 1 to this distance as this 1 is
the cost of the insertion. Thus, in this case,
ed (i, j )  ed (i  1, j )  1 .
2. Deleting b j as shown in Fig. 10.2-2.
A  a1 a2 ... ai 1 ai
B  b1 b2 . . . bj 1 bj
deleted
Fig. 10.2-2
In this
A deletion operation in the finding of the edit distance.
case, we should
only compute
10-5
the edit
distance between
A(1, i)  a1a2 ai and B(1, j  1)  b1b2 b j 1 and add 1, which is the cost of
deleting b j to it.
Thus, in this case,
ed (i, j )  ed (i, j  1)  1 .
3. Substituting b j by a i as shown in Fig. 10.2-3.
A  a1 a2 ... ai 1 ai
B  b1 b2 ... bj 1 bj
bj substituted by ai
Fig. 10.2-3 A substitution operation in the finding of the edit distance.
In this case, we merely have to compute the edit distance between
A(1, i  1)  a1a2 ai 1 and B(1, j  1)  b1b2 b j 1 and add 1, which is the cost of
substitution, to it. Thus, in this case,
ed (i, j )  ed (i  1, j  1)  1
Although there are three possible operations, we have to select one of them.
Thus, for the first edit distance finding algorithm, we have the recursive formula for
the first edit distance finding algorithm.:
Procedure 10.2-1 The Procedure to Find the Edit Distance for the First Edit
Distance Finding Algorithm
If a i  b j , ed (i, j )  ed (i  1, j  1)
I
ed (i  1, j )  1

If ai  b j , ed (i, j )  m i ned (i, j  1)  1
D
ed (i  1, j  1)  1 S

ed (i  1, j )

 1  m i ned (i, j  1)
ed (i  1, j  1)

ed (i,0)  i, 0  i  n
ed (0, j )  j , 0  j  m
(10.2-1)
(10.2-2)
(10.2-3)
The first algorithm to compute the edit distance between two strings is as
follows:
10-6
Algorithm 10.1 The First Algorithm to Compute the Edit Distance between
Two Strings
Input: String A(1, n) and B(1, m)
Output: The edit distance between strings A and B .
For i  0 to i  n , ed (i,0)  i
For j  0 to j  m , ed (0, j )  j
For i  1 to i  n ,
For j  1 to j  m
If a i  b j , ed (i, j )  ed (i  1, j  1)
ed (i  1, j )  1

If ai  b j , ed (i, j )  m i ned (i, j  1)  1
ed (i  1, j  1)  1

ed (i  1, j )

 1  m i ned (i, j  1)
ed (i  1, j  1)

Report ed (n, m) as the edit distance between A and B .
For Equation (10.2-3), ed (i,0) is the edit distance between A(1, i)  a1a2 ai
and B   (an empty string). Thus we must insert i elements to B in order to
transform B to A(1, i ) . That is why ed (i,0)  i . Similarly, ed (0, j ) is the edit
distance between A   and B(1, j )  b1b2 b j . Because we need to delete j
elements of B in order to transform B(1, j )  b1b2 b j to A , ed (0, j )  j . We
must note that ED( A, B)  ed (n, m) if the lengths of A and B are n and m
respectively.
Let us examine Equation (10.2-1) again.
We did say that a i  b j will cause
ed (i, j )  ed (i  1, j  1) . But we did not say that ed (i, j )  ed (i  1, j  1) can
only be caused by a i  b j . This is explained by the following example.
Example 10.2-1
Let A  accgatgc and B  aaacga .
10.2-1.
10-7
ED( A, B) is found as shown in Table
Table 10.2-1 The calculation of the edit distance between
A  accgatgc and B  aaacga .
i
j
0
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
In the above table, we can find ed (i, j ) for 1  i  8 and 1  j  6 . For
instance, we can see ed (3,4)  2 and ed (6,6)  3 .
Let us see whether
ed (3,4)  2 is correct.
A(1,3)  acc and B(1,4)  aaac . By deleting b2 and
substituting b3  a by c , we can transform B(1,4)  aaac to A(1,3)  acc . Thus
ed (3,4)  2 is correct.
Let us see how the computation is done by examining some cases. Consider
ed (5,3) . In this case, i  5 and j  3 . Note that a5  a  b3  a . According
to Equation (10.2-1), ed (5,3)  ed (4,2)  3 .
Consider ed (5,5) . In this case, i  5 and j  5 .
a5  a  b5  g , we apply Equation (10.2-2).
In this case, since
ed (i  1, j )  1  ed (4,5)  1  2  1  3

ed (5,5)  min ed (i, j  1)  1  ed (5,4)  1  4  1  5
ed (i  1, j  1)  1  ed (4,4)  3  1  4

3
The case of ed (5,5) confirms our earlier statement;
ed (5,5)  ed (4,4) . But we cannot say that this is caused by a5  b5 .
note that a5  b5 .
Note that
In fact, we
The Tracing Back of the Dynamic Programming Table
In the above, we showed how to find the edit distance. But, it is usually not
sufficient to find the edit distance. We also want to know how many insertions,
deletions and substitutions are involved in the edit distance. To achieve this, we
need to perform a tracing back in the dynamic programming table.
10-8
To trace back from ed (i, j ) , we use the following procedure:
Procedure 10.2-2: Procedure for Tracing Back of the Dynamic Programming
Table for the Edit Distance
ed (i  1, j )  1 


Case 1: If a i  b j , let x  min ed (i, j  1)  1  and point ed (i, j ) to
ed (i  1, j  1)  1


locations of x . Note that there may be more than one such locations.
ed (i  1, j )  1


Case 2. If a i  b j , let x  min ed (i, j  1)  1 and point ed (i, j ) to
ed (i  1, j  1) 


locations of x . Note that there may be more than one such locations.
The correctness of the above procedure will be proved later as we present
Procedure 10.3-1. It is obvious that this procedure is correct for the case of a i  b j ,
by consulting Equation (10.2-1). But it is not obvious at all that this procedure is
also correct for the case of a i  b j . Essentially, again, by consulting Equation
(10.2-1), we need to prove the following:
ed (i  1, j )  1


If a i  b j , ed (i  1, j  1)  min ed (i, j  1)  1
ed (i  1, j  1) 


But, Procedure 10.2-2 points out another critical point as follows:
If a i  b j , ed (i  1, j  1) may also be equal to ed (i  1, j )  1 or ed (i, j  1)  1 .
ed (i  1, j  1)  ed (i  1, j )  1
This
happens
when
ed (i  1, j  1)  ed (i, j  1)  1 .
or
Let us consider Table 10.2-1. Let i  1 and j  2 . In this case, a1  b2 .
Therefore, ed (i, j )  ed (1,2)  ed (i  1, j  1)  ed (0,1)  1 . But ed (1,2)  ed (0,1)
may be caused by another situation. Note that ed (i, j  1)  ed (1,1)  0 . Thus
ed (i, j )  ed (1,2)  ed (i, j  1)  1 . This means that ed (i, j )  1 can also be caused
by the deletion of b2 . Let us take a look at the situation. Note that A(1,1)  a
and B(1,2)  aa . Since a1  b2 , we may say that ed (1,2)  ed (0,1)  1 . But we
may also delete b2  a and transform B (1,2) into B (1,1)  a which is exactly
equal to A(1,1)  a . Thus ed (1,2)  ed (1,1)  1  0  1  1 . What we are saying is
that we can transform B(1,2)  aa into A(1,1)  a through a deletion operation
although a1  b2 .
10-9
One may find cases discussed above in Table 10.2-1. They occur at locations
(1,2),(1,3), (1,6) and (5,1). In all such cases, a i  b j , yet the tracing back will point
not only to (i  1, j  1) . It may also point to (i  1, j ) or (i, j  1) .
In general, there are three possibilities for the tracing back by using Prcedure
10.2-2 as shown in the following figures:
Case1;
(i, j )  (i  1. j ) (insertion)
Fig. 10.2-4
(i-1,j-1)
(i,j-1)
(i-1,j)
(i,j)
The tracing back when ed (i  1, j )  ed (i, j ) (inserting a i after b j ).
For instance, in Table 10.2-1, (4, 4) points back to (3, 4), as an insertion is done.
Case 2:
(i, j )  (i. j  1) (deletion)
Fig. 10.2-5
(i-1, j-1)
(i, j-1)
(i-1, j)
(i, j)
The tracing back when ed (i, j  1)  ed (i, j ) (deleting b j ).
For instance, in Table 10.2-1, (4, 6) points back to (4, 5), as a deletion is done.
Case 3:
(i, j )  (i  1. j  1) (substitution or b j  ai )
Fig. 10.2-6
(i-1, j-1)
(i, j-1)
(i-1, j)
(i, j)
The tracing back when ed (i  1, j  1)  ed (i, j )
(substituting b j by a i ) or b j  ai .
We trace back from (i, j )
satisfied.
to (i  1, j  1) if any of the following conditions is
10-10
(1) ed (i  1, j  1)  ed (i, j ) . It occurs when a substitution is done. For
instance, in Table 10.2-2, (4, 2) will point back to (3, 1) because b2 may be
substituted by a 4 .
(2) b j  ai . In this case, ed (i  1, j  1)  ed (i, j ) . For instance, (5, 6) points
back to (4, 5) because a5  b6 .
One of the tracing backs from location (8,6) by using Procedure 10.2-2 is
displayed in Table 10.2-2
Table 10.2-2
i
j
One tracing back of location (8.6) in Table 10.2-1
0
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
From the above table, we can see that in this tracing back, there are one
substitution, one deletion and three insertions in the determining of the ed (8,6) as
follows:
1. substituting b2  a by a2  c ,
2. deleting b3  a ,
3. inserting a6  t , a7  g and a8  c after b6 .
It must be reminded that for ed (8,6) , there are many other paths of tracing back.
Let us consider location (4,4).
as shown in Table 10.2-3.
In this case, there are many paths of tracing back
10-11
Table 10.2-3 The tracing back of location (4.4) in Table 10.2-1
i
j
0
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
As can be seen, there are four paths for tracing back from location (4,4),
displayed as below. We use S, I, D, and M to denote substitution, insertion, deletion
and matching respectively.
(1) (4, 4) S
(3, 3) S
(2, 2) S
(1, 1) M (0, 0)
(2) (4, 4) I
(3, 4) M (2, 3) D (2, 2) S
(3) (4, 4) I
(3, 4) M (2, 3) S
(1, 2) D (1, 1) M (0, 0)
(4) (4, 4) I
(3, 4) M (2, 3) S
(1, 2) M (0, 1) D (0, 0)
(1, 1) M (0, 0)
The reader must understand that the following statements are all wrong:
*1.
ed (i, j ) cannot be smaller than ed (i  1, j ) and ed (i, j  1) .
A counter-example can be seen at location (4,5) of Table 10.2-3 because
ed (4,5)  2  ed (4,4)  3 . Besides, ed (4,5)  2  ed (3,5)  3 .
This might be against our intuition because we may easily think that by going
along a row from left to right or along a column from top down will see the edit
distances increasing or at least maintaining the same value. Let us consider
Column 6 of Table 10.2-3. In this column, ed (5,4)  4 , but ed (5,5)  3 . That is,
as we go down from location (5,4), the value decreases. This can be explained as
follows: Note that A(1,5)  accga and B(1,4)  aaac . The edit distance between
A(1,5) and B (1,4) is 4 as seen below:
10-12
A(1,5)
a
c
c
B(1,4) a
a
a
c
d
s
g
a
i
i
If we go along the column downward, B (1,4) becomes B(1,5)  aaacg and
the edit distance is reduced to 3 as shown in the following table.
A(1,5)
a
c
c
g
B(1,5) a
a
a
c
g
d
s
a
i
Let us consider another example.
Consider Row 6 of Table 10.2-3.
B(1,6)  aaacga and A(1,3)  acc as shown below. The edit distance between them
is 4 as indicated.
A(1,3)
a
c
c
B(1,6) a
a
a
c
d
s
g
a
d
d
Suppose A(1,3)  acc becomes A(1,4)  accg .
The situation is now shown
as below and the edit distance is reduced to 3.
A(1,4)
a
c
c
g
B(1,6) a
a
a
c
g
d
s
a
d
*2. Whenever ed (i  1, j  1)  ed (i, j ) , a i  b j .
Consider
the
location
(5,5).
In
this
location
ed (i, j )  ed (5,5)  3  ed (i  1, j  1)  ed (4,4) , yet ai  a5  g  b j  b5  a .
Let us redisplay Equation (10.2-2) as below:
ed (i  1, j )

If ai  b j , ed (i, j )  1  min ed (i, j  1)
ed (i  1, j  1)

Suppose
ai  b j
and
ed (i  1, j )
ed (i  1, j), ed (i, j  1), ed (i  1, j  1).
is
the
minimum
of
Then according to the above equation, we
10-13
have ed (i, j )  ed (i  1, j )  1 .
But, it may be the case where
ed (i  1, j  1)  ed (i  1, j )  1 . Consequently, we have ed (i, j )  ed (i  1, j  1) .
The same argument can be applied to the case of ed (i, j  1) is the minimum of
ed (i  1, j), ed (i, j  1), ed (i  1, j  1).
All such examples can be found (1,4) , (1,5) , ( 2,5) , (3,1) , (4,6) , (5,5), (6,1) ,
(6,2) , (6,5) , (6,6) , (7,1) , (7,2) , (7,6) , (8,1) , (8,2) and (8,5) .
`
Let us first define a new term.
Definition 10.2-1 Diagonal of a Matrix
Given a matrix, a diagonal d is a continuous sequence of locations (i, j ) where
d i j.
The following statements are all true:
1. For a row, or a column, the values in the dynamic programming table may
decrease.
2. In a diagonal of the dynamic programming table to compute the edit distance,
the values are non-deceasing.
To demonstrate this, we redisplay Table 10.2-1 as Table 10.2-4. For both
diagonals shown in Table 10.2-4, the values are non-decreasing along them.
Table 10.2-4 The non-decreasing of the diagonals in a dynamic programming table
for computing the edit distance
i
j
0
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
3. The value of the difference between the value of any location and that of any
of its neighbors can be only -1, 0 or 1.
For example, in Table 10.2-4, ed (6,6)  ed (5,5) and ed (4,4)  ed (3,3)  1 .
10-14
The above two statements are quite important and will be proved later.
Section 10.3
Rule A1 and the Second Dynamic
Programming Algorithm to Find the Edit Distance
In this section, we shall introduce the second dynamic programming algorithm to find
the edit distance. This algorithm is actually similar to the first one. Yet, it is quite
significant because many approximate string matching algorithms use this approach.
We are given two strings A  a1a 2  ai and B  b1b2 b j . The recursive
formula of the second dynamic programming algorithm to compute the edit distance
between two strings is as follows:
Procedure 10.3-1 The Recursive Formula for the Second Edit Distance Finding
Algorithm
ed (i, j  1)  1

ed (i, j )  min ed (i  1, j )  1
ed (i  1, j  1)  eq (i, j )

where eq (i, j )  0 if ai  b j
and
(10.3-1)
eq (i, j )  1 if ai  b j
ed (i,0)  i, 0  i  n
(10.3-2)
ed (0, j )  j , 0  j  m
Comparing the above equations with the recursive formulas expressed in
Equations (10.2-1), (10.2-2) and (10.2-3), we can see that when a i  b j , the
recursive formulas of the second algorithm are identical to those of the first one. To
prove that Equations (10.3-1) is equivalent to Equations (10.2-1) and (10.2-2), it
seems that we only have to take care of the case when a i  b j . That is, we have to
prove that if a i  b j , ed (i  1, j  1)  eq(i, j )  ed (i  1, j  1) is smaller than or
equal to ed (i, j  1)  1 and ed (i  1, j )  1 , and ed (i, j )  ed (i  1, j  1) .
Let us first examine Table 10.2-1 which is again displayed as Table 10.3-1 below.
We can easily see one special property of the elements inside the table. Consider
any element in the dynamic programming table. It is obvious that the difference
between it and its neighbors is between -1 and 1. For example, consider i  3 and
j3 .
ed (3,3)  ed (4,3)  2  3  1
We
note
that
and
ed (3,3)  ed (3,4)  2  2  0 .
10-15
Table 10.3-1 Table 10.2-1 redisplayed
i
j
0
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
Before we prove the formal lemma, let us give the following claims:
Claim 10.3-1:
If x  y  1 and x  y , then ( x  1)  y  1
Claim 10.3-2
If x  y  2 , and x  y , then ( x  1)  y  1
We now prove the following:
Lemma 10.3-1: In each element of the dynamic programming table to compute
the edit distance, for all i and j ,
and
ed (i, j )  ed (i, j  1)  1
(10.3-3)
ed (i, j )  ed (i  1, j )  1 .
(10.3-4)
Proof:
We prove by induction. For i  1 and j  1 , ed (0,0)  0, ed (0,1)  1 and
ed (1,0)  1 . Thus, ed (i, j )  ed (1,1)  0 or 1 . Thus this lemma is true for i  1
and j  1 .
Note that this lemma implies that
ed (i, j )  ed (i  1, j )  1
(10.3-5)
ed (i, j )  ed (i, j  1)  1
(10.3-6)
Assume that this lemma is true for (i  1, j  1) .
10-16
If a i  b j , according to Equation (10.2-1), ed (i, j )  ed (i  1, j  1) .
assumed and according to Equations (10.3-5) and (10.3-6),
and
But, as
ed (i  1, j  1)  ed (i, j  1)  1
(10.3-7)
ed (i  1, j  1)  ed (i  1, j )  1
(10.3-8)
Thus, by substituting ed (i, j )  ed (i  1, j  1) into Equations (10.3-7) and
(10.3-8), we again have that
ed (i, j )  ed (i, j  1)  1
ed (i, j )  ed (i  1, j )  1 .
and
From Equations (10.3-7) and (10.3-8), we also have:
ed (i, j  1)  ed (i  1, j )  2
(10,3-9)
If a i  b j , according to Equation 10.2-2,
ed (i  1, j )  1
ed (i  1, j )


ed (i, j )  min ed (i, j  1)  1  1  min ed (i, j  1)
ed (i  1, j  1)  1
ed (i  1, j  1)


Case 1. ed (i,1, j  1) is the minimum of ed (i  1, j), ed (i, j  1), ed (i  1, j  1) .
Then, we have ed (i, j )  ed (i  1, j  1)  1 . Besides, since ed (i,1, j  1) is the
minimum of ed (i  1, j), ed (i, j  1), ed (i  1, j  1) . we have
ed (i  1, j  1)  ed (i  1, j )
(10.3-10)
(10.3-11)
ed (i  1, j  1)  ed (i, j  1)
and
Using ed (i, j )  ed (i  1, j  1)  1 , Equations (10.3-10) and (10.3-11) and Claim
10.3-1, we conclude:
ed (i, j )  ed (i, j  1)  1
and
Case 2:
ed (i, j )  ed (i  1, j )  1 .
Without losing generality, we assume ed (i, j  1) is the minimum of
10-17
ed (i  1, j), ed (i, j  1), ed (i  1, j  1) .
Then we have ed (i, j )  ed (i, j  1)  1 .
Using ed (i, j )  ed (i, j  1)  1 , Equation (10.3-9) and Claim 10.3-2, again, we have:
ed (i, j )  ed (i, j  1)  1
and
ed (i, j )  ed (i  1, j )  1 .
Thus the proof.
Although in the above proof, we have proved the general case for both a i  b j
and a i  b j , we will only make use of the case where a i  b j . For a i  b j , we
have ed (i, j )  ed (i  1, j  1) . Therefore, we have the following lemma:
Lemma 10.3-2
If a i  b j ,
ed (i  1, j  1)  ed (i, j  1)  1
and
ed (i  1, j  1)  ed (i  1, j )  1.
What is the significance of Lemma 10.3-2?
a  b  1 There are three cases:
Let us assume that we have
Case 1: a  b  1 . In this case, we have a  b  1 .
Case 2: a  b  0 In this case, we have a  b and we may say that a  b  1 .
Case 3: a  b  1 . In this case, we have a  b  1 and we may say that
a  b 1
In conclusion, we have
Claim 10.3-3
If a  b  1 , then a  b  1 .
Combining Lemma 10.3-2 and Claim 10.3-3, we have:
Lemma 10.3-3
If a i  b j , ed (i  1, j  1)  ed (i, j  1)  1 and ed (i  1, j  1)  ed (i  1, j )  1 .
Consider the recursive formula in Equation 10.3-1. According to this formula,
when a i  b j , ed (i  1, j  1)  eq(i, j )  ed (i  1, j  1) . But from Lemma 10.3-2
and Claim 10.3-3,
we can see that when a i  b j , ed (i  1, j  1)  ed (i, j  1)  1 and
ed (i  1, j  1)  ed (i  1, j )  1 .
Therefore, when a i  b j ,
10-18
ed (i, j  1)  1

ed (i, j )  min ed (i  1, j )  1
ed (i  1, j  1)  eq (i, j )

ed (i, j  1)  1

.
 min ed (i  1, j )  1
ed (i  1, j  1)

 ed (i  1, j  1)
Thus, Equations (10.3-1) and (10.3-2) are correct. Besides, by proving the
correctness of Procedure 10.3-1, we have also proved the correctness of the tracing
procedure introduced in Section 10.2, namely Procedure 10.2-2.
The second algorithm to compute the edit distance between two strings is now
given below.
Algorithm 10.2 The Second Algorithm to Compute the Edit Distance between
Two Strings
Input: String A(1, n) and B(1, m)
Output: The edit distance between strings A and B .
For i  0 to i  n , ed (i,0)  i
For j  0 to j  m , ed (0, j )  j
For i  1 to i  n ,
For j  1 to j  m
ed (i, j  1)  1

ed (i, j )  min ed (i  1, j )  1
ed (i  1, j  1)  eq (i, j )

where eq (i, j )  0 if ai  b j
and
eq (i, j )  1 if ai  b j
Report ed (n, m) as the edit distance between A and B .
Let us now explain the beauty of Equations (10.3-1) and (10.3-2). Consider
i  1 and j  2 in Table 10.3-1. In this case, since ai  a1  b j  b2  a , we have
ed (1,1)  1  0  1  1 


ed (1,2)  min ed (0,2)  1  2  1  3  1
ed (0,1)  0  1  0  1 


This time, we note that there are two locations causing ed (1,2)  1 . The first
location is ed (1,1) which corresponds to a deletion and the second location is
10-19
ed (0,1) which corresponds to a matching. If we use Equation (10.2-1), only
ed (0,1) is used. Note that Equation (10.2-1) gives the correct result, but fails to
remind us that there is another way to achieve the same edit distance. We now can
understand the tracing back discussed in the above section in the case where a i  b j .
Similar situation occurs at i  1 and j  6.
In the following, we shall prove another important property of the dynamic
programming table to compute the edit distance, namely the non-decreasing of the
values of any diagonal in the table. We shall call this Rule A1.
Rule A1: On the diagonals of a dynamic programming table to compute the edit
distance, the values are non-decreasing.
For instance, consider Table 10.3-2 which is a redisplay of Table 10.2-4. There
are two diagonals as shown. One of them, d  2 and for the other one, d  0 .
Table 10.3-2
i
j
Table 10.2-4 redisplayed
0
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
d=-2
d=0
As can be seen, for both diagonals, the values in the locations along the
diagonals are non-decreasing. This is summarized in the following lemma.
Lemma 10.3-3 Along any diagonal of the dynamic programming table to
compute the edit distance, the values are non-decreasing.
Proof:
Consider any location (i, j ) .
lemma is true for this case.
If ai  b j , ed (i, j )  ed (i  1, j  1) . Therefore this
10-20

ed (i  1, j )  1
If a i  b j , ed (i, j )  min ed (i, j  1)  1

ed (i  1, j  1)  1
Case 1.:
From Equation (10.3-12),
(10.3-12)
ed (i, j )  ed (i  1, j  1)  1 .
Then
ed (i, j )  ed (i  1, j  1)
and the lemma holds.
Case 2:
From Equation (10.3-12),
without
ed (i, j )  ed (i  1, j )  1 . By Equation (10.3-6), we have:
losing
generality,
(ed (i  1, j  1)  ed (i  1, j )  1
By Claim 10.3-3,
ed (i  1, j  1)  ed (i  1, j )  1
But,
ed (i, j )  ed (i  1, j )  1
Therefore, ed (i  1, j  1)  ed (i, j ) ,
or, equivalently,
ed (i, j )  ed (i  1, j  1) .
Thus the proof.
Rule A1 will be used in Chapter 12 to design elegant algorithms to solve some
approximate string matching problems.
Section 10.4
Distance
The MP80 Algorithm to Find the Edit
In this section, we will introduce the MP80 Algorithm to find the edit distance .
Let us use the Equation (10.3-1) as redisplayed as follows:
10-21
ed (i, j  1)  1

ed (i, j )  min ed (i  1, j )  1
ed (i  1, j  1)  eq (i, j )

where eq (i, j )  0 if ai  b j
and
(10.4-1)
eq (i, j )  1 if ai  b j
From the above equation, we can see that ed (i, j ) is determined by four
parameters. We now point out one idea: If we perform a pre-process, for all
possible ed (i, j  1), ed (i  1, j ), ed (i  1, j  1) and eq(i, j ) , we compute the
resulting ed (i, j ) , then the dynamic programming process can be made very efficient.
Let us assume that we are computing the case shown in Table 10.2-1 which is
redisplayed as below:
i
j
0
Table 10.4-1 A dynamic programming table
1
2
3
4
5
6
7
8
A
a
c
c
g
a
t
g
c
0
B
0
1
2
3
4
5
6
7
8
1
a
1
0
1
2
3
4
5
6
7
2
a
2
1
1
2
3
3
4
5
6
3
a
3
2
2
2
3
3
4
5
6
4
c
4
3
2
2
3
4
4
5
5
5
g
5
4
3
3
2
3
4
4
5
6
a
6
5
4
4
3
2
3
4
5
Suppose we have already pre-processed the case so that we have the following
pre-processed data:
If ed (i  1, j  1)  3, ed (i, j  1)  2, ed (i  1, j )  4
ed (i, j )  min( 3  1,2  1,4  1)  min( 4,3,5)  3 .
and
eq(i, j )  1 , then
Then, suppose we compute the case where i  4 and j  6 . In this case,
ai  a 4  g  b j  b6  a . Thus eq(i, j )  1 and we can now use the above result
and immediately obtain ed (4,6)  3 .
have already computed the following:
Let us consider another case. Suppose we
If ed (i  1, j  1)  2, ed (i, j  1)  3, ed (i  1, j )  3 and eq (i, j )  0 . Then
ed (i, j )  min( 2  0,3  1,3  1)  min( 2,4,4)  2
Then, suppose we compute the case where i  5 and j  6 . In this case,
ai  a5  a  b j  b6  a . Thus eq (i, j )  0 and we can now use the above result
and immediately obtain ed (5,6)  2 .
10-22
But the above approach has a big problem. There are infinite number of
possible values of ed (i  1, j  1)' s . They may assume 1,2,  , and so on. In the
following, we will introduce another method which avoids the problem.
Let us take a look at Table 10.4-1. Although this is a large table, we actually are
only interested in the value of the last row. That is, we are only interested in
ed (i,6) for 1  i  8 . The important thing is that we know the value of
ed (0,6)  6 which is an initial value.
Suppose we have found the value of ed (1,6)  ed (0,6)  1 , we can obtain
ed (1,6)  ed (0,6)  (1)  6  1  5 immediately. If we know ed (i,6)  ed (i  1,6)
for 1  i  8 , we would obtain the last row entirely.
The question is:
How can we know ed (i,6)  ed (i  1,6) ?
We define two new terms:
Definition 10.4-1
Distances
V (i, j ) and H (i, j ) for the MP80 Algorithm to Find the Edit
V (i, j )  ed (i, j )  ed (i, j  1)
(10.4-2)
H (i, j )  ed (i, j )  ed (i  1, j )
Fig. 10.4-1 illustrates the relationship between these terms.
ed(i,j-1)
V(i,j)
ed(i-1,j)
H(i,j)
ed(i,j)
Fig. 10.4-1 The relationship between V (i, j ) and H (i, j )
Substituting Equation (10.4-1) into Equation (10.4-2), we have:
10-23
V (i, j )  ed (i, j )  ed (i, j  1)
ed (i, j  1)  ed (i, j  1)  1

 min ed (i  1, j )  ed (i, j  1)  1
ed (i  1, j  1)  ed (i, j  1)  eq (i, j )

1

 min (ed (i  1, j )  ed (i  1, j  1))  (ed (i, j  1)  ed (i  1, j  1))  1
eq (i, j )  (ed (i, j  1)  ed (i  1, j  1))

(10.4-3)
1

 min V (i  1, j )  H (i, j  1)  1
eq (i, j )  H (i, j  1)

where eq (i, j )  0 if a i  b j
and
eq (i, j )  1 if a i  b j
Similarly, we can derive:
1

H (i, j )  min  H (i, j  1)  V (i  1, j )  1
eq (i, j )  V (i  1, j )

where eq (i, j )  0 if a i  b j
and
(10.4-4)
eq (i, j )  1 if ai  b j
In summary, we have the following formulas:
Procedure 10.4-1 Formulas to Compute V (i, j ) and H (i, j ) for the MP
Algorithm.
eq(i, j )  0 if ai  b j
eq(i, j )  1 if ai  b j
1

V (i, j )  min V (i  1, j )  H (i, j  1)  1
eq (i, j )  H (i, j  1)

1

H (i, j )  min  H (i, j  1)  V (i  1, j )  1
eq (i, j )  V (i  1, j )

Let us consider some examples.
Example 10.4-1
For the case in table 10.4-1
Let i  1 and j  1 . Then, we have:
10-24
and
H (i, j  1)  ed (i, j  1)  ed (i  1, j  1)  ed (1,0)  ed (0,0)  1  0  1
V (i  1, j )  ed (i  1, j )  ed (i  1, j  1)  ed (0,1)  ed (0,0)  1  0  1
Note that in the above computation, ed (1,0), ed (0,1) and ed (0,0) are all
initial conditions. Since ai  a1  b j  b1  a , eq(i, j )  eq(1,1)  0 . Substituting
the above results into Equation (10.4-3), we obtain:
V (i, j )  V (1,1)
1

 min 1  1  1
0  1

1

 min 1
 1

 1
Once we have obtained V (1,1) ,
ed (i, j )  ed (i, j  1)  V (i, j ) . Thus,
we
can
obtain
ed (1,1)
because
ed (1,1)  ed (1,0)  V (1,1)  1  (1)  1  1  0
This is correct as can be seen in Table 10.4-1.
Similarly, we can compute H (i, j )  H (1,1) as follows:
H (i, j )  H (1,1)
1

 min 1  1  1
0  1

1

 min 1
 1

 1
Since ed (i, j )  ed (i  1, j )  H (i, j ) , we can also obtain
ed (1,1)  ed (0,1)  H (1,1)  1  (1)  1  1  0
Example 10.4-2
Let us consider the case i  1 and j  6 from the case in Table 10.4-1.
10-25
First, note that because ai  a1  b j  b6  a , eq (i, j )  0 .
Assume that we have
ed (1,5), ed (0,5)
ed (0,6) , we may compute
already obtained
and
H (i, j  1)  ed (i, j  1)  ed (i  1, j  1)  ed (1,5)  ed (0,5)  4  5  1
and
V (i  1, j )  ed (i  1, j )  ed (i  1, j  1)  ed (0,6)  ed (0,5)  6  5  1 . Then
H (i, j )  H (1,6)
1

 min  1  1  1
0  1

1

 min  1
 1

 1
Since we know that ed (0,6)  6 , we obtain that
ed (1,6)  ed (0,6)  H (1,6)  6  (1)  6  1  5
By comparing this with the result in Table 10.4-1, we can see that this result is
correct.
By examining Equations (10.4-3) and (10.4-4), we can consider V (i  1, j ) ,
H (i, j  1) and eq(i, j ) as inputs and V (i, j ) and H (i, j ) as outputs as illustrated
in Fig. 10.4-2.
Inputs
Fig. 10.4-2
ed(i-1,j-1)
H(i,j-1)
ed(i,j-1)
V(i-1,j)
eq(i, j)
V(i,j)
ed(i-1,j)
H(i,j)
ed(i,j)
Outputs
An illustration of Equations (10.4-3) and (10.4-4)
To give the reader more feeling about Equations (10.4-3) and (10.4-4), we may
also illustrate them as in Fig. 10.4-3.
10-26
V(i-1,j)
Equation (10.4-3)
V(i,j)
Equation (10.4-4)
H(i,j)
H(i,j-1)
eq(i, j)
Fig. 10.4-3 The inputs and outputs of Equations (10.4-3) and (10.4-4)
Let us give more examples.
Example 10.4-3
Let V (i  1, j )  0 , H (i, j  1)  1 and eq(i, j )  1 .
(10.4-3), we obtain
1

V (i, j )  min 0  (1)  1
1  (1)

1

 min 2
2

1
From Equation (10.4-4), we obtain
1

H (i, j )  min  1  0  1
1  0

1

 min 0
1

0
A possible case for this is now illustrated in Fig. 10.4-4.
10-27
Then, from Equation
3
a
2
c 3
3
Fig. 10.4-4 A possible case for Example 10.4-3
Example 10.4-4
Let V (i  1, j )  1 , H (i, j  1)  1 and eq (i, j )  0 .
(10.4-3), we obtain
Then, from Equation
1

V (i, j )  min 1  (1)  1
0  (1)

1

 min 3
1

1
From Equation (10.4-4), we obtain
1

H (i, j )  min  1  1  1
0  1

1

 min 0
 1

 1
A possible case for this is now illustrated in Fig. 10.4-5.
Fig. 10.4-5
3
a
2
a 4
3
A possible case for Example 10.4-3
To implement this idea, we compute Equations (10.4-3) and (10.4-4) for all
10-28
possible cases of the inputs. Let us assume that the size of the vocabulary   4 .
There are 4 2 possible a i and b j pairs. There are three possible values for
Therefore there are 3 2  9 possible cases of pairs of V (i, j )
and H (i, j ) . Totally there are 4 2  3 2  16  9  144 possible cases.
V (i, j ) and H (i, j ) .
Once we have performed the pre-processing of these 144 cases, they can be
stored in an array to be retrieved later. Then, instead of using Algorithm 10-1, we
may use the following algorithm.
Algorithm 10.3 An Edit Distance Finding Algorithm Based upon the
MP 80 Algorithm
Input: Strings A(1.n) and B(1, m)
Output: The edit distance between strings A and B .
ed (0, m)  m
For i  1 to i  n , H (i,0)  1
For j  1 to j  m , V (0, j )  1
For i  1 to i  n
For j  1 to j  m
1

V (i. j )  min V (i  1, j )  H (i, j  1)  1
eq (i, j )  H ((i, j  1)

1

H (i, j )  min  H (i, j  1)  V (i  1, j )  1
eq (i, j )  V (i  1, j )

where eq (i, j )  0 if a i  b j
and
eq (i, j )  1 if ai  b j
For i  1 to i  n
ed (i, m)  ed (i  1, m)  H (i, m)
Report ed (n, m) as the edit distance between strings A and B
Section 10.5
References
[L66]: Levenshtein, V. I. Binary codes capable of correcting deletions, insertions,
and reversals. Soviet Physics Doklady, 1996, Vol. 10, No. 8, pp. 707 - 710.
[MP80]: Masek, W. J. and Paterson, M. S. A faster algorithm computing string edit
distances. Journal of Computer and System Sciences, 1980, Vol. 20, No. 1, pp. 12 –
31.
[S80]: Sellers, P. H. The theory and computation of evolutionary distances: Pattern
recognition. Journal of Algorithms, 1980, Vol. 1, No, 4, pp. 359 – 373.
[WF74]: Wagner, R. A. and Fincher, M. J. The String-to-String Correction Problem.
10-29
Journal of the ACM, 1974, Vol. 21, No. 1, pp. 168-173.
10-30

Download Report

PART I

Paperzz.com

Your Paperzz