PART II Approximate String Matching Algorithms 10-1 Chapter 10 The Edit Distance In Part I of this book, the algorithms are all exact string matching algorithms. The problem is: Given a text string T and a pattern string P , the exact string matching problem is to find whether P appears in T and if it does, where it appears. For the approximate string matching problem, we are also given a text string T and a pattern string P . We ask whether a substring which is quite similar to P appears in T . For example, let us consider the following case: T = 1 2 3 4 5 6 7 8 9 10 11 12 13 a c g t t t a a c t t g c and P = t t c a c We can see that T (5,9) ttaac is quite similar to P ttcac . Therefore, we may say that there is an approximate solution of this problem. To precisely define the approximate string matching problem, we need a precise definition of similarity. We shall use the edit distance to measure the similarity between two strings: Section 10.1 The Definition of Edit Distance Given two strings, we define three operations: insertion, deletion and substitution between them. Insertion: Let A = 1 2 3 4 5 6 7 8 9 10 a c t a c g t g a a 1 2 3 4 5 6 7 8 9 a c t c g t g a a and B = 10-2 Suppose we insert “ a ” between b3 and b4 , string B will become identical to string A . Deletion: Suppose we have B = 1 2 3 4 5 6 7 8 9 10 11 a c t a g c g t g a a By deleting b5 , string B will become identical to string A . Substitution: Suppose that we have: B = 1 2 3 4 5 6 7 8 9 10 a c t a c t t g a a If we substitute b6 t by g , string B will become identical to string A . Definition 10.1-1 Edit Distance Given two strings A and B , the edit distance between A and B , denoted as ED( A, B) , is defined as the minimum number of insertions, deletions and substitutions needed to transform string B to string A . Although these operations can be performed on both A and B , we stipulate that all operations are performed on string B . Note that this rule does not lose generality. Example 10.1-1 Let us assume that A = 1 2 3 4 5 6 7 8 9 10 a c c t a g t t a g and 10-3 B = 1 2 3 4 5 6 7 8 9 10 11 a c t g a g t a a g t It can be proved that the following four operations will transform string B to string A : 1. 2. 3. 4. inserting c after b2 . deleting b4 g substituting b8 a by t . deleting b11 t . It can also be proved the minimum number of insertions, deletions and substitutions to transform string B to string A is 4. Thus ED( A, B ) 4 . Example 10.1-2 We have A = 1 2 3 4 5 6 7 8 a c c g t a t g and B = 1 2 3 4 5 6 7 8 9 c c g a c c c g a It can be proved that we need at least six operations to transform string B to string A . Thus ED( A, B ) 6 . The following is one set of these six operations: 1. 2. 3. 4. inserting a inserting t deleting b5 substituting before b1 . after b3 . c and b6 c b7 c by t 5. deleting b9 a Having defined the edit distance, we need an algorithm to find the edit distance between strings. The following section will give the first method. 10-4 Section 10.2 The First Dynamic Programming Algorithm to Find the Edit Distance The problem of finding the edit distance is an optimization problem and can be solved by the dynamic programming approach. We are given two strings: A a1a 2 ai and B b1b2 b j . Our job is to find ED( A, B) . Let us first define a new term, denoted as ed (i, j ) ED( A(1, i), B(1, j )) . Then we have the following statement: If a i b j , ed (i, j ) ed (i 1, j 1) The above statement is obviously correct because the addition of a pair of characters which are identical to each other to A(1, i 1) a1a2 ai 1 and B(1, j 1) b1b2 b j 1 will not alter the edit distance between A(1, i 1) and B(1, j 1) . If a i b j , we have to examine one of the following operations: 1. Inserting a character which is equal to a i after the character b j in B as shown in Fig. 10.2-1. A a1 a2 ... ai 1 ai B b1 b2 ... bj ai inserted Fig. 10.2-1 An insertion operation in the finding of the edit distance. In this case, we merely have to take a look at A(1, i 1) a1a2 ai 1 and B(1, j ) b1b2 b j . Suppose we have already computed ed (i 1, j ) ED( A(1, i 1), b(1, j )) , we then simply add 1 to this distance as this 1 is the cost of the insertion. Thus, in this case, ed (i, j ) ed (i 1, j ) 1 . 2. Deleting b j as shown in Fig. 10.2-2. A a1 a2 ... ai 1 ai B b1 b2 . . . bj 1 bj deleted Fig. 10.2-2 In this A deletion operation in the finding of the edit distance. case, we should only compute 10-5 the edit distance between A(1, i) a1a2 ai and B(1, j 1) b1b2 b j 1 and add 1, which is the cost of deleting b j to it. Thus, in this case, ed (i, j ) ed (i, j 1) 1 . 3. Substituting b j by a i as shown in Fig. 10.2-3. A a1 a2 ... ai 1 ai B b1 b2 ... bj 1 bj bj substituted by ai Fig. 10.2-3 A substitution operation in the finding of the edit distance. In this case, we merely have to compute the edit distance between A(1, i 1) a1a2 ai 1 and B(1, j 1) b1b2 b j 1 and add 1, which is the cost of substitution, to it. Thus, in this case, ed (i, j ) ed (i 1, j 1) 1 Although there are three possible operations, we have to select one of them. Thus, for the first edit distance finding algorithm, we have the recursive formula for the first edit distance finding algorithm.: Procedure 10.2-1 The Procedure to Find the Edit Distance for the First Edit Distance Finding Algorithm If a i b j , ed (i, j ) ed (i 1, j 1) I ed (i 1, j ) 1 If ai b j , ed (i, j ) m i ned (i, j 1) 1 D ed (i 1, j 1) 1 S ed (i 1, j ) 1 m i ned (i, j 1) ed (i 1, j 1) ed (i,0) i, 0 i n ed (0, j ) j , 0 j m (10.2-1) (10.2-2) (10.2-3) The first algorithm to compute the edit distance between two strings is as follows: 10-6 Algorithm 10.1 The First Algorithm to Compute the Edit Distance between Two Strings Input: String A(1, n) and B(1, m) Output: The edit distance between strings A and B . For i 0 to i n , ed (i,0) i For j 0 to j m , ed (0, j ) j For i 1 to i n , For j 1 to j m If a i b j , ed (i, j ) ed (i 1, j 1) ed (i 1, j ) 1 If ai b j , ed (i, j ) m i ned (i, j 1) 1 ed (i 1, j 1) 1 ed (i 1, j ) 1 m i ned (i, j 1) ed (i 1, j 1) Report ed (n, m) as the edit distance between A and B . For Equation (10.2-3), ed (i,0) is the edit distance between A(1, i) a1a2 ai and B (an empty string). Thus we must insert i elements to B in order to transform B to A(1, i ) . That is why ed (i,0) i . Similarly, ed (0, j ) is the edit distance between A and B(1, j ) b1b2 b j . Because we need to delete j elements of B in order to transform B(1, j ) b1b2 b j to A , ed (0, j ) j . We must note that ED( A, B) ed (n, m) if the lengths of A and B are n and m respectively. Let us examine Equation (10.2-1) again. We did say that a i b j will cause ed (i, j ) ed (i 1, j 1) . But we did not say that ed (i, j ) ed (i 1, j 1) can only be caused by a i b j . This is explained by the following example. Example 10.2-1 Let A accgatgc and B aaacga . 10.2-1. 10-7 ED( A, B) is found as shown in Table Table 10.2-1 The calculation of the edit distance between A accgatgc and B aaacga . i j 0 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 In the above table, we can find ed (i, j ) for 1 i 8 and 1 j 6 . For instance, we can see ed (3,4) 2 and ed (6,6) 3 . Let us see whether ed (3,4) 2 is correct. A(1,3) acc and B(1,4) aaac . By deleting b2 and substituting b3 a by c , we can transform B(1,4) aaac to A(1,3) acc . Thus ed (3,4) 2 is correct. Let us see how the computation is done by examining some cases. Consider ed (5,3) . In this case, i 5 and j 3 . Note that a5 a b3 a . According to Equation (10.2-1), ed (5,3) ed (4,2) 3 . Consider ed (5,5) . In this case, i 5 and j 5 . a5 a b5 g , we apply Equation (10.2-2). In this case, since ed (i 1, j ) 1 ed (4,5) 1 2 1 3 ed (5,5) min ed (i, j 1) 1 ed (5,4) 1 4 1 5 ed (i 1, j 1) 1 ed (4,4) 3 1 4 3 The case of ed (5,5) confirms our earlier statement; ed (5,5) ed (4,4) . But we cannot say that this is caused by a5 b5 . note that a5 b5 . Note that In fact, we The Tracing Back of the Dynamic Programming Table In the above, we showed how to find the edit distance. But, it is usually not sufficient to find the edit distance. We also want to know how many insertions, deletions and substitutions are involved in the edit distance. To achieve this, we need to perform a tracing back in the dynamic programming table. 10-8 To trace back from ed (i, j ) , we use the following procedure: Procedure 10.2-2: Procedure for Tracing Back of the Dynamic Programming Table for the Edit Distance ed (i 1, j ) 1 Case 1: If a i b j , let x min ed (i, j 1) 1 and point ed (i, j ) to ed (i 1, j 1) 1 locations of x . Note that there may be more than one such locations. ed (i 1, j ) 1 Case 2. If a i b j , let x min ed (i, j 1) 1 and point ed (i, j ) to ed (i 1, j 1) locations of x . Note that there may be more than one such locations. The correctness of the above procedure will be proved later as we present Procedure 10.3-1. It is obvious that this procedure is correct for the case of a i b j , by consulting Equation (10.2-1). But it is not obvious at all that this procedure is also correct for the case of a i b j . Essentially, again, by consulting Equation (10.2-1), we need to prove the following: ed (i 1, j ) 1 If a i b j , ed (i 1, j 1) min ed (i, j 1) 1 ed (i 1, j 1) But, Procedure 10.2-2 points out another critical point as follows: If a i b j , ed (i 1, j 1) may also be equal to ed (i 1, j ) 1 or ed (i, j 1) 1 . ed (i 1, j 1) ed (i 1, j ) 1 This happens when ed (i 1, j 1) ed (i, j 1) 1 . or Let us consider Table 10.2-1. Let i 1 and j 2 . In this case, a1 b2 . Therefore, ed (i, j ) ed (1,2) ed (i 1, j 1) ed (0,1) 1 . But ed (1,2) ed (0,1) may be caused by another situation. Note that ed (i, j 1) ed (1,1) 0 . Thus ed (i, j ) ed (1,2) ed (i, j 1) 1 . This means that ed (i, j ) 1 can also be caused by the deletion of b2 . Let us take a look at the situation. Note that A(1,1) a and B(1,2) aa . Since a1 b2 , we may say that ed (1,2) ed (0,1) 1 . But we may also delete b2 a and transform B (1,2) into B (1,1) a which is exactly equal to A(1,1) a . Thus ed (1,2) ed (1,1) 1 0 1 1 . What we are saying is that we can transform B(1,2) aa into A(1,1) a through a deletion operation although a1 b2 . 10-9 One may find cases discussed above in Table 10.2-1. They occur at locations (1,2),(1,3), (1,6) and (5,1). In all such cases, a i b j , yet the tracing back will point not only to (i 1, j 1) . It may also point to (i 1, j ) or (i, j 1) . In general, there are three possibilities for the tracing back by using Prcedure 10.2-2 as shown in the following figures: Case1; (i, j ) (i 1. j ) (insertion) Fig. 10.2-4 (i-1,j-1) (i,j-1) (i-1,j) (i,j) The tracing back when ed (i 1, j ) ed (i, j ) (inserting a i after b j ). For instance, in Table 10.2-1, (4, 4) points back to (3, 4), as an insertion is done. Case 2: (i, j ) (i. j 1) (deletion) Fig. 10.2-5 (i-1, j-1) (i, j-1) (i-1, j) (i, j) The tracing back when ed (i, j 1) ed (i, j ) (deleting b j ). For instance, in Table 10.2-1, (4, 6) points back to (4, 5), as a deletion is done. Case 3: (i, j ) (i 1. j 1) (substitution or b j ai ) Fig. 10.2-6 (i-1, j-1) (i, j-1) (i-1, j) (i, j) The tracing back when ed (i 1, j 1) ed (i, j ) (substituting b j by a i ) or b j ai . We trace back from (i, j ) satisfied. to (i 1, j 1) if any of the following conditions is 10-10 (1) ed (i 1, j 1) ed (i, j ) . It occurs when a substitution is done. For instance, in Table 10.2-2, (4, 2) will point back to (3, 1) because b2 may be substituted by a 4 . (2) b j ai . In this case, ed (i 1, j 1) ed (i, j ) . For instance, (5, 6) points back to (4, 5) because a5 b6 . One of the tracing backs from location (8,6) by using Procedure 10.2-2 is displayed in Table 10.2-2 Table 10.2-2 i j One tracing back of location (8.6) in Table 10.2-1 0 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 From the above table, we can see that in this tracing back, there are one substitution, one deletion and three insertions in the determining of the ed (8,6) as follows: 1. substituting b2 a by a2 c , 2. deleting b3 a , 3. inserting a6 t , a7 g and a8 c after b6 . It must be reminded that for ed (8,6) , there are many other paths of tracing back. Let us consider location (4,4). as shown in Table 10.2-3. In this case, there are many paths of tracing back 10-11 Table 10.2-3 The tracing back of location (4.4) in Table 10.2-1 i j 0 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 As can be seen, there are four paths for tracing back from location (4,4), displayed as below. We use S, I, D, and M to denote substitution, insertion, deletion and matching respectively. (1) (4, 4) S (3, 3) S (2, 2) S (1, 1) M (0, 0) (2) (4, 4) I (3, 4) M (2, 3) D (2, 2) S (3) (4, 4) I (3, 4) M (2, 3) S (1, 2) D (1, 1) M (0, 0) (4) (4, 4) I (3, 4) M (2, 3) S (1, 2) M (0, 1) D (0, 0) (1, 1) M (0, 0) The reader must understand that the following statements are all wrong: *1. ed (i, j ) cannot be smaller than ed (i 1, j ) and ed (i, j 1) . A counter-example can be seen at location (4,5) of Table 10.2-3 because ed (4,5) 2 ed (4,4) 3 . Besides, ed (4,5) 2 ed (3,5) 3 . This might be against our intuition because we may easily think that by going along a row from left to right or along a column from top down will see the edit distances increasing or at least maintaining the same value. Let us consider Column 6 of Table 10.2-3. In this column, ed (5,4) 4 , but ed (5,5) 3 . That is, as we go down from location (5,4), the value decreases. This can be explained as follows: Note that A(1,5) accga and B(1,4) aaac . The edit distance between A(1,5) and B (1,4) is 4 as seen below: 10-12 A(1,5) a c c B(1,4) a a a c d s g a i i If we go along the column downward, B (1,4) becomes B(1,5) aaacg and the edit distance is reduced to 3 as shown in the following table. A(1,5) a c c g B(1,5) a a a c g d s a i Let us consider another example. Consider Row 6 of Table 10.2-3. B(1,6) aaacga and A(1,3) acc as shown below. The edit distance between them is 4 as indicated. A(1,3) a c c B(1,6) a a a c d s g a d d Suppose A(1,3) acc becomes A(1,4) accg . The situation is now shown as below and the edit distance is reduced to 3. A(1,4) a c c g B(1,6) a a a c g d s a d *2. Whenever ed (i 1, j 1) ed (i, j ) , a i b j . Consider the location (5,5). In this location ed (i, j ) ed (5,5) 3 ed (i 1, j 1) ed (4,4) , yet ai a5 g b j b5 a . Let us redisplay Equation (10.2-2) as below: ed (i 1, j ) If ai b j , ed (i, j ) 1 min ed (i, j 1) ed (i 1, j 1) Suppose ai b j and ed (i 1, j ) ed (i 1, j), ed (i, j 1), ed (i 1, j 1). is the minimum of Then according to the above equation, we 10-13 have ed (i, j ) ed (i 1, j ) 1 . But, it may be the case where ed (i 1, j 1) ed (i 1, j ) 1 . Consequently, we have ed (i, j ) ed (i 1, j 1) . The same argument can be applied to the case of ed (i, j 1) is the minimum of ed (i 1, j), ed (i, j 1), ed (i 1, j 1). All such examples can be found (1,4) , (1,5) , ( 2,5) , (3,1) , (4,6) , (5,5), (6,1) , (6,2) , (6,5) , (6,6) , (7,1) , (7,2) , (7,6) , (8,1) , (8,2) and (8,5) . ` Let us first define a new term. Definition 10.2-1 Diagonal of a Matrix Given a matrix, a diagonal d is a continuous sequence of locations (i, j ) where d i j. The following statements are all true: 1. For a row, or a column, the values in the dynamic programming table may decrease. 2. In a diagonal of the dynamic programming table to compute the edit distance, the values are non-deceasing. To demonstrate this, we redisplay Table 10.2-1 as Table 10.2-4. For both diagonals shown in Table 10.2-4, the values are non-decreasing along them. Table 10.2-4 The non-decreasing of the diagonals in a dynamic programming table for computing the edit distance i j 0 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 3. The value of the difference between the value of any location and that of any of its neighbors can be only -1, 0 or 1. For example, in Table 10.2-4, ed (6,6) ed (5,5) and ed (4,4) ed (3,3) 1 . 10-14 The above two statements are quite important and will be proved later. Section 10.3 Rule A1 and the Second Dynamic Programming Algorithm to Find the Edit Distance In this section, we shall introduce the second dynamic programming algorithm to find the edit distance. This algorithm is actually similar to the first one. Yet, it is quite significant because many approximate string matching algorithms use this approach. We are given two strings A a1a 2 ai and B b1b2 b j . The recursive formula of the second dynamic programming algorithm to compute the edit distance between two strings is as follows: Procedure 10.3-1 The Recursive Formula for the Second Edit Distance Finding Algorithm ed (i, j 1) 1 ed (i, j ) min ed (i 1, j ) 1 ed (i 1, j 1) eq (i, j ) where eq (i, j ) 0 if ai b j and (10.3-1) eq (i, j ) 1 if ai b j ed (i,0) i, 0 i n (10.3-2) ed (0, j ) j , 0 j m Comparing the above equations with the recursive formulas expressed in Equations (10.2-1), (10.2-2) and (10.2-3), we can see that when a i b j , the recursive formulas of the second algorithm are identical to those of the first one. To prove that Equations (10.3-1) is equivalent to Equations (10.2-1) and (10.2-2), it seems that we only have to take care of the case when a i b j . That is, we have to prove that if a i b j , ed (i 1, j 1) eq(i, j ) ed (i 1, j 1) is smaller than or equal to ed (i, j 1) 1 and ed (i 1, j ) 1 , and ed (i, j ) ed (i 1, j 1) . Let us first examine Table 10.2-1 which is again displayed as Table 10.3-1 below. We can easily see one special property of the elements inside the table. Consider any element in the dynamic programming table. It is obvious that the difference between it and its neighbors is between -1 and 1. For example, consider i 3 and j3 . ed (3,3) ed (4,3) 2 3 1 We note that and ed (3,3) ed (3,4) 2 2 0 . 10-15 Table 10.3-1 Table 10.2-1 redisplayed i j 0 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 Before we prove the formal lemma, let us give the following claims: Claim 10.3-1: If x y 1 and x y , then ( x 1) y 1 Claim 10.3-2 If x y 2 , and x y , then ( x 1) y 1 We now prove the following: Lemma 10.3-1: In each element of the dynamic programming table to compute the edit distance, for all i and j , and ed (i, j ) ed (i, j 1) 1 (10.3-3) ed (i, j ) ed (i 1, j ) 1 . (10.3-4) Proof: We prove by induction. For i 1 and j 1 , ed (0,0) 0, ed (0,1) 1 and ed (1,0) 1 . Thus, ed (i, j ) ed (1,1) 0 or 1 . Thus this lemma is true for i 1 and j 1 . Note that this lemma implies that ed (i, j ) ed (i 1, j ) 1 (10.3-5) ed (i, j ) ed (i, j 1) 1 (10.3-6) Assume that this lemma is true for (i 1, j 1) . 10-16 If a i b j , according to Equation (10.2-1), ed (i, j ) ed (i 1, j 1) . assumed and according to Equations (10.3-5) and (10.3-6), and But, as ed (i 1, j 1) ed (i, j 1) 1 (10.3-7) ed (i 1, j 1) ed (i 1, j ) 1 (10.3-8) Thus, by substituting ed (i, j ) ed (i 1, j 1) into Equations (10.3-7) and (10.3-8), we again have that ed (i, j ) ed (i, j 1) 1 ed (i, j ) ed (i 1, j ) 1 . and From Equations (10.3-7) and (10.3-8), we also have: ed (i, j 1) ed (i 1, j ) 2 (10,3-9) If a i b j , according to Equation 10.2-2, ed (i 1, j ) 1 ed (i 1, j ) ed (i, j ) min ed (i, j 1) 1 1 min ed (i, j 1) ed (i 1, j 1) 1 ed (i 1, j 1) Case 1. ed (i,1, j 1) is the minimum of ed (i 1, j), ed (i, j 1), ed (i 1, j 1) . Then, we have ed (i, j ) ed (i 1, j 1) 1 . Besides, since ed (i,1, j 1) is the minimum of ed (i 1, j), ed (i, j 1), ed (i 1, j 1) . we have ed (i 1, j 1) ed (i 1, j ) (10.3-10) (10.3-11) ed (i 1, j 1) ed (i, j 1) and Using ed (i, j ) ed (i 1, j 1) 1 , Equations (10.3-10) and (10.3-11) and Claim 10.3-1, we conclude: ed (i, j ) ed (i, j 1) 1 and Case 2: ed (i, j ) ed (i 1, j ) 1 . Without losing generality, we assume ed (i, j 1) is the minimum of 10-17 ed (i 1, j), ed (i, j 1), ed (i 1, j 1) . Then we have ed (i, j ) ed (i, j 1) 1 . Using ed (i, j ) ed (i, j 1) 1 , Equation (10.3-9) and Claim 10.3-2, again, we have: ed (i, j ) ed (i, j 1) 1 and ed (i, j ) ed (i 1, j ) 1 . Thus the proof. Although in the above proof, we have proved the general case for both a i b j and a i b j , we will only make use of the case where a i b j . For a i b j , we have ed (i, j ) ed (i 1, j 1) . Therefore, we have the following lemma: Lemma 10.3-2 If a i b j , ed (i 1, j 1) ed (i, j 1) 1 and ed (i 1, j 1) ed (i 1, j ) 1. What is the significance of Lemma 10.3-2? a b 1 There are three cases: Let us assume that we have Case 1: a b 1 . In this case, we have a b 1 . Case 2: a b 0 In this case, we have a b and we may say that a b 1 . Case 3: a b 1 . In this case, we have a b 1 and we may say that a b 1 In conclusion, we have Claim 10.3-3 If a b 1 , then a b 1 . Combining Lemma 10.3-2 and Claim 10.3-3, we have: Lemma 10.3-3 If a i b j , ed (i 1, j 1) ed (i, j 1) 1 and ed (i 1, j 1) ed (i 1, j ) 1 . Consider the recursive formula in Equation 10.3-1. According to this formula, when a i b j , ed (i 1, j 1) eq(i, j ) ed (i 1, j 1) . But from Lemma 10.3-2 and Claim 10.3-3, we can see that when a i b j , ed (i 1, j 1) ed (i, j 1) 1 and ed (i 1, j 1) ed (i 1, j ) 1 . Therefore, when a i b j , 10-18 ed (i, j 1) 1 ed (i, j ) min ed (i 1, j ) 1 ed (i 1, j 1) eq (i, j ) ed (i, j 1) 1 . min ed (i 1, j ) 1 ed (i 1, j 1) ed (i 1, j 1) Thus, Equations (10.3-1) and (10.3-2) are correct. Besides, by proving the correctness of Procedure 10.3-1, we have also proved the correctness of the tracing procedure introduced in Section 10.2, namely Procedure 10.2-2. The second algorithm to compute the edit distance between two strings is now given below. Algorithm 10.2 The Second Algorithm to Compute the Edit Distance between Two Strings Input: String A(1, n) and B(1, m) Output: The edit distance between strings A and B . For i 0 to i n , ed (i,0) i For j 0 to j m , ed (0, j ) j For i 1 to i n , For j 1 to j m ed (i, j 1) 1 ed (i, j ) min ed (i 1, j ) 1 ed (i 1, j 1) eq (i, j ) where eq (i, j ) 0 if ai b j and eq (i, j ) 1 if ai b j Report ed (n, m) as the edit distance between A and B . Let us now explain the beauty of Equations (10.3-1) and (10.3-2). Consider i 1 and j 2 in Table 10.3-1. In this case, since ai a1 b j b2 a , we have ed (1,1) 1 0 1 1 ed (1,2) min ed (0,2) 1 2 1 3 1 ed (0,1) 0 1 0 1 This time, we note that there are two locations causing ed (1,2) 1 . The first location is ed (1,1) which corresponds to a deletion and the second location is 10-19 ed (0,1) which corresponds to a matching. If we use Equation (10.2-1), only ed (0,1) is used. Note that Equation (10.2-1) gives the correct result, but fails to remind us that there is another way to achieve the same edit distance. We now can understand the tracing back discussed in the above section in the case where a i b j . Similar situation occurs at i 1 and j 6. In the following, we shall prove another important property of the dynamic programming table to compute the edit distance, namely the non-decreasing of the values of any diagonal in the table. We shall call this Rule A1. Rule A1: On the diagonals of a dynamic programming table to compute the edit distance, the values are non-decreasing. For instance, consider Table 10.3-2 which is a redisplay of Table 10.2-4. There are two diagonals as shown. One of them, d 2 and for the other one, d 0 . Table 10.3-2 i j Table 10.2-4 redisplayed 0 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 d=-2 d=0 As can be seen, for both diagonals, the values in the locations along the diagonals are non-decreasing. This is summarized in the following lemma. Lemma 10.3-3 Along any diagonal of the dynamic programming table to compute the edit distance, the values are non-decreasing. Proof: Consider any location (i, j ) . lemma is true for this case. If ai b j , ed (i, j ) ed (i 1, j 1) . Therefore this 10-20 ed (i 1, j ) 1 If a i b j , ed (i, j ) min ed (i, j 1) 1 ed (i 1, j 1) 1 Case 1.: From Equation (10.3-12), (10.3-12) ed (i, j ) ed (i 1, j 1) 1 . Then ed (i, j ) ed (i 1, j 1) and the lemma holds. Case 2: From Equation (10.3-12), without ed (i, j ) ed (i 1, j ) 1 . By Equation (10.3-6), we have: losing generality, (ed (i 1, j 1) ed (i 1, j ) 1 By Claim 10.3-3, ed (i 1, j 1) ed (i 1, j ) 1 But, ed (i, j ) ed (i 1, j ) 1 Therefore, ed (i 1, j 1) ed (i, j ) , or, equivalently, ed (i, j ) ed (i 1, j 1) . Thus the proof. Rule A1 will be used in Chapter 12 to design elegant algorithms to solve some approximate string matching problems. Section 10.4 Distance The MP80 Algorithm to Find the Edit In this section, we will introduce the MP80 Algorithm to find the edit distance . Let us use the Equation (10.3-1) as redisplayed as follows: 10-21 ed (i, j 1) 1 ed (i, j ) min ed (i 1, j ) 1 ed (i 1, j 1) eq (i, j ) where eq (i, j ) 0 if ai b j and (10.4-1) eq (i, j ) 1 if ai b j From the above equation, we can see that ed (i, j ) is determined by four parameters. We now point out one idea: If we perform a pre-process, for all possible ed (i, j 1), ed (i 1, j ), ed (i 1, j 1) and eq(i, j ) , we compute the resulting ed (i, j ) , then the dynamic programming process can be made very efficient. Let us assume that we are computing the case shown in Table 10.2-1 which is redisplayed as below: i j 0 Table 10.4-1 A dynamic programming table 1 2 3 4 5 6 7 8 A a c c g a t g c 0 B 0 1 2 3 4 5 6 7 8 1 a 1 0 1 2 3 4 5 6 7 2 a 2 1 1 2 3 3 4 5 6 3 a 3 2 2 2 3 3 4 5 6 4 c 4 3 2 2 3 4 4 5 5 5 g 5 4 3 3 2 3 4 4 5 6 a 6 5 4 4 3 2 3 4 5 Suppose we have already pre-processed the case so that we have the following pre-processed data: If ed (i 1, j 1) 3, ed (i, j 1) 2, ed (i 1, j ) 4 ed (i, j ) min( 3 1,2 1,4 1) min( 4,3,5) 3 . and eq(i, j ) 1 , then Then, suppose we compute the case where i 4 and j 6 . In this case, ai a 4 g b j b6 a . Thus eq(i, j ) 1 and we can now use the above result and immediately obtain ed (4,6) 3 . have already computed the following: Let us consider another case. Suppose we If ed (i 1, j 1) 2, ed (i, j 1) 3, ed (i 1, j ) 3 and eq (i, j ) 0 . Then ed (i, j ) min( 2 0,3 1,3 1) min( 2,4,4) 2 Then, suppose we compute the case where i 5 and j 6 . In this case, ai a5 a b j b6 a . Thus eq (i, j ) 0 and we can now use the above result and immediately obtain ed (5,6) 2 . 10-22 But the above approach has a big problem. There are infinite number of possible values of ed (i 1, j 1)' s . They may assume 1,2, , and so on. In the following, we will introduce another method which avoids the problem. Let us take a look at Table 10.4-1. Although this is a large table, we actually are only interested in the value of the last row. That is, we are only interested in ed (i,6) for 1 i 8 . The important thing is that we know the value of ed (0,6) 6 which is an initial value. Suppose we have found the value of ed (1,6) ed (0,6) 1 , we can obtain ed (1,6) ed (0,6) (1) 6 1 5 immediately. If we know ed (i,6) ed (i 1,6) for 1 i 8 , we would obtain the last row entirely. The question is: How can we know ed (i,6) ed (i 1,6) ? We define two new terms: Definition 10.4-1 Distances V (i, j ) and H (i, j ) for the MP80 Algorithm to Find the Edit V (i, j ) ed (i, j ) ed (i, j 1) (10.4-2) H (i, j ) ed (i, j ) ed (i 1, j ) Fig. 10.4-1 illustrates the relationship between these terms. ed(i,j-1) V(i,j) ed(i-1,j) H(i,j) ed(i,j) Fig. 10.4-1 The relationship between V (i, j ) and H (i, j ) Substituting Equation (10.4-1) into Equation (10.4-2), we have: 10-23 V (i, j ) ed (i, j ) ed (i, j 1) ed (i, j 1) ed (i, j 1) 1 min ed (i 1, j ) ed (i, j 1) 1 ed (i 1, j 1) ed (i, j 1) eq (i, j ) 1 min (ed (i 1, j ) ed (i 1, j 1)) (ed (i, j 1) ed (i 1, j 1)) 1 eq (i, j ) (ed (i, j 1) ed (i 1, j 1)) (10.4-3) 1 min V (i 1, j ) H (i, j 1) 1 eq (i, j ) H (i, j 1) where eq (i, j ) 0 if a i b j and eq (i, j ) 1 if a i b j Similarly, we can derive: 1 H (i, j ) min H (i, j 1) V (i 1, j ) 1 eq (i, j ) V (i 1, j ) where eq (i, j ) 0 if a i b j and (10.4-4) eq (i, j ) 1 if ai b j In summary, we have the following formulas: Procedure 10.4-1 Formulas to Compute V (i, j ) and H (i, j ) for the MP Algorithm. eq(i, j ) 0 if ai b j eq(i, j ) 1 if ai b j 1 V (i, j ) min V (i 1, j ) H (i, j 1) 1 eq (i, j ) H (i, j 1) 1 H (i, j ) min H (i, j 1) V (i 1, j ) 1 eq (i, j ) V (i 1, j ) Let us consider some examples. Example 10.4-1 For the case in table 10.4-1 Let i 1 and j 1 . Then, we have: 10-24 and H (i, j 1) ed (i, j 1) ed (i 1, j 1) ed (1,0) ed (0,0) 1 0 1 V (i 1, j ) ed (i 1, j ) ed (i 1, j 1) ed (0,1) ed (0,0) 1 0 1 Note that in the above computation, ed (1,0), ed (0,1) and ed (0,0) are all initial conditions. Since ai a1 b j b1 a , eq(i, j ) eq(1,1) 0 . Substituting the above results into Equation (10.4-3), we obtain: V (i, j ) V (1,1) 1 min 1 1 1 0 1 1 min 1 1 1 Once we have obtained V (1,1) , ed (i, j ) ed (i, j 1) V (i, j ) . Thus, we can obtain ed (1,1) because ed (1,1) ed (1,0) V (1,1) 1 (1) 1 1 0 This is correct as can be seen in Table 10.4-1. Similarly, we can compute H (i, j ) H (1,1) as follows: H (i, j ) H (1,1) 1 min 1 1 1 0 1 1 min 1 1 1 Since ed (i, j ) ed (i 1, j ) H (i, j ) , we can also obtain ed (1,1) ed (0,1) H (1,1) 1 (1) 1 1 0 Example 10.4-2 Let us consider the case i 1 and j 6 from the case in Table 10.4-1. 10-25 First, note that because ai a1 b j b6 a , eq (i, j ) 0 . Assume that we have ed (1,5), ed (0,5) ed (0,6) , we may compute already obtained and H (i, j 1) ed (i, j 1) ed (i 1, j 1) ed (1,5) ed (0,5) 4 5 1 and V (i 1, j ) ed (i 1, j ) ed (i 1, j 1) ed (0,6) ed (0,5) 6 5 1 . Then H (i, j ) H (1,6) 1 min 1 1 1 0 1 1 min 1 1 1 Since we know that ed (0,6) 6 , we obtain that ed (1,6) ed (0,6) H (1,6) 6 (1) 6 1 5 By comparing this with the result in Table 10.4-1, we can see that this result is correct. By examining Equations (10.4-3) and (10.4-4), we can consider V (i 1, j ) , H (i, j 1) and eq(i, j ) as inputs and V (i, j ) and H (i, j ) as outputs as illustrated in Fig. 10.4-2. Inputs Fig. 10.4-2 ed(i-1,j-1) H(i,j-1) ed(i,j-1) V(i-1,j) eq(i, j) V(i,j) ed(i-1,j) H(i,j) ed(i,j) Outputs An illustration of Equations (10.4-3) and (10.4-4) To give the reader more feeling about Equations (10.4-3) and (10.4-4), we may also illustrate them as in Fig. 10.4-3. 10-26 V(i-1,j) Equation (10.4-3) V(i,j) Equation (10.4-4) H(i,j) H(i,j-1) eq(i, j) Fig. 10.4-3 The inputs and outputs of Equations (10.4-3) and (10.4-4) Let us give more examples. Example 10.4-3 Let V (i 1, j ) 0 , H (i, j 1) 1 and eq(i, j ) 1 . (10.4-3), we obtain 1 V (i, j ) min 0 (1) 1 1 (1) 1 min 2 2 1 From Equation (10.4-4), we obtain 1 H (i, j ) min 1 0 1 1 0 1 min 0 1 0 A possible case for this is now illustrated in Fig. 10.4-4. 10-27 Then, from Equation 3 a 2 c 3 3 Fig. 10.4-4 A possible case for Example 10.4-3 Example 10.4-4 Let V (i 1, j ) 1 , H (i, j 1) 1 and eq (i, j ) 0 . (10.4-3), we obtain Then, from Equation 1 V (i, j ) min 1 (1) 1 0 (1) 1 min 3 1 1 From Equation (10.4-4), we obtain 1 H (i, j ) min 1 1 1 0 1 1 min 0 1 1 A possible case for this is now illustrated in Fig. 10.4-5. Fig. 10.4-5 3 a 2 a 4 3 A possible case for Example 10.4-3 To implement this idea, we compute Equations (10.4-3) and (10.4-4) for all 10-28 possible cases of the inputs. Let us assume that the size of the vocabulary 4 . There are 4 2 possible a i and b j pairs. There are three possible values for Therefore there are 3 2 9 possible cases of pairs of V (i, j ) and H (i, j ) . Totally there are 4 2 3 2 16 9 144 possible cases. V (i, j ) and H (i, j ) . Once we have performed the pre-processing of these 144 cases, they can be stored in an array to be retrieved later. Then, instead of using Algorithm 10-1, we may use the following algorithm. Algorithm 10.3 An Edit Distance Finding Algorithm Based upon the MP 80 Algorithm Input: Strings A(1.n) and B(1, m) Output: The edit distance between strings A and B . ed (0, m) m For i 1 to i n , H (i,0) 1 For j 1 to j m , V (0, j ) 1 For i 1 to i n For j 1 to j m 1 V (i. j ) min V (i 1, j ) H (i, j 1) 1 eq (i, j ) H ((i, j 1) 1 H (i, j ) min H (i, j 1) V (i 1, j ) 1 eq (i, j ) V (i 1, j ) where eq (i, j ) 0 if a i b j and eq (i, j ) 1 if ai b j For i 1 to i n ed (i, m) ed (i 1, m) H (i, m) Report ed (n, m) as the edit distance between strings A and B Section 10.5 References [L66]: Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1996, Vol. 10, No. 8, pp. 707 - 710. [MP80]: Masek, W. J. and Paterson, M. S. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 1980, Vol. 20, No. 1, pp. 12 – 31. [S80]: Sellers, P. H. The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1980, Vol. 1, No, 4, pp. 359 – 373. [WF74]: Wagner, R. A. and Fincher, M. J. The String-to-String Correction Problem. 10-29 Journal of the ACM, 1974, Vol. 21, No. 1, pp. 168-173. 10-30
© Copyright 2026 Paperzz