CHAPTER 4 The Sequence Alignment Problem It can be easily seen that we have to compare two sequences. For instance, it may well happen that we are given a DNA sequence and there is another DNA sequence. They are not exactly the same. Yet, we like to adjust these two sequences in some way such that information can be extracted from them. Consider for example, the following two sequences: ATTCATTACAACCGCTATG ACCCATCAACAACCGCTATG These two sequences do not look similar to each other. Suppose we perform a sequence alignment operation on these sequences, the result will be the following: ATTCATTA-CAACCGCTATG ACCCATCAACAACCGCTATG Then we can see that these sequences are indeed quite similar to each other. In certain sense, sequence alignment lets us measure the similarity of two sequences. There are many reasons for us to perform the sequence alignment operation. Suppose that we are a DNA sequence X and we have also a database containing a set of DNA sequences. It is quite desirable for us to search through the database to find a sequence which is most similar to this sequence X. To compare any sequence, say Y, in the database, we have to perform the sequence alignment operation on X and Y. Another application is data compression. Consider the above two sequences. Instead of recording both sequences, we merely have to record one of them and then for the other sequence, we store the difference. In this chapter, we shall introduce many interesting methods to perform the sequence alignment operation. 4--1 4.1 The Sequence Alignment Problem The sequence alignment problem can be illustrated by considering the two sequences S1 =GAACTG and S 2 = GAGCTG. Two alignments are now shown below: GAACTG--GA---GCTG GAACTG GAGCTG It is obvious that the second alignment is much better than the first one. We shall use a scoring rule to measure the goodness of an alignment. Throughout this chapter, we shall use the following rule: 1. If a i is aligned with b j and a i = b j , the score is +2. 2. If a i or b j is aligned with a blank, the score is -1. 3. If a i is aligned with b j and a i b j , the score is -1. For the following alignment, the score is 2 2 5 (1) 1. ab-bcad aecb--Our sequence alignment problem is as follow: alignment which has the highest score. Given two sequences, find an Again, this problem can be solved by the dynamic programming approach through the following reasoning. Let Ai, j denote the score of the optimal alignment of a1a 2 ...ai and b1b2 ...b j . 1. If a i is equal to b j , then a i should be aligned with b j and the score should be increased by +2. Ai, j Ai 1, j 1 1 2 . To find Ai 1, j 1 means that 4--2 we should find an optimal alignment of a1a2 ...ai 1 and b1b2 ...b j 1 . 2. If a i is not equal to b j , then there are three possibilities: (a) a i is aligned with b j and the score is decreased by 1. Ai, j Ai 1, j 1 1 . (b) a i is aligned with - and the score is decreased by 1. Ai, j Ai 1, j 1 . (c) b j is aligned with - and the score is decreased by 1. Ai, j Ai, j 1 1 . Among the above three alignments, we simply pick the alignment which gives the highest Ai, j . In summary, to find the optimal alignment, we may have the following formula: A0,0 0 Ai,0 i A0, j j Ai 1, j 1 1 max Ai 1, j 1 Ai, j Ai, j 1 1 Ai 1, j 1 2 (4.1) if ai b j if Note that A0,0 means that - is aligned with -, Ai,0 A0, j means that a1a 2 ...ai is aligned with – (- is aligned with b1b2 ...b j ). We now give an example. Let S1 = abbcad and S 2 = eacb. The computation of Ai, j is now illustrated in Figure 4.1. 4--3 Figure 4.1:The Computation of the Optimal Alignment of abbcad and eacb. The tracing back can be done as follows: (1) If i, j points to i 1, j 1 , a i is aligned with b j . (2) If i, j points to i 1, j , a i is aligned with -. (3) If i, j points to i, j 1 , b j is aligned with -. By tracing back the table, we can conclude that three optimal alignments between abbcad and eacb are as follows: -abbcad -abbcad -abbcad eacb--- ea--cb- ea--c-b There is another way to express the same idea in Equation 4.1 . defined as follows: x, y 2 if x y x, y 1 if x y 4--4 Let x, y be Then Equation 4.1 becomes Ai 1, j 1 x, y Ai, j max Ai 1, j x, Ai, j 1 (, y ) (4.2) Perhaps it is meaningful for us to examine one example more in detail so that the reader can have more feeling about this alignment algorithm. Consider S1 eab and S 2 acb. Let us first examine the meaning of A0,0 , A1,0 and A0,1 . A0,0 corresponds to - A1,0 corresponds to e - A0,1 corresponds to a This is why A0,0 0 , A1,0 1 and A0,1 1 . The reader should understand that A2,0 corresponds to ea -and A0,3 corresponds to --acb 4--5 Having A0,0 , A1,0 and A0,1 , we can now determine A1,1 . a1 e and b1 a. Thus a1 ,b1 e,a 1 . Since a1 b1 , we have three choices, as illustrated below: Choice 1: e a which corresponds to A0,0 a1 , b1 0 1 1 . Choice 2: -e awhich corresponds to A0,1 a1 , 1 1 2 . Choice 3: e-a which corresponds to A1,0 , b1 1 1 2 . A0,0 a1 , b1 0 1 1 A1,1 max A0,1 a1 , 1 1 2 A0,0 a1 , b1 1 A1,0 , b 1 1 2 1 This means that the optimum alignment of a1 with b1 is as follows: -e e = -a We calculate A1,2 . a a1 = e and b2 =c. Thus a1 ,b2 e,c 4--6 1 . Note that Since a1 b2 , we have three choices: Choice 1: -e ac which corresponds to A0,1 a1 , b2 1 1 2 . Choice 2: --e acwhich corresponds to A0,2 a1 , 2 1 3 . Choice 3: eac which corresponds to A1,1 , b2 1 1 2 . We may choose either Choice 1 or Choice 3. Assume that we choose Choice 1. Then A1,2 A0,1 a1 , b2 1 1 2 . An optimal alignment of a1 with b1b2 is -e ac We calculate A2,1 . a 2 = a, b1 = a and a 2 = b1 . Thus a2 ,b1 a,a 2 . We simply decide that A2,1 A1,0 a2 , b1 1 2 1 4--7 The optimal alignment of a1a2 with b1 is We calculate A2,2 . ea -a a2 a, b2 c and a2 b2 . Thus a2 , b2 a, c) 1. Since a1 b2 , we have three choices: Choice 1: ea ac which corresponds to A1,1 a1 , b2 1 1 2 . Choice 2: -ea acwhich corresponds to A1,2 1 2 1 3 . Choice 3: ea-ac which corresponds to A2,1 1 1 1 0 We choose Choice 3. A2,2 A2,1 1 0 . b1b2 is The optimal alignment of a1a2 with ea-ac Since a3 b3 , we shall have an optimal alignment of a1a2 a3 with b1b2 b3 as ea-b -acb 4--8 which corresponds to A2,2 a3 , b3 0 2 2 . It should be obvious that the longest common subsequence problem is similar to this alignment problem. The only difference is on the scoring function. For the longest common subsequence problem, an exact matching scores 1 and 0 for everything else. In other words, there are only awards, no penalties. 4.2 The Local Alignment Problem The local alignment problem is defined as follows: We are given two sequences ' ' S1 and S 2 , find a subsequence S1 from S1 and a subsequence S 2 from S 2 such ' ' that the score obtained by aligning S1 and S 2 is the highest, among all possible subsequences of S1 and S 2 . Consider the following sequences: S1 = abbbcc S 2 = adddcc Using the global alignment method introduced in the previous section, we will obtain the following S1 = abbbcc S 2 = adddcc The score is 3 2 3 1 3 . ' ' Suppose we let S1 = cc and S 2 = cc. Then we obtain a local alignment with score 2 2 4 , which is higher than that obtained by the global alignment method. This local alignment problem can be solved again by simply inserting one mechanism. We scan from the beginning. As soon as the score becomes negative, we reset it to zero. For the above example, as soon as we scan to abbb of S1 and addd of S 2 , the optimal alignment already has a negative score. We therefore, reset the alignment. That is, we forget about the previous alignment and start all over again. Through this way, we will obtain an optimal local alignment which is cc of S1 against 4--9 cc of S 2 . The recurrence formula 4.2 in Section 4.1 now becomes as follows: Ai,0 i A0, j j 0 Ai 1, j 1 i, j Ai, j max Ai, j 1 i, Ai 1, j , j Let us consider the following two strings S1 = abbcdae S 2 = afgfde Figure 4.2 illustrates our computations. Figure 4.2: The Computation of an Optimal Local Alignment. 4--10 As shown in Figure 4.2, we have found an optimal local alignment as shown below: dae d-e 4.3 The Affine Gap Penalty Consider the following two sequences S1 = ACTTGATCC S 2 = AGTTAGTAGTCC An optimal alignment of the above pair of sequences is as follows. S1 = ACTT - G - A -TCC S 2 = AGTTAGTAGTCC There are three gaps in the above alignment, where a gap is defined as a string of consecutive spaces. Let us now consider the following alignment. S1 = ACTT - - - GATCC S 2 = AGTTAGTAGTCC In this alignment, there is only one gap. We understand that a gap is caused by a mutational event which removed a sequence of residues. We also understand that a simple mutational event is more likely than several events. Therefore a long gap is often more preferable than several gaps. This can be achieved by imposing an affine gap penalty on the alignment. An affine gap penalty is defined as Pg kPe for a gap with k, k 1, spaces where Pg 0 and Pe 0 . x, , x 0 . When this gap penalty is used, we have to let Pg is related to the initiation of a gap and Pe is related to the 4--11 length of the gap. We shall use our previous scoring function. That is, x, y 2 if x y and x, y 1 if x y . Suppose we further let Pg 4 and Pe 1 . Then, for the first alignment, the score is 8 2 1 3 4 11 16 1 15 0 . For the second alignment, the score is 6 2 3 1 4 3 1 12 3 7 2 That is, if we use this gap penalty function, the second alignment is better than the first one. The problem of finding an alignment with an affine gap penalty can still be solved by the dynamic programming approach. Let Ai, j denote the score of an optimal alignment of a1a 2 ...ai and b1b2 ...b j . 1. If a i is equal to b j , then a i should be aligned with b j and the score should be increased by 2. Ai, j Ai 1, j 1 2 . That is, we should find an optimal alignment of a1a2 ...ai 1 and b1b2 ...b j 1 . 2. If a i is not equal to b j , then there are five possibilities. (a) (b) (c) a i is aligned with b j and the score is decreased by 1. Ai, j Ai 1, j 1 1 . a i is aligned with -, this - is in the midst of an existing gap and the score is decreased by Pe . Ai, j Ai 1, j Pe . a i is aligned with -, this - initiates a new gap and the score is decreased by Pg Pe . Ai, j Ai 1, j Pg Pe . 4--12 b j is aligned with -, (d) this - is in the midst of an existing gap and the score is decreased by Pe . b j is aligned with -, (e) decreased by Pg Pe . Ai, j Ai, j 1 Pe . this - initiates a new gap and the score is Ai, j Ai, j 1 Pg Pe . Among the above five possible choices, we simply choose the one which gives the highest Ai, j . To simplify the computation, we may also define three functions as follows. 1. A1 i, j is the score of an optimal alignment of a1a 2 ...ai with b1b2 ...b j under the condition that a i is aligned with b j . (Note that a i is not necessarily exactly equal to b j .) 2. A2 i, j is the score of an optimal alignment of a1a 2 ...ai with b1b2 ...b j under the condition that a i is aligned with -. 3. A3 i, j is the score of an optimal alignment of a1a 2 ...ai with b1b2 ...b j under the condition that b j is aligned with -. Then, clearly Ai, j max A1 i, j , A2 i, j , A3 i, j . How do we compute A1 i, j , A2 i, j and A3 i, j ? First of all, the following is obvious: A1 i, j Ai 1, j 1 ai , b j . 4--13 Now, consider A2 i, j . Note that by definition, we force a i to be aligned with -. Here are two possibilities: (1) This - is in the midst of an existing gap. Then A2 i, j A2 i 1, j Pe . (2) This - initiates a new gap. Then A2 i, j Ai 1, j Pg Pe . Thus, A2 i, j max A2 i 1, j Pe , Ai 1, j Pg Pe . Similarly, A3 i, j max A3 i, j 1 Pe , Ai, j 1 Pg Pe . In summary, we may use the following formula. A0,0 A2 0,0 A3 0,0 0 . A2 i,0 Ai,0 Pg iPe for i>0. A3 0, i A0, i Pg iPe for i>0. A2 0, i A3 i,0 . Ai, j max A1 i, j , A2 i, j , A3 i, j A1 i, j Ai 1, j 1 ai , b j 4--14 . . A2 i, j max A2 i 1, j Pe , Ai 1, j Pg Pe . A3 i, j max A3 i, j 1 Pe , Ai, j 1 Pg Pe In the following, we give the computation of Ai, j 's where the two sequences are those given at the beginning of this section. The optimal alignment is as follows. ACTT---GATCC AGTTAGTAGTCC 4--15 Figure 4.3: The Computation of an Optimal Alignment with Affine Gap Penalty. 4--16 4.4 The Multiple Sequence Alignment Problem In previous sections, we only studied the alignment between two sequences. It is natural to extend this problem to the multiple sequence alignment problem in which more than two sequences are involved. Consider the following case where three sequences are involved. S1 = ATTCGAT S 2 = TTGAG S 3 = ATGCT A very good alignment of these three sequences is now shown as follows. S1 = ATTCGAT S 2 = -TT-GAG S 3 = AT--GCT It is noted that the alignment between every pair of sequences is quite good. Let us assume that a gene is found. We may translate this gene into its corresponding amino acid sequence. Once this is done, we can search through a relevant protein database to find similar protein sequences in the hope that we can better understand the function of this gene. Before searching this database, it is necessary to alignment the proteins into an appropriate form through multiple matching. The multiple alignment problem is similar to the two sequence alignment problem. Instead of defining a fundamental scoring function x, y we have to define a score function involving more variables. Let us assume that there are, say three sequences. Instead of considering ai and b j , we now have to consider matching of ai , b j and ck . That is, a x, y, z has to be defined. Then we note that instead of finding Ai, j as we did in the previous sections, we have to find Ai, j, k . The problem is: when we determine Ai, j, k , we have to consider the following. Ai 1, j, k 4--17 Ai, j 1, k Ai, j, k 1 Ai 1, j 1, k Ai, j 1, k 1 Ai 1, j, k 1 Ai 1, j 1, k 1 The reader can easily imagine that the number of cases to be considered increases tremendously as the number of sequences involved increases. In previous sections, we know that if there are two sequences and the lengths of the sequences are n , we need to construct an n n approach is used. problem. If there steps. Given k table to find an optimal alignment if the dynamic programming Thus it takes O n 2 steps to solve a two sequence alignment are three sequences, we can obtain an optimal alignment in O n3 input sequences, we need at least O n k steps if the dynamic programming approach is used. Let there be k sequences. Let a scoring function be defined between two sequences. Assume that this scoring function is linear. Given k input sequences, the sum of pair multiple sequence alignment problem is to find an alignment of these k sequences which maximizes the sum of scores of all pairs of sequence alignments among them. If k , the number of input sequences, is a variable, this problem was proved to be NP-complete. Thus there is not much hope for us to have any polynomial algorithm to solve this sum of pair multiple sequence alignment problem and approximation algorithms are needed. In the following sections, we will introduce some approximation algorithms to solve the multiple sequence alignment problem. 4.5 The Gusfield Approximation Algorithm for the Sum of Pairs Multiple Sequence Alignment Problem In this section, we shall introduce an approximation algorithm, proposed by Gusfield, to solve one version of the multiple sequence alignment problem. First of all, we shall use a different kind of scoring function. Consider the following two sequences: 4--18 S1 = GCCAT S 2 = GAT A possible alignment between these two sequences is ' S1 = GCCAT ' S 2 = G—-AT For this alignment, there are three exact matches and two mismatches. The distance induced by this alignment is 2. That is, we define x, y 0 if x y and x, y 1 if x y . For an alignment S1 a1 , a2 ,..., an ' S2 b1 , b2 ,..., bn ' ' ' ' ' ' ' the distance between the two sequences induced by the alignment is defined as a , b n ' i ' i i 1 The reader can easily see that this distance function, denoted as d Si , S j , has the following characteristics: (1) d Si , Si 0 (2) d Si , S j d Si , S k d S j , S k The second property is called the triangular inequality. There is another point which we have to emphasize at this point. Note that for a two sequence alignment, the pair (-, -) will never occur. But, in a multiple sequence alignment, it is possible to have a "-" matched with a "-". Consider S1 , S 2 and S 3 as 4--19 follows: S1 = ACTC S 2 = AC S 3 = ATCG Let us first align S1 and S 2 as follows: ACTC A—-C Then suppose we align S 3 with S1 as follows: ACTCA-TCG The three sequences are finally aligned as below: S1 = ACTCS 2 = A--CS 3 = A-TCG This time, - is matched with - twice. Note that when S 3 is aligned with S1 , - is added to the already aligned S1 . Given two sequences Si and S j , the minimum induced aligned distance is denoted as D Si , S j . This notation will be used throughout this section. Let us now consider the following four sequences. We like to find out the sequence which has the shortest distances to all other sequences S1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC We align the four sequences in pair. 4--20 = ATGCTC S1 = A-GAGC S2 DS1,S2 =3 = ATGCTC S1 = TT-CTG S3 DS1 , S3 = 3 = AT-GC-T-C S1 = ATTGCATGC S4 DS1, S4 = 3 = AGAGC S2 = TTCTG S3 DS2 , S3 = 5 = A--G-A-GC S2 = ATTGCATGC S4 DS2 , S4 = 4 = -TT-C-TG- S3 = ATTGCATGC S4 DS3 , S4 = 4 DS1 , S 2 DS1 , S3 DS1 , S 4 3 3 3 9 DS 2 , S1 DS 2 , S3 DS 2 , S4 3 5 4 12 DS3 , S1 DS3 , S 2 DS3 , S 4 3 5 4 12 DS 4 , S1 DS 4 , S 2 DS 4 , S3 3 4 4 11 It can be seen that S1 has the shortest distances to all other sequences. In some sense, we may say that S1 is most similar to others. We shall call this sequence the center of the sequences. Let us now formally define this concept. Given a set S of k sequences, the center of this set of sequences is the sequence 4--21 which minimizes X S \ S i DSi , X There are k k 1 / 2 pairs of sequences. Each pair can be aligned by using the dynamic programming approach. It can be seen that it takes polynomial time to find a center. Our approximation algorithm works as follows: Algorithm 4.1 An approximation algorithm to find an approximation solution for the sum of pair multiple sequence alignment problem. Input: k sequences. Output: An alignment of the k sequences with performance ration smaller than 2. Step 1: Find the center of these k sequences. Without losing generality, we may assume that S1 is the center. Step 2: Let i 2 . Step 3: Find an optimal alignment between Si and S1 . Add spaces to the already aligned sequences S1, S2 ,..., Si 1 if necessary. Step 4: If i k , output the final alignment; otherwise, i i 1 , go to Step 3. Let us consider the four sequences discussed by us in the above paragraphs: S1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC As we showed before, S1 is the center. We now align S 2 with S1 as follows: S1 = ATGCTC S 2 = A-GAGC 4--22 Add S 3 by aligning S 3 with S1 . S1 = ATGCTC S 3 = -TTCTG Thus the alignment becomes: S1 = ATGCTC S 2 = A-GAGC S 3 = -TTCTG Add S 4 by aligning S 4 with S1 . S1 = AT-GC-T-C S 4 = ATTGCATGC This time, spaces are added to the aligned S1 . aligned S 2 and S 3 . The final alignment is: Thus, spaces have to be added to S1 = AT-GC-T-C S 2 = A--GA-G-C S 3 = -T-TC-T-G S 4 = ATTGCATGC As we can see, this is a typical approximation algorithm as we only align all sequences with respect to S1 . Let d Si , S j denote the distance between Si and S j Let App d Si , S j . k induced by this approximation algorithm. k i 1 j 1 j i Let d * Si , S j denote the distance between Si and S j induced by an optimal multiple sequence Let Opt d * Si , S j . k alignment. k i 1 j 1 j i We shall in the following, show that 4--23 App 2Opt . Before the formal proof, let us note that d S1 , Si DS1 , Si . This can be easily seen by examining the above example. S1 = ATGCTC S 2 = A-GAGC Thus DS1, S2 3 . At the end of the algorithm, S1 and S 2 are aligned as follows: S1 = AT-GC-T-C S 2 = A--GA-G-C d S1, S2 3 DS1, S2 . This distance is not changed because , 0 . The proof of App 2Opt is as follows: App d Si , S j k k i 1 j 1 j i d Si , S1 d S1 , S j (triangle inequality) k k i 1 j 1 j i ( d S1 , Si d Si , S1 ) 2k 1 d S1 , Si k i 2 Since d S1 , Si DS1 , Si for all i , we have App 2k 1 DS1 , Si k i 2 Let us now find Opt . Opt d * Si , S j k k i 1 j 1 j i 4--24 (4.3) First, let us note that D Si , S j is the distance induced by an optimal two sequence alignment. Thus DSi , S j d * Si , S j , and Opt d * Si , S j k k i 1 j 1 j i DSi , S j k k i 1 j 1 j i But, note that S1 is the center. Thus Opt DSi , S j k k i 1 j 1 j i DS1 , S j k k i 1 j 2 k DS1 , S j k j 2 Considering (4.3) and (4.4), we have App 2Opt The error rate induced by this approximation algorithm is App Opt 1 Opt 4--25 (4.4) 4.6 The Minimal Spanning Tree Preservation Approach for Multiple Sequence Alignment Given a set of sequences, between every pair of them, there is an optimal alignment which induces a minimum distance. After a multiple sequence alignment, many distances after this alignment will not be optimal any more. Of course, we like to preserve as much as possible. Since we cannot preserve all of the inter-sequence distances, we must select a set of them to preserve. In Gusfield's approach, he selected one sequence and the alignments between this sequence and all other sequences are optimal. Thus, in his approach, he preserves k 1 distances where k is the number of input sequences. The particular sequence, which his method selects, is one, which is most similar, in some sense, to other sequences. Let D denote the distance matrix based upon optimal alignments between every pair of the input sequences. After a multiple sequence alignment M is performed, the distance between a pair of sequences may be changed. Let Dm denote the new distance matrix based upon the new distances between sequences after M is performed. Let MST D be a minimal spanning tree constructed based upon the distance matrix D and let MST Dm denote a minimal spanning tree based upon Dm . Our minimal spanning tree approach for the multiple sequence alignment problem stipulates that all distances on MST D are exactly preserved after the alignment is made. We shall prove later that every MST D is also an MST Dm . Our minimal spanning tree preservation algorithm is as follows: Algorithm 4.2 A Minimal Spanning Tree Preservation Approach for the Multiple Sequence Alignment Problem. Input: k sequences S1 , S2 ,..., Sk Output: A multiple sequence alignment M such that MST D = MST Dm where each entry of D is the distance induced by an optimal alignment between two species and each entry of Dm is the distance induced by M . Step 1: Compute D Si , S j 's by applying the dynamic programming algorithm. 4--26 Step 2: Find the minimal spanning tree of D , i.e., MST D . Step 3: For every edge, say ei , on MST D Let the sequences connected by ei be Si1 and S i 2 . Find an optimal alignment between ei be Si1 and S i 2 . Add spaces to the already aligned sequences S j1 , S j 2 | 1 j i if necessary. Step 4: Output the final alignment. We illustrate our idea in the following example. Example 1: Consider four sequences as follows. S1 = ATGCTC S 2 = ATGAGC S 3 = TTCTG S 4 = ATGCATGC Step 1 finds the pair wise distances optimally by the dynamic programming algorithm. S1 = ATGCTC S 2 = ATGAGC DS1, S2 = 2 S1 = ATGCTC S3 = TT-CTG DS1 , S3 = 3 S1 = ATGC-T-C 4--27 S 4 = ATGCATGC DS1,S4 = 2 S 2 = ATGAGC S 3 = TTCTGDS2 , S3 = 4 S 2 = ATG-A-GC S 4 = ATGCATGC DS2 , S4 = 2 S 3 = -TTC-TG- S 4 = ATGCATGC DS2 , S4 = 4 Therefore the distance matrix D is shown in Table 4.1. S1 S 2 S 3 S 4 S1 2 S2 S3 3 2 4 2 4 S4 Table 4.1: The Distance Matrix D of Example 1 A minimal spanning tree MST D is determined in Step 2 as shown in Figure 4.4. 4--28 Figure 4.4: A MST D The edges on MST D are eS1, S2 , eS2 , S4 and eS1 , S3 . We construct an optimal alignment for each edge in Step 3. For eS1, S2 , S1 = ATGCTC S 2 = ATGAGC For eS2 , S4 , ( S1 = ATG-C-TC) S 2 = ATG-A-GC S 4 = ATGCATGC For eS1 , S3 , S1 = ATG-C-TC ( S 2 = ATG-A-GC) S 3 = TT--C-TG ( S 4 = ATGCATGC) Note that blanks are added in sequences S 2 and S 4 due to the blank-addition of S1 when aligning with S 3 . They would not affect the previous optimal alignments for the pairs of S1 with S 2 and S 2 with S 4 respectively. The final alignment result is 4--29 S1 = ATG-C-TC S 2 = ATG-A-GC S 3 = TT--C-TG S 4 = ATGCATGC with Dm as shown in Table4.2. S1 S 2 S 3 S 4 S1 2 S2 S3 3 4 5 2 7 S4 Table 4.2: The Distance Matrix Dm Produced by Algorithm 4.2 for Example 1. Figure4.5 shows a minimal spanning tree of Dm which is exactly the same as MST D shown in Figure4.4. Figure 4.5: A MST Dm . As described previously, due to the optimality of D Si , S j in Step 1,we have 4--30 DSi , S j Dm Si , S j for 1 i , i k . Furthermore, if Si and S j are connected on MST D , Algorithm 4.2 exactly preserves the distance between Si and S j before the alignment because Si and S j are optimally aligned and adding blanks will not change the distance. Thus DSi , S j Dm Si , S j if Si and S j are connected by an edge in MST D and DSi , S j Dm Si , S j if Si and S j are not connected by an edge in MST D . Thus we have Theorem 4.1. Without losing generality, let us assume that the distances of D are all distinct. Theorem 4.1 MST D is equal to MST Dm . Proof. According to Kruskal's algorithm, all of the distances on MST D must be the smallest set of all distances without causing cycles. Since our algorithm preserves every distance between Si and S j if Si and S j are connected by an edge on MST D , and the distance Dm Si , S j DSi , S j if Si and S j are not connected by an edge in MST D , the application of Kruskal's algorithm on Dm will produce exactly the same minimal spanning tree. Thus MST D MST Dm . In fact, the order of the distances of the edges on MST D is also preserved on MST Dm by Algorithm 4.2. We have Corollary 4.1 as follows. 4--31 Corollary 4.1 Let ea, b and ec, d be two edges on MST D . then Dm a, b Dm c, d . If Da, b Dc, d , Proof. By Theorem 4.1, MST D MST Dm by using Algorithm 4.2. That is Dm a, b Da, b Dm c, d Dc, d . 4.7 The Edit Distance Concept Sequence alignment may be viewed as a method to measure the similarity of two sequences. After an alignment, one can then compute the Hamming distance between the aligned sequences. The smaller the Hamming distance is, the more similar these two sequences are to each other. In this section, we shall introduce a concept, called the edit distance, which is also used quite often to measure the similarity between two sequences. Let us consider two sequences A a1a 2 a m and B b1b2 bn . We may transform A to B by the following three edit operations: deletion of a character into A, insertion of a character from A and substitution of a character in A with a another character. For example, let A = GTAAHTY and B =TAHHYC. A can be transformed to B by the following: (1)Deleting the first character G of A. Sequence A becomes A = TAAHTY. (2)Substituting the third character of A, namely A, by H. Sequence A becomes A = TAHHTY. (3)Deleting the fifth character of A, namely T, from A. Sequence A becomes A = TAHHY. (4)Inserting C after the last character of A. Sequence A becomes A = TAHHYC which is identical to B. We can associate a cost with each operation. The edit distance is the minimum cost associated with the edit operations needed to transform sequence A to sequence B. If the cost is one for each operation, the edit distance becomes the minimum number of edit operations needed to transform A to B. In the above example, if the cost of each 4--32 operation is one, the edit distance between A and B is 4 as at least four edit operations are needed. It is obvious that the edit distance can be found by the dynamic programming approach. A recursive formula similar to that used for finding the longest common sequence or an optimal alignment between two sequences can be easily formulated. Let C(i), C(d) and C(s) denote the costs of insertion, deletion and substitution respectively. Let A(i, j ) denote the edit distance between a1a2 ai and b1b2 b j . Then, A(i, j ) can be expressed as follows: A(0,0) 0 A(i,0) iC (d ) A(0, j ) jC (i ) A(i, j ) A(i 1, j 1) if ai b j A(i 1, j ) C (d ) A(i, j ) min A(i 1, j 1) C ( s ) if otherwise A(i, j 1) C (i ) Actually, it is easy to see that the edit distance finding problem is equivalent to the optimal alignment problem. We do not intend to present a formal proof here as it can be easily from the similarity of respective recursive formulas. Instead, we shall use the example presented above to illustrate our point. Consider A = GTAAHTY and B= TAHHYC again. produce the following: An optimal alignment would A = GTAAHTYB = -TAHH-YC. An examination of the above alignment shows the equivalence between edit operations and alignment operations as follows: (1) (ai , b j ) in the alignment finding is equivalent to the substitution operation in the edit 4--33 distance finding. We substitute a i by b j in this case. (2) (ai ,) in the alignment finding is equivalent to deleting a i in A in the edit distance finding. (3) (, b j ) in the alignment finding is equivalent to inserting b j into A in the edit distance finding. The reader can use the above rules and the optimal alignment found above to produce the four edit operations. 4.8 The Protein Structure Alignment Problem In the previous sections, proteins are considered as sequences of characters. Of course, a protein is not only a one-dimensional sequence; it has a 3-diemsional structure. Let us consider the sequences of proteins 1MBC and 2GDB from PDB. They are displayed below: 1MBC: VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP GDFGADAQGA MNKALELFRK DIAAKYKELG YQG 2GDM: GALTESQAAL VKSSWEEFNA NIPKHTHRFF ILVLEIAPAA KDLFSFLKGT SEVPQNNPEL QAHAGKVFKL VYEAAIQLEV TGVVVTDATL KNLGSVHVSK GVADAHFPVV KEAILKTIKE VVGAKWSEEL NSAWTIAYDE LAIVIKKEMD DAA It is apparent that these two sequences are far from being alike. Yet, they are quite alike in their 3-dimensional structure, as shown in Fig. 4.6. 4--34 (a). (b) Fig. 4.6: 3-dimensional structures of Proteins (a) 1MBC, (2) 2GDM Since proteins have 3-dimensional structures, it is meaningful to align two proteins, not through their sequences, rather through their structures. In the following, we shall first describe briefly the basic structure of a protein. A protein may be viewed as a sequence of amino acids. Let us therefore first introduce the structure amino acids. An amino acid consists of an R-group, an amino group and a carboxyl group, as illustrated in Fig. 4.7. Fig. 4.7 The structure of an amino acid An amino acid is divided into two parts: the R group and the backbone. 4--35 All of the amino acids have the same backbone. It is the R-group that makes the difference between amino acids. The backbone of an amino acid consists of four units of hydrogen atoms, one unit of nitrogen atom, two units of carbon atoms and two units of oxygen atoms. The Ca atom is still a carbon atom and is often called the carbon. The carbon is in the center of the backbone. In general, we say that the back-bone of an amino acid is N-Ca-C and one unit of oxygen atom. We usually ignore three units of hydrogen atoms and one unit of oxygen atom because the their structure will be changed when two amino acids are connected together. On the other hand, the atom order of N-Ca-C of the backbone will not change when two amino acids are connected. We may therefore say that the structure of an amino acid is determined by the structure of N-Ca-C. There are 20 different R groups as there are 20 different amino acids. shows R groups of Alamine and Valine. Fig. 4.8 (a) Fig. 4.8 (b) Two R Groups (a) Alamine (b)Valine We now describe how two amino acids are connected in a protein. This is illustrated in Fig. 4.9. When two amino acids are connected, two units of hydrogen atoms of the 4--36 second amino acid and one unit of oxygen atom of the first amino acid are combined to form a water molecule and this water molecule is released. The carbon atom of the first amino acid is connected to the nitrogen atom of the second amino acid. Fig. 4.9 The connection of two amino acids When two amino acids are connected, a bond between the carbon atom of the first amino acid and the nitrogen atom of the second amino acid is formed and this bond is called the peptide bond. As shown in Fig. 4.10, the two Ca atoms, the carbon atom, the oxygen atom and the nitrogen atom form the peptide plane, as illustrated in Fig. 4.10. This plane consists of six atoms and none of them can be rotated. Yet the plane itself can be rotated. In other words, the peptide planes determine the 3-dimensional structure of a protein. 4--37 Fig. 4.10 The peptide bond and the peptide plane The amino acids in a protein turn in the 3-dimensional space. some typical examples. Fig. 4.11 shows Fig. 4.11 Some examples of protein structures There are two basic 3-dimensional structures of a protein, namely the alpha-helix and the beta-sheet. We shall not elaborate these structures and only give illustrations. Fig. 4.12 shows a typical alpha-helix and Fig. 4.13 shows a typical beta-sheet. Inside an alpha-helix, there is a bond between each hydrogen atom of the backbone and some other carbon atom, as shown in Fig. 4.12. The alpha-helix is stable because of these bonds. There are also such bonds in a beta-sheet, as illustrated in Fig. 4.13. The difference between these alpha-helices and beta-sheets is that the direction of hydrogen bonds with respect to the backbone. In a beta-sheet, the hydrogen bonds are perpendicular to the backbones and in an alpha-helix, the bonds are along the axis of the helix. 4--38 Fig. 4.12 Fig. 4.13 An alpha-helix A Beta-sheet 4--39 As explained above, the structure of a protein is determined by the direction of the peptide planes, we may simply consider the 3-dimensional coordinates of all of the N atoms and all of C atoms in the backbone. The following table, Table 4.3, gives a part of the coordinates of a segment of the protein 1MBC which can be found in Sperm Whale. Amino acid (position) VAL (1) LEU (2) SER (3) GLU (4) GLY (5) GLU (6) TRP (7) GLN (8) LEU (9) VAL (10) Atom X-axis Y-axis Z-axis C -2.562 14.402 15.817 N -4.094 14.896 13.982 C -1.183 13.755 18.42 N -1.328 14.772 16.218 C -0.208 12.829 21.09 N -1.114 12.521 18.854 C 1.495 12.367 23.459 N -0.513 12.922 22.391 C 3.076 10.146 21.957 N 1.157 11.102 23.292 C 4.352 11.966 19.722 N 2.656 10.455 20.723 C 5.812 13.847 21.59 N 3.806 13.045 20.289 C 7.781 11.997 22.792 N 5.528 13.045 22.591 C 9.152 11.472 20.222 N 7.405 11.102 21.857 C 10.019 14.279 19.655 N 8.462 12.367 19.488 Table 4.3 The 3-dimensional coordinates of a segment of protein 1MBC Since the structure of a protein is defined by a sequence of 3-dimensional points, we may also use vectors to represent the structure. For example, suppose the first N atom is located at (26,10,4) and the first C atom is located at (27,9,5). These two consecutive atoms defined a vector which is (27-16, 9-10,5-4)=(1,-1,1). For every pair of N atoms and C atoms, we calculate a vector this way. We may therefore say that the structure of 4--40 a protein is characterized by a sequence of vectors. That is, suppose that a protein sequence is a1a2 an . Then its structure is A1 A2 An where Ai , 1 i n , is the vector defined by the 3-dimensional coordinates of a i and ai between N atoms and C atoms. Given the structures of two proteins, can we somehow determine whether they are similar to each other or not? Let these proteins be characterized by A1 A2 Am and B1 B2 Bn where all Ai ’s and Bi ’s are vectors. Then, we can perform an alignment between A1 A2 Am and B1 B2 Bn . When we consider Ai and B j , we match them in the alignment if the angle between them is rather small. between vectors Ai and B j . Let (i, j ) denote the angle Suppose we pair Ai and B j if and only if (i, j ) . Then we can use the dynamic programming to find an alignment between 4 Ai and B j as follows: Let C (i, j ) denote the optimal score of alignment between A1 A2 Ai and B1 B2 B j . Then C (i,0) 0 C (0, j ) 0 C (i 1, j 1) cos( (i, j ) cos( 4 ) C (i, j ) max C (i 1, j ) C (i, j 1) Note that in the above formula, we use cos((i, j ) cos( ) . This value will be 4 positive if and only if (i, j ) . 4 In the following, we should find a part of the protein structure alignment result of proteins 1MBC and 2GDM. 4--41 1MBC 2GDM KTE**A*EM* TSEVPQNNPE KAS**EDL** KKHG**VTVL LQAHAG*KVF K**LVYE**A 1MBC 2GDM K*KKGHHEAE LKPL*AQS** HATKHK**IP *VVVT**D** A**TLK*NLG S***VHVSK* *:Gap 4--42 *TAL**GAIL AI*QLEVTG*
© Copyright 2026 Paperzz