CHAPTER 4

CHAPTER 4
The Sequence Alignment Problem
It can be easily seen that we have to compare two sequences. For instance, it may well
happen that we are given a DNA sequence and there is another DNA sequence. They
are not exactly the same. Yet, we like to adjust these two sequences in some way such
that information can be extracted from them. Consider for example, the following two
sequences:
ATTCATTACAACCGCTATG
ACCCATCAACAACCGCTATG
These two sequences do not look similar to each other. Suppose we perform a
sequence alignment operation on these sequences, the result will be the following:
ATTCATTA-CAACCGCTATG
ACCCATCAACAACCGCTATG
Then we can see that these sequences are indeed quite similar to each other. In
certain sense, sequence alignment lets us measure the similarity of two sequences.
There are many reasons for us to perform the sequence alignment operation. Suppose
that we are a DNA sequence X and we have also a database containing a set of DNA
sequences. It is quite desirable for us to search through the database to find a sequence
which is most similar to this sequence X. To compare any sequence, say Y, in the
database, we have to perform the sequence alignment operation on X and Y. Another
application is data compression. Consider the above two sequences. Instead of
recording both sequences, we merely have to record one of them and then for the other
sequence, we store the difference. In this chapter, we shall introduce many interesting
methods to perform the sequence alignment operation.
4--1
4.1 The Sequence Alignment Problem
The sequence alignment problem can be illustrated by considering the two sequences
S1 =GAACTG and S 2 = GAGCTG. Two alignments are now shown below:
GAACTG--GA---GCTG
GAACTG
GAGCTG
It is obvious that the second alignment is much better than the first one.
We shall use a scoring rule to measure the goodness of an alignment. Throughout
this chapter, we shall use the following rule:
1. If a i is aligned with b j and a i = b j , the score is +2.
2. If a i or b j is aligned with a blank, the score is -1.
3. If a i is aligned with b j and a i  b j , the score is -1.
For the following alignment, the score is 2   2  5  (1)  1.
ab-bcad
aecb--Our sequence alignment problem is as follow:
alignment which has the highest score.
Given two sequences, find an
Again, this problem can be solved by the dynamic programming approach through
the following reasoning. Let Ai, j  denote the score of the optimal alignment of
a1a 2 ...ai and b1b2 ...b j .
1.
If a i is equal to b j , then a i should be aligned with b j and the score should be
increased by +2.
Ai, j   Ai  1, j  1  1  2 . To find Ai  1, j  1 means that
4--2
we should find an optimal alignment of a1a2 ...ai 1 and b1b2 ...b j 1 .
2.
If a i is not equal to b j , then there are three possibilities:
(a)
a i is aligned with b j and the score is decreased by 1.
Ai, j   Ai  1, j  1  1 .
(b) a i is aligned with - and the score is decreased by 1. Ai, j   Ai  1, j   1 .
(c)
b j is aligned with - and the score is decreased by 1. Ai, j   Ai, j  1  1 .
Among the above three alignments, we simply pick the alignment which gives the
highest Ai, j  .
In summary, to find the optimal alignment, we may have the following formula:
A0,0   0
Ai,0   i
A0, j    j

 Ai  1, j  1  1


max  Ai  1, j   1
Ai, j   
 Ai, j  1  1


 Ai  1, j  1  2

(4.1)
if ai  b j
if
Note that A0,0 means that - is aligned with -, Ai,0 A0, j  means that
a1a 2 ...ai is aligned with – (- is aligned with b1b2 ...b j ).
We now give an example. Let S1 = abbcad and S 2 = eacb. The computation of
Ai, j  is now illustrated in Figure 4.1.
4--3
Figure 4.1:The Computation of the Optimal Alignment of abbcad and eacb.
The tracing back can be done as follows:
(1) If i, j  points to i  1, j  1 , a i is aligned with b j .
(2) If i, j  points to i  1, j  , a i is aligned with -.
(3) If i, j  points to i, j  1 , b j is aligned with -.
By tracing back the table, we can conclude that three optimal alignments between
abbcad and eacb are as follows:
-abbcad
-abbcad
-abbcad
eacb---
ea--cb-
ea--c-b
There is another way to express the same idea in Equation 4.1 .
defined as follows:
 x, y   2 if x  y
 x, y   1 if x  y
4--4
Let  x, y  be
Then Equation 4.1 becomes
 Ai  1, j  1   x, y 

Ai, j   max  Ai  1, j    x, 
 Ai, j  1   (, y )

(4.2)
Perhaps it is meaningful for us to examine one example more in detail so that the
reader can have more feeling about this alignment algorithm. Consider
S1  eab
and S 2  acb.
Let us first examine the meaning of A0,0 , A1,0 and A0,1 .
A0,0 corresponds to
-
A1,0 corresponds to
e
-
A0,1 corresponds to
a
This is why A0,0  0 , A1,0  1 and A0,1  1 .
The reader should understand that A2,0 corresponds to
ea
-and A0,3 corresponds to
--acb
4--5
Having A0,0 , A1,0 and A0,1 , we can now determine A1,1 .
a1  e and b1  a. Thus
 a1 ,b1     e,a   1 .
Since a1  b1 , we have three choices, as illustrated below:
Choice 1:
e
a
which corresponds to A0,0   a1 , b1   0  1  1 .
Choice 2:
-e
awhich corresponds to A0,1   a1 ,  1  1  2 .
Choice 3:
e-a
which corresponds to A1,0   , b1   1  1  2 .
 A0,0   a1 , b1   0  1  1


A1,1  max  A0,1   a1 ,   1  1  2   A0,0   a1 , b1   1
 A1,0   , b   1  1  2
1


This means that the optimum alignment of a1 with b1 is as follows:
-e
e
=
-a
We calculate A1,2 .
a
a1 = e and b2 =c. Thus
 a1 ,b2     e,c
4--6
  1 .
Note that
Since a1  b2 , we have three choices:
Choice 1:
-e
ac
which corresponds to A0,1   a1 , b2   1  1  2 .
Choice 2:
--e
acwhich corresponds to A0,2   a1 ,  2  1  3 .
Choice 3:
eac
which corresponds to A1,1   , b2   1  1  2 .
We may choose either Choice 1 or Choice 3. Assume that we choose Choice 1.
Then A1,2  A0,1   a1 , b2   1  1  2 . An optimal alignment of a1 with b1b2
is
-e
ac
We calculate A2,1 .
a 2 = a, b1 = a and a 2 = b1 . Thus
 a2 ,b1     a,a
  2 .
We simply decide that
A2,1  A1,0   a2 , b1 
 1  2
1
4--7
The optimal alignment of a1a2 with b1 is
We calculate A2,2 .
ea
-a
a2  a, b2  c and a2  b2 .
Thus
 a2 , b2    a, c)  1.
Since a1  b2 , we have three choices:
Choice 1:
ea
ac
which corresponds to A1,1   a1 , b2   1  1  2 .
Choice 2:
-ea
acwhich corresponds to A1,2  1  2  1  3 .
Choice 3:
ea-ac
which corresponds to A2,1  1  1  1  0
We choose Choice 3. A2,2  A2,1  1  0 .
b1b2 is
The optimal alignment of a1a2 with
ea-ac
Since a3  b3 , we shall have an optimal alignment of a1a2 a3 with b1b2 b3 as
ea-b
-acb
4--8
which corresponds to A2,2   a3 , b3   0  2  2 .
It should be obvious that the longest common subsequence problem is similar to this
alignment problem. The only difference is on the scoring function. For the longest
common subsequence problem, an exact matching scores 1 and 0 for everything else. In
other words, there are only awards, no penalties.
4.2 The Local Alignment Problem
The local alignment problem is defined as follows:
We are given two sequences
'
'
S1 and S 2 , find a subsequence S1 from S1 and a subsequence S 2 from S 2 such
'
'
that the score obtained by aligning S1 and S 2 is the highest, among all possible
subsequences of S1 and S 2 .
Consider the following sequences:
S1 = abbbcc
S 2 = adddcc
Using the global alignment method introduced in the previous section, we will obtain
the following
S1 = abbbcc
S 2 = adddcc
The score is 3  2  3   1  3 .
'
'
Suppose we let S1 = cc and S 2 = cc.
Then we obtain a local alignment with score
2  2  4 , which is higher than that obtained by the global alignment method.
This local alignment problem can be solved again by simply inserting one
mechanism. We scan from the beginning. As soon as the score becomes negative, we
reset it to zero. For the above example, as soon as we scan to abbb of S1 and addd of
S 2 , the optimal alignment already has a negative score.
We therefore, reset the
alignment. That is, we forget about the previous alignment and start all over again.
Through this way, we will obtain an optimal local alignment which is cc of S1 against
4--9
cc of S 2 . The recurrence formula 4.2 in Section 4.1 now becomes as follows:
Ai,0   i
A0, j    j
0
 Ai  1, j  1   i, j 

Ai, j   max 
 Ai, j  1   i, 
 Ai  1, j    , j 
Let us consider the following two strings
S1 = abbcdae
S 2 = afgfde
Figure 4.2 illustrates our computations.
Figure 4.2: The Computation of an Optimal Local Alignment.
4--10
As shown in Figure 4.2, we have found an optimal local alignment as shown below:
dae
d-e
4.3 The Affine Gap Penalty
Consider the following two sequences
S1 = ACTTGATCC
S 2 = AGTTAGTAGTCC
An optimal alignment of the above pair of sequences is as follows.
S1 = ACTT - G - A -TCC
S 2 = AGTTAGTAGTCC
There are three gaps in the above alignment, where a gap is defined as a string of
consecutive spaces. Let us now consider the following alignment.
S1 = ACTT - - - GATCC
S 2 = AGTTAGTAGTCC
In this alignment, there is only one gap.
We understand that a gap is caused by a mutational event which removed a sequence
of residues. We also understand that a simple mutational event is more likely than
several events. Therefore a long gap is often more preferable than several gaps. This can
be achieved by imposing an affine gap penalty on the alignment.
An affine gap penalty is defined as Pg  kPe for a gap with k, k  1, spaces where
Pg  0
and
Pe  0 .
 x,   , x  0 .
When this gap penalty is used, we have to let
Pg is related to the initiation of a gap and Pe is related to the
4--11
length of the gap.
We shall use our previous scoring function. That is,  x, y   2 if x  y and
 x, y   1 if x  y . Suppose we further let Pg  4 and Pe  1 . Then, for the first
alignment, the score is
8  2  1  3  4  11  16  1  15  0 .
For the second alignment, the score is
6  2  3 1  4  3 1  12  3  7  2
That is, if we use this gap penalty function, the second alignment is better than the first
one.
The problem of finding an alignment with an affine gap penalty can still be solved by
the dynamic programming approach. Let Ai, j  denote the score of an optimal
alignment of a1a 2 ...ai and b1b2 ...b j .
1.
If a i is equal to b j , then a i should be aligned with b j and the score
should be increased by 2. Ai, j   Ai  1, j  1  2 .
That is, we should find
an optimal alignment of a1a2 ...ai 1 and b1b2 ...b j 1 .
2.
If a i is not equal to b j , then there are five possibilities.
(a)
(b)
(c)
a i is aligned with b j and the score is decreased by 1.
Ai, j   Ai  1, j  1  1 .
a i is aligned with -, this - is in the midst of an existing gap and the
score is decreased by Pe . Ai, j   Ai  1, j   Pe .
a i is aligned with -, this - initiates a new gap and the score is decreased
by Pg  Pe .
Ai, j   Ai  1, j   Pg  Pe .
4--12
b j is aligned with -,
(d)
this - is in the midst of an existing gap and the
score is decreased by Pe .
b j is aligned with -,
(e)
decreased by Pg  Pe .
Ai, j   Ai, j  1  Pe .
this - initiates a new gap and the score is
Ai, j   Ai, j  1  Pg  Pe .
Among the above five possible choices, we simply choose the one which gives the
highest Ai, j  .
To simplify the computation, we may also define three functions as follows.
1.
A1 i, j  is the score of an optimal alignment of a1a 2 ...ai with b1b2 ...b j
under the condition that a i is aligned with b j . (Note that a i is not
necessarily exactly equal to b j .)
2.
A2 i, j  is the score of an optimal alignment of a1a 2 ...ai with b1b2 ...b j
under the condition that a i is aligned with -.
3.
A3 i, j  is the score of an optimal alignment of a1a 2 ...ai with b1b2 ...b j
under the condition that b j is aligned with -.
Then, clearly
Ai, j   max A1 i, j , A2 i, j , A3 i, j  .
How do we compute A1 i, j  , A2 i, j  and A3 i, j  ?
First of all, the following is
obvious:
A1 i, j   Ai  1, j  1   ai , b j  .
4--13
Now, consider A2 i, j  .
Note that by definition, we force a i to be aligned with -.
Here are two possibilities:
(1)
This - is in the midst of an existing gap. Then
A2 i, j   A2 i  1, j   Pe .
(2)
This - initiates a new gap. Then
A2 i, j   Ai  1, j   Pg  Pe .
Thus,
A2 i, j   max A2 i  1, j   Pe , Ai  1, j   Pg  Pe .
Similarly,
A3 i, j   max A3 i, j  1  Pe , Ai, j  1  Pg  Pe .
In summary, we may use the following formula.
A0,0  A2 0,0  A3 0,0  0
.
A2 i,0  Ai,0   Pg  iPe
for i>0.
A3 0, i   A0, i    Pg  iPe
for i>0.
A2 0, i   A3 i,0  
.
Ai, j   max A1 i, j , A2 i, j , A3 i, j 
A1 i, j   Ai  1, j  1   ai , b j 
4--14
.
.
A2 i, j   max A2 i  1, j   Pe , Ai  1, j   Pg  Pe 
.
A3 i, j   max A3 i, j  1  Pe , Ai, j  1  Pg  Pe 
In the following, we give the computation of Ai, j  's where the two sequences are
those given at the beginning of this section.
The optimal alignment is as follows.
ACTT---GATCC
AGTTAGTAGTCC
4--15
Figure 4.3: The Computation of an Optimal Alignment with Affine Gap Penalty.
4--16
4.4 The Multiple Sequence Alignment Problem
In previous sections, we only studied the alignment between two sequences. It is natural
to extend this problem to the multiple sequence alignment problem in which more than
two sequences are involved. Consider the following case where three sequences are
involved.
S1 = ATTCGAT
S 2 = TTGAG
S 3 = ATGCT
A very good alignment of these three sequences is now shown as follows.
S1 = ATTCGAT
S 2 = -TT-GAG
S 3 = AT--GCT
It is noted that the alignment between every pair of sequences is quite good.
Let us assume that a gene is found. We may translate this gene into its corresponding
amino acid sequence. Once this is done, we can search through a relevant protein
database to find similar protein sequences in the hope that we can better understand the
function of this gene. Before searching this database, it is necessary to alignment the
proteins into an appropriate form through multiple matching.
The multiple alignment problem is similar to the two sequence alignment problem.
Instead of defining a fundamental scoring function  x, y  we have to define a score
function involving more variables.
Let us assume that there are, say three sequences.
Instead of considering ai and b j , we now have to consider matching of ai , b j and
ck . That is, a  x, y, z  has to be defined.
Then we note that instead of finding
Ai, j  as we did in the previous sections, we have to find Ai, j, k  . The problem is:
when we determine Ai, j, k  , we have to consider the following.
Ai  1, j, k 
4--17
Ai, j  1, k 
Ai, j, k  1
Ai  1, j  1, k 
Ai, j  1, k  1
Ai  1, j, k  1
Ai  1, j  1, k  1
The reader can easily imagine that the number of cases to be considered increases
tremendously as the number of sequences involved increases. In previous sections, we
know that if there are two sequences and the lengths of the sequences are n , we need to
construct an n  n
approach is used.
problem. If there
steps. Given k
table to find an optimal alignment if the dynamic programming
Thus it takes O n 2 steps to solve a two sequence alignment
are three sequences, we can obtain an optimal alignment in O n3
input sequences, we need at least O n k steps if the dynamic
 
 
 
programming approach is used.
Let there be k sequences. Let a scoring function be defined between two
sequences. Assume that this scoring function is linear. Given k input sequences, the
sum of pair multiple sequence alignment problem is to find an alignment of these k
sequences which maximizes the sum of scores of all pairs of sequence alignments among
them. If k , the number of input sequences, is a variable, this problem was proved to be
NP-complete. Thus there is not much hope for us to have any polynomial algorithm to
solve this sum of pair multiple sequence alignment problem and approximation
algorithms are needed. In the following sections, we will introduce some approximation
algorithms to solve the multiple sequence alignment problem.
4.5 The Gusfield Approximation Algorithm for the Sum of
Pairs Multiple Sequence Alignment Problem
In this section, we shall introduce an approximation algorithm, proposed by Gusfield, to
solve one version of the multiple sequence alignment problem. First of all, we shall use
a different kind of scoring function. Consider the following two sequences:
4--18
S1 = GCCAT
S 2 = GAT
A possible alignment between these two sequences is
'
S1 = GCCAT
'
S 2 = G—-AT
For this alignment, there are three exact matches and two mismatches. The distance
induced by this alignment is 2. That is, we define  x, y   0 if x  y and
 x, y   1 if x  y . For an alignment
S1  a1 , a2 ,..., an
'
S2  b1 , b2 ,..., bn
'
'
'
'
'
'
'
the distance between the two sequences induced by the alignment is defined as
 a , b 
n
'
i
'
i
i 1
The reader can easily see that this distance function, denoted as d Si , S j  , has the
following characteristics:
(1) d Si , Si   0
(2) d Si , S j   d Si , S k   d S j , S k 
The second property is called the triangular inequality.
There is another point which we have to emphasize at this point. Note that for a
two sequence alignment, the pair (-, -) will never occur. But, in a multiple sequence
alignment, it is possible to have a "-" matched with a "-". Consider S1 , S 2 and S 3 as
4--19
follows:
S1 = ACTC
S 2 = AC
S 3 = ATCG
Let us first align S1 and S 2 as follows:
ACTC
A—-C
Then suppose we
align S 3 with S1 as follows:
ACTCA-TCG
The three sequences are finally aligned as below:
S1 = ACTCS 2 = A--CS 3 = A-TCG
This time, - is matched with - twice. Note that when S 3 is aligned with S1 , - is
added to the already aligned S1 .
Given two sequences Si and S j , the minimum induced aligned distance is denoted
as D Si , S j  .
This notation will be used throughout this section.
Let us now consider the following four sequences.
We like to find out the sequence
which has the shortest distances to all other sequences
S1 = ATGCTC
S 2 = AGAGC
S 3 = TTCTG
S 4 = ATTGCATGC
We align the four sequences in
pair.
4--20
= ATGCTC
S1
= A-GAGC
S2
DS1,S2  =3
= ATGCTC
S1
= TT-CTG
S3
DS1 , S3  = 3
= AT-GC-T-C
S1
= ATTGCATGC
S4
DS1, S4  = 3
= AGAGC
S2
= TTCTG
S3
DS2 , S3  = 5
= A--G-A-GC
S2
= ATTGCATGC
S4
DS2 , S4  = 4
= -TT-C-TG-
S3
= ATTGCATGC
S4
DS3 , S4  = 4
DS1 , S 2   DS1 , S3   DS1 , S 4   3  3  3  9
DS 2 , S1   DS 2 , S3   DS 2 , S4   3  5  4  12
DS3 , S1   DS3 , S 2   DS3 , S 4   3  5  4  12
DS 4 , S1   DS 4 , S 2   DS 4 , S3   3  4  4  11
It can be seen that S1 has the shortest distances to all other sequences. In some
sense, we may say that S1 is most similar to others. We shall call this sequence the
center of the sequences.
Let us now formally define this concept.
Given a set S of k sequences, the center of this set of sequences is the sequence
4--21
which minimizes

X S \ S i 
DSi , X 
There are k k  1 / 2 pairs of sequences.
Each pair can be aligned by using the
dynamic programming approach. It can be seen that it takes polynomial time to find a
center.
Our approximation algorithm works as follows:
Algorithm 4.1 An approximation algorithm to find an approximation solution for the
sum of pair multiple sequence alignment problem.
Input: k sequences.
Output: An alignment of the k sequences with performance ration smaller than 2.
Step 1: Find the center of these k sequences. Without losing generality, we may
assume that S1 is the center.
Step 2: Let i  2 .
Step 3: Find an optimal alignment between Si and S1 . Add spaces to the already
aligned sequences S1, S2 ,..., Si 1 if necessary.
Step 4: If i  k , output the final alignment; otherwise, i  i  1 , go to Step 3.
Let us consider the four sequences discussed by us in the above paragraphs:
S1 = ATGCTC
S 2 = AGAGC
S 3 = TTCTG
S 4 = ATTGCATGC
As we showed before, S1 is the center.
We now align S 2 with S1 as follows:
S1 = ATGCTC
S 2 = A-GAGC
4--22
Add S 3 by aligning S 3 with S1 .
S1 = ATGCTC
S 3 = -TTCTG
Thus the alignment becomes:
S1 = ATGCTC
S 2 = A-GAGC
S 3 = -TTCTG
Add S 4 by aligning S 4 with S1 .
S1 = AT-GC-T-C
S 4 = ATTGCATGC
This time, spaces are added to the aligned S1 .
aligned S 2 and S 3 . The final alignment is:
Thus, spaces have to be added to
S1 = AT-GC-T-C
S 2 = A--GA-G-C
S 3 = -T-TC-T-G
S 4 = ATTGCATGC
As we can see, this is a typical approximation algorithm as we only align all
sequences with respect to S1 .
Let d Si , S j  denote the distance between Si and S j
Let App    d Si , S j  .
k
induced by this approximation algorithm.
k
i 1 j 1
j i
Let d * Si , S j 
denote the distance between Si and S j induced by an optimal multiple sequence
Let Opt    d * Si , S j  .
k
alignment.
k
i 1 j 1
j i
We shall in the following, show that
4--23
App  2Opt .
Before the formal proof, let us note that d S1 , Si   DS1 , Si  .
This can be easily
seen by examining the above example.
S1 = ATGCTC
S 2 = A-GAGC
Thus DS1, S2   3 .
At the end of the algorithm, S1 and S 2 are aligned as follows:
S1 = AT-GC-T-C
S 2 = A--GA-G-C
d S1, S2   3  DS1, S2  . This distance is not changed because  ,  0 .
The proof of App  2Opt is as follows:
App    d Si , S j 
k
k
i 1 j 1
j i
   d Si , S1   d S1 , S j  (triangle inequality)
k
k
i 1 j 1
j i
( d S1 , Si   d Si , S1  )
 2k  1  d S1 , Si 
k
i 2
Since d S1 , Si   DS1 , Si  for all i , we have
App  2k  1  DS1 , Si 
k
i 2
Let us now find Opt .
Opt    d * Si , S j 
k
k
i 1 j 1
j i
4--24
(4.3)
First, let us note that D Si , S j  is the distance induced by an optimal two sequence
alignment. Thus
DSi , S j   d * Si , S j ,
and
Opt    d * Si , S j 
k
k
i 1 j 1
j i
   DSi , S j 
k
k
i 1 j 1
j i
But, note that S1 is the center. Thus
Opt    DSi , S j 
k
k
i 1 j 1
j i
   DS1 , S j 
k
k
i 1 j  2
 k  DS1 , S j 
k
j 2
Considering (4.3) and (4.4), we have
App  2Opt
The error rate induced by this approximation algorithm is

App  Opt
1
Opt
4--25
(4.4)
4.6 The Minimal Spanning Tree Preservation Approach for
Multiple Sequence Alignment
Given a set of sequences, between every pair of them, there is an optimal alignment
which induces a minimum distance. After a multiple sequence alignment, many
distances after this alignment will not be optimal any more. Of course, we like to
preserve as much as possible. Since we cannot preserve all of the inter-sequence
distances, we must select a set of them to preserve. In Gusfield's approach, he selected
one sequence and the alignments between this sequence and all other sequences are
optimal. Thus, in his approach, he preserves k  1 distances where k is the number
of input sequences. The particular sequence, which his method selects, is one, which is
most similar, in some sense, to other sequences.
Let D denote the distance matrix based upon optimal alignments between every
pair of the input sequences. After a multiple sequence alignment M is performed, the
distance between a pair of sequences may be changed. Let Dm denote the new
distance matrix based upon the new distances between sequences after M is performed.
Let MST D be a minimal spanning tree constructed based upon the distance matrix D
and let MST Dm  denote a minimal spanning tree based upon Dm . Our minimal
spanning tree approach for the multiple sequence alignment problem stipulates that all
distances on MST D are exactly preserved after the alignment is made. We shall
prove later that every MST D is also an MST Dm  .
Our minimal spanning tree preservation algorithm is as follows:
Algorithm 4.2 A Minimal Spanning Tree Preservation Approach for the Multiple
Sequence Alignment Problem.
Input: k sequences S1 , S2 ,..., Sk
Output: A multiple sequence alignment M such that MST D = MST Dm  where
each entry of D is the distance induced by an optimal alignment between two species
and each entry of Dm is the distance induced by M .
Step 1: Compute D Si , S j  's by applying the dynamic programming algorithm.
4--26
Step 2: Find the minimal spanning tree of D , i.e., MST D .
Step 3: For every edge, say ei , on MST D
Let the sequences connected by ei be Si1 and S i 2 .
Find an optimal alignment between ei be Si1 and S i 2 .
Add spaces to the already aligned sequences
S
j1
, S j 2 | 1  j  i if
necessary.
Step 4: Output the final alignment.
We illustrate our idea in the following example.
Example 1: Consider four sequences as follows.
S1 = ATGCTC
S 2 = ATGAGC
S 3 = TTCTG
S 4 = ATGCATGC
Step 1 finds the pair wise distances optimally by the dynamic programming
algorithm.
S1 = ATGCTC
S 2 = ATGAGC
DS1, S2  = 2
S1 = ATGCTC
S3
= TT-CTG
DS1 , S3  = 3
S1 = ATGC-T-C
4--27
S 4 = ATGCATGC
DS1,S4  = 2
S 2 = ATGAGC
S 3 = TTCTGDS2 , S3  = 4
S 2 = ATG-A-GC
S 4 = ATGCATGC
DS2 , S4  = 2
S 3 = -TTC-TG-
S 4 = ATGCATGC
DS2 , S4  = 4
Therefore the distance matrix D is shown in Table 4.1.
S1 S 2 S 3 S 4
S1
2
S2
S3
3
2
4
2
4
S4
Table 4.1: The Distance Matrix D of Example 1
A minimal spanning tree MST D is determined in Step 2 as shown in Figure 4.4.
4--28
Figure 4.4: A MST D
The edges on MST D are eS1, S2  , eS2 , S4  and eS1 , S3  .
We construct an
optimal alignment for each edge in Step 3.
For eS1, S2  ,
S1 = ATGCTC
S 2 = ATGAGC
For eS2 , S4  ,
( S1 = ATG-C-TC)
S 2 = ATG-A-GC
S 4 = ATGCATGC
For eS1 , S3  ,
S1 = ATG-C-TC
( S 2 = ATG-A-GC)
S 3 = TT--C-TG
( S 4 = ATGCATGC)
Note that blanks are added in sequences S 2 and S 4 due to the blank-addition of
S1 when aligning with S 3 . They would not affect the previous optimal alignments for
the pairs of S1 with S 2 and S 2 with S 4 respectively. The final alignment result is
4--29
S1 = ATG-C-TC
S 2 = ATG-A-GC
S 3 = TT--C-TG
S 4 = ATGCATGC
with Dm as shown in Table4.2.
S1 S 2 S 3 S 4
S1
2
S2
S3
3
4
5
2
7
S4
Table 4.2: The Distance Matrix Dm Produced by Algorithm 4.2 for Example 1.
Figure4.5 shows a minimal spanning tree of Dm which is exactly the same as
MST D shown in Figure4.4.
Figure 4.5: A MST Dm  .
As described previously, due to the optimality of D Si , S j  in Step 1,we have
4--30
DSi , S j   Dm Si , S j 
for 1  i , i  k .
Furthermore, if Si and S j are connected on MST D , Algorithm
4.2 exactly preserves the distance between Si and S j before the alignment because Si
and S j are optimally aligned and adding blanks will not change the distance. Thus
DSi , S j   Dm Si , S j 
if Si and S j are connected by an edge in MST D and
DSi , S j   Dm Si , S j 
if Si and S j are not connected by an edge in MST D .
Thus we have Theorem 4.1.
Without losing generality, let us assume that the distances of D are all distinct.
Theorem 4.1 MST D is equal to MST Dm  .
Proof. According to Kruskal's algorithm, all of the distances on MST D must be the
smallest set of all distances without causing cycles. Since our algorithm preserves every
distance between Si and S j if Si and S j are connected by an edge on MST D ,
and the distance Dm Si , S j   DSi , S j  if Si and S j are not connected by an edge in
MST D , the application of Kruskal's algorithm on Dm will produce exactly the same
minimal spanning tree. Thus MST D   MST Dm  .
In fact, the order of the distances of the edges on MST D is also preserved on
MST Dm  by Algorithm 4.2. We have Corollary 4.1 as follows.
4--31
Corollary 4.1 Let ea, b and ec, d  be two edges on MST D .
then Dm a, b  Dm c, d  .
If Da, b  Dc, d  ,
Proof. By Theorem 4.1, MST D   MST Dm  by using Algorithm 4.2. That is
Dm a, b  Da, b  Dm c, d   Dc, d  .
4.7 The Edit Distance Concept
Sequence alignment may be viewed as a method to measure the similarity of two
sequences. After an alignment, one can then compute the Hamming distance between
the aligned sequences. The smaller the Hamming distance is, the more similar these two
sequences are to each other. In this section, we shall introduce a concept, called the edit
distance, which is also used quite often to measure the similarity between two sequences.
Let us consider two sequences A  a1a 2  a m and B  b1b2 bn .
We may
transform A to B by the following three edit operations: deletion of a character into A,
insertion of a character from A and substitution of a character in A with a another
character. For example, let
A = GTAAHTY
and
B =TAHHYC.
A can be transformed to B by the following:
(1)Deleting the first character G of A. Sequence A becomes A = TAAHTY.
(2)Substituting the third character of A, namely A, by H. Sequence A becomes A =
TAHHTY.
(3)Deleting the fifth character of A, namely T, from A. Sequence A becomes A =
TAHHY.
(4)Inserting C after the last character of A. Sequence A becomes A = TAHHYC which is
identical to B.
We can associate a cost with each operation. The edit distance is the minimum cost
associated with the edit operations needed to transform sequence A to sequence B. If the
cost is one for each operation, the edit distance becomes the minimum number of edit
operations needed to transform A to B. In the above example, if the cost of each
4--32
operation is one, the edit distance between A and B is 4 as at least four edit operations are
needed.
It is obvious that the edit distance can be found by the dynamic programming
approach. A recursive formula similar to that used for finding the longest common
sequence or an optimal alignment between two sequences can be easily formulated. Let
C(i), C(d) and C(s) denote the costs of insertion, deletion and substitution respectively.
Let A(i, j ) denote the edit distance between a1a2  ai and b1b2 b j . Then, A(i, j )
can be expressed as follows:
A(0,0)  0
A(i,0)  iC (d )
A(0, j )  jC (i )
A(i, j )  A(i  1, j  1) if
ai  b j
 A(i  1, j )  C (d )

A(i, j )  min  A(i  1, j  1)  C ( s ) if otherwise
 A(i, j  1)  C (i )

Actually, it is easy to see that the edit distance finding problem is equivalent to the
optimal alignment problem. We do not intend to present a formal proof here as it can be
easily from the similarity of respective recursive formulas. Instead, we shall use the
example presented above to illustrate our point.
Consider A = GTAAHTY and B= TAHHYC again.
produce the following:
An optimal alignment would
A = GTAAHTYB = -TAHH-YC.
An examination of the above alignment shows the equivalence between edit operations
and alignment operations as follows:
(1) (ai , b j ) in the alignment finding is equivalent to the substitution operation in the edit
4--33
distance finding. We substitute a i by b j in this case.
(2) (ai ,) in the alignment finding is equivalent to deleting a i in A in the edit distance
finding.
(3) (, b j ) in the alignment finding is equivalent to inserting b j into A in the edit
distance finding.
The reader can use the above rules and the optimal alignment found above to
produce the four edit operations.
4.8
The Protein Structure Alignment Problem
In the previous sections, proteins are considered as sequences of characters. Of course,
a protein is not only a one-dimensional sequence; it has a 3-diemsional structure. Let us
consider the sequences of proteins 1MBC and 2GDB from PDB. They are displayed
below:
1MBC:
VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL EKFDRFKHLK TEAEMKASED
LKKHGVTVLT ALGAILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP
GDFGADAQGA MNKALELFRK DIAAKYKELG YQG
2GDM:
GALTESQAAL VKSSWEEFNA NIPKHTHRFF ILVLEIAPAA KDLFSFLKGT SEVPQNNPEL
QAHAGKVFKL VYEAAIQLEV TGVVVTDATL KNLGSVHVSK GVADAHFPVV KEAILKTIKE
VVGAKWSEEL NSAWTIAYDE LAIVIKKEMD DAA
It is apparent that these two sequences are far from being alike. Yet, they are quite
alike in their 3-dimensional structure, as shown in Fig. 4.6.
4--34
(a).
(b)
Fig. 4.6: 3-dimensional structures of Proteins (a) 1MBC, (2) 2GDM
Since proteins have 3-dimensional structures, it is meaningful to align two proteins,
not through their sequences, rather through their structures. In the following, we shall
first describe briefly the basic structure of a protein.
A protein may be viewed as a sequence of amino acids.
Let us therefore first
introduce the structure amino acids. An amino acid consists of an R-group, an amino
group and a carboxyl group, as illustrated in Fig. 4.7.
Fig. 4.7 The structure of an amino acid
An amino acid is divided into two parts: the R group and the backbone.
4--35
All of the
amino acids have the same backbone. It is the R-group that makes the difference
between amino acids. The backbone of an amino acid consists of four units of
hydrogen atoms, one unit of nitrogen atom, two units of carbon atoms and two units of
oxygen atoms. The Ca atom is still a carbon atom and is often called the   carbon.
The   carbon is in the center of the backbone. In general, we say that the back-bone of
an amino acid is N-Ca-C and one unit of oxygen atom. We usually ignore three units of
hydrogen atoms and one unit of oxygen atom because the their structure will be changed
when two amino acids are connected together. On the other hand, the atom order of
N-Ca-C of the backbone will not change when two amino acids are connected. We may
therefore say that the structure of an amino acid is determined by the structure of N-Ca-C.
There are 20 different R groups as there are 20 different amino acids.
shows R groups of Alamine and Valine.
Fig. 4.8
(a)
Fig. 4.8
(b)
Two R Groups (a) Alamine (b)Valine
We now describe how two amino acids are connected in a protein. This is illustrated
in Fig. 4.9. When two amino acids are connected, two units of hydrogen atoms of the
4--36
second amino acid and one unit of oxygen atom of the first amino acid are combined to
form a water molecule and this water molecule is released. The carbon atom of the first
amino acid is connected to the nitrogen atom of the second amino acid.
Fig. 4.9
The connection of two amino acids
When two amino acids are connected, a bond between the carbon atom of the first
amino acid and the nitrogen atom of the second amino acid is formed and this bond is
called the peptide bond. As shown in Fig. 4.10, the two Ca atoms, the carbon atom, the
oxygen atom and the nitrogen atom form the peptide plane, as illustrated in Fig. 4.10.
This plane consists of six atoms and none of them can be rotated. Yet the plane itself
can be rotated. In other words, the peptide planes determine the 3-dimensional structure
of a protein.
4--37
Fig. 4.10 The peptide bond and the peptide plane
The amino acids in a protein turn in the 3-dimensional space.
some typical examples.
Fig. 4.11 shows
Fig. 4.11 Some examples of protein structures
There are two basic 3-dimensional structures of a protein, namely the alpha-helix
and the beta-sheet. We shall not elaborate these structures and only give illustrations.
Fig. 4.12 shows a typical alpha-helix and Fig. 4.13 shows a typical beta-sheet. Inside an
alpha-helix, there is a bond between each hydrogen atom of the backbone and some other
carbon atom, as shown in Fig. 4.12. The alpha-helix is stable because of these bonds.
There are also such bonds in a beta-sheet, as illustrated in Fig. 4.13. The difference
between these alpha-helices and beta-sheets is that the direction of hydrogen bonds with
respect to the backbone. In a beta-sheet, the hydrogen bonds are perpendicular to the
backbones and in an alpha-helix, the bonds are along the axis of the helix.
4--38
Fig. 4.12
Fig. 4.13
An alpha-helix
A Beta-sheet
4--39
As explained above, the structure of a protein is determined by the direction of the
peptide planes, we may simply consider the 3-dimensional coordinates of all of the N
atoms and all of C atoms in the backbone. The following table, Table 4.3, gives a part
of the coordinates of a segment of the protein 1MBC which can be found in Sperm
Whale.
Amino acid
(position)
VAL (1)
LEU (2)
SER (3)
GLU (4)
GLY (5)
GLU (6)
TRP (7)
GLN (8)
LEU (9)
VAL (10)
Atom
X-axis
Y-axis
Z-axis
C
-2.562
14.402
15.817
N
-4.094
14.896
13.982
C
-1.183
13.755
18.42
N
-1.328
14.772
16.218
C
-0.208
12.829
21.09
N
-1.114
12.521
18.854
C
1.495
12.367
23.459
N
-0.513
12.922
22.391
C
3.076
10.146
21.957
N
1.157
11.102
23.292
C
4.352
11.966
19.722
N
2.656
10.455
20.723
C
5.812
13.847
21.59
N
3.806
13.045
20.289
C
7.781
11.997
22.792
N
5.528
13.045
22.591
C
9.152
11.472
20.222
N
7.405
11.102
21.857
C
10.019
14.279
19.655
N
8.462
12.367
19.488
Table 4.3 The 3-dimensional coordinates of a segment of protein 1MBC
Since the structure of a protein is defined by a sequence of 3-dimensional points, we
may also use vectors to represent the structure. For example, suppose the first N atom is
located at (26,10,4) and the first C atom is located at (27,9,5). These two consecutive
atoms defined a vector which is (27-16, 9-10,5-4)=(1,-1,1). For every pair of N atoms
and C atoms, we calculate a vector this way. We may therefore say that the structure of
4--40
a protein is characterized by a sequence of vectors. That is, suppose that a protein
sequence is a1a2 an . Then its structure is A1 A2  An where Ai , 1  i  n , is the
vector defined by the 3-dimensional coordinates of a i and ai between N atoms and C
atoms.
Given the structures of two proteins, can we somehow determine whether they are
similar to each other or not? Let these proteins be characterized by A1 A2  Am and
B1 B2  Bn where all Ai ’s and Bi ’s are vectors. Then, we can perform an alignment
between A1 A2  Am and B1 B2  Bn . When we consider Ai and B j , we match them
in the alignment if the angle between them is rather small.
between vectors Ai and B j .
Let (i, j ) denote the angle
Suppose we pair Ai and B j if and only if
(i, j )   . Then we can use the dynamic programming to find an alignment between
4
Ai and B j as follows:
Let C (i, j ) denote the optimal score of alignment between
A1 A2  Ai and B1 B2  B j . Then
C (i,0)  0
C (0, j )  0


C (i  1, j  1)  cos( (i, j )  cos( 4 )

C (i, j )  max C (i  1, j )
C (i, j  1)



Note that in the above formula, we use cos((i, j )  cos( ) . This value will be
4
positive if and only if (i, j )   .
4
In the following, we should find a part of the protein structure alignment result of
proteins 1MBC and 2GDM.
4--41
1MBC
2GDM
KTE**A*EM*
TSEVPQNNPE
KAS**EDL**
KKHG**VTVL
LQAHAG*KVF K**LVYE**A
1MBC
2GDM
K*KKGHHEAE LKPL*AQS** HATKHK**IP
*VVVT**D**
A**TLK*NLG S***VHVSK*
*:Gap
4--42
*TAL**GAIL
AI*QLEVTG*