BINF6201/8201 Sequence alignment algorithms 2

BINF6201/8201
Sequence alignment algorithms 2
10-12-2016
Global alignment vs. local alignment
Ø Needleman-Wunsch algorithm gives the optimal alignment of two
sequences using the entire region of the two sequences, therefore, the
resulting alignment is a global alignment.
Ø We compute the global alignment of two sequences when we believe
that the domain arrangements of the two sequences are similar.
a
b
Global alignment
Ø However, very often, we are more interested in aligning the subregions/domains of two sequences. In such cases, we want to find the
local optimal alignment between two sequences.
a
b
Local alignment Non-alignable regions
Smith-Waterman local alignment algorithm
Ø Smith-Waterman algorithm (1981) uses dynamic programming to find
the optimal local alignment between two sequences.
Ø The algorithm is modified from the Needleman-Wunsch algorithm.
Ø To identify the optimal local alignment, the algorithm terminates an
ongoing alignment between the first i letters of a and the first j letters
of b, if the alignment is not promising, i.e., when H(i,j) is negative, and
restarts another new alignment by assigning H(i,j)=0.
Ø Therefore, H(i,j) is the score for the alignment starting from the last
terminating point to the i-th and j-th positions in a and b, respectively.
Optimal local alignment
Terminate the
alignment, if H(i,j)< 0,
and assign H(i,j) = 0
Terminate the
alignment, if H(i,j)< 0,
and assign H(i,j) = 0
Maximal H(i,j)
Alignment contains
the optimal local
alignment
Smith-Waterman local alignment algorithm
Ø If we use the linear gap penalty function, then the recursion relation of
the Smith-Waterman algorithm is,
" H (i −1, j −1) + S(ai , b j ) diagonal
$
$ H (i −1, j) − g
vertical
H (i, j) = max #
horizontal
$ H (i, j −1) − g
$%0
restart
with the initial condition H(i,0)=0, and H(j,0)=0.
Ø The optimal local alignments can be identified by the cells in the
alignment matrix that has the maximal score.
Ø The alignment is recovered by backtracking starting from this cell until
a zero is encountered. Of course, the track of how H(i,j) is computed
needs to be stored in another matrix.
Ø As in the case of global alignment, this algorithm has the time
complexity of O(mn) .
Smith-Waterman local alignment algorithm
Ø Initialize the first row and column, and then compute each cell from
the upper left corner to the bottom right corner.
b:
a : a0
a1
a2
b0
b1
b2
...
b j −1
bj
...
bn
0
0
0
...
0
0
...
0
0
−g
H (1,1)
+ s(a1, b1 )
−g
0
...
...
a i −1 0
ai
0
...
am
...
0
H (i − 1, j − 1)
+ S (ai , b j )
H (i, j − 1) − g
H (i − 1, j )
−g
H (i , j )
Smith-Waterman local alignment algorithm
Ø  Using our toy sequences,
a: SHAKE and
b: SPEARE,
we can compute the following alignment matrix using a linear gap
penalty function, W(l) = -6l.
Amino acid alignment scores
taken from PAM250
3
1
1
1
Ø Through backtracking, we obtain the
following local alignment:
SHAKE
PEARE
Alignment algorithms with a general gap penalty
Ø If the gap penalty function is in the general form W(l), when filling a
cell in the alignment matrix by a horizontal or vertical move, we need
to determine the optimal value of l, i.e., how many spaces should have
been inserted before the current one.
Ø The recursion relation for the global alignment is given by,
⎧
⎪ H (i − 1, j − 1) + S (ai , b j ) diagonal
⎪
H (i, j ) = max ⎨max[ H (i − l , j ) − W (l )}] vertical
1≤l ≤i
⎪
[ H (i, j − l ) − W (l )]
horizontal
⎪⎩max
1≤l ≤ j
with the initial condition H(0, 0)=0, H(i, 0) = W(i) and H(0, j) = W(j).
Ø To compute a cell, up to i+j+1 calculations have to be made, which
have a time complexity O(n+m) or O(n).
Ø Therefore the entire algorithm run in O(n3), which is considered to be
too slow, though it is a polynomial algorithm.
Alignment algorithms with a general gap penalty
Ø  The recursion relation for the local alignment version is,
⎧ H (i − 1, j − 1) + S (ai , b j )
⎪
[ H (i − l , j ) − W ( l ) ]
⎪max
1≤l ≤i
H (i, j ) = max ⎨
[ H (i , j − l ) − W ( l ) ]
⎪max
1≤l ≤ j
⎪
⎩0
diagonal
vertical
horizontal
restart
with the initial condition H(i, 0)= H(0, j)= 0.
Ø To avoid too many adjacent short gaps separated by too short
alignments, the following relation must be hold,
W (l1 + l2 ) ≤ W (l1 ) + W (l2 ).
Favorable
alignment
l1+l2 m
l1 s
l2 m
Unfavorable
alignment
That is, the penalty for a long gap should not be greater than the
penalty for two short ones that add up to the same length.
In general, this relation holds if the open gap penalty is larger than
any mismatch and gap extension, ie, S(a,b) > - gext >- gopen.
Alignment algorithms with affine gap penalty
Ø Because using a general form of gap penalty function slows down the
algorithm, an affine gap penalty function is preferred.
Ø When using affine gap penalty function in dynamic programming, we
only need to differentiate between the case that the gap is first being
introduced and the case that the gap is being extended.
Ø Let M(i,j) be the score of the best alignment up to the i-th letter in a
and j-th letter in b, and ai is aligned with bj.
a
b
i-1 _i
j-1 j
M (i , j )
Ø Let I(i,j) be the score of the best alignment up to the i-th letter in a and
j-th letter in b, and ai is aligned with a space.
a
b
i-1
j
i
I (i , j )
Alignment algorithms with affine gap penalty
Ø Let J (i,j) be the score of the best alignment up to the i-th letter in a
and j-th letter in b, and bj is aligned with a space.
i
a
J (i , j )
b
j-1
j
Ø  The score of the best alignment to this point is the best of the three
cases,
H (i, j ) = max( M (i, j ), I (i, j ), J (i, j )).
Ø  Therefore, we need to fill four separate matrices.
Ø  To compute M (i,j), we consider the following three possibilities,
i-2 i-1 i
ai-1 aligns with bj-1, so we a
M (i, j ) = M (i − 1, j − 1)
extend an alignment
j-1 j
b
j-2
+ S (a1 , b j )
ai-1 aligns with a space, so
we end a gap.
a
b
i-2
j-1
i-1
i
j
M (i, j ) = I (i − 1, j − 1)
bi-1 aligns with a space, so
we end a gap.
a
b
i-1
j-2
i
j-1 j
M (i, j ) = J (i − 1, j − 1)
+ S (a1 , b j )
+ S (a1 , b j )
Alignment algorithms with affine gap penalty
Ø  Therefore, we have the following recursion to compute M(i,j),
⎧ M (i − 1, j − 1) + S (a1 , b j )
⎪
M (i, j ) = max ⎨ I (i − 1, j − 1) + S (a1 , b j )
⎪
⎩ J (i − 1, j − 1) + S (a1, b j )
Initialization:
M(0,0)= 0, M(0, j)= -∞ and M(i,0)= -∞
Ø  To compute I(i,j), we consider the following two possibilities,
i-2 i-1 i
a
ai-1 aligns with bj, so
I (i, j ) = M (i − 1, j ) − gopen
we open a gap.
j
b
j-1
i-2 i-1 i
a
I (i, j ) = I (i − 1, j ) − gext
b
j
Ø We do not need to consider the following case, as it never happens by
design (a high cost for opening a gap).
i-1
i
a
bj aligns with a space
b
j-1
j
Ø  Therefore, we have the following recursion to compute I(i, j),
ai-1 aligns with a space,
so we extend the gap.
⎧ M (i − 1, j ) − gopen
I (i, j ) = max ⎨
⎩ I (i − 1, j ) − g ext
Initialization: I(0,0)=-∞, I(0,j)=-∞, I(1,j)= -gopen,
I(i,0)=-gopen - gext (i-1)
Alignment algorithms with affine gap penalty
Ø Similarly, to compute J(i,j), we consider the following two possibilities,
bi-1 aligns with ai, so,
we open a gap.
a
b
i-1
j-2
i
j-1 j
j (i, j ) = M (i, j − 1) − gopen
bi-1 aligns with a space,
so, we extent the gap.
a
b
i
j-1
j-1 j
J (i, j ) = J (i, j − 1) − gext
Ø  Therefore, we have the following recursion to compute J(i, j),
Initialization:
⎧ M (i, j − 1) − gopen
J (i, j ) = max ⎨
J(0,0)= -∞, J(i, 0) = -∞, J(i, 1) = -gopen,
J(0, j) = -gopen - gext (j-1)
⎩ J (i, j − 1) − g ext
Ø We fill the dynamic programming matrix H(i,j) by using the recursive
! M (i, j)
relation:
diagonal
#
H (i, j) = max " I(i, j)
# J(i, j)
$
vertical
horizontal
with the initial condition: H(0, 0)=0, H(i, 0)=-gopen -(i-1)gext,
H(0, j)=- gopen –(j-1)gex.
Alignment algorithms with affine gap penalty
Ø In this algorithm design, I or J type alignments can only be followed
by the same type of alignment or M type alignment, thus we prevent
alignments where a space in one sequence is immediately followed by
a space in another sequence, such as,
S--HAKE
SPE-ARE
Ø To see this, consider the following relation by setting gopen to values
large enough, so that,
S(a,b) > - gopen
Ø Therefore, alignment (I) always scores higher than (II), because
S(H,E)>-gopen.
(I) S-HAKE
(II) S--HAKE
SPEARE
SPE-ARE
Alignment algorithms with affine gap penalty
Ø As before, by adding the restart option and initialization conditions to
this global alignment algorithm, we can produce an algorithm for local
alignment.
⎧ M (i , j )
⎪ I (i , j )
⎪
H (i, j ) = max ⎨
⎪ J (i , j )
⎪⎩0
diagonal
vertical
horizontal
restart
with the initial condition H(i, 0)= H(0, j)= 0.
Effect of scoring parameters on the alignment
Ø Although dynamic
programming
algorithm guarantee
the optimal
alignment between
two sequences under
the given scoring
system, if we change
the scoring system,
different results will
be resulted.
Ø  Shown left are
alignments between
human and yeast
hexokinase proteins
using different gap
open penalties.
On line pairwise alignment programs
Ø Needleman-Wunsch algorithm with an affine gap penalty function has
been implemented by a few groups and the programs are freely
available in both standalones or web-based applications.
Ø Such as the needle
program in the
EMBOSS package by
EMBO, and the
GGSEARCH program
in the FASTA package.
Ø Webserver of
GGSEARCH:
http://www.ebi.ac.uk/
Tools/fasta33/
index.html?
program=GGSEARCH
On line pairwise alignment programs
Ø Smith-Waterman algorithm with an affine gap penalty function has
been also implemented by a few groups and programs are freely
available in both standalones or web-based applications.
Ø Such as the the
water program in the
EMBOSS package,
and the SSEARCH
program in the
FASTA package.
Ø Web server of
SSREACH: http://
www.ebi.ac.uk/
Tools/fasta33/
index.html?
program=SSEARCH
Multiple sequence alignment
Ø  Multiple sequence alignment (MSA): alignment of more than three
sequences, such that the alignment have the maximal score given a
scoring matrix and gap penalty function.
Ø  Theoretically, MSA can be solved by multidimensional dynamic
programming, however it has time complexity O(NS) for aligning S
sequences of length N, so it can only be applied to few sequences.
Ø  In fact, it has been shown that MAS is a NP-hard problem, therefore,
there is no known efficient algorithm to solve it.
Ø  Various heuristic algorithms have been proposed to align multiple
sequences. They generally perform well when the sequences to be
aligned are not too distantly related to one another.
Ø  The most of these heuristic algorithms, such as the Clustal X/W and
OMEGA algorithms, use a progressive alignment method to align
multiple sequences.
Progressive alignment algorithm
Ø This method starts with the most confident pairwise alignment, and
then gradually add each sequence or groups of sequences to the
already aligned MSA using a guide tree.
Ø For example, the Clustal algorithms first construct a phylogenetic tree
of the sequences to be aligned using the pairwise alignment of the
sequences.
Ø The evolutionary distance between two sequences can be estimated by
the Kimura estimator,
d = -ln(1-D-0.2D2)
Ø Evolutionary distance can be also calculated from the alignment score
using Feng and Doolittle formula,
d = −100[ln(S − Srand ) − ln(Sident − Srand )].
where Srand is the average score to align two random sequences, Sident is
the average score to align two identical sequences.
Progressive alignment algorithm
Ø Clustal uses the neighbor-joining
method to construct the tree using the
computed evolutionary distance matrix.
Ø Using the tree as a guide, Clustal aligns
the two pairs of sequences that has
shortest evolutionary distance, i.e.,
HXK2 RAT and HXK2 HUMAN, and
HXK1 RAT and HXK1 HUMAN,
using the Needleman-Wunsch global
alignment algorithm.
Ø This two pairwise alignments are then
aligned to form a cluster of four
sequences.
Ø This process is repeated until all
clusters are joined to form a single
cluster.
Alignment algorithm of two clusters of sequences
Ø The algorithm for
aligning two clusters of
sequences are
essentially the same as
aligning two sequences,
but all the sequences in
a cluster are treated as if
they are a single
sequence.
Ø If a space is introduced
in the cluster, it will be
inserted at the same
position in all
sequences in the cluster.
Alignment algorithm of two clusters of sequences
Ø To score the aligned sites i and j in two clusters, we can use the
average of the individual scores for the amino acid pairs that can be
formed between the clusters
n n
1 1 2
S[cluster1(i ), cluster2( j )] =
S[a1k (i ), a2t ( j )],
n1n2 k =1 t =1
∑∑
where n1 and n2 are the number of sequences, and ak1(i) and at2(j) are
the amino acids at the sites i and j in the k-th and t-th sequences in
clusters 1 and 2, respectively.
Ø For example, the score for aligning the following sites from two
clusters,
P
Position i in cluster 1 A
Position j in cluster 2
I
R
would be
1
S[cluster1(i ), cluster 2( j )] =
[ S ( P, I ) + S ( P, R) + S ( A, I ) + S ( A, R)].
2×2
Problems of progressive alignment algorithms
Ø Because the heuristic nature of the progressive multiple algorithm,
global optimal is not guaranteed.
Ø In particular, the errors made by an earlier step cannot be corrected by
the later steps.
Ø To avoid the bias caused by aligning very closed-related sequences in
the earlier steps, Clustal uses a weighted scoring system, i.e., smaller
weight is given to closed related sequences when computing the
alignment scores.
w1, 2 n1 n2
k
t
S[cluster1(i ), cluster 2( j )] =
S
[
a
(
i
),
a
∑∑
1
2 ( j )],
n1n2 k =1 t =1
Ø Sometimes, manual adjustment is needed in the regions that are not
aligned very well.
Online multiple sequence alignment programs
Ø The popular multiple sequence alignment programs include:
1.  Clustal X/W/OMEGA: http://www.clustal.org/
2.  T-Coffee: http://www.tcoffee.org/Projects_home_page/
t_coffee_home_page.html
3.  MUSCLE: http://www.drive5.com/muscle/
Ø Since MSA algorithms are still an active research area, new algorithms
and programs are expected in the future.
Ø A recent development is the SATe algorithm, which iteratively
constructs the guide tree and the alignment until a convergent criterion
is met: http://people.ku.edu/~jyu/sate/sate.html