part 1 - user web page

Bioinformatics Algorithms
AIMS 2009
Marshall Hampton
1. Sequence Alignment
In these notes I mostly try to stick with the notation and presentation in Ewens
and Grant’s “Statistical Methods in Bioinformatics”, which I think has been ordered
for the AIMS library.
(1) What are alignments? It is perhaps easiest to see from a simple example. We
will begin with global pairwise alignments.
Lets consider alignments of two sequences: WHY and WHAT. Here are a
few alignments:
WHY_
WHAT
W-HY
WHAT
-W-HY
W-HAT
-W-H-YW-H-A-T
The “-” symbol stands for an indel, an insertion or deletion. A series of 1
or more indels is called a gap.
We will not consider pairwise alignments in which gaps are aligned. So the
following is not a valid alignment:
WH-Y_
WH-AT
(2) The number of global pairwise alignments: how many alignments are there
between two sequences of length m and n? The short answer is, too many!
To be more precise, consider two sequences x = X1 X2 X3 . . . Xm and y =
Y1 Y2 . . . Yn . Lets denote the number of alignments by c(m, n). Its difficult
to get this number exactly, so instead we will calculate a lower bound. Let
g(m, n) be the number of alignments between x and y which have distinct
pairs of aligned letters. So for example the following two alignments of WHY
and WHAT would be counted together:
W-HY_
-WHAT
-WHYW-HAT
since they both match H-H and Y-A. The number
kof such aligned letters
m
can be 0 to min(m, n). For each such k there are
ways of choosing the
k
n
letters from x and
ways of choosing the letters from y. So
k
min(m.n) X
m
n
g(m, n) =
.
k
k
k=0
This is bad news, since that function grows veryquickly. For simplicity,
2n
consider the case that m = n. Then g(n, n) =
(a good exercise if you
n
don’t know this identity). This can be approximated using Stirling’s formula
1
2
n! ≈
√
2πnn+1/2 e−n as
22n
√
g(n, n) ≈
.
πn
For two sequences of length 1000 (a common size to arise in practice), we
have g(1000, 1000) ≈ 10600 , a ridiculous number. It would be impossible to
consider all the alignments.
(3) Scoring systems. Because of the vast numbers of alignments, we need to use a
simple scoring system to be able to find an optimal alignment. The simplest
systen of any use considers each pair in an alignment seperately, and add
up those individual scores. Each pair of letters and/or gaps has some score
associated to it.
As an example, consider the DNA alphabet A,C,G,T. A simple choice would
be to score any match as +1, any mismatch of letters as -1, and a gap-letter
pair as -2. We will discuss later how such values might be chosen.
In general we will denote the letter-gap score by −d where d ≥ 0, and the
score between letters i and j by s(i, j).
(4) The Needleman-Wunsch algorithm: a dynamic programming algorithm. The
basic idea is to break the problem down into common sub-problems. In
this setting, we consider finding the highest-scoring alignment of some subsequences x1,i = X1 X2 . . . Xi and y1,j = Y1 Y2 . . . Yj . Let B(i, j) be the score of
a highest-scoring alignment between such subsequences.
B(i, 0) will be the score of x1,i aligned with i gaps, so it must be B(i, 0) =
−di. Similarly B(0, j) = −jd. We define B(0, 0) = 0. Then we can begin
to fill out a matrix of B(i, j) values. The key idea is to note that a highest
scoring alignment between x1,i and y1,j can only end in three possible ways:
Xi aligned to Yj , Xi aligned to a gap, or Yj aligned to a gap. This gives us a
recursion relation:
B(i, j) = max(B(i − 1, j − 1) + s(i, j), B(i − 1, j) − d, B(i, j − 1) − d)
This lets us fill out a (m + 1) by (m + 1) matrix of B(i, j) scores. At each
step we must keep track of which choice provided the maximum, so we can
retrace our steps. There may be a tie, in which case we can arbitrarily choose
one direction or keep track of them all.
Lets consider a short example where x = AGT C and y = AT CG. We
will use the scoring matrix described above, with each gap = −2, each match
worth 1, and each mismatch worth −1. We start by filling in B(0, 0) = 0,
B(i, 0) = −2i and B(0, j) = −2j:
−
A
G
T
C
3
A T C G
0 -2 -4 -6 -8
-2
-4
-6
-8
Then we can fill in the upper-left corner, B(1, 1), from the recursion formula
above. The maximum is the first entry, B(0, 0) + s(1, 1) = 0 + 1 = 1. We keep
track of this choice with an arrow pointing to (0, 0):
track of this choice with an
A T C G
0 -2 -4 -6 -8
A -2 1
G -4
T -6
C -8
4
After filling all the entries, we get the following matrix with pointers that
tell us how to back-trace to get the optimal alignment:
A
G
T
C
0
-2
-4
-6
-8
A
-2
1
-1
-3
-5
T
-4
-1
0
0
-2
C
-6
-3
-2
-1
1
G
-8
-5
-2
-3
-1
The path back corresponds to the alignment:
AGTCA-TCG
with a score of −1 (from the lower right corner).
(5) Smith-Waterman local pairwise alignment. Often we do not want a global
alignment, but rather to find a good match of substrings of the two sequences. This is a local alignment. The Needleman-Wunsch algorithm can
be modified a little to accomplish this, which is the Smith-Waterman algorithm. The changes we need to make are to consider a matrix of values
L(i, j) = max(0, B(i, j)), where B(i, j) is given recursively as before, and to
trace back from the maximum entry in the matrix.