Bioinformatics Algorithms AIMS 2009 Marshall Hampton 1. Sequence Alignment In these notes I mostly try to stick with the notation and presentation in Ewens and Grant’s “Statistical Methods in Bioinformatics”, which I think has been ordered for the AIMS library. (1) What are alignments? It is perhaps easiest to see from a simple example. We will begin with global pairwise alignments. Lets consider alignments of two sequences: WHY and WHAT. Here are a few alignments: WHY_ WHAT W-HY WHAT -W-HY W-HAT -W-H-YW-H-A-T The “-” symbol stands for an indel, an insertion or deletion. A series of 1 or more indels is called a gap. We will not consider pairwise alignments in which gaps are aligned. So the following is not a valid alignment: WH-Y_ WH-AT (2) The number of global pairwise alignments: how many alignments are there between two sequences of length m and n? The short answer is, too many! To be more precise, consider two sequences x = X1 X2 X3 . . . Xm and y = Y1 Y2 . . . Yn . Lets denote the number of alignments by c(m, n). Its difficult to get this number exactly, so instead we will calculate a lower bound. Let g(m, n) be the number of alignments between x and y which have distinct pairs of aligned letters. So for example the following two alignments of WHY and WHAT would be counted together: W-HY_ -WHAT -WHYW-HAT since they both match H-H and Y-A. The number kof such aligned letters m can be 0 to min(m, n). For each such k there are ways of choosing the k n letters from x and ways of choosing the letters from y. So k min(m.n) X m n g(m, n) = . k k k=0 This is bad news, since that function grows veryquickly. For simplicity, 2n consider the case that m = n. Then g(n, n) = (a good exercise if you n don’t know this identity). This can be approximated using Stirling’s formula 1 2 n! ≈ √ 2πnn+1/2 e−n as 22n √ g(n, n) ≈ . πn For two sequences of length 1000 (a common size to arise in practice), we have g(1000, 1000) ≈ 10600 , a ridiculous number. It would be impossible to consider all the alignments. (3) Scoring systems. Because of the vast numbers of alignments, we need to use a simple scoring system to be able to find an optimal alignment. The simplest systen of any use considers each pair in an alignment seperately, and add up those individual scores. Each pair of letters and/or gaps has some score associated to it. As an example, consider the DNA alphabet A,C,G,T. A simple choice would be to score any match as +1, any mismatch of letters as -1, and a gap-letter pair as -2. We will discuss later how such values might be chosen. In general we will denote the letter-gap score by −d where d ≥ 0, and the score between letters i and j by s(i, j). (4) The Needleman-Wunsch algorithm: a dynamic programming algorithm. The basic idea is to break the problem down into common sub-problems. In this setting, we consider finding the highest-scoring alignment of some subsequences x1,i = X1 X2 . . . Xi and y1,j = Y1 Y2 . . . Yj . Let B(i, j) be the score of a highest-scoring alignment between such subsequences. B(i, 0) will be the score of x1,i aligned with i gaps, so it must be B(i, 0) = −di. Similarly B(0, j) = −jd. We define B(0, 0) = 0. Then we can begin to fill out a matrix of B(i, j) values. The key idea is to note that a highest scoring alignment between x1,i and y1,j can only end in three possible ways: Xi aligned to Yj , Xi aligned to a gap, or Yj aligned to a gap. This gives us a recursion relation: B(i, j) = max(B(i − 1, j − 1) + s(i, j), B(i − 1, j) − d, B(i, j − 1) − d) This lets us fill out a (m + 1) by (m + 1) matrix of B(i, j) scores. At each step we must keep track of which choice provided the maximum, so we can retrace our steps. There may be a tie, in which case we can arbitrarily choose one direction or keep track of them all. Lets consider a short example where x = AGT C and y = AT CG. We will use the scoring matrix described above, with each gap = −2, each match worth 1, and each mismatch worth −1. We start by filling in B(0, 0) = 0, B(i, 0) = −2i and B(0, j) = −2j: − A G T C 3 A T C G 0 -2 -4 -6 -8 -2 -4 -6 -8 Then we can fill in the upper-left corner, B(1, 1), from the recursion formula above. The maximum is the first entry, B(0, 0) + s(1, 1) = 0 + 1 = 1. We keep track of this choice with an arrow pointing to (0, 0): track of this choice with an A T C G 0 -2 -4 -6 -8 A -2 1 G -4 T -6 C -8 4 After filling all the entries, we get the following matrix with pointers that tell us how to back-trace to get the optimal alignment: A G T C 0 -2 -4 -6 -8 A -2 1 -1 -3 -5 T -4 -1 0 0 -2 C -6 -3 -2 -1 1 G -8 -5 -2 -3 -1 The path back corresponds to the alignment: AGTCA-TCG with a score of −1 (from the lower right corner). (5) Smith-Waterman local pairwise alignment. Often we do not want a global alignment, but rather to find a good match of substrings of the two sequences. This is a local alignment. The Needleman-Wunsch algorithm can be modified a little to accomplish this, which is the Smith-Waterman algorithm. The changes we need to make are to consider a matrix of values L(i, j) = max(0, B(i, j)), where B(i, j) is given recursively as before, and to trace back from the maximum entry in the matrix.
© Copyright 2026 Paperzz