s[n +1] - CS, Technion

Sequence Alignment I
Lecture #2
Background Readings: The second chapter (pages 12-45) in the text
book, Biological Sequence Analysis, Durbin et al., 2001.
This class has been edited from Nir Friedman’s lecture which is available at
www.cs.huji.ac.il/~nir.
Changes made by Dan Geiger, then Shlomo Moran.
.
Sequence Comparison
Much of bioinformatics involves sequences
 DNA sequences
 RNA sequences
 Protein sequences
We can think of these sequences as strings of letters
& RNA: alphabet ∑ of 4 letters
 Protein: alphabet ∑ of 20 letters
 DNA
2
Sequence Comparison (cont)
 Finding
similarity between sequences is important
for many biological questions
For example:
 Find similar proteins
 Allows to predict function & structure
 Locate
similar subsequences in DNA
 Allows to identify (e.g) regulatory elements
 Locate
g1
g2
DNA sequences that might overlap
 Helps in sequence assembly
3
Sequence Alignment
Input: two sequences over the same alphabet
Output: an alignment of the two sequences
Example:
 GCGCATGGATTGAGCGA
 TGCGCCATTGATGACCA
A possible alignment:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
4
Alignments
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three elements:
 Perfect matches
 Mismatches
 Insertions & deletions (indel)
5
Choosing Alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Which one is better?
6
Scoring Alignments
Intuition:
 Similar sequences evolved from a common
ancestor
 Evolution changed the sequences from this
ancestral sequence by mutations:
 Replacements: one letter replaced by another
 Deletion: deletion of a letter
 Insertion: insertion of a letter
 Scoring of sequence similarity should examine how
many and which operations took place
7
Simple Scoring Rule
Score each position independently:
 Match:
+1
 Mismatch:
-1
 Indel
-2
Score of an alignment is sum of position scores
8
Example
Example:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score: (+1x5) + (-1x6) + (-2x11) = -23
9
More General Scores
 The
choice of +1,-1, and -2 scores is quite arbitrary
 Depending on the context, some changes are more
plausible than others
 Exchange of an amino-acid by one with similar
properties (size, charge, etc.)
vs.
 Exchange of an amino-acid by one with opposite
properties
 Probabilistic interpretation: (e.g.) How likely is one
alignment versus another ?
10
Additive Scoring Rules
 We
define a scoring function by specifying a
function
 : (  {})  (  {})  
 (x,y) is the score of replacing x by y
 (x,-) is the score of deleting x
 (-,x) is the score of inserting x
 The score of an alignment is the sum of position
scores
11
The Optimal Score
 The
optimal (maximal) score between two
sequences is the maximal score of all alignments of
these sequences, namely,
d(s1 , s2 )  max alignment of s1 &s2 score(alignment)
 Computing
the maximal score or actually finding an
alignment that yields the maximal score are closely
related tasks with similar algorithms.
 We
now address these problems.
12
Computing Optimal Score
 How
can we compute the optimal score ?
 If |s| = n and |t| = m, the number A(m,n) of
possible “legal” alignments is large!
Exercise: Show that
m  n
m  n

  A(m, n)  

 m 
 m 
2
 The
additive form of the score allows us to perform
dynamic programming to compute optimal score
efficiently.
13
Recursive Argument
 Suppose
we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be one of three cases:
1. Last match is (s[n+1],t[m +1] )
2. Last match is (s[n +1],-)
3. Last match is (-, t[m +1] )
d ( s[1..n  1], t[1..m  1])  d ( s[1.., n], t[1..m]) 
 ( s[n  1], t[m  1])
14
Recursive Argument
 Suppose
we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be one of three cases:
1. Last match is (s[n+1],t[m +1] )
2. Last match is (s[n +1],-)
3. Last match is (-, t[m +1] )
d ( s[1..n  1], t[1..m  1])  d ( s[1.., n], t[1..m  1]) 
 ( s[n  1],)
15
Recursive Argument
 Suppose
we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be one of three cases:
1. Last match is (s[n+1],t[m +1] )
2. Last match is (s[n +1],-)
3. Last match is (-, t[m +1] )
d ( s[1..n  1], t[1..m  1])  d ( s[1.., n  1], t[1..m]) 
 (, t[m  1])
16
Recursive Argument
Define the notation:
V [i , j ]  d (s [1..i ],t [1.. j ])
 Using
our recursive argument, we get the following
recurrence for V:
V [i , j ]   (s [i  1 ],t [ j  1 ]) 


V [i  1 , j  1 ]  max V [i , j  1 ]   (s [i  1 ],) 
V [i  1 , j ]   ( ,t [ j  1 ]) 


V[i,j]
V[i+1,j]
V[i,j+1]
V[i+1,j+1]
17
Recursive Argument
 Of
course, we also need to handle the base cases
in the recursion:
V [0 ,0 ]  0
V [i  1 ,0 ]  V [i ,0 ]   (s [i  1 ],)
V [0 , j  1 ]  V [0 , j ]   ( ,t [ j  1 ])
AA
versus - We fill the matrix using the
recurrence rule:
S
T
0
0 0
A G C
1 2 3
-2 -4 -6
A 1 -2
A 2 -4
A 3 -6
C 4 -8
18
Dynamic Programming Algorithm
S
T
0
0 0
A G C
1 2 3
-2 -4 -6
A 1 -2
A 2 -4
A 3 -6
C 4 -8
We continue to fill the matrix using the recurrence rule
19
Dynamic Programming Algorithm
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4
A 3 -6
C 4 -8
V[0,0] +1 V[0,1]
V[1,0]
V[1,1]
-2 -A
versus
A-
-2 (A- versus -A)
20
Dynamic Programming Algorithm
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
A 3 -6 -3
C 4 -8 -5
21
Dynamic Programming Algorithm
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
Conclusion: d(AAAC,AGC) = -1
22
Reconstructing the Best Alignment
 To
reconstruct the best alignment, we record which
case(s) in the recursive rule maximized the score
A G C
T
S
0 1 2 3
0 0 -2 -4 -6
A 1 -2
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
23
Reconstructing the Best Alignment
 We
now trace back a path that corresponds to the
best alignment
A G C
T
S
0 1 2 3
0 0 -2 -4 -6
A 1 -2
AAAC
AG-C
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
24
Reconstructing the Best Alignment
 Sometimes,
more than one alignment has the best
score
S
AAAC
A-GC
AAAC
-AGC
AAAC
AG-C
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
25
Time Complexity
Space: O(mn)
Time: O(mn)
 Filling the matrix O(mn)
 Backtrace O(m+n)
S
T
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
26
Space Complexity
real-life applications, n and m can be very large
 The space requirements of O(mn) can be too
demanding
 If m = n = 1000, we need 1MB space
 If m = n = 10000, we need 100MB space
 We can afford to perform extra computation to save
space
 Looping over million operations takes less than
seconds on modern workstations
 In
 Can
we trade space with time?
27
Why Do We Need So Much Space?
To compute V[n,m]=d(s[1..n],t[1..m]),
we need only O(min(n,m)) space:

Compute V(i,j), column by
column, storing only two columns in
memory (or line by line if lines are
shorter).
Note however that
 This “trick” fails when we
need to reconstruct the optimizing
sequence.
 Trace back information requires
O(mn) memory bytes.
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
28
Space Efficient Version: Outline
Input: Sequences s[1,n] and t[1,m] to be aligned.
Idea: perform divide and conquer
 If
n=1 align s[1,1] and t[1,m]
 Else, find position (n/2, j) at which some best
s
alignment crosses a midpoint
 Construct alignments
 A=s[1,n/2] vs t[1,j]
 B=s[n/2+1,n] vs t[j+1,m] t
 Return AB
29
Finding the Midpoint
The score of the best alignment that goes through j
equals:
d(s[1,n/2],t[1,j]) + d(s[n/2+1,n],t[j+1,m])
 Thus,
we need to compute these two quantities for
all values of j
s
t
30
Finding the Midpoint (Algorithm)
Define
 F[i,j] = d(s[1,i],t[1,j])
 B[i,j] = d(s[i+1,n],t[j+1,m])
 F[i,j]
+ B[i,j] = score of best alignment through (i,j)
compute F[i,j] as we did before
 We compute B[i,j] in exactly the same manner,
going “backward” from B[n,m]
 We
 Requires
linear space complexity
31
Time Complexity Analysis
to find a mid-point: cnm
(c - a constant)
 Size of recursive sub-problems is
(n/2,j) and (n/2,m-j-1), hence
 Time
T(n,m) = cnm + T(n/2,j) + T(n/2,m-j-1)
Lemma: T(n,m)  2cnm
Proof (by induction):
T(n,m)  cnm + 2c(n/2)j + 2c(n/2)(m-j-1)  2cnm.
Thus, time complexity is linear in size of the problem
At worst, twice the cost of the regular solution.
32
Local Alignment
Consider now a different question:
 Can we find similar substrings of s and t
 Formally, given s[1..n] and t[1..m] find i,j,k, and l
such that d(s[i..j],t[k..l]) is maximal
33
Local Alignment
 As
before, we use dynamic programming
 We now want to setV[i,j] to record the best
alignment of a suffix of s[1..i] and a suffix of t[1..j]
 How
should we change the recurrence rule?
 Same as before but with an option to start afresh

The result is called the Smith-Waterman algorithm
34
Local Alignment
New option:
 We can start a new match instead of extending a
previous alignment




V [i , j ]   (s [i  1 ],t [ j  1 ]) 
V [i  1 , j  1 ]  max V [i , j  1 ]   (s [i  1 ],) 
V [i  1 , j ]   ( ,t [ j  1 ]) 


0



Alignment of empty suffixes
V [0,0]  0
V [i  1,0]  max( 0, V [i,0]   ( s[i  1], ))
V [0, j  1]  max( 0, V [0, j ]   (, t[ j  1]))
35
Local Alignment Example
S
s = TAATA
t = TACTAA
T
0
0 0
A
1
0
T
2
0
C
3
0
T
4
0
A
5
0
A
6
0
T1 0
A2 0
A3 0
T4 0
A5 0
36
Local Alignment Example
T
0
0 0
T
1
0
A
2
0
C
3
0
T
4
0
A
5
0
A
6
0
T1 0
1
0
0
1
0
0
A2 0
0
2
0
0
2
1
S
s = TAATA
t = TACTAA
A3 0
T4 0
A5 0
37
Local Alignment Example
0
0 0
T
1
0
A
2
0
C
3
0
T
4
0
A
5
0
A
6
0
T1 0
1
0
0
1
0
0
A2 0
0
2
0
0
2
1
A3 0
0
1
1
0
1
3
T4 0
0
0
0
2
0
1
A5 0
0
1
0
0
3
1
S
s = TAATA
t = TACTAA
T
38
Local Alignment Example
0
0 0
T
1
0
A
2
0
C
3
0
T
4
0
A
5
0
A
6
0
T1 0
1
0
0
1
0
0
A2 0
0
2
0
0
2
1
A3 0
0
1
1
0
1
3
T4 0
0
0
0
2
0
1
A5 0
0
1
0
0
3
1
S
s = TAATA
t = TACTAA
T
39
Local Alignment Example
0
0 0
T
1
0
A
2
0
C
3
0
T
4
0
A
5
0
A
6
0
T1 0
1
0
0
1
0
0
A2 0
0
2
0
0
2
1
A3 0
0
1
1
0
1
3
T4 0
0
0
0
2
0
1
A5 0
0
1
0
0
3
1
S
s=
TAATA
t = TACTAA
T
40
Variants of Sequence Alignment
We have seen two variants of sequence alignment:
 Global alignment
 Local alignment
Other variants in the book and in tutorial time:
1. Finding best overlap
2.
Using an affine cost d(g) = -d –(g-1)e for gaps of
length g. The –d is for introducing a gap and –e for
continuing the gap. We used d=e=2. We could use
smaller e.
These variants are based on the same basic idea of
dynamic programming.
41
Remark: Edit Distance
Instead of speaking about the score of an alignment, one often
talks about an edit distance between two sequences,
defined to be the “cost” of the “cheapest” set of edit
operations needed to transform one sequence into the other.
 Cheapest operation is “no change”
 Next cheapest operation is “replace”
 The most expensive operation is “add space”.
Our goal is now to minimize the cost of operations, which is
exactly what we actually did.
42
Where do scoring rules come from ?
We have defined an additive scoring function by specifying a
function ( ,  ) such that
 (x,y) is the score of replacing x by y
 (x,-) is the score of deleting x
 (-,x) is the score of inserting x
But how do we come up with the “correct” score ?
Answer: By encoding experience of what are similar
sequences for the task at hand.
43