L3 - UCSD CSE

Vineet Bafna
ì
0
ï S[i -1, j -1] + C(si ,t j )
S[i, j] = maxí S[i -1, j] + C(s ,-)
i
ï
î S[i, j -1] + C(-,t j )
How can we compute the local alignment itself?
ì
0
ï S[i -1, j -1] + C(si ,t j )
S[i, j] = maxí S[i -1, j] + C(s ,-)
i
ï
î S[i, j -1] + C(-,t j )
T
G
0
0
0
0
0
0
0
1
0
0
0
0
C
0
0
0
1
0
0
A
0
0
0
0
2
1
0
1
0
0
1
1
T
T
C
A
A
Gaps appear together
•
•
It is more likely
for gaps to be
contiguous
The penalty for
a gap of length l
should be
go + ge * l
ì
0
ï S[i -1, j -1] + C(si ,t j )
S[i, j] = max l í
S[i
l,
j]
+
go
+
ge
*
l
ï
î S[i, j - l] + go + ge * l
•
•
•
What is the time taken for this?
What are the values that l can take?
Can we get rid of the extra Dimension?
•
•
Define D[i,j] : Score of the
best alignment, given that
the final column is a
‘deletion’ (si is aligned to
a gap)
Define I[i,j]: Score of the
best alignment, given that
the final column is an
insertion (tj is aligned to a
gap)
Optimum alignment of s[1..i-1], and t[1..j]
s[i]
Optimum alignment of s[1..i], and t[1..j-1]
ì
0
ï S[i -1, j -1] + C(si ,t j )
S[i, j] = max í
D[i,
j]
ï
I[i, j]
î
t[j]
ì D[i -1, j] + ge
D[i, j] = max í
î S[i -1, j] + go + ge
ì I[i, j -1] + ge
I[i, j] = max í
î S[i, j -1] + go + ge
ì S[i -1, j -1] + C(si ,t j )
ï
S[i, j] = maxí S[i -1, j] + C(si ,-)
ï
î S[i, j -1] + C(-,t j )
•
•
How much space do we need?
Is the space requirement too much?
Fig. 1. Dot-plot representation of sample assembly comparison results
Istrail, Sorin et al. (2004) Proc. Natl. Acad. Sci. USA 101, 1916-1921
Copyright ©2004 by the National Academy of Sciences
•
Score computation
ì S[i -1, j -1]+ C(si,t j )
ï
S[i, j] = maxí S[i -1, j] + C(si,-)
ï S[i, j -1]+ C(-,t )
î
j
For i = 1 to n
For j = 1 to m
æi2 = i%2; i1 = (i -1)%2;
ç
ì S[i1, j -1] + C(si ,t j )
ç
ç S[i , j] = maxï
í S[i1, j] + C(si ,-)
2
çç
ï S[i , j -1] + C(-,t )
î 2
j
è
•
•
In Linear Space, we can
do each row of the D.P.
(0,0)
We need to compute
the optimum path from
the origin (0,0) to
(m,n)
(m,n)
•
Score computation
ì S[i -1, j -1]+ C(si,t j )
ï
S[i, j] = maxí S[i -1, j] + C(si,-)
ï S[i, j -1]+ C(-,t )
î
j
For i = 1 to n
For j = 1 to m
æi2 = i%2; i1 = (i -1)%2;
ç
ì S[i1, j -1] + C(si ,t j )
ç
ç S[i , j] = maxï
í S[i1, j] + C(si ,-)
2
çç
ï S[i , j -1] + C(-,t )
î 2
j
è
•
•
•
At i=n/2, we know scores
of all the optimal paths
ending at that row.
Define F[j] = S[n/2,j]
One of these j is on the
true path. Which one?
•
Let Sb[i,j] be the optimal score of aligning
s[i+1..n] with t[j+1..m]
ì Sb [i +1, j +1]+ C(si,t j )
ï
Sb [i, j] = maxí Sb [i +1, j]+ C(si,-)
ï S [i, j +1]+ C(-,t )
î b
j
• Boundary cases?
• Sb[n,j]? Sb[m,j]?
•
•
•
Let Sb[i,j] be the optimal
score of aligning
s[i+1..n] with t[j+1..m]
Define B[j] = Sb[n/2,j]
One of these j is on the
true path. Which one?
•
At the optimal coordinate,
j
–
•
F[j]+B[j]=S[n,m]
In O(nm) time, and O(m)
space, we can compute
one of the coordinates on
the optimum path.
•
Align(1..n,1..m)
– For all 1<=j <= m
•
–
For all 1<=j <= m
•
–
–
–
–
Compute F[j]=S(n/2,j)
Compute B[j]=Sb(n/2,j)
j* = maxj {F[j]+B[j] }
X = Align(1..n/2,1..j*)
Y = Align(n/2+1..n,j*+1..m)
Return X,j*,Y
•
•
T(nm) = c.nm + T(nm/2) = O(nm)
Space = O(m)
•
Name an early Bioinformatics Researcher
(preferably dead)
• Name an early Bioinformatics Researcher who is a
woman
•
•
We have seen that affine gap penalties help
concentrate the gaps in small regions.
What about substitution errors. Are all
substitutions alike?
•
Scoring protein sequence alignments is a
much more complex task than scoring
DNA
–
•
•
Not all substitutions are equal
Problem was first worked on by Pauling
and collaborators
In the 1970s, Margaret Dayhoff created
the first similarity matrices.
–
–
–
“One size does not fit all”
Homologous proteins which are
evolutionarily close should be scored
differently than proteins that are
evolutionarily distant
Different proteins might evolve at
different rates and we need to normalize
for that
•
•
We have seen that affine gap penalties help
concentrate the gaps in small regions.
What about substitution errors. Are all
substitutions alike?
•
DNA has structure.
•
•
•
So far, we considered a
simple match/mismatch
criterion.
The nucleotides can be
grouped into Purines
(A,G) and Pyrimidines.
Nucleotide substitutions
within a group
(transitions) are more
likely than those across a
group (transversions)
•
Transversions are more heavily penalized than
transitions.
•
•
Suppose we are searching with a mouse protein.
Blast returns proteins ranked by score
–
–
–
Top hit is to human
Somewhere below is Drosophila
Which one will you trust?
hum
mus
dros

Download Report

L3 - UCSD CSE

Paperzz.com

Your Paperzz