rec04

Comp. Genomics
Recitation 4
Multiple Sequence Alignment or
Computational Complexity –
What is it good for?
Outline
•
•
•
•
MSA is very expensive
Speedup – Carillo Lipman
Approximation algorithms
Heuristics
MSA is expensive
• The running time of DP MSA is O(2NLN)
• E.g., to align 10 proteins of 50 residues
each we need 1020 operations
• If one operation takes one microsecond, it
takes over a billion years to align these 10
proteins
• How do we obtain practical algorithms?
Exercise 1
• Definitions:
• D(xi,xj)-The optimal pairwise alignment
between the ith and jth sequences
• A*-The optimal SOP MSA (c(A*) = its cost)
• A*i,j-The projection of the optimal SOP MSA
on the ij plane (c(A*i,j) = its cost)
• c‘ = an upper bound on c(A*)
• Given D and c’, upper bound c(A*u,v).
The Carillo-Lipman bound
A bound on the cost of the optimal MSA
Cost of the optimal MSA
Cost of the optimal MSA’s
Projection on the ij plane
Break sum
The optimal
Alignment is better than
Any other
The Carillo-Lipman bound
• Which cells can we cancel?
• Those whose projection on any 2dimensional plane falls in a cells such that
the optimal 2-dimensional path through
that cell costs more than the bound
Exercise 2
• We are building MSA between x,y and z of
sizes n,m, and l:
• D(x[1,..,i],y[1,..,j])=5, D(x[1,..,i],z[1,..,k])=7,
D(y[1,..,j],z[1,..,k])=3
• D(x[i+1,…,n],y[j+1,…,m])=3,
D(x[i+1,..,n],z[k+1,..,l])=7,
D(y[j+1,..,m],z[K+1,..,l])=4
• Carillo Lipman gave us: C(Ax,y*)≤13, C(Ax,z*)≤14,
C(Ay,z*)≤15 (we assume lower scores are better)
• True or false: The cell (i,j,k) needs to be
considered in the MSA
Solution
• The cost of the optimal path through (i,j)
on the xy plane is 8
• The Carillo-Lipman bound is 13. So the
projection to the xy plane does not cancel
the cell.
• Similarly, the other bounds do not cancel
the cell
• The claim is true
Approximation algorithms
• Carillo-Lipman is still impractical for many long
sequences
• Hence, our goal is to obtain faster algorithms
• Approximation algorithms promise to remain a
certain factor away from OPT
• Good: constant factor algorithms (e.g., 1.785
approximation)
• Worse: approximation ratio dependent on the
input size (e.g., log(n))
• Not always good empirically, but important for
inspiring good heuristics
Approximation ratio
•
•
•
•
Has to be maintained for any legal input
Cost (score) of OPT is c(OPT)
Cost (score) of ALG is c(ALG)
Approximation ratio: c(ALG)/c(OPT)
Reminder: SP MSA
• Input: strings S1, …,Sk of length n.
• d(i,j) – The distance between Si and Sj as
induced by the MSA
k
k
• Sum-of-pairs (SP) score:  d  i, j 
i 1 j 1
j i
• Goal: find MSA with minimum SP score
• We’ll look for minimal scores
The Center * algorithm
• Assumptions:
1. The triangle inequality holds
2. σ(-,-)=0
3. σ(x,y)=σ(y,x)
• Input: strings S1, …,Sk.
• The algorithm:
DS *, S 
1. Find the string S* that minimizes S
 \S *
2. Iteratively add all other sequences to the alignment
• Running time: O(k2n2)
Exercise 3
Find the approximation ratio of the center star algorithm
Use the following definitions:
•M*
•M
- An optimal alignment
- The alignment produced by this algorithm
• d(i,j) - The distance M induces on the pair Si,Sj
k
k
vM    d i, j 
i 1 j 1
j i
• D(S,T) – min cost of alignment between S and T
For all i: d(1,i)=D(S1,Si)
(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )
Solution
k
k
vM    d i, j    d 1, i   d 1, j 
k
k
i 1 j 1
j i
i 1 j 1
j i
k
k
l 2
l 2
 2(k  1) d 1, l   2(k  1) DS1 , Sl 
 
v M *   d * i, j    D S i , S j  
k
k
k
i 1 j 1
j i
k
i 1 j 1
j i
  D S1 , S j   k  D S1 , S j 
k
k
i 1 j  2
k
j 2
v( M ) 2(k  1)

2
*
k
v( M )
Randomized approximation
algorithms
• Now we want to reduce the O(k2n2) time
further
• Randomized algorithms perform well on
most inputs, and thus have a good
performance on average
• Approximation ratio of a randomized
algorithm: E(c(ALG)/c(OPT))
• The expectancy is over the choices of the
algorithm (coin tosses)
Center-star reminder
• Algorithm:
• Compute the sequence “closest-to-all-others”
• Align all sequences to it
• Approximation:
• (2-1/k) approximation to the optimal sum-of-pairs
(SP) alignment
• Complexity:
• O(k2) pairwise alignments to choose the center starO(k2n2) – Bottleneck!
• O(k) pairwise alignments to construct the MSA:
O(kn2)
Random-star
• What if instead of picking the best sequence as
the starting point, we pick a random sequence
from the group?
• Ex4: Show that for any r, we’ll construct an
example (choose cost function, sequences and
number of sequences) such that the algorithm
will be > r worse than OPT:
d(Mb)/d(Mopt)>r
Solution
• Bad sequence: CCC..C (k letters)
• Other sequences: k AGAGAG…AG and k
GAGAGA…GA
• Costs: 1 for gap and 1 for mismatch
Solution
• c(ALG):
• k2 couples with k mismatches (AGAG vs GAGA).
• 2k couples with k mismatches (CCCC vs AGAG\GAGA).
• D(Mb) = k3 + 2k2.
• C(OPT):
• We have k2 couples with 2 gaps (AGAG vs GAGA).
• k couples with k mismatches (CCCC vs AGAG).
• k couples with k-1 mismatches and 2 gaps (CCCC vs
GAGA).
• D(Mb) = 4k2 + k
• Ratio: (k3 + 2k2) / (4k2 + k)=O(k)
• Choose k large enough…
Random-star
• Apparently the idea is not “that-bad”
• Ex5: Show that if we choose the starting
sequence from the k-size group at
random we get an expected 2
approximation: E(d(Mb)/d(Mopt))<=2
Solution
 d (M
d (M b )
d (M b )
E(
)   d ( M ) P( M b ) 
b
opt
d ( M opt )
b
)
1
d ( M opt ) k
b
d ( M opt )   D( Si , S j )
i
j i
 d (M )   d (S , S )  [d (S , S )  d (S , S ) 
b
b
b
i
j i
b
i
j
b
[2(k 1) d (S , S )] 
b
d (M b )
E(
)
d ( M opt )
i b
b
i
b
i
b
j i
2(k  1) D( Si , S j )
j i
k  D( Si , S j )
i
j i
b
b
b
2(k  1) D( Si , Sb )
b
i
i
i b
2(k  1)

 2 1
k
k
j
Random-star
• Ex6: Use this idea for a O(kn2) algorithm!
• Solution: Choose sequence at random and
proceed as in center-star
• Complexity:
• No need for initial pairwise alignments
• All subsequent alignments can be
implemented in O(kn2)
Progressive alignment
• A heuristic method for finding MSA
• There is no analytic bound on accuracy
• Development is guided by empirical
results
• Examples: Feng-Doolittle, CLUSTALW
Progressive alignment
• General algorithm:
1. Globally align every pair of sequences
2. Use the alignment scores to construct a
“guide tree”
3. Combine alignments (sequencesequence,sequence-profile, profile-profile)
according to the guide tree
Progressive alignment
W-NW
F- RF
W-NW
WL-W
WLW
WNW
FRF