Steiner consensus strings: Defn. - cse.sc.edu

Bioinformatics Algorithms and
Data Structures
Chapter 14.6-8: Multiple Alignment
Lecturer: Dr. Rose
Slides by: Dr. Rose
February 28, 2003
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Defn. The sum-of-pairs (SP) score of a multiple
alignment is the sum of the score of all induced
pairs in a global alignment.
From the previous example:
1
2
3
A
A
T
A
A
A
T
T
C
C
G
G
G
G
T
-
T
T
A
T
A
A
T
T
T
SP = 4 + 5 + 4 = 13
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Q: What theoretical justification is there for adopting
the SP score?
Wait for response…..
A: None. Or rather none more than for any other
multiple alignment scoring scheme.
In practice it is a good heuristic and is popular.
Q: How can we compute a global alignment M using
a minimum sum-of-pairs score?
A: Why dynamic programming of course!
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Assuming that we want to align k strings
Q: What time complexity for the DP solution?
A: Q(nk), exact SP aligment has been shown to be
NP-complete.
Q: So what should we do?
A: Choose small a k.
In practice, the NP-completeness of a problem often
does not mean that the sky is falling.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Q: How will k affect the recurrence relation?
The recurrence relation for k = 3 is:
D(i, j, k) = min[
D(i -1, j - 1, k - 1) + ?,
D(i -1, j - 1, k ) + ?,
D(i -1, j, k - 1) + ?,
D(i, j - 1, k - 1) + ?,
D(i -1, j , k ) + ?,
D(i, j - 1, k ) + ?,
D(i, j , k - 1) + ?]
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Let’s consider each term of the recurrence in turn:
1.
D(i -1, j - 1, k - 1) is the diagonal cell in all three
dimensions.
Q: What should be the SP transition cost for D(i-1,j-1,k-1)
D(i, j, k) ?
Recall for k = 2, if S1(i) = S2(j) the cost is the match cost,
o/w S1(i)  S2(j) and we incur the mismatch cost.
A: the sum of pairwise match comparisons, i.e., ij, jk, ik.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Let m(i, j) denote the pairwise character match
function defined as:
m(i, j) = matchCost if the characters match
m(i, j) = mismatchCost if the characters mismatch
Then the SP transition cost for D(i - 1, j - 1, k - 1)
D(i, j, k) is m(i, j) + m(j, k) + m(i, k)
Hence the term cost is :
1. D(i - 1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
The next term:
2.
D(i -1, j - 1, k ) is the diagonal cell in the first two
dimensions.
Q: What should be the SP transition cost for D(i-1, j-1, k)
D(i, j, k) ?
We have two types of cases to consider:
1. The pairwise diagonal case: i-1, j-1 i, j
2. The two pairwise space insertion cases:
i-1, k  i, k and j-1, k  j, k
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
The cost will be the sum of the pairwise match and
space insertion costs.
1. m(i, j) for (i-1, j-1 i, j) and
2.
spacecost for i-1, k  i, k and spacecost for j-1, k  j, k
Then the SP transition cost for D(i - 1, j - 1, k)
D(i, j, k) is m(i, j) + 2 * spacecost
Hence the term cost is :
2. D(i - 1, j - 1, k) + m(i, j) + 2 * spacecost
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Similarly, the third and fourth term costs are:
3.
4.
D(i - 1, j, k - 1) + m(i, k) + 2 * spacecost,
D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost
Note the similarity in the fifth, sixth, and seventh
terms:
5.
6.
7.
D(i -1, j , k ) + ?
D(i, j - 1, k ) + ?
D(i, j , k - 1) + ?
Q: What should be the cost for transitions from them?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
For D(i -1, j , k) we have two types of cases to
consider:
1. The pairwise no change case: j, k  j, k
2. The two pairwise space insertion cases:
i-1, j  i, j and i-1, k  i, k
Then the SP transition cost for D(i - 1, j , k)
D(i, j, k) is 0 + 2 * spacecost
Hence the term cost is :
5. D(i - 1, j, k) + 2 * spacecost
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Similarly, the sixth and seventh term costs are:
6.
7.
D(i - 1, j, k) + 2 * spacecost,
D(i, j, k) + 2 * spacecost
Hence D(i, j, k) = min[
D(i -1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k),
D(i -1, j - 1, k ) + m(i, j) + 2 * spacecost,
D(i -1, j,
k - 1) + m(i, k) + 2 * spacecost,
D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost,
D(i -1, j , k ) + 2 * spacecost,
D(i, j - 1, k ) + 2 * spacecost,
D(i, j ,
k - 1) + 2 * spacecost]
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Q: What about the boundary cells on the 3 faces of the
table?
1.
2.
3.
D(i, j, 0),
D(i, 0, k),
D(0, j, k)
Observation: Each case degenerates into the familiar
two-string alignment distance + space costs for
the empty string argument.
Approach: represent these cases in terms of pair-wise
distance + space costs.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
Let D1,2(i, j) denote the pairwise distance between
S1[1..i] and S2[1..j]. D1,3(i, k) and D2,3(j, k) are
analogously defined.
Consider D(i, j, 0):
D(i, j, 0) = D1,2(i, j) + ? * spaceCost
Q: What is the space cost, i.e., how many spaces?
A: i for S1 and j for S2 hence:
D(i, j, 0) = D1,2(i, j) +(i + j) * spaceCost
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs
By this argument, the boundary cells are given by:
1.
2.
3.
4.
D(i, j, 0) = D1,2(i, j) + (i + j) * spaceCost ,
D(i, 0, k) = D1,3(i, k) + (i + k) * spaceCost ,
D(0, j, k) = D2,3(j, k) + (j + k) * spaceCost,
D(0,0,0) = 0
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Q: How can we speedup our DP approach?
A: Use forward dynamic programming.
Note: so far we have used backward dynamic
programming, i.e., cell (i, j, k) looks back to the
seven cells that can influence its value.
In contrast: forward DP sends the result of cell (i, j, k)
forward to the seven cells whose value it could
influence.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Q: How does this speed things up?
A: it doesn’t, if we always send cell (i, j, k)’s value
forward.
The only significant way to speed up the Q(nk) is to
avoid computing all nk cells in the DP table.
We will use forward DP to reduce the number of cells
that we compute in the DP table.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Let’s rethink this problem:
 View the optimal alignment problem as the shortest
path through the weighted edit distance graph.
 We are looking for the shortest path from (0,0,0) to
(n,n,n).
 When node (i, j, k) is computed, we have the shortest
path from (0,0,0) to (i, j, k).
 The value of node (i, j, k) is sent forward to the seven
neighboring nodes that it can influence
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Let w be reached by an outgoing edge from (i, j, k)
 the true shortest distance from (0,0,0) to w is the
value computed after it has been updated by
every node with a ingoing edge to it.
 A queue is used to order the nodes for processing.
 The final shortest distance for the node v at the
head of the queue is set and node v is removed.
 Every neighbor w of v is then updated, w is
placed in the queue if it is not already there.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
At this point we borrow an A*-like idea:
IF (i, j, k) is not on the shortest path from (0,0,0) to
(n,n,n) then avoid passing its value forward.
More importantly, avoid putting its neighbors, not
already in the queue, into the queue.
The trick is deciding (i, j, k) is not on the shortest path
from (0,0,0) to (n,n,n).
Q: How do we pull this rabbit out of our hat?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Define d1,2(i, j) to be the edit distance between
suffixes S1[i..n] and S2[j..n]. Define d1,3(i, k) &
d2,3(j, k), analogously.
Note: these edit distances can be computed in O(n2)
via DP on the reversed strings.
Observation: any shortest path from (i, j, k) to (n,n,n)
must have distance at least d1,2(i, j) + d1,3(i, k) +
d2,3(j, k)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Suppose we have an alignment (from
somewhere) with an SP distance score z.
Core idea:
if D(i, j, k) + d1,2(i, j) + d1,3(i, k) + d2,3(j, k) > z,
then node (i, j, k) can not be on any shortest
path.
 Do not pass its value forward.
 Do not put its neighbors reached by outgoing
edges onto the queue.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
Benefits of being able to prune cell (i, j, k):
 We automatically prune many of its
descendants.
 We don’t process all nk cells in a k-string
problem. Big win!!!!
 The computation is still exact & will find the
optimal alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Sum-of-Pairs Speedup
The program called MSA implements the speedup we
are discussing.
Cold shower:
 MSA can align 6 strings with n = ~200
 Unlikely to be able to align tens or hundreds of
strings.
Still, 2006 cells (= 6.4 * 1013 cells), otherwise
impossible.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Q: Where do we get z from?
A: We will use a bounded-error approximation
method.
Properties of the specific method we will discuss:
1. Polynomial worst-case time complexity
2. The SP-score is less than twice the optimal value.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Idea: focus on alignments consistent with a tree.
Q: What do we mean by “consistent with a tree”?
Informal explanation:
•
A graph edge denotes a relation between two nodes.
•
Recall that D(Si, Sj) is the optimal weighted distance
between Si and Sj.
•
We could let D(Si, Sj) be the edge relation.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Informal explanation:
• A graph edge denotes a relation between two
nodes.
•
Recall that D(Si, Sj) is the optimal weighted
edit distance between Si and Sj.
•
We could let D(Si, Sj) be the edge relation
between the node labeled Si and the node
labeled Sj.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Informal explanation continued:
• Suppose we have a multiple alignment M.

•
Suppose we construct an unrooted tree from a
subset of such edges between nodes labeled
with strings from M.
•
We call the alignment of the strings
represented in the tree consistent with the tree.
recall D(Si, Sj) is the edge relation.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Example from text:
3 A X X _ Z
1 A X _ _ Z
2 A _ X _ Z
4 A Y _ _ Z
5 A Y X X Z
1
2
3
4
5
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Defn. More formally, let:
•
S be a set of distinct strings.
•
T be an unrooted tree comprised of nodes labeled with
strings from set S.
M be multiple alignment of the strings in S.
•
M is consistent with T if the induced pairwise
alignment of Si and Sj has score D(Si, Sj) for each
pair of strings (Si, Sj) that label adjacent nodes in
T.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Thm. For any set of strings S and for and tree T
whose nodes are labeled by distinct strings from
set S, we can efficiently find a multiple alignment
M(T) of S that is consistent with T.
Proof sketch: construct M(T) of S one string at a time.
Base case:
•
•
Pick two strings Si and Sj labeling nodes adjacent in T.
Create M2(T) a two string alignment with distance
D(Si,Sj).
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Inductive Hypothesis: Assume the theorem holds for
2 < k strings, i.e., Mk(T) is consistent with T.
Inductive Step: show that the theorem holds for k + 1
strings.
•
•
•
Pick a string Sj not in Mk(T) such that it labels a node
adjacent to a node labeled Si already in Mk(T).
Optimally align Sj with Si (Si with spaces in Mk(T)).
Add Sj (Sj with spaces) to Mk(T) creating Mk+1(T).
Look at detailed proof (pg. 348) to see how the issue
of inserted spaces is handled.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
By construction:
• Sj and Si have distance D(Si, Sj)
• Mk+1(T) is consistent with T.
By induction, M(T) of S is consistent with T and is
efficiently computed.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
We need some more definitions at this point:
Defn. the center string Sc  S, a set of k strings, is the
string that minimizes M = SSjS D(Sc, Sj).
Defn. the center star is a star tree of k nodes, with the
center node Sc and each of the k-1 remaining
nodes labeled by a distinct string in S – Sc.
S1
S2
S6
Sc
S5
S3
UNIVERSITY OF SOUTH CAROLINA
S4
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Defn. the multiple alignment Mc of strings in S is the
multiple alignment consistent with the center star.
Defn. let d(Si, Sj) denote the score of the pairwise
alignment of strings Sj and Si induced by Mc.
Defn. let d(M) denote the score of the alignment M.
Observations:
•
•
d(Si, Sj)  D(Si, Sj)
d(Mc) = Si<jd(Si, Sj).
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Defn. the triangle inequality wrt a scoring scheme is
defined as the relation s(x, z)  s(x, y) + s(y, z) for
any three characters x, y, and z.
We can extend the triangle inequality from the scoring
scheme for characters to string alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Lemma. If a 2-string scoring scheme that satisfies the
triangle inequality is used, then for any Si& Sj :
d(Si, Sj)  d(Si, Sc) + d(Sc, Sj) = D(Si, Sc) + D(Sc, Sj)
Proof sketch: Notice that for each column we have:
s(x, z)  s(x, y) + s(y, z)
The inequality in the lemma follows immediately.
The equality holds since all strings are optimally aligned
with Sc.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
We can now establish the bounded-error
approximation:
Defn. Let M* denote the optimal alignment of the k
string of S.
Defn. Let d*(Si, Sj) denote the pairwise alignment
score of the strings Si and Sj induced by M*.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Bounded-Error Approximation
for SP-Alignment
Thm. d(Mc)/d(M*)  2(k – 1)/k < 2
See proof on page 350 for details. (basically depends
on the previous lemma)
Corollary:
kM  Si<jD(Si, Sj)  d(M*)  d(Mc)  [2(k – 1)/k] Si<jD(Si, Sj)
•
•
Recall that M = SSjSD(Sc, Sj)
The alignment score D(Si, Sj) is not based on Mc or M*
Observation: d(Mc)/Si<jD(Si, Sj) gives a measure of the
goodness of Mc and is guaranteed to be less than 2.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
First fact of consensus representations:
There is no consensus as to how to define consensus.
Consequently, we will look at several definitions.
Steiner consensus strings:
Defn. Given a set of string S and a string S´, the
consensus error of S´ relative to S is E(S´)=
SSjSD(S´, Sj).
S´ is not required to be a member of S.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Defn. Given a set of strings S, an optimal Steiner
string S* for S minimizes the consensus error
E(S*).
S* is not required to be a member of S.
Observations:
•
in S* we are trying to capture the essential common
features in S.
•
Computing E(S*) appears to be a hard problem.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
No known efficient method for finding S*.
 We will consider an approximate method.
Lemma: Assume that S contains k strings and that the
scoring scheme satisfies the triangle inequality.
There exists a string S´ S such that E(S´)/E(S*)
 2.
Q: What does this lemma say?
(Proof sketch next slide)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Proof sketch:
For any i, D(S´, Si)  D(S´, S*) + D(S*, Si) so,
E(S´) = SSjS D(S´, Sj) and
SSjS D(S´, Sj)  SSjS*[ D(S´, S*) + D(S*, Sj)]
But SSjS*[ D(S´, S*) + D(S*, Sj)] = (k-2) D(S´, S*) +
E(S*)
Therefore E(S´)  (k-2) D(S´, S*) + E(S*)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Q:Where do we find a good candidate for S´?
A: Sc, the center string.
Recall Sc minimizes SSjS D(Sc, Sj).
Thm. E(Sc)/E(S*)  2 - 2/k, assuming the scoring
scheme satisfies the triangle inequality.
Proof. Follows immediately from the previous lemma
and the observation that E(Sc)  E(S´)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Consensus strings from multiple alignment
Defn. Let M be a multiple alignment of strings S, the
consensus character of column i of M is the
character that minimizes the summed distance to
all the characters in column i.
Note:
•
•
the summed distance depends on the pairwise scoring scheme.
The plurality character is the consensus character for some
scoring schemes.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Defn. Let d(i) denote the minimum sum in column i.
Defn. The consensus string SM derived from
alignment M is the concatenation of consensus
characters for each column of M.
Q: How can we evaluate the goodness of SM ?
A: One possibility is Goodness(SM ) = SiD(SM, Si),
i.e., see how good of a Steiner string SM is.
Consider a different approach…..
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Defn. The alignment error of SM, a consensus string
containing q characters, is Sqi=1d(i).
Defn. The alignment error of M is defined as the alignment error of SM, its consensus string.
Example:
1
2
3
A
A
T
A
A
A
T
T
C
C
G
G
G
T
-
T
T
A
T
A
A
T
T
T
A
A
T
C
G
-
T
A
T Consensus (alignment error of
?)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Defn. The optimal consensus multiple alignment is a
multiple alignment M whose consensus string SM
has the smallest alignment error over all possible
multiple alignments of S.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
The 3 notions of consensus we have discussed are:
1. The Steiner string S* defined from S.
2. The consensus string SM derived from M, with
goodness related to its function as a Steiner
string.
3. The consensus string SM derived from M, with
goodness related to is ability to reflect the
column-wise properties of M.
Surprisingly (or not) they lead to the same multiple
alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Let’s investigate the assertion these concepts result in
the same multiple alignment.
Let S be a set of k strings.
Let T be the star tree with Steiner string S* at the root
and each of the k strings of S at distinct leave of
T, then:
Defn. the multiple alignment consistent with S* is the
multiple alignment of S  S* consistent with T.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Thm. Let S denote the consensus string of the optimal
consensus multiple alignment.
1. Removing the spaces from S results in the
optimal Steiner string S*.
2. Removal of S* from the multiple alignment
consistent with S* results in the optimal consensus
multiple alignment of S.
Proof on page 353.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Q: Why should we care about this theorem?
A: The theorem stating: E(Sc)/E(S*)  2 - 2/k plus this
theorem can be used to approximate the optimal
consensus alignment:
1. Find the center string Sc. Recall the center string Sc 
S, a set of k strings, is the string that minimizes M =
SSjS D(Sc, Sj).
2. Place Sc at the center of a k node star.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
S1
S2
S6
Sc
S5
S3
S4
3. Label each leaf with a string from S.
4. Construct the multiple alignment M consistent with
this tree T.
Recall: M is consistent with T if the induced pairwise
alignment of Si and Sj has score D(Si, Sj) for each pair
of strings (Si, Sj) that label adjacent nodes in T.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Consensus Objective Functions
Revelation: The multiple alignment M is the same as
Mc used to approximate the SP objective function.
Thm. The multiple alignment Mc created by the
center star method has:
1. An SP score  (2-2/k) score of the optimal SP
alignment.
2. A consensus alignment error  (2-2/k) the alignment
error of the optimal consensus multiple alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Phylogenetic tree: a depiction of the evolutionary
history of set of taxa. The leaves of the tree are
labeled by taxa names.
Convention:
•
•
•
•
Each edge (u,v) denotes an ancestor-descendant
relation.
This relation may be on the basis of morphological
attributes or sequence similarity.
The internal nodes represent extinct taxa.
The leafs represent currently existing taxa.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Two related problems:
1. Problem: find a multiple alignment for a tree
a)
Given a phylogenetic tree, deduce sequences for the internal
nodes to optimize some objective function.
b) Find the multiple alignment consistent with the tree.
c) Delete the deduced sequences (internal node labels)
2. Find a tree from a set of leaf sequences.(Chapter 17)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Let T be a tree with leaf nodes labeled with distinct
strings from a set S.
Defn. a phylogenetic alignment for T is an assignment
of one string to each internal node.
Note: strings labeling internal nodes need not come
from S.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Recall that D(S1, S2) denotes the edit distance between
strings S1 and S2.
Defn. The edge distance of edge (i, j) is D(Si, Sj)
where Si and Sj are the strings labeling nodes i
and j, respectively.
Defn. Path distance is the sum of edge distances along
the path.
Defn. Phylogenetic alignment distance is the sum of
all edge distances in the tree.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Phylogenetic alignment problem for T:
Find an assignment of strings to internal nodes of T that
minimizes the distance of the alignment.
S20
S19
S8
S4
S11
S7
S1
S2
S3
S5
S14
S6 S9 S10 S12 S13
UNIVERSITY OF SOUTH CAROLINA
S18
S15
S16
S17
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Phylogenetic alignment problem for T:
• The general problem is too hard (NP-complete).
• We will consider a heuristic approximate
solution.
 The solution is within twice the minimal distance.
 The approach has polynomial time complexity.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Defn. A lifted alignment is a phylogenetic alignment
in which the string assigned to each internal node
is also assigned to one of its children.
Example:
S4
S7
S4
S3
S7
S4
S1
S2
S3
S4
S5 S6 S7
UNIVERSITY OF SOUTH CAROLINA
S8
S8
S11
S9
S10
S11
S12
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Lifted Alignment Observation:
Each internal node v is labeled by a leaf label appearing in
the subtree rooted at v.
S4
S7
S4
S3
S7
S4
S1
S2
S3
S4
S5 S6 S7
UNIVERSITY OF SOUTH CAROLINA
S8
S8
S11
S9
S10
S11
S12
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Plan:
1. Construct a lifted alignment TL.
2. Initial approach: conceptually transform the optimal
phylogenetic alignment.
Q: Why do we say “conceptually”?
A: Because we don’t have T*, the optimal phylogenetic
alignment.
3. Demonstrate property of TL: total distance < twice
optimal phylogenetic alignment distance.
4. Next: show how to compute TL efficiently using DP.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Creating TL:
•
•
•
•
•
Start with input tree T, with leafs labeled by distinct
strings.
Let T* denote the optimal phylogenetic alignment for T.
(This is the assignment of strings to internal nodes of T
that minimizes the total of all edge distances.)
Successively lift each internal node.
An internal node can only be lifted if all of its children
have been lifted.
Leaf nodes are defined to be lifted.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Q: How do we “lift” a node?
Let S*v denote the label of node v in T*.
Assume that v’s children have been lifted.
WLOG let the labels of v’s children be S1, S2,..,Sk
from S.
S
v
*
v
5
3
4
S1
6
S4
S2
S3
Example
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Find the string Sj among the children that is closest to
S*v, i.e., the string Sj such that D(S*v, Sj)  D(S*v,
Si) for all i from1 to k.
Replace S*v,with Sj.
v
S*v
5
S4
3
4
S1
v
0
6
S4
S2
S3
S1
S4
S2
S3
Example: after the lifting operation, the edge distances change.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Claim: The lifted alignment TL has total distance less
or equal to twice that of the optimal phylogenetic
alingment T* of T.
Sketch of proof:
Suppose e(v, w) (v the parent of w) is a nonzero-length edge in TL.
Suppose v is labeled Sj  S, and w is labeled Si  S.
If Sj  Si then the distance of e in TL is D(Sj, Si)  D(Sj, S*v) + D(S*v, Si).
But D(Sj, S*v) + D(S*v, Si)  2 * D(S*v, Si)
Q: Why is this true?
A: because D(Sj, S*v)  D(S*v, Si)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Sketch of proof (continued):
What about paths?
Let Pe denote the path from v to the leaf labeled Si in T*. The
distance is at most the sum of the edge distances.
In TL, if e is a nonzero-length edge, then this path has distance
at most twice Pe.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
The lifted alignment can be computed with DP.
Let Tv be the subtree of T rooted at node v.
Defn. d(v, S) denotes the distance of the best lifted
alignment of Tv where v is labeled with S.
Obviously, S must be the label of a leaf in Tv.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
d(v, S) is computed from the leaves up.
1. The leaves are already considered “lifted”.
2. d(v, S) for a parent of leaves is computed by:
d(v, S) = SS´ D(S, S´) where S´ is the label of a
child of v.
3. The general recurrence for an internal node is:
d(v, S) = Sv´ minS´ [D(S, S´) + d(v´, S´) ],
where v´ is a child of v and S´ labels a leaf in Tv´.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Time analysis:
Assume that T has k leaves.
k
Assume that all  2  pairwise distances have been computed.
Q: How long does this take?
A: O(N2) where N is the total length of all the k strings.
Why is this true? How can we explain it?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic trees: Multiple
alignment
Time analysis:
The processing at an internal node is O(k2).
Why is this true?
Then the total time is O(N2 + k3).
Why O(N2 + k3) and not O(N2 + k2)?
Bottom line: we can compute the optimal lifted alignment in
time that is polynomial in the length of the strings and
size of the tree.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology