Sections 17.4-6 - cse.sc.edu - University of South Carolina

Bioinformatics Algorithms and
Data Structures
Chapter 17.4-6: Strings and
Evolutionary Trees
Lecturer: Dr. Rose
Slides by: Dr. Rose
April 10, 2007
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem Centrality
Four related tree problems:
•
•
•
1. Ultrametric
2. Additive
3. Binary perfect phylogeny
4. Tree compatibility
All can be solved as ultrametric tree problems.
Recall tree compatibility reduces to perfect phylogeny.
Now we reduce additive tree & (binary) perfect
phylogeny problems to the ultrametric tree problem.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
•
•
•
•
•
Goal: reduce additive tree problem to ultrametric
problem
Complexity: O(n2) reduction
Approach: create a matrix D that is ultrametric
 D is additive.
We will start by describing a reduction that
involves a tree T for D and T for D.
We will then describe a direct reduction of D to
D.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
•
•
•
•
•
Assume that D is additive.
Assume that we know of an additive tree T for D
Assume that each of the n taxa in D labels a leaf
of T.
Idea: label the nodes of T to create an ultrametric
tree T.
Q: How can we do this?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
A: we will do the following:
–
–
•
•
•
•
Select one node as the root
Stretch the leaf edges so that they are equidistant
from the root.
Let v be the row of D containing the largest
entry.
Let mv denote the value of this entry.
Select node v as the root of T.
This creates a directed tree.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Example:
• A is the row of D containing the largest entry.
• Select node A as the root of T.
A B C
A 0 3 7
0 6
B
0
C
D
D
6
5
3
0
A
2
3
1
1
2
B
C
A
2
D
3
2
C
UNIVERSITY OF SOUTH CAROLINA
1
1
B
D
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Stretch leaf edges:
–
–
for each leaf i, add mA – D(A, i) to the leaf edge.
Leaf edges are now equidistant from A.
A
D
A
1
2
2
A B C D
3
1
A 0 3 7 6
(4)
0 6 5
B
1
2
3
0 3
C
B
C
B
0
D
2
1
(1)
(0)
D
C
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
The resulting tree T is:
–
–
–
A
2
a rooted edge-weighted tree
distance mv from root to every leaf
each internal node is equidistant to
leaves in its subtree.
3
(0)
C
UNIVERSITY OF SOUTH CAROLINA
2
1
(4)
B
1
(1)
D
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Since each internal node is equidistant to the leaves
in its subtree:
• Label each internal node by this unique distance.
• These labels can be used to define an ultrametric
matrix D.
• D(i, j) is the label at the least common ancestor
of leaves i and j in T.
Q: How can we go directly from matrix D to matrix
D without involving T and T?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Consider leaves i & j in T:
–
–
–
Let node w be their least common ancestor
Let x be the distance from the root v to w.
Let y be the distance from node w to leaf i.
v
x
w
y
i
UNIVERSITY OF SOUTH CAROLINA
z
j
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Q: What is the distance from w to i in T?
A: y + mv - D(v, i) in T.
Q: Where does mv - D(v, i) come from?
A: Recall we add mv - D(v, i) to stretch the leaf edges.
v
x
w
y
i
UNIVERSITY OF SOUTH CAROLINA
z
j
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Gusfield presents the following lemma:
Without knowing T or T´ explicitly, we can deduce that
D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2
v
Q: Is this equation correct?
D´(i, j) = mv + ((y + z) - (x + y) - (x + z))/2 ?
x
D´(i, j) = mv + -2x/2 ?
w
Should it instead be:
z
y
D´(i, j) = 2mv + D(i, j) - D(v, i) - D(v, j)?
j
i
i.e., D´(i, j) = 2mv - 2x?
Probably, but it is not necessary for the reduction (slide 9)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
This brings us to the following Theorem:
If D is an additive matrix, then D´ is ultrametric, where
D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2
Proof. We’ve shown that:
D´(i, j) = y + mv - D(v, i)
y = D(v, i) – x
x = (D(v, i) + D(v, j) - D(i, j))/2
Putting it altogether establishes the equation in the
theorem.
D´ satisfies the ultrametric requirement.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Q: What is the value of y?
A: y = D(v, i) - x.
Q: What is the value of x in terms of values in D?
A: x = (D(v, i) + D(v, j) - D(i, j))/2
v
x
w
y
i
UNIVERSITY OF SOUTH CAROLINA
z
j
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
So: D additive  D´ ultrametric
By contraposition: D´ ultrametric  D additive
Q: does D´ ultrametric  D additive?
A: Theorem: D´ ultrametric  D additive
Proof. (constructive)
•
Let T´´ be the ultrametric tree for D´
•
Assign weights to edges of T´´
–
–
Note: the sum of edges from a leaf to an ancestor must match the
ancestor’s label.
For each edge (p, q), assign the weight |p-q|
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
•
Assign weights to edges of T´´ continued
–
–
–
–
Note the path distance between leaves (i, j) is twice the value
labeling the least common ancestor
Hence, 2D´(i, j) = 2mv + D(i, j) - D(v, i) - D(v, j)
Now shrink the edge into each leaf i by mv - D(v, i)
The path from leaf i to leaf j is now D(i, j)
The result is an additive tree for matrix D from D´’s
ultrametric tree.
Putting all of this together results in a method for contructing
and additive tree for an additive matrix.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Additive Tree Algorithm
–
–
–
Create matrix D´ from D.
Create ultrametric tree T´´ from D´
Create T from T´´
•
•
Label edge (p, q) with the value |p-q|
For each leaf i, shrink the leaf edge by mv - D(v, i)
Note: no step takes more than O(n2) time.
Thm. An additive tree for an additive matrix can be
constructed in O(n2) time.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Example: Given D, first find D´
Recall: D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2
D
A B C
A 0 3 7
0 6
B
0
C
D
D
6
5
3
0
UNIVERSITY OF SOUTH CAROLINA
D’
A B C
A 0 7 7
0 5
B
0
C
D
?
D
7
5
2
0
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Example: From D´ find T´´
Recall: label edge inner edges (p, q) by |p-q|
D’
A B C
A 0 7 7
0 5
B
0
C
D
7
D
7
5
2
0
UNIVERSITY OF SOUTH CAROLINA
2
5
3
5
2
2
2 B
C
D
?
7
A
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Example: From T´´ find T
Recall: shrink leaf edge i by mv - D(v, i)
D
7
A B C
A 0 3 7
0 6
B
0
C
D
D
6
5
3
0
2
2
C
5
3
2
2
D
UNIVERSITY OF SOUTH CAROLINA
2
7
5
B
?
3
A
2
C
1
D
0
1
A
B
College of Engineering & Information Technology
Ultrametric Problem: Additive Trees
Example: Finally compare the derived T with the
original tree as a sanity check.
2
3
2
C
0
1
1
D
A
2
A
B
UNIVERSITY OF SOUTH CAROLINA
3
1
B
1
2
C
D
A B C
A 0 3 7
0 6
B
0
C
D
D
6
5
3
0
College of Engineering & Information Technology
Ultrametric Problem: Perfect Phylogeny
We now recast perfect phylogeny in terms of an
ultrametric tree problem.
Defn. DM – the n by n matrix of shared characters
More formally:
Given the n by m character matrix M, define the n by n
matrix DM: for each pair of objects, set DM(p, q) to be
the number of characters that p and q both possess.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Perfect Phylogeny
Lemma: If M has a perfect phylogeny, then DM is a
min-ultrametric matrix.
Proof: convert M’s perfect phylogeny T to a minultrametric tree for DM
–
–
–
Let T be the perfect phylogeny for M.
Label T’s root be zero.
Traverse T from top to bottom, for each node v:
•
•
•
Let pv be the number labeling node v’s parent.
Let ev be the # of characters labeling the edge into v.
Label node v with the sum pv + ev
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Perfect Phylogeny
–
The label of node v is the number of characters
common to all leaves in the subtree rooted at v.
– if v is the immediate parent of leaves p and q, then the
label of v is DM(p, q)
– The numbers labeling nodes on any path from the
root are strictly increasing.
 The result is an ultrametric tree for matrix DM.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Perfect Phylogeny
Algorithm: perfect phylogeny via ultrametrics:
1. Create matrix DM from M.
2. Attempt to create a min-ultrametric tree T´ from DM.
If not possible, then M has no perfect phylogeny.
3. If T´ was successfully created in step 2:
•
•
•
Attempt to label its edges with the m characters of M.
If not possible, then M has no perfect phylogeny.
O/w the modified T´ is the perfect phylogeny T.
Note: T´ may be min-ultrametric but M may not have
a perfect phylogeny, hence the check in step 3
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Ultrametric Problem: Perfect Phylogeny
Final notes on the centrality ultrametric problem.
We can see that the following problems:
1. perfect phylogeny
2. tree compatibility
can be cast as ultrametric problems.
This is not an efficient way to address these
problems.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Maximum Parsimony
Maximum parsimony:
•
•
Perfect phylogeny is a special instance
Can be viewed as a Steiner tree problem on a
hypercube
Presentation Approach:
•
•
•
Introduce Steiner trees
Hypercube graphs
Maximum parsimony as a Steiner tree problem
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Maximum Parsimony
Definitions:
Let N be a set of nodes
Let E be a set undirected edges with non-negative weight
Let G = (N, E) be an undirected graph
Let X  N be a subset of nodes.
A Steiner tree ST for X is any connected subtree of G that
contains all nodes of X and possibly nodes in N-X.
Weighted Steiner Tree Problem: Given G and X, find the
Steiner tree of minimum total weight.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Maximum Parsimony
More Definitions:
A hypercube of dimension d is an undirected graph with
2d nodes, labeled 0..2d-1. Adjacent nodes differ in
only one label bit position.
The weighted Steiner tree problem on hypercubes: G must
be a hypercube.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Maximum Parsimony
More Definitions:
Maximum Parsimony: Occam’s razor applied to
phylogenetic reconstruction. A preference for trees
requiring fewer evolutionary events to explain data.
Gusfield’s definition:
The Maximum Parsimony problem is the unweighted
Steiner tree problem on a d-dimensional hypercube.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Maximum Parsimony
More about the hypercube formulation of MP:
–
–
–
The X input taxa are described as d-length binary vectors.
Recall: adjacent nodes differ in only one label bit position.
Correspondingly, taxa that differ by a single mutation will be
adjacent.
 Steiner tree of X nodes and l edges iff  a corresponding
phylogenetic tree that entails l character-state mutations.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Steiner interpretation of Perfect
Phylogeny
Define a nontrivial binary character to be a character
contained by some taxa but not all.
Consider an MP dataset of d nontrivial binary
characters
Q: what is the minimal number of mutations in the
MP tree?
A: at least d.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Steiner interpretation of Perfect
Phylogeny
Q: What is the relation to binary perfect phylogeny?
A: the binary perfect phylogeny problem is
equivalent to asking if there is an MP solution
with a cost of exactly d.
Q: What about generalized perfect phylogeny?
A: It’s similar. The lower bound must reflect:
–
–
the number of character states in the input taxa.
a character having r states in the input taxa is allowed
only r-1 transitions.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Steiner interpretation of Perfect
Phylogeny
Complexity:
• No known efficient solution for Steiner tree
problem on unweighted graphs.
• Polynomial time solution for generalized perfect
phylogeny problem when r is fixed.
 this particular Steiner tree problem can be
answer in polynomial time.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Steiner interpretation of Perfect
Phylogeny
MP approximations:
–
–
–
The weighted Steiner tree problem on hypercubes is
NP-hard.
There is an approximate method with an error bound
of a factor of 11/6.
Also MST can be used to find a Steiner tree with
weight less than twice the optimal Steiner tree.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic Alignment
Recall:
•
•
•
•
phylogenetic alignment was discussed in section 14.8
The focus was on deriving a multiple alignment
enlightened by evolutionary history.
The tree focused emphasis on specific alignment
groupings
Internal node sequences were a secondary artifact
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic Alignment
Phylogenetic alignment as a parsimony problem:
In contrast:
• we are now interested in the internal sequences
• These sequences are waypoints in the evoutionary
trajectory leading to the extant taxa
• phylogenetic alignment is thus a parsimony problem
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Phylogenetic Alignment
Hypothesis: optimal phylogenetic alignment describes
evolutionary history.
Assumptions:
–
–
Edit distance realistically models evolutionary distance
Globally optimal phylogenetic alignment captures
essence of the evolutionary process
We will look at minimum mutation, a variant of
phylogenetic alignment
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Defn. minimum mutation problem – variant of
phylogenetic alignment problem.
Input comprised of:
1. Tree
2. Strings labeling the leaves
3. A multiple alignment of those strings
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Q: If you are given the tree and the multiple alignment,
what is left to compute?
A: the mutations that accounts for the input data.
These mutations should be:
1. minimum sequence of site mutations that is
2. compatible with the given tree and
3. the given multiple alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Q: How is the input data used to determine the
minimum sequence of mutations?
1. The multiple alignment associates each amino acid
with a specific position.
2. The evolutionary history of the sequences is then
treated as a combined but independent
evolutionary history of each position.
3. The tree guides the order of mutations for each
position.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Assumptions:
–
–
Each column of the alignment can be solved separately
The strings labeling inner nodes adhere to the same
alignment
The problem reduces to a computation at a single
position.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Minimum mutation for a single position:
Input:
1. rooted tree with n nodes
2. Each leaf is labeled by a single character
Output:
1. Each interior node is labeled by a single character
2. The labeling minimizes the number of edges between
nodes with different labels.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Algorithmic approach: Dynamic Programming
•
•
•
•
•
Let Tv denote the subtree rooted at node v
Let C(v) be the cost of the optimal solution for Tv
Let C(v, x) be the cost when v must be labeled by x
Let vi denote the ith child of node v
Base case: for each leaf specify C(v) & C(v, x) x  S.
•
•
C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x.
C(v, x) =  if leaf v is not labeled by x.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
When v is an internal node:
C (v)  min C (v,x)
x
C (v, x)  i min( C (vi )  1, C (vi , x))
The recurrence relations start from the base cases.
•
•
Bottom up from leaves
Backtracking is used to after all C(v,x) computed to
extract the solution.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Backtracking process:
•
•
•
The root is labeled by the character x s.t. C(r) = C(r,x)
The traversal is then top-down
If v is labeled x, then vi is labeled:
•
•
character x if C(vi) + 1 > C(vi,x)
o/w character y such that C(vi) = C(vi,y)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Let’s evaluate an example:
C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x, o/w C(v, x) =  if leaf v is not labeled by x.
C (v, x)  i min( C (vi )  1, C (vi , x))
C (v)  min C (v,x)
x
C (v,x )
A
B
15
14
13
12
11
9
1
A
10
2
3
B B
4
5
6
A AB
7
8
B
A
UNIVERSITY OF SOUTH CAROLINA
C (v )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
College of Engineering & Information Technology
Fitch-Hartigan minimum mutation
problem
Time complexity:
Bottom-up portion
– Let s = |S|
– Each node is evaluate wrt each x  S.
– For n nodes this gives O(ns)
The backtracking portion is O(n)
Overall O(ns)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Maximum Parsimony
•
•
Most widely used tree building algorithm
Differs from distance-based algorithms:
–
–
–
–
Does not actually build trees from distances
Parsimony is used to compute the cost of a tree
A search strategy is used to search through all
topologies
Goal: find the tree topology with the overall minimum
cost
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Traditional Parsimony
Algorithm: Traditional parsimony [Fitch 1971]
• Goal: count the number of substitutions at a site.
• Method: recursion, keeping track of
–
–
C, the current cost
Rk, the residues at k, the current node
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Traditional Parsimony
Algorithm: Traditional parsimony [Fitch 1971]
C = 0, k = root / initialize the cost and
TP(k) {
If k is a leaf then return xk
Rleft = TP( k.left)
Rright = TP(k.right)
if Rleft  Rright   return Rleft  Rright
else {
C = C +1
return Rleft  Rright }}
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Traditional Parsimony
Let’s evaluate an example:
if Rleft  Rright   return Rleft  Rright
else C = C +1, return Rleft  Rright
R
15
14
13
12
11
9
1
A
10
2
3
B B
4
5
6
A AB
7
8
B
A
UNIVERSITY OF SOUTH CAROLINA
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
College of Engineering & Information Technology
Traditional Parsimony
There is a traceback procedure for finding ancestral
assignments.
Q: How do you think the traceback works?
A: Start from the root:
1. Pick a residue
2. Pick the same residue for each child set if possible
3. If a child set does not contain the parent’s residue,
randomly select a residue from its set.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Traditional Parsimony
Let’s perform the traceback on our example:
R
15
14
13
12
11
9
1
A
10
2
3
B B
4
5
6
A AB
7
8
B
A
UNIVERSITY OF SOUTH CAROLINA
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
College of Engineering & Information Technology