Agenda for today • Phylogenic tree construction – Introduction to

Agenda for today
• Phylogenic tree construction
– Introduction to phylogenic trees
∗ Ultrametricality
∗ Additive distance
– Distance-based approaches
– Parsimony
• Phylogenic tree building and multi-sequence alignment
1
Phylogenic trees
• Rooted trees, tracing evolutionary divergence
• Without loss of generality, assume binary branching
• Strings at the leaves of trees
• Internal nodes labeled or unlabeled
– Labels may hypothesize ancestor strings
• Nodes or links in tree may have scores
– Scores may represent distance or time
2
Clade of Apes
3
Ultrametric trees
• Ultrametric trees have real numbers at the internal nodes
• Numbers at the nodes must strictly decrease
• For our purposes, strings at the leaves of the tree
• Pairwise scores between leaves:
number associated with the lowest-common-ancestor
• Defines an ultrametric (symmetric) matrix of pairwise distances
– Diagonals zero; off-diagonal positive
• Can construct a unique ultrametric tree efficiently from a given
ultrametric matrix (see Gusfield)
4
Example ultrametric matrix/tree
9
A B C D E
A 0 9 9 6 4
B
0 4 9 9
C
0 9 9
D
0 6
E
0
4
6
D
4
B
C
E
A
• Variant: min-ultrametric trees are strictly increasing down the tree
– (just a sign change, same procedures apply)
5
Molecular clock theory
• Ultrametric trees generally involve time since divergence
• The Molecular clock theory of Zuckerkandl and Pauling states
– For any given protein, accepted mutations in the amino acid
sequence occur at a constant rate (from Gusfield)
– Accepted means not impacting function
– Hence number of changes is proportional to time
– (rate varies depending on protein)
• Evidence of mutation can be based on sequence edits
• Most real data is not ultrametric – assumptions too strong
6
Additive-distance trees
• Relaxes some assumptions on the constancy of the rate
• Scores are labeled on links rather than internal nodes of the tree
• Distance between two leaves (strings) is the sum of scores on links
between the leaves
• We can move away from straight phylogenic trees and allow strings
to label internal nodes
– “Compact” additive-distance trees introduce no additional nodes
beyond leaves
7
Example additive-distance tree
A B C D E
2
A 0 9 9 6 4
B
0 4 9 9
C
0 9 9
D
0 6
E
0
2
3
1
2
D
2
B
2
C
2
E
A
• Ultrametric matrices can be represented with additive-distance trees
• O(n2) algorithms for building additive-distance trees from n×n
matrices
8
Building phylogenic trees
• Given a set of sequences, how can we build a phylogenic tree from
those sequences?
• Two main approaches
– Distance-based approaches (minimize distance)
– Parsimony (fewest required changes)
• Distance-based approaches are well suited to ultrametric and
additive distance trees
– Many problems are not so well behaved
• Parsimony does not make such assumptions
9
Simple distance-based tree building
• Standard approach (around since late 50s) involves agglomerative
pairwise distance-based clustering
• Unweighted pair group method using arithmetic averages,
aka “UPGMA”
• Builds binary tree by iteratively merging two closest clusters
• Initialize with each string as a cluster unto itself
• Cluster distances are based on pairwise distances
X X
1
d(Ci, Cj ) =
d(x, y)
|Ci||Cj |
x∈Ci y∈Cj
10
Efficient re-calculation of pairwise distances
• When two clusters Ci, Cj are merged into Ck , must calculate
pairwise distances to other remaining clusters
• Efficient re-calculation possible:
d(Ck , Cl ) =
1
X X
|Ck ||Cl | x∈C
d(x, y)
k y∈Cl
=
=
=


1
X X
(|Ci| + |Cj |)|Cl |

d(x, y) +
x∈Ci y∈Cl
|Cl ||Ci| d(Ci, Cl ) + |Cl ||Cj | d(Cj , Cl )
(|Ci| + |Cj |)|Cl |
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
|Ci| + |Cj |
11
X X
x∈Cj y∈Cl
d(x, y)
Clustering by distance
• If the distance matrix is ultrametric, UPGMA is the
right approach
• If, however, the distance matrix is not ultrametric,
but additive, then there are better clustering methods than UPGMA
• Why not join the closest as UPGMA suggests?
– Because two very close strings may not form a node
– Additivity is a very different requirement from minimum distance
12
UPGMA clustering of additive-distance matrix
A
B
C
A
B
C
D
E
0
8
8
5
4
0
2
5
6
0
5
6
0
3
D
E
B
0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
d(Ck , Cl ) =
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
|Ci| + |Cj |
13
C
UPGMA clustering of additive-distance matrix
A
BC
A
BC
D
E
0
8
5
4
0
5
6
0
3
D
E
E
0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
d(Ck , Cl ) =
C
B
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
|Ci| + |Cj |
14
D
UPGMA clustering of additive-distance matrix
A
BC
A
BC
DE
0
8
4.5
0
5.5
E
DE
0
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
d(Ck , Cl ) =
|Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl )
|Ci| + |Cj |
15
C
B
A
D
Additive matrix
A B C D E
A
B
C
D
0
8
8
5
4
0
2
5
6
0
5
0
E
Truth:
2
1
1
D
3
0
1
1
6
1
C
B
1
3
E
A
What UPGMA found:
E
16
C
B
A
D
Neighbor joining in additive distance
• Two key ideas in modifying clustering for additive trees
• “Normalize” distance by average distances over set of leaves L
P
z∈L d(x, z) + d(y, z)
D(x, y) = d(x, y) −
|L| − 2
• Change distance recalculation after merging clusters
S
• If Ck = Ci Cj then for all nodes m
d(k, m) =
1
2
(d(i, m) + d(j, m) − d(i, j))
17
Additive neighbor joining
A B C D E
A
B
C
D
0
A
8
8
5
4
A
0
2
5
6
B
0
5
6
0
3
D
0
E
E
→
C
B
C
D
0 -7.33 -7.33 -9.33 -10.5
0
-12
-8
-7.17
0
-8
-7.17
0
-9.17
• Find the lowest score in the matrix
• Merge columns/rows
• Update distances
d(k, m) =
1
2
(d(i, m) + d(j, m) − d(i, j))
18
E
0
Additive neighbor joining
A
BC
D
E
A
BC D E
0
7
0
5
4
A
→
4
5
0
3
D
0
E
BC
A
BC D
0
-9
-9 -10
0
-10 -9
0
E
-9
0
• Two possible merges
• Additive-distance tree (unlike ultrametric) not necessarily unique
19
Multiple trees (basically unrooted)
2
1
2
1
A
1
1
1
1
1
1
E
D
3
3
1
C
B
1
D
A
1
1
E
B
20
C
Internal nodes
• Just finding the tree-topology is not necessarily the end product
• What about labels on the internal nodes?
• A phylogenic tree is hypothesizing a point of divergence
– There was an ancestor string at that point
– Can we hypothesize the string in addition to the point of divergence?
• One method is “maximum parsimony” or just “parsimony”
• Sort of an Occam’s razor approach: hypothesize as few mutations
as necessary
21
Parsimony
• Parsimony methods are generally presented for a given tree
– Multiple trees can be compared, but searching over all possible
trees is generally intractable
• For the current discussion, assume that we are given a tree
– Perhaps derived via iterative pairwise alignment
• Since we have a tree, assume that we have a multi-sequence alignment consistent with that tree
• Given the tree and multi-sequence alignment of leaves, parsimony
looks for the minimum substitutions/mutations over the tree
22
Phylogenic Alignment problem
• Given a phylogenic tree T , the phylogenic alignment problem is:
– Label the internal nodes of T such that the overall distance of
the alignment is minimized
– Overall distance is the sum of all parent/child distances
– Usually different “sites” in the string are modeled independently
• Minimum mutation problem (Fitch-Hartigan)
– Phylogenic Alignment problem when given multi-sequence
alignment of the leaves
– Efficient dynamic programming when given tree T
23
Continue with additive tree example
A B C D E
A
B
C
0
8
8
5
4
0
2
5
6
0
D
E
5
0
2
1
D
3
0
1
1
6
1
1
3
E
A
A: CATG-AAG
D: G-AG-ATT
B: G-CATCCT
E: C--G-AGT
C: G-GATGCT
24
B
1
C
Minimum mutation dynamic programming
G−AG−ATT G−CATCCT
CATG−AAG
C−−G−AGT
25
G−GATGCT
Minimum mutation dynamic programming
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
G−AG−ATT G−CATCCT
C−−G−AGT
26
G−GATGCT
Minimum mutation dynamic programming
C:
T:
A:
G:
−:
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
G−AG−ATT G−CATCCT
C−−G−AGT
27
22122102
22220220
22202222
02122122
20222222
G−GATGCT
Minimum mutation dynamic programming
C:
T:
A:
G:
−:
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
C:
T:
A:
G:
−:
13322233
23222221
22222023
13302222
21220233
G−AG−ATT G−CATCCT
C−−G−AGT
28
22122102
22220220
22202222
02122122
20222222
G−GATGCT
Minimum mutation dynamic programming
C:
T:
A:
G:
−:
C:
T:
A:
G:
−:
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
23422232
33421231
33512233
13412233
31421343
C:
T:
A:
G:
−:
13322233
23222221
22222023
13302222
21220233
G−AG−ATT G−CATCCT
C−−G−AGT
29
22122102
22220220
22202222
02122122
20222222
G−GATGCT
Minimum mutation backtrace
G−−ATCCT
C:
T:
A:
G:
−:
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
C:
T:
A:
G:
−:
13322233
23222221
22222023
13302222
21220233
G−AG−ATT G−CATCCT
C−−G−AGT
30
22122102
22220220
22202222
02122122
20222222
G−GATGCT
Minimum mutation backtrace
G−−ATCCT
C:
T:
A:
G:
−:
G−−G−ACT
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
G−AG−ATT G−CATCCT
C−−G−AGT
31
22122102
22220220
22202222
02122122
20222222
G−GATGCT
Minimum mutation backtrace
G−−ATCCT
G−−ATCCT
G−−G−ACT
C:
T:
A:
G:
−:
02222222
22122221
21222012
22202211
21120222
CATG−AAG
G−AG−ATT G−CATCCT
C−−G−AGT
32
G−GATGCT
Minimum mutation backtrace
G−−ATCCT
G−−ATCCT
G−−G−ACT
G−AG−ATT G−CATCCT
C−−G−AAT
CATG−AAG
C−−G−AGT
33
G−GATGCT
Minimum mutation without tree
• For a given tree, we have seen that this algorithm has an efficient
dynamic programming solution
• No efficient algorithm for exploring all possible phylogenic trees
for a set of strings
• However, for a given order of strings at the leaves, a variant of the
CYK algorithm could be used
– Typically used with context-free grammars
– O(k3n) for k strings and n-column multi-sequence alignment
– Could result in any binary tree shape over the ordered set
34
Iterative approaches to MSA and tree building
• Having trees helps with multi-sequence alignment
• Having multi-sequence alignment helps with trees
• Iterative approaches can be used. One example:
1. Build a multi-sequence alignment using Iterative alignment
2. Build an initial phylogenic tree using neighbor joining
3. Perform minimum mutation phylogenic alignment
4. Build new multi-sequence alignment using full phylogenic tree
5. Throw away strings on internal nodes, preserving MSA
6. Go back to step 2. using new multi-sequence alignment
35