3.2 Computing Phylogeny - Columbia CS

Chapter 3: Phylogenetics
3.2 Computing Phylogeny
Prof. Yechiam Yemini (YY)
Computer Science Department
Columbia University
Overview
 Computing trees
 Distance-based techniques
 Maximal Parsimony (MP) techniques
 Maximum likelihood techniques
This chapter is based on Durbin Chapter 7
Also recommended: The Phylogenetic Handbook, Salemi and Vandamme 2004
2
1
Can We Tell Evolution From Homology
Partial sample
Duplication
Speciation
1A
2A
3A
3B
2B
1B
Phylogeny
How do we tell the right tree?
3A
1A
2B
1A
3A
2B
3
Phylogeny: Computing Trees
INPUT:
AGGGCAT
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
Phylogeny
U
V
W
X
Y
OUTPUT:
X
U
Y
V
W
4
2
Brute Force Approach
Brute Force
 Enumerate all trees
 Compute some measure of evolutionary likelihood
 Select best tree
How many rooted trees are there with n leaves?
 n=2 leaves => 1 tree
 n=3 leaves =>attach 3rd leaf to 3 edges => 3 trees
Let T(n)= # rooted trees with n leaves; E(n) = # edges




T(2)=1, E(2)=3; T(3)=3, E(3)=5
Addition of a leaf creates two new edges => E(n)=E(n-1)+2=> E(n)=2n-1
T(n)=T(n-1)*E(n-1)=T(n-1)*(2n-3) => T(n)= 1*3*5*…(2n-3)
For n=20 leaves ~1021
5
Approaches
 Distance based
 Tree should best model evolutionary distance metric among taxa
 Character-based [Maximal Parsimony (MP)]
 Tree should minimize changes
 Maximum likelihood (ML)
 Tree should maximize likelihood of changes
INPUT:
OUTPUT:
Phylogeny
U
V
W
X
Y
AGGGCAT
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
X
U
Y
V
W
6
3
Distance Based
Techniques
7
I. Distance Based Techniques
 Key Idea:
 Compute evolutionary distance metric D among S={U,V,W,X,Y}
 Compute a tree on S that best fits the distances D
 Formally:
 Given: nxn distance matrix D
 Compute: weighted tree T on n leaves that “best fits” D
 How to establish evolutionary distance measures?
 Distance ~ AA changes
 Next chapter: evaluating distance using Markovian evolution models
8
4
Is There A Tree That Perfectly Fits D?
Not every distance metric D can be modeled by a tree
How can we tell distance metrics that model a tree?
U V WX
U
V 2
W 2 2
X 2 1 1
U
1
.5
.5 .5
1
V X
U V WX
U
V 1
W 1 2
X 2 1 1
W
U
?
X
1 1 1 1
W
V
9
The Four-point condition
 A distance matrix corresponding to a tree is called additive
 THEOREM: D is additive if and only if:
For every four indices i,j,k,l, the maximum and median of
the three pairwise sums are identical:
Dij+Dkl < Dik+Djl = Dil+Djk
 Suggests how to connect 4 points into a tree to fit D
i
l
Dij
Dkl
j
U V WX
U
V 2
W 2 2
X 2 1 1
<
k
U
V
1
1
.5
W
.5
.5
X
Dik
Djl
Dil
=
U VWX
U
V1
W1 2
X2 1 1
Djk
U
1
V
1
2
W
2
1
X
1
10
5
How Do We Handle Non-Additive D ?
Additive metrics are very useful
 Provide perfect fit with a tree model; tree is easily computed from D
But evolutionary distance metrics are often non-additive
How do we handle non-additive metric?
Fitch & Margoliash: find a tree T to minimize least-square fit:
E(T) =∑i,j (dij(T) – Dij )2
 This problem is NP-Hard  need heuristics
 Fitch & Margoliash (1968) – exhaustive search
11
Closest-Pair Clustering
Idea: use D to guide closest-pair clustering
 Extend D to clusters by UPGMA/WPGMA averaging
12
6
UPGMA Algorithm
Initialization



Initialize n clusters Ci={Si}
Initialize T with leaves for each cluster Ci
Iteration





Find Ci, Cj with smallest distance Dij
Create new cluster Ck = Ci∪Cj
Add a new node to T, for Ck, and connect it to Ci,Cj
If all nodes are connected to a tree exit; otherwise, assign Dki=Dkj=Dij/2
and compute the distances Dkl to all clusters Cl
Dil |Ci| + Djl |Cj|
Dkl = ––––––––––––––
|Ci| + |Cj|
•
Repeat the iteration
13
UPGMA: Molecular Clock Property
 Uniform distance from root to leaves
 Distance to root ~ evolutionary clock
 Species are assumed to take identical time to evolve
9
8
7
0.5D67
6
0.5D18
0.5D45
1
2
3
4
5
14
7
Notes
Complexity is is O(n2)
Averaging redistributes distances to overcome non-additivity
Clustering can lead to substantial errors and is very sensitive
This limits the applications of clustering
 How do we overcome the sensitivity of UPGMA?
Real tree
UPGMA
.5
2.5
2
1
1
W
V
20
U
9
X
U V W X
U
V 22
W 24 6
X 32 14 10
13
U
12
X
9
3
V
3
W 15
Improvements Through Bootstrapping
 Bootstrapping: statistical technique to increase robustness
 Scenario: given a sample S(ω) and a result R(S) computed from S
 Bootstrapping:
o Resample S, to get S’(ω);
o Evaluate R(S’(ω));
o Evaluate match of R(S) with the values R(S’(ω))
 In here S= columns of sequences of size n; R(S)=tree
 S’(ω)=Sample n random columns of S with possible repetitions
 Compute phylogenetic tree R(S’(ω))
 Use {R(S’(ω))} to compute consensus/likelihood of branches of R(S)
16
8
Bootstrapping Example
17
Closest Pair vs. Evolutionary-Neighbors
Additivity: Dij+Dkl < Dik+Djl = Dil+Djk
i
l
Dij
<
Dkl
j
k
Dik
Dil
=
Djl
Djk
 UPGMA overcomes non-additivity by averaging distances
 But, the closest pair may not be evolutionary neighbors
 The evolutionary tree distances may diverge greatly;
averaging distorts neighborhood
U 1
4
W
1
V
1
22
X
X
W
U
V
18
9
Neighbor Joining [Saitou & Nei 87; Studier & Keppler 88]
Neighbor joining heuristics:
join closest clusters that are far from the rest
 Define: Rk=Σi≠kDik the divergence of k
 Cluster nodes k,m that minimize D’km=Dkm-(Rk+Rm)/(n-2)
 [Define rk=Rk/(n-2) and consider Dkm-rk-rm]
rk
Dkm
rm
D’
U 1
4
W
1
V 1
22
X
r
U 16
V 16
W 19
X 37
U
V W X
U
V -29
W -30 -29
X -29 -30 -29
19
Neighbor Joining Algorithm


Initialization:(same as UPGMA) Initialize n clusters Ci={Si}
Iteration:
1.
2.
3.
4.
Compute rk=Σi≠kDik/(n-2) for each cluster k
Find (k,m) minimizing Dkm-rk-rm;
Define a new node i and set Dis= 0.5(Dks+Dms-Dkm) for all s
Join node i to k and m with edges of respective lengths:
Dki=0.5(Dkm+rk-rm) Dmi=0.5(Dkm+rm-rk)
5. Repeat until all nodes are connected
20
10
Example: Step 1--Compute Divergences rX
A B CD E F
A
B
C
D
E
F
5 4 7 6 8
5
7 10 9 11
4 7
7 6 8
7 10 7
5 9
6 9 6 5
8
A
8 11 8 9 8
B
C
D
E
F
Step 1
Σ
30 42 32 38 34 44
r
7.5 10.5 8 9.5 8.5 11
Step 1: compute rk=Σi≠kDik/(n-2)
 Sum the columns then divide by 6-2=4
From The Phylogenetic Handbook, Salemi and Vandamme 2004
21
Step 2: find neighboring pair
 Step 2: evaluate neighboring distance matrix
 Nkm=Dkm-(rk+rm)
[Subtract the r column & row]
 Find (k,m) minimizing Nkm
 Create a new node U and attach to k,m
U
A
A B C D E F
5 4 7 6 8
5
4 7
7 6 8
7 10 7
8
5 9
9.5
8
8.5
6 9 6 5
8 11 8 9 8
7.5 10.5 8 9.5 8.5 11
B
7.5
A
7 10 9 11 10.5
11
Step 2
A
B
C
D
E
F
UPGMA would connect
the closest pair
A
B
C
B
-13
C -11.5 -11.5
C
D
E
D
F
E
F
Min{Nkm}}
Min{D’
km
D -10 -10 -10.5
E -10 -10 -10.5 -13
F -10.5 -10.5 -11 -11.5 -11.5
22
11
Step 3,4: Join Neighbors Update Distances
Step 3: Compute the branch lengths UA,UB
 DAU=0.5(DAB+rA-rB)=0.5(5-3)=1
 DBU=0.5(DAB+rB-rA)=0.5(5+3)=4
Step 3
Step 4: Update distance matrix
 DUX= 0.5(DAX+DBX-DAB)
 DUC= 0.5(4+7-5)=3; DUD=0.5(7+10-5)=6
DUE=0.5(6+9-5)=5; DUF=0.5(8+11-5)=7
U
1
A
4
A B C D E F
5 4 7 6 8
5
7 10 9 11
4 7
7 6 8
7 10 7
5 9
6 9 6 5
B
UC D E F
8
8 11 8 9 8
C
D
E
F
3 6 5 7
U
C
D
E
F
Step 4
A
B
C
D
E
F
3
7 6 8
6 7
5 9
5 6 5
8
7 8 9 8
7.5 10.5 8 9.5 8.5 11
23
Repeat Steps 1/2/3/4
U C D E F
 Step 1: compute rk=Σi≠kDik/(n-2)
 Step 2: compute neighboring pair
3 6 5 7
U
C 3
7 6 8
D 6 7
 Min{NXY=DXY-rX-rY} => (U,C) or (D,E)
5 9
E 5 6 5
 Step 3: join neighbors; compute branch length
8
 DUV=0.5(DUC+rU -rC)=1; DCV=2
 Step 4: re-compute distances
F 7 8 9 8
 DVX= 0.5(DUX+DCX-DUC)
Step 1
r 7 8 9
Step 3
8 10.7
Step 2
U
U
C
D
E
F
C
D
A
-10
-11
-10
-10
V D E F
V
2
U
1
-12
V
1
E
Step 4
5 4 5
D 5
C
4
5 9
E 4 5
8
F 5 9 8
-12
-10.7 -10.7 -10.7 -10.7
B
D
E
F
24
12
Repeat
V D E F
5 4 6
V
D 5
5 9
E 4 5
8
F 6 9 8
 Step 1: compute rk=Σi≠k Dik/(n-2)
 Step 2: compute neighboring pair
 Min{NXY=DXY-rX -rY } => (D,E)
 Step 3: join neighbors; compute branch length
 DWD=0.5(DDE+rD-rE )=3; DWE=2
 Step 4: re-compute distances
 DWX= 0.5(DDX+DDX-DDE)
Step 1
r 7.5 9.5 8.5 11.5
Step 3
Step 2
V
D
V
D -12
E -12 -13
F -13 -12 -12
V
1
E
A
V W F
2 6
V
2
U
1
Step 4
W 2
C
4
W
B
F 6 6
2
3
6
E
D
F
25
Repeat
V W F
V
2 6
W 2
6
F 6 6
 Step 1: compute rk=Σi≠k Dik/(n-2)
 Step 2: compute neighboring pair
 Min{NXY=DXY-rX -rY } => (V,F)
 Step 3: join neighbors; compute branch length
 DZV=0.5(DVF+rV -rF )=1; DZF=5
 Step 4: re-compute distances
Step 1
 DZX= 0.5(DVX+DFX-DVF)
r 8 8 12
Step 3
1
Step 2
W
V
W -14
F -14 -14
A
Z
2
U
1
Z W
V
1
V
Step 4
Z
5
W 1
C
4
W
F
B
1
2
3
E
D
26
13
Complete
Z W
1
Z
W 1
1
V
1
A
5
C
4
F
B
A
W
2
3
2
U
1
5
C
4
2
3
E
D
F
E
D
W
V
1
2
U
1
Z
1
Z
1
B
27
Notes On Neighbors Joining
Complexity is O(n2)
Does not depend on molecular clock assumption
Heavily used in practice [e.g., Clustal W]
But can be sensitive to non-additivity
28
14
Maximal Parsimony
(character based phylogeny)
29
Key Idea: Minimize Changes
 Reconsider the problem:
 Find “best” tree to explain evolution of sequences
Motivation: focus on evolution of positions
ATTACTG
ATTACTA
GTTGCTA
ATTGCTA
 “Distance” loses information on evolutionary changes
 Key idea: find tree with minimal changes to explain data
AAG
AAA
GGA
AGA
C=4
AAA
AAA
AAA
AAG AGA AAA GGA
C=3
AAA
AAA
AGA
AAG AAA GGA AGA
30
15
More Generally
 Taxa are considered as sets of attributes: characters
 “character” = DNA position, genes order, morphological feature…
 “character state” = a value assumed by a character
 Characters evolve through state changes
 Evolutionary tree represents changes in character states
 MP-tree seeks to minimize state changes
31
MP Example
http://evolution.berkeley.edu/evosite/evo101/IIC1aUsingparsimony.shtml
Characters
Binary states
Taxa
1 state change
32
16
MP Example
7 state changes
6 state changes
33
Example: Evolution of A Gene
Taxa
www.life.uiuc.edu/ib/335/MolSyst.html
Character = position
State = nucleotide
34
17
Example: Evolution of A Gene
http://home.cc.umanitoba.ca/~psgendb/GDE/phylogeny/parsimony/phylip.parsimony.html
Character = position
State = nucleotide
Taxa
35
Example
Pevzner 2003 Genome Research
MP rearrangements of chromosome X
36
18
The Max Parsimony (MP) Problem
“Big” MP:
 Input: set of n aligned sequences of length k
 Output: phylogenetic tree T such that
o T has n leaves labeled with the input sequences (taxa)
o T has internal nodes labeled with sequences of length k (states)
o T minimizes the Hamming distance among its node labels
H=3
AAA
AAA
AGA
AAG AAA GGA AGA
This is a Steiner Tree type problem
 Can be shown to be NP hard [Gusfield, Foulds]
 But often the number of sequences considered is small
“Small” MP
 Input: a tree with sequence-labeled leaves
 Output: labeling of internal nodes states which max parsimony
37
MP Basics
 Consider {ATA,ATT, GTT, GTA, GGT}
 First column admits 2 arrangements & identifies likely mutation ATA
ATT
A 1
5 G
G 3
5 G
GTT
GTA
A 2
4 G
GGT
A 1
4 G
G 3
A 2
MP (1 mutation)
2 mutations
 Second column does not provide clues on likely mutations
T 1
T
2
T 3
5 G
T 1
4 T
T
3
2 T
G 5
4 T
Non-informative position (need at least 2 characters)
ATA
ATT
GTT
GTA
GGT
38
19
MP Basics
A 1
A 2
G 3
5 G
A 1
4 G
A
5 T
4
ATA
ATT
GTT
GTA
GGT
2 T
T 3
MP
MP
Merge MP trees of columns 1 & 3:
ATA 1
ATT
ATT 2
GTT
5 GGT
ATA
GTT
4 GTA
GTT 3
5 GGT
ATA 1
GTA
4
GTT
ATT
3 GTT
2
ATT
Two MP trees
39
Example (N. Friedman)
Aardvark:
Bison:
Chimp:
Dog:
Elephant:
CAGGTA
CAGACA
CGGGTA
TGCACT
TGCGTA
TGGGTA
CGGGTA
CAGGTA
Aardvark
Bison
TGCGTA
Chimp
Dog
CAGGTA CAGACA CGGGTA TGCACT
Elephant
TGCGTA
40
20
Example:Evolution of Protein Domains
http://ai.stanford.edu/~serafim/CS374_2006/
D1 D2 D3
1
1
0
1
1
1
0
0
1
1
0
1
Total Cost: 3
1
1
1
 C. Chothia et al…, “Evolution of the Protein Repertoire”, Science VOL 300, 13 June 2003
 T. Przytycka et al, “Graph Theoretical Insights ….”, RECOMB 2005, LNBI 3500, pp. 311-325, 2005
41
Single Site MP: The Fitch Algorithm
Problem:
 Input: a tree T with labeled leaves
 Output: labels of internal nodes of MP tree + cost C
Step 1: Assign to each node x a set of labels S(x) such that
 If x is a leaf then S(x)= label of x, C⇐0
 If x has children y,z
S(x) = if S(y)∩S(z)≠0 then S(y)∩S(z)
else S(y)∪S(z), C⇐C+1
 Traverse T in postorder (leaves to root)
Step 2: Assign to a node x a character value v(x)
 Traverse T in preorder (root to leaves)
 If y is the parent of x and v(y)εS(x) then v(x)v(y)
else v(x)= any label from S(x)
42
21
Step 1: Computing Candidate Labels
C=2
∪
{A, G}
C=1
∩
C=1
{A}
C=1
{A}
C=1
∪
{A, G}
A
{A}
A C=0 G
{A} {G}
G
A
{G} {A}
{A, G}
G
{G}
A
{A}
G
{G}
A
{A}
A
{A}
G
{G}
G
{G}
43
Step 2: Selecting MP Labels
C=2
{A, G}
{A, G}
{A}
{A}
{A, G}
A
{A}
A
G
{G} {A}
{A, G}
{A}
{A, G}
G
{G}
A
{A}
A
G
{G} {A}
{A, G}
G
{G}
A
{A}
A
G
{G} {A}
G
{G}
44
22
Notes
 Algorithm is fast O(nk) n= # nodes, k=#character values
 It selects a particular MP tree (there may be others)
{A, G}
C=2
{A}
G
A
{G} {A}
A
G
A
A
G
{G}
G
A
A
{A, G}
A
{A}
A
G
G
A
G
A
G
G
A
G
A
G
A
G
 Run separately for each character then merge results
 May be generalized for weighted parsimony:
 Sankoff’s generalization: different costs of different changes
45
Heuristic MP Algorithms
 Use Steiner-tree heuristic algorithms
 Branch-and-bound search
 Represent search space as tree
(nodes at k-th level represent phylogenetic trees for first k species)
 Find best scoring search-node and use it as bound
 Branch to children of this search-node
 Nearest neighbor interchange (NNI) – switch subtrees
 Simulated annealing
….
46
23
Maximal Likelihood
Approach
47
(III) Max Likelihood Approaches
(Based on N. Friedman slides)
Key idea: compute maximum likelihood tree
 Many models of changes (trees) can yield observed data
 Compute tree that maximizes the likelihood
Problem 1: given T, compute probability P(S|T)
 S={X1,…Xn} are the observed sequences
 Need a probability model of changes generated by T:
o Background probabilities: q(a)
o Mutation probabilities: P(a|b,t)
x5
Problem 2: compute T that maximizes P(S|T)
t4
x4
 This is the complex part
t1
x1
t2
x2
t3
x3
48
24
Tree Likelihood Computation
 Define P(Lk|a)= prob. of subtree below node k given xk=a
 Init: for all leaves k; P(Lk|a)=1 if xk=a ; 0 otherwise
 Iteration: if k is node with children i and j, then
P(Lk | a) = " P(b | a,t i )L(i | b)P(c | a,t j )L( j | c)
b,c
x5
 Termination:Likelihood is
!
P ( x1,K, x 3 | T , t ) = ! P (Lroot | a )q(a )
a
t1
t4
x4
t2
x1
x2
t3
x3
49
Maximum Likelihood (ML)
 Score each tree by
P ( X1,K, X n | T , t ) = ! P ( x1 [m],K, x n [m] | T , t )
 Assumption of independent positions
m
 Find the highest scoring tree
 Exhaustive search
 Sampling methods (Metropolis)
 Approximation (consider only a subset of trees)
50
25
Comparison
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Neighbor-joining
Maximum parsimony
Maximum likelihood
Uses only pairwise distances
Uses only shared derived
characters
Uses all data
Minimizes distance between
nearest neighbors
Minimizes total distance
Maximizes tree likelihood
given specific parameter
values
Very fast
Slow
Very slow
Easily trapped in local optima Assumptions fail when
evolution is rapid
Highly dependent on assumed
evolution model
Good for generating tentative
tree, or choosing among
multiple trees
Good for very small data sets
and for testing trees built using
other methods
Best option when
tractable (<30 taxa)
51
Conclusions
 Computing phylogeny is an area of active research
 Hundreds of algorithms.
 New models: phylogenetic networks (generalize trees)
 New challenges: whole genome phylogeny
 Account for multi-site changes: replication, transpositions…
 New algorithms
 Applications
 Epidemiology
 Cancer diagnosis
 ….
52
26