i, j

Building phylogenetic trees
Jurgen Mourik &
Richard Vogelaars
Utrecht University
Overview
• Background
• Making a tree from pairwise distances;
• Parsimony;
– <break>;
• Assessing the trees: the bootstrap;
• Simultaneous alignment and phylogeny;
• Application: Phylip
2
Building phylogenetic trees
Background
• Phylogenetic tree: diagram showing evolutionary
lineages of species/genes
• Trees are used:
– To understand lineage of various species
– To understand how various functions evolved
– To inform multiple alignments
3
Building phylogenetic trees
Phylogenetic tree approaches
• Distance:
– UPGMA
– Neighbour-joining
• Parsimony:
– Traditional parsimony
– Weighted parsimony
4
Building phylogenetic trees
Making a tree from pairwise
distances
• Given a set of sequences you want to build a
tree.
• Compute the distances dij between each pair i, j
of the sequences.
• There are many different distance measures.
• Average distance between pairs of sequences
from each cluster.
5
Building phylogenetic trees
UPGMA
• Unweighted Pair Group Method using arithmetic
Averages.
• It works by clustering the sequences, at each
stage combining two clusters and at the same
time creating a new node in a tree, using a
distance measure.
6
Building phylogenetic trees
Distance between points
1
d ij 
Ci C j
l
3
j
4
2
pq
p in Ci , q in C j
• |Ci| and |Cj| denote the
number of sequences in
clusters i and j.
i
1
d il 
(d il )  4
1*1
7
d
Building phylogenetic trees
Distance between clusters
l
3
j
4
k
i
d kl 
8
• Let Ck be the union of
clusters Ci and Cj,then dkl
d kl 
d il Ci  d jl C j
Ci  C j
• Where Cl is any other
cluster.
4 *1  3 *1 7
  3.5
11
2
Building phylogenetic trees
Building the tree: UPGMA
Initialisation:
Assign each sequence i to its own cluster Ci,
Define one leaf of T for each sequence, and place at height zero.
Iteration:
Determine the two clusters i, j for which dij is minimal.
Define a new cluster k by Ck  Ci  C j , and define dkl for all l.
Define a node k with daughter nodes i an j, and place it at height dij /2.
Add k to the current clusters and remove i and j.
Terminiation:
When only two clusters i, j remain, place the root at height dij /2.
9
Building phylogenetic trees
UPGMA: Initialisation
10
Building phylogenetic trees
UPGMA: Iteration 1
11
Building phylogenetic trees
UPGMA: Iteration 2
12
Building phylogenetic trees
UPGMA: Iteration 3
13
Building phylogenetic trees
UPGMA: Terminiation
14
Building phylogenetic trees
Properties of UPGMA
• Molecular clock & ultrametric property of
distances
• Additivity
15
Building phylogenetic trees
Properties of UPGMA:
Molecular clock & ultrametric
• The molecular clock assumption: divergence of
sequences is assumed to occur at the same rate
at all points in the tree.
• If this does holds, then the data is said to be
ultrametric.
16
Building phylogenetic trees
Properties of UPGMA:
Additivity
m
i
k
• Given a tree, its edge lengths
are said to be additive if the
distance between any pair of
leaves is the sum of the
lengths of the edges on the
path connecting them.
d im  d ik  d km
j
d jm  d jk  d km
d ij  d ik  d jk
d km  12 (d im  d jm  d ij )
17
Building phylogenetic trees
Neighbour-joining
• N-j constructs a tree by iteratively joining
subtrees (like UPGMA).
• Produces an unrooted tree.
• Doesn’t make the molecular clock assumption,
therefore the ultrametric property does not hold.
18
Building phylogenetic trees
Distances in Neighbour-joining
m
i
k
• Given a new internal node k,
the distance to another node
m is given by:
d km  12 (d im  d jm  d ij )
d ik  12 (d ij  d im  d jm )
j
19
d jk  d ij  d ik
Building phylogenetic trees
Distances in Neighbour-joining
• Generalizing this so that the
distance to all other leaves
are taken into account:
m
i
k
d ik  12 (d ij  ri  rj )
• Where
j
1
ri 
dim

L  2 mL
• And |L| denotes the size of
the set L of leaves.
20
Building phylogenetic trees
Building the tree:
Neighbour-joining
ri 
1
d im

L  2 mL
Initialisation:
Define T to be the set of leaf nodes, one for each given
sequence, and put L=T.
Iteration:
Pick a pair i, j in L for which Dij  d ij  (ri  rj ) defined by is minimal.
Define a new node k and set d km  12 (d im  d jm  d ij ), for all m in L.
Add k to T with edges of lengths d ik  12 (d ij  ri  rj ) , d jk  d ij  d ik
joining k to i and j, respectively.
Remove i and j from L and add k.
Termination:
When L consists of two leaves i and j add the remaining edge
between i and j, with length dij.
21
Building phylogenetic trees
Rooting trees
m
• Finding a root in an unrooted
tree is sometimes
accomplished by using an
outgroup:
outgroup
– A species known to be more
distantly related to remaining
species than they are to
each other
• The point where the
outgroup joins the rest of the
tree is the best candidate for
root position
22
i
k
l
Building phylogenetic trees
j
Candidate
root
Comments on distance based
methods
• If the given data is ultrametric (and these
distances represent real distances), then UPGMA
will identify the correct tree.
• If the data is additive (and these distances
represent real distances), then Neighbour-joining
will identify the correct tree.
• Otherwise, the methods may not recover the
correct tree, but they may still be reasonable
heuristics.
23
Building phylogenetic trees
Phylogenetic tree approaches
• Distance:
– UPGMA
– Neighbour-joining
• Parsimony:
– Traditional parsimony
– Weighted parsimony
24
Building phylogenetic trees
Parsimony
• Most widely used tree building algorithm(?).
• Finds the tree that explains the data with a
minimal number of changes.
• Instead of building a tree, it assigns a cost to a
given tree.
• Two components of the parsimony algorithm can
be distinguished:
– The computation of a cost for a given tree;
– A search through all trees, to find the overall
minimum of this cost.
25
Building phylogenetic trees
Parsimony example
• Given the following sequences: AAG,AAA,GGA,AGA.
• Several trees could explain the phylogeny
26
Building phylogenetic trees
Traditional Parsimony
• Count the number of substitutions
• At each node keep:
– a list of minimal cost residues
– the current cost
• Post-order traversal of the tree
27
Building phylogenetic trees
Traditional Parsimony
Initialisation:
Set current cost C=0 and k =2n-1, the number of the root
node.
Recursion: To obtain the set Rk:
If k is a leaf node:
Set Rk  xuk
If k is not a leaf node:
Compute Ri , Rj for the daughter i, j of k, and
set Rk  Ri  R j if this intersection is not empty, or else
set Rk  Ri  R j and increment C.
Termination:
Minimal cost of tree = C.
28
Building phylogenetic trees
Weighted Parsimony
• Extension of the traditional parsimony.
• Adds a cost function S(a,b) for each substitution
of a by b.
• Post-order traversal of the tree
• Aim is now to minimize the cost.
29
Building phylogenetic trees
Weighted Parsimony
Initialisation:
Set k =2n-1, the number of the root node
Recursion: Compute Sk(a) for all a as follows:
If k is a leaf node:
Set S k (a ) for a  xuk , S k (a )   , otherwise
If k is not a leaf node:
Compute Si(a), Sj(a) for all a at the daughter i, j
and define S k (a)  min b ( Si (b)  S (a, b))  min b ( S j (b)  S (a, b))
Termination:
Minimal cost of tree = minaS2n-1(a).
30
Building phylogenetic trees
Break
• Questions so far?
• After the break:
– Assessing the trees: the bootstrap;
– Simultaneous alignment and phylogeny;
– Application: Phylip
31
Building phylogenetic trees
Branch and bound
• Parsimony itself can not build a tree!
• Using simple enumeration methods the number
of trees become very large very fast.
• How to build the trees?
– Stochastically
– Branch and bound
32
Building phylogenetic trees
Branch and bound
• B&B uses the parsimony algorithm.
• It guarantees to find the overall best tree.
• It systematically builds trees by increasing the
number of leaves.
• Abandons a particular avenue of tree building
whenever the current incomplete tree (T*) has a
cost(T*)>cost(Tmin).
33
Building phylogenetic trees
The Bootstrap
• A measure how much a tree should be trusted.
• Use the bootstrap as a method of assessing the
significance of some phylogenetic feature.
34
Building phylogenetic trees
The Bootstrap (2)
• The bootstrap works as follows:
– Given a dataset of an alignment of sequences.
– Generate an artificial dataset of the same size as the original
dataset by picking columns from the alignment at random
with replacement.
– Apply the tree building algorithm to this artificial dataset.
– Repeat selection and tree building procedure n times.
– The feature with which a chosen phylogenetic features
appears is taken to be a measure of the confidence we can
have in this feature.
35
Building phylogenetic trees
Simultaneous alignment and
phylogeny
• Simultaneously aligning sequences and finding a
plausible phylogeny:
– Sankoff & Cedergren’s gap-substitution algorithm;
– Hein’s affine cost algorithm.
• Both find an optimal alignment given a tree.
36
Building phylogenetic trees
Sankoff & Cedergren’s gapsubstitution algorithm
• Guarantees to find ancestral sequences, and
alignments of them and the leaf sequences.
• It uses a character-substitution model of gaps
• Together this minimizes a tree-based parsimonytype cost.
• The algorithm is a combination of two known
methods:
– Dynamic programming method (Chapter 6);
– Weighted Parsimony algorithm.
37
Building phylogenetic trees
Hein’s affine cost algorithm
• It uses affine gap penalties.
• Faster than the Sankoff & Cedergren algorithm.
• The aim is to find sequences z at a given node
aligned to both of the sequences x and y at the
daughter nodes satisfying:
S ( x, z )  S ( z , y )  S ( x , y )
• Where S is the total cost for a given alignment of
two sequences. (mismatch cost =1 and 0
otherwise)
38
Building phylogenetic trees
Hein’s affine cost algorithm
• Compared to equation (2.16) (alignment with
affine gap scores) here the algorithm searches
for the minimal cost path.
V M (i  1, j  1)  S ( xi , yi )
• The affine gap cost for M

V (i, j )  min V X (i  1, j  1)  S ( xi , yi )
a gap of length k is
V Y (i  1, j  1)  S ( x , y )
i
i

d+(k-1)e, where e<=d.
M

V
(i  1, j )  d
X
V (i, j )  min  X
 V (i  1, j )  e
V M (i, j  1)  d
V (i, j )  min  Y
 V (i, j  1)  e
Y
39
Building phylogenetic trees
Dynamic programming matrix
for two sequences
i
VM
j
d=2
VX
e=1
VY
40
Building phylogenetic trees
Hein’s affine cost algorithm
• Find the z for which S ( x, z )  S ( z , y )  S ( x, y ) is
minimal.
CAC(?)
• From the matrix follows:
– C--AC– CAC---
• CAC could be possible z.
41
Building phylogenetic trees
CAC
CTCACA
Hein’s affine cost algorithm
CAC(?)
CAC
CACACA(?)
Which z could
serve best as
ancestor?
CTCACA
CAC
CACAC(?)
CAC
42
CTCACA
Building phylogenetic trees
CTCACA
Hein’s affine cost algorithm
CAC
CACACA
CACAC
43
S (CAC, CAC)  0
S (CAC, CTCACA)  d  2e  1
S (CACACA, CAC )  d  2e
S (CACACA, CTCACA)  1
S (CACAC, CAC)  d  e
S (CACAC, CTCACA)  d  1
Building phylogenetic trees
S (CAC, CTCACA)  d  2e  1
S (CAC, CTCACA)  d  2e  1
S (CAC, CTCACA)  2d  e  1
Sequence graph
• Follow a path through the dynamic programming
matrix.
• Derive a graph from this matrix.
• Whenever a cell is used by an optimal path a
vertex is added to the graph.
44
Building phylogenetic trees
Sequence graph
Graph 1
45
Building phylogenetic trees
Sequence graph:
line arrangement
Graph 1
Graph 2
46
Building phylogenetic trees
Sequence graph:
replacing the dummy edges
Graph 2
Graph 3
47
Building phylogenetic trees
Dynamic Programming matrix:
TAC – Graph 3
48
Building phylogenetic trees
Ancestors
CAC
1
TAC
CAC
5
CAC
• Possible ancestral
sequences for the leaf
sequences TAC, CAC and
CTCACA given the tree
shown.
• Derived from the
sequence graphs.
CTCACA
49
Building phylogenetic trees
Limitations of Hein’s model
• Hein’s algorithm takes the minimal cost
sequences at each node upward.
• This can fail to give the overall optimum.
• Suppose the cost for a gap of length k is:
– 13+3(k-1)
• Mismatch:
–4
• Suppose the leaves G and GTT.
50
Building phylogenetic trees
Limitations of Hein’s model
• A eligible ancestor of G and GTT would be
themselves, since they both have a cost of
13+3=16.
• GT would not be eligible because of the total cost
of 2*13=26.
• Now we want to branch to the ancestor of G and
GTT and there is a third leave GT.
– The total cost for ineligible GT would be lower than
for either G or GTT.
51
Building phylogenetic trees
Application: PHYLIP
(Phylogeny Inference Package)
• Many features, among:
– Traditional (unrooted) parsimony
– Branch and bound to find all most parsimonious
trees
52
Building phylogenetic trees
Application: PHYLIP
• Test dataset:
53
Jurgen
AACGUGGCCAAAU
Alpha
Beta
Gamma
Delta
Epsilon
Richard
ACCGCCGCCAAAU
AAGGUCGCCAAAC
CAUUUCGUCACAA
GGUAUCUCGGCCU
GAAAUCUCGAUCC
GGGCUCUCGGCUC
Building phylogenetic trees
Demo
Questions?