Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University Overview • Background • Making a tree from pairwise distances; • Parsimony; – <break>; • Assessing the trees: the bootstrap; • Simultaneous alignment and phylogeny; • Application: Phylip 2 Building phylogenetic trees Background • Phylogenetic tree: diagram showing evolutionary lineages of species/genes • Trees are used: – To understand lineage of various species – To understand how various functions evolved – To inform multiple alignments 3 Building phylogenetic trees Phylogenetic tree approaches • Distance: – UPGMA – Neighbour-joining • Parsimony: – Traditional parsimony – Weighted parsimony 4 Building phylogenetic trees Making a tree from pairwise distances • Given a set of sequences you want to build a tree. • Compute the distances dij between each pair i, j of the sequences. • There are many different distance measures. • Average distance between pairs of sequences from each cluster. 5 Building phylogenetic trees UPGMA • Unweighted Pair Group Method using arithmetic Averages. • It works by clustering the sequences, at each stage combining two clusters and at the same time creating a new node in a tree, using a distance measure. 6 Building phylogenetic trees Distance between points 1 d ij Ci C j l 3 j 4 2 pq p in Ci , q in C j • |Ci| and |Cj| denote the number of sequences in clusters i and j. i 1 d il (d il ) 4 1*1 7 d Building phylogenetic trees Distance between clusters l 3 j 4 k i d kl 8 • Let Ck be the union of clusters Ci and Cj,then dkl d kl d il Ci d jl C j Ci C j • Where Cl is any other cluster. 4 *1 3 *1 7 3.5 11 2 Building phylogenetic trees Building the tree: UPGMA Initialisation: Assign each sequence i to its own cluster Ci, Define one leaf of T for each sequence, and place at height zero. Iteration: Determine the two clusters i, j for which dij is minimal. Define a new cluster k by Ck Ci C j , and define dkl for all l. Define a node k with daughter nodes i an j, and place it at height dij /2. Add k to the current clusters and remove i and j. Terminiation: When only two clusters i, j remain, place the root at height dij /2. 9 Building phylogenetic trees UPGMA: Initialisation 10 Building phylogenetic trees UPGMA: Iteration 1 11 Building phylogenetic trees UPGMA: Iteration 2 12 Building phylogenetic trees UPGMA: Iteration 3 13 Building phylogenetic trees UPGMA: Terminiation 14 Building phylogenetic trees Properties of UPGMA • Molecular clock & ultrametric property of distances • Additivity 15 Building phylogenetic trees Properties of UPGMA: Molecular clock & ultrametric • The molecular clock assumption: divergence of sequences is assumed to occur at the same rate at all points in the tree. • If this does holds, then the data is said to be ultrametric. 16 Building phylogenetic trees Properties of UPGMA: Additivity m i k • Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them. d im d ik d km j d jm d jk d km d ij d ik d jk d km 12 (d im d jm d ij ) 17 Building phylogenetic trees Neighbour-joining • N-j constructs a tree by iteratively joining subtrees (like UPGMA). • Produces an unrooted tree. • Doesn’t make the molecular clock assumption, therefore the ultrametric property does not hold. 18 Building phylogenetic trees Distances in Neighbour-joining m i k • Given a new internal node k, the distance to another node m is given by: d km 12 (d im d jm d ij ) d ik 12 (d ij d im d jm ) j 19 d jk d ij d ik Building phylogenetic trees Distances in Neighbour-joining • Generalizing this so that the distance to all other leaves are taken into account: m i k d ik 12 (d ij ri rj ) • Where j 1 ri dim L 2 mL • And |L| denotes the size of the set L of leaves. 20 Building phylogenetic trees Building the tree: Neighbour-joining ri 1 d im L 2 mL Initialisation: Define T to be the set of leaf nodes, one for each given sequence, and put L=T. Iteration: Pick a pair i, j in L for which Dij d ij (ri rj ) defined by is minimal. Define a new node k and set d km 12 (d im d jm d ij ), for all m in L. Add k to T with edges of lengths d ik 12 (d ij ri rj ) , d jk d ij d ik joining k to i and j, respectively. Remove i and j from L and add k. Termination: When L consists of two leaves i and j add the remaining edge between i and j, with length dij. 21 Building phylogenetic trees Rooting trees m • Finding a root in an unrooted tree is sometimes accomplished by using an outgroup: outgroup – A species known to be more distantly related to remaining species than they are to each other • The point where the outgroup joins the rest of the tree is the best candidate for root position 22 i k l Building phylogenetic trees j Candidate root Comments on distance based methods • If the given data is ultrametric (and these distances represent real distances), then UPGMA will identify the correct tree. • If the data is additive (and these distances represent real distances), then Neighbour-joining will identify the correct tree. • Otherwise, the methods may not recover the correct tree, but they may still be reasonable heuristics. 23 Building phylogenetic trees Phylogenetic tree approaches • Distance: – UPGMA – Neighbour-joining • Parsimony: – Traditional parsimony – Weighted parsimony 24 Building phylogenetic trees Parsimony • Most widely used tree building algorithm(?). • Finds the tree that explains the data with a minimal number of changes. • Instead of building a tree, it assigns a cost to a given tree. • Two components of the parsimony algorithm can be distinguished: – The computation of a cost for a given tree; – A search through all trees, to find the overall minimum of this cost. 25 Building phylogenetic trees Parsimony example • Given the following sequences: AAG,AAA,GGA,AGA. • Several trees could explain the phylogeny 26 Building phylogenetic trees Traditional Parsimony • Count the number of substitutions • At each node keep: – a list of minimal cost residues – the current cost • Post-order traversal of the tree 27 Building phylogenetic trees Traditional Parsimony Initialisation: Set current cost C=0 and k =2n-1, the number of the root node. Recursion: To obtain the set Rk: If k is a leaf node: Set Rk xuk If k is not a leaf node: Compute Ri , Rj for the daughter i, j of k, and set Rk Ri R j if this intersection is not empty, or else set Rk Ri R j and increment C. Termination: Minimal cost of tree = C. 28 Building phylogenetic trees Weighted Parsimony • Extension of the traditional parsimony. • Adds a cost function S(a,b) for each substitution of a by b. • Post-order traversal of the tree • Aim is now to minimize the cost. 29 Building phylogenetic trees Weighted Parsimony Initialisation: Set k =2n-1, the number of the root node Recursion: Compute Sk(a) for all a as follows: If k is a leaf node: Set S k (a ) for a xuk , S k (a ) , otherwise If k is not a leaf node: Compute Si(a), Sj(a) for all a at the daughter i, j and define S k (a) min b ( Si (b) S (a, b)) min b ( S j (b) S (a, b)) Termination: Minimal cost of tree = minaS2n-1(a). 30 Building phylogenetic trees Break • Questions so far? • After the break: – Assessing the trees: the bootstrap; – Simultaneous alignment and phylogeny; – Application: Phylip 31 Building phylogenetic trees Branch and bound • Parsimony itself can not build a tree! • Using simple enumeration methods the number of trees become very large very fast. • How to build the trees? – Stochastically – Branch and bound 32 Building phylogenetic trees Branch and bound • B&B uses the parsimony algorithm. • It guarantees to find the overall best tree. • It systematically builds trees by increasing the number of leaves. • Abandons a particular avenue of tree building whenever the current incomplete tree (T*) has a cost(T*)>cost(Tmin). 33 Building phylogenetic trees The Bootstrap • A measure how much a tree should be trusted. • Use the bootstrap as a method of assessing the significance of some phylogenetic feature. 34 Building phylogenetic trees The Bootstrap (2) • The bootstrap works as follows: – Given a dataset of an alignment of sequences. – Generate an artificial dataset of the same size as the original dataset by picking columns from the alignment at random with replacement. – Apply the tree building algorithm to this artificial dataset. – Repeat selection and tree building procedure n times. – The feature with which a chosen phylogenetic features appears is taken to be a measure of the confidence we can have in this feature. 35 Building phylogenetic trees Simultaneous alignment and phylogeny • Simultaneously aligning sequences and finding a plausible phylogeny: – Sankoff & Cedergren’s gap-substitution algorithm; – Hein’s affine cost algorithm. • Both find an optimal alignment given a tree. 36 Building phylogenetic trees Sankoff & Cedergren’s gapsubstitution algorithm • Guarantees to find ancestral sequences, and alignments of them and the leaf sequences. • It uses a character-substitution model of gaps • Together this minimizes a tree-based parsimonytype cost. • The algorithm is a combination of two known methods: – Dynamic programming method (Chapter 6); – Weighted Parsimony algorithm. 37 Building phylogenetic trees Hein’s affine cost algorithm • It uses affine gap penalties. • Faster than the Sankoff & Cedergren algorithm. • The aim is to find sequences z at a given node aligned to both of the sequences x and y at the daughter nodes satisfying: S ( x, z ) S ( z , y ) S ( x , y ) • Where S is the total cost for a given alignment of two sequences. (mismatch cost =1 and 0 otherwise) 38 Building phylogenetic trees Hein’s affine cost algorithm • Compared to equation (2.16) (alignment with affine gap scores) here the algorithm searches for the minimal cost path. V M (i 1, j 1) S ( xi , yi ) • The affine gap cost for M V (i, j ) min V X (i 1, j 1) S ( xi , yi ) a gap of length k is V Y (i 1, j 1) S ( x , y ) i i d+(k-1)e, where e<=d. M V (i 1, j ) d X V (i, j ) min X V (i 1, j ) e V M (i, j 1) d V (i, j ) min Y V (i, j 1) e Y 39 Building phylogenetic trees Dynamic programming matrix for two sequences i VM j d=2 VX e=1 VY 40 Building phylogenetic trees Hein’s affine cost algorithm • Find the z for which S ( x, z ) S ( z , y ) S ( x, y ) is minimal. CAC(?) • From the matrix follows: – C--AC– CAC--- • CAC could be possible z. 41 Building phylogenetic trees CAC CTCACA Hein’s affine cost algorithm CAC(?) CAC CACACA(?) Which z could serve best as ancestor? CTCACA CAC CACAC(?) CAC 42 CTCACA Building phylogenetic trees CTCACA Hein’s affine cost algorithm CAC CACACA CACAC 43 S (CAC, CAC) 0 S (CAC, CTCACA) d 2e 1 S (CACACA, CAC ) d 2e S (CACACA, CTCACA) 1 S (CACAC, CAC) d e S (CACAC, CTCACA) d 1 Building phylogenetic trees S (CAC, CTCACA) d 2e 1 S (CAC, CTCACA) d 2e 1 S (CAC, CTCACA) 2d e 1 Sequence graph • Follow a path through the dynamic programming matrix. • Derive a graph from this matrix. • Whenever a cell is used by an optimal path a vertex is added to the graph. 44 Building phylogenetic trees Sequence graph Graph 1 45 Building phylogenetic trees Sequence graph: line arrangement Graph 1 Graph 2 46 Building phylogenetic trees Sequence graph: replacing the dummy edges Graph 2 Graph 3 47 Building phylogenetic trees Dynamic Programming matrix: TAC – Graph 3 48 Building phylogenetic trees Ancestors CAC 1 TAC CAC 5 CAC • Possible ancestral sequences for the leaf sequences TAC, CAC and CTCACA given the tree shown. • Derived from the sequence graphs. CTCACA 49 Building phylogenetic trees Limitations of Hein’s model • Hein’s algorithm takes the minimal cost sequences at each node upward. • This can fail to give the overall optimum. • Suppose the cost for a gap of length k is: – 13+3(k-1) • Mismatch: –4 • Suppose the leaves G and GTT. 50 Building phylogenetic trees Limitations of Hein’s model • A eligible ancestor of G and GTT would be themselves, since they both have a cost of 13+3=16. • GT would not be eligible because of the total cost of 2*13=26. • Now we want to branch to the ancestor of G and GTT and there is a third leave GT. – The total cost for ineligible GT would be lower than for either G or GTT. 51 Building phylogenetic trees Application: PHYLIP (Phylogeny Inference Package) • Many features, among: – Traditional (unrooted) parsimony – Branch and bound to find all most parsimonious trees 52 Building phylogenetic trees Application: PHYLIP • Test dataset: 53 Jurgen AACGUGGCCAAAU Alpha Beta Gamma Delta Epsilon Richard ACCGCCGCCAAAU AAGGUCGCCAAAC CAUUUCGUCACAA GGUAUCUCGGCCU GAAAUCUCGAUCC GGGCUCUCGGCUC Building phylogenetic trees Demo Questions?
© Copyright 2026 Paperzz