Agenda for today • Phylogenic tree construction – Introduction to phylogenic trees ∗ Ultrametricality ∗ Additive distance – Distance-based approaches – Parsimony • Phylogenic tree building and multi-sequence alignment 1 Phylogenic trees • Rooted trees, tracing evolutionary divergence • Without loss of generality, assume binary branching • Strings at the leaves of trees • Internal nodes labeled or unlabeled – Labels may hypothesize ancestor strings • Nodes or links in tree may have scores – Scores may represent distance or time 2 Clade of Apes 3 Ultrametric trees • Ultrametric trees have real numbers at the internal nodes • Numbers at the nodes must strictly decrease • For our purposes, strings at the leaves of the tree • Pairwise scores between leaves: number associated with the lowest-common-ancestor • Defines an ultrametric (symmetric) matrix of pairwise distances – Diagonals zero; off-diagonal positive • Can construct a unique ultrametric tree efficiently from a given ultrametric matrix (see Gusfield) 4 Example ultrametric matrix/tree 9 A B C D E A 0 9 9 6 4 B 0 4 9 9 C 0 9 9 D 0 6 E 0 4 6 D 4 B C E A • Variant: min-ultrametric trees are strictly increasing down the tree – (just a sign change, same procedures apply) 5 Molecular clock theory • Ultrametric trees generally involve time since divergence • The Molecular clock theory of Zuckerkandl and Pauling states – For any given protein, accepted mutations in the amino acid sequence occur at a constant rate (from Gusfield) – Accepted means not impacting function – Hence number of changes is proportional to time – (rate varies depending on protein) • Evidence of mutation can be based on sequence edits • Most real data is not ultrametric – assumptions too strong 6 Additive-distance trees • Relaxes some assumptions on the constancy of the rate • Scores are labeled on links rather than internal nodes of the tree • Distance between two leaves (strings) is the sum of scores on links between the leaves • We can move away from straight phylogenic trees and allow strings to label internal nodes – “Compact” additive-distance trees introduce no additional nodes beyond leaves 7 Example additive-distance tree A B C D E 2 A 0 9 9 6 4 B 0 4 9 9 C 0 9 9 D 0 6 E 0 2 3 1 2 D 2 B 2 C 2 E A • Ultrametric matrices can be represented with additive-distance trees • O(n2) algorithms for building additive-distance trees from n×n matrices 8 Building phylogenic trees • Given a set of sequences, how can we build a phylogenic tree from those sequences? • Two main approaches – Distance-based approaches (minimize distance) – Parsimony (fewest required changes) • Distance-based approaches are well suited to ultrametric and additive distance trees – Many problems are not so well behaved • Parsimony does not make such assumptions 9 Simple distance-based tree building • Standard approach (around since late 50s) involves agglomerative pairwise distance-based clustering • Unweighted pair group method using arithmetic averages, aka “UPGMA” • Builds binary tree by iteratively merging two closest clusters • Initialize with each string as a cluster unto itself • Cluster distances are based on pairwise distances X X 1 d(Ci, Cj ) = d(x, y) |Ci||Cj | x∈Ci y∈Cj 10 Efficient re-calculation of pairwise distances • When two clusters Ci, Cj are merged into Ck , must calculate pairwise distances to other remaining clusters • Efficient re-calculation possible: d(Ck , Cl ) = 1 X X |Ck ||Cl | x∈C d(x, y) k y∈Cl = = = 1 X X (|Ci| + |Cj |)|Cl | d(x, y) + x∈Ci y∈Cl |Cl ||Ci| d(Ci, Cl ) + |Cl ||Cj | d(Cj , Cl ) (|Ci| + |Cj |)|Cl | |Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl ) |Ci| + |Cj | 11 X X x∈Cj y∈Cl d(x, y) Clustering by distance • If the distance matrix is ultrametric, UPGMA is the right approach • If, however, the distance matrix is not ultrametric, but additive, then there are better clustering methods than UPGMA • Why not join the closest as UPGMA suggests? – Because two very close strings may not form a node – Additivity is a very different requirement from minimum distance 12 UPGMA clustering of additive-distance matrix A B C A B C D E 0 8 8 5 4 0 2 5 6 0 5 6 0 3 D E B 0 • Find the lowest score in the matrix • Merge columns/rows • Update distances d(Ck , Cl ) = |Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl ) |Ci| + |Cj | 13 C UPGMA clustering of additive-distance matrix A BC A BC D E 0 8 5 4 0 5 6 0 3 D E E 0 • Find the lowest score in the matrix • Merge columns/rows • Update distances d(Ck , Cl ) = C B |Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl ) |Ci| + |Cj | 14 D UPGMA clustering of additive-distance matrix A BC A BC DE 0 8 4.5 0 5.5 E DE 0 • Find the lowest score in the matrix • Merge columns/rows • Update distances d(Ck , Cl ) = |Ci| d(Ci, Cl ) + |Cj | d(Cj , Cl ) |Ci| + |Cj | 15 C B A D Additive matrix A B C D E A B C D 0 8 8 5 4 0 2 5 6 0 5 0 E Truth: 2 1 1 D 3 0 1 1 6 1 C B 1 3 E A What UPGMA found: E 16 C B A D Neighbor joining in additive distance • Two key ideas in modifying clustering for additive trees • “Normalize” distance by average distances over set of leaves L P z∈L d(x, z) + d(y, z) D(x, y) = d(x, y) − |L| − 2 • Change distance recalculation after merging clusters S • If Ck = Ci Cj then for all nodes m d(k, m) = 1 2 (d(i, m) + d(j, m) − d(i, j)) 17 Additive neighbor joining A B C D E A B C D 0 A 8 8 5 4 A 0 2 5 6 B 0 5 6 0 3 D 0 E E → C B C D 0 -7.33 -7.33 -9.33 -10.5 0 -12 -8 -7.17 0 -8 -7.17 0 -9.17 • Find the lowest score in the matrix • Merge columns/rows • Update distances d(k, m) = 1 2 (d(i, m) + d(j, m) − d(i, j)) 18 E 0 Additive neighbor joining A BC D E A BC D E 0 7 0 5 4 A → 4 5 0 3 D 0 E BC A BC D 0 -9 -9 -10 0 -10 -9 0 E -9 0 • Two possible merges • Additive-distance tree (unlike ultrametric) not necessarily unique 19 Multiple trees (basically unrooted) 2 1 2 1 A 1 1 1 1 1 1 E D 3 3 1 C B 1 D A 1 1 E B 20 C Internal nodes • Just finding the tree-topology is not necessarily the end product • What about labels on the internal nodes? • A phylogenic tree is hypothesizing a point of divergence – There was an ancestor string at that point – Can we hypothesize the string in addition to the point of divergence? • One method is “maximum parsimony” or just “parsimony” • Sort of an Occam’s razor approach: hypothesize as few mutations as necessary 21 Parsimony • Parsimony methods are generally presented for a given tree – Multiple trees can be compared, but searching over all possible trees is generally intractable • For the current discussion, assume that we are given a tree – Perhaps derived via iterative pairwise alignment • Since we have a tree, assume that we have a multi-sequence alignment consistent with that tree • Given the tree and multi-sequence alignment of leaves, parsimony looks for the minimum substitutions/mutations over the tree 22 Phylogenic Alignment problem • Given a phylogenic tree T , the phylogenic alignment problem is: – Label the internal nodes of T such that the overall distance of the alignment is minimized – Overall distance is the sum of all parent/child distances – Usually different “sites” in the string are modeled independently • Minimum mutation problem (Fitch-Hartigan) – Phylogenic Alignment problem when given multi-sequence alignment of the leaves – Efficient dynamic programming when given tree T 23 Continue with additive tree example A B C D E A B C 0 8 8 5 4 0 2 5 6 0 D E 5 0 2 1 D 3 0 1 1 6 1 1 3 E A A: CATG-AAG D: G-AG-ATT B: G-CATCCT E: C--G-AGT C: G-GATGCT 24 B 1 C Minimum mutation dynamic programming G−AG−ATT G−CATCCT CATG−AAG C−−G−AGT 25 G−GATGCT Minimum mutation dynamic programming C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG G−AG−ATT G−CATCCT C−−G−AGT 26 G−GATGCT Minimum mutation dynamic programming C: T: A: G: −: C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG G−AG−ATT G−CATCCT C−−G−AGT 27 22122102 22220220 22202222 02122122 20222222 G−GATGCT Minimum mutation dynamic programming C: T: A: G: −: C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG C: T: A: G: −: 13322233 23222221 22222023 13302222 21220233 G−AG−ATT G−CATCCT C−−G−AGT 28 22122102 22220220 22202222 02122122 20222222 G−GATGCT Minimum mutation dynamic programming C: T: A: G: −: C: T: A: G: −: C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG 23422232 33421231 33512233 13412233 31421343 C: T: A: G: −: 13322233 23222221 22222023 13302222 21220233 G−AG−ATT G−CATCCT C−−G−AGT 29 22122102 22220220 22202222 02122122 20222222 G−GATGCT Minimum mutation backtrace G−−ATCCT C: T: A: G: −: C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG C: T: A: G: −: 13322233 23222221 22222023 13302222 21220233 G−AG−ATT G−CATCCT C−−G−AGT 30 22122102 22220220 22202222 02122122 20222222 G−GATGCT Minimum mutation backtrace G−−ATCCT C: T: A: G: −: G−−G−ACT C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG G−AG−ATT G−CATCCT C−−G−AGT 31 22122102 22220220 22202222 02122122 20222222 G−GATGCT Minimum mutation backtrace G−−ATCCT G−−ATCCT G−−G−ACT C: T: A: G: −: 02222222 22122221 21222012 22202211 21120222 CATG−AAG G−AG−ATT G−CATCCT C−−G−AGT 32 G−GATGCT Minimum mutation backtrace G−−ATCCT G−−ATCCT G−−G−ACT G−AG−ATT G−CATCCT C−−G−AAT CATG−AAG C−−G−AGT 33 G−GATGCT Minimum mutation without tree • For a given tree, we have seen that this algorithm has an efficient dynamic programming solution • No efficient algorithm for exploring all possible phylogenic trees for a set of strings • However, for a given order of strings at the leaves, a variant of the CYK algorithm could be used – Typically used with context-free grammars – O(k3n) for k strings and n-column multi-sequence alignment – Could result in any binary tree shape over the ordered set 34 Iterative approaches to MSA and tree building • Having trees helps with multi-sequence alignment • Having multi-sequence alignment helps with trees • Iterative approaches can be used. One example: 1. Build a multi-sequence alignment using Iterative alignment 2. Build an initial phylogenic tree using neighbor joining 3. Perform minimum mutation phylogenic alignment 4. Build new multi-sequence alignment using full phylogenic tree 5. Throw away strings on internal nodes, preserving MSA 6. Go back to step 2. using new multi-sequence alignment 35
© Copyright 2026 Paperzz