Phylogeny Analysis Phylogeny Analysis Announcements Looks like online course evaluation is limited to Chemistry this quarter. So we will presumably do course evaluations on paper on Wednesday. Friday’s section will be a review session for the Final Exam. We will post practice exam materials by Wednesday. Final exam Tues., Dec. 6, 11:30 am - 2:30 pm, here. Covers material from the whole quarter. Graduate projects due (by email) prior to the Final (recommended by this Friday, to make it easier for you to study!) Phylogeny Analysis Ancestral State Likelihood Say you are given a rooted tree of a set of modern sequences X , Y , Z , .... Now consider a given nucleotide site s in the aligned sequences where they differ (i.e. different nucleotides are observed). You are asked to infer the ancestral states (i.e. nucleotide) at this site in the different MRCA (ancestors) of these sequences, under the Jukes-Cantor model. state what hidden variables you would use, and their relation to the tree. state the basic form of the likelihood model you would use by specifying what you would use as the basic transition probability for an edge, and the basic form of how the transition probabilities are combined. Evolutionary Models Ancestral State Likelihood Lessons Most of you understood that the ancestral sequences should be treated as hidden variables, corresponding to the internal nodes of the rooted tree. A few of you assumed we should model just one ancestral sequence (i.e. the root), forgetting that phylogenies are generally modeled as binary trees (i.e. each node has only two child nodes) so we need two ancestral nodes to give three modern sequences. Some of you proposed that the mutation probability should just be proportional to rate X time. This is reasonable at short time scales, but at long time scales it “saturates” to an upper bound (e.g. for JC, τij → 3/4). Just remember that it’s linear if we assume at most one mutation / site, but turns into exponential decay if multiple mutations / site are possible. Many of you forgot that each edge also has an associated hidden variable that affects the mutation probability. Under JC, this is the time t associated with this edge (which can also be expressed as the “distance” (1 − π)λ t. Phylogeny Analysis Distance Metric Behavior Under the Jukes-Cantor model (all nucleotides equally probable; all transitions between them equally probable), how do you expect the following distance metrics to behave as a function of evolutionary time t? (sketch a basic graph with t as the horizontal axis and distance as the vertical axis): f: observed letter differences per site δ : inferred mutation events per site Phylogenetic Tree Construction Distance Metric Behavior Lessons Most of you understood that total mutation events per site will just increase proportional to the total elapsed time. Some of you realized that the observed letter differences / site is likely to underestimate the total mutation events per site (because if more than one mutation hits the same site, it will only count one observed letter difference). However, most of you assumed that observed letter differences / site will just be proportional to total elapsed time. This is reasonable at short time scales, but impossible at long time scales, since the observed letter differences / site cannot exceed the fraction expected for completely random sequences (e.g. for JC, 3/4). This “saturation” at long time scales gives an exponential decay. Phylogeny Analysis Ultrametric vs. Additive Can the Jukes-Cantor model fit the additive distance assumption? Can the Jukes-Cantor model fit the ultrametric distance assumption? If so, how? If not, why not? Phylogenetic Tree Construction Ultrametric vs. Additive Lessons 3/8 of you got this right. A few of you interpreted JC as implying the same transition matrix for all time intervals. No. The JC transition matrix has both a mutation rate parameter λ and time parameter t T t = eλ t Π Note that an ultrametric distance by definition obeys additivity. Yet some of you proposed that JC fits ultrametric but not additive distance. You are misinterpreting what the “additive distance assumption” means: it means that distances are additive (which is true for ultrametric distances). Most of you did not consider how to prove that distances are guaranteed to be additive. This flows from the Markov property and using the same mutation matrix Π on every edge (but allowing the rate multiplier λ to vary). Phylogeny Analysis Tree Topology Information You are given an alignment of four sequences, and asked to construct an unrooted tree that minimizes the number of total mutations on the tree, using the alignment columns where all four are aligned. Select all of the following classes of alignment columns that contain information about the tree topology (i.e. excluding those alignment columns from the analysis could change which tree has the minimal number of total mutations): A. alignment columns where all four sequences share the same letter. B. alignment columns where three sequences share the same letter, and the remaining sequence has a different letter. C. alignment columns where two sequences share a letter, and the other two sequences share another letter. D. alignment columns where all four sequences disagree with each other. Phylogenetic Tree Construction Tree Topology Information Lessons common error: interpreted case where one sequence differs from the other three as informative. But for an unrooted tree, the three possible ways we could pair that sequence (i.e. the three distinct unrooted trees for these four sequences) all imply exactly one mutation. So this case is not informative for picking which of the three trees has the fewest mutations. Some of you thought of this question in terms of a rooted tree, in which case the 3:1 split would be informative. But the question was specifically for an unrooted tree. Some of you chose the columns where all 4 sequences were identical, apparently on the principle of minimizing the number of mutations. But such column are totally uninformative for choosing the best tree. Phylogeny Analysis UPGMA Applications UPGMA is a general-purpose clustering algorithm that can be applied to inferring the family tree structure of a set of related sequences. How general is this algorithm? I.e. do you expect it to work on all kinds of evolutionarily related sequences? If yes, briefly explain why; if not, provide an example scenario where you would expect it to fail. Phylogenetic Tree Construction UPGMA Application Lessons Most of you understood that UPGMA is not totally general. Some of you correctly saw that UPGMA only can find the correct tree if the distances from each ancestor to all its descendants are equal. (So UPGMA will reconstruct the tree wrong if the mutation rates were different on different branches). Additive distances are not enough to make UPGMA valid. Some of you saw that the mean-averaging assumption (reduction step) is inadequate if the members of the group have different mutation rates. Some of you assumed UPGMA uses an incorrect distance metric (e.g. observed differences / site) so it would fail at longer time scales. No. You can input any metric you want into UPGMA. Phylogeny Analysis Viterbi at the Leaf Nodes Let’s compute p∗ (Dα |α) for the simplest possible case, where both child nodes of α are observed (i.e. leaf nodes; call them X , Y ). Furthermore, assume that all of the variables (sequences) consist of just a single nucleotide (e.g. X = A, Y = G). Write an equation for how you would compute the Viterbi maximum probability p∗ (Dα |α) in this case. Intro Bioinformatics Viterbi at the Leaf Nodes Lessons Most of you applied the general Viterbi formula properly. Some of you disregarded the fact that X,Y were leaf-nodes, and so have no descendants. Some of you ignored the fact that X,Y were observed, so no need to consider multiple states. Phylogeny Analysis Viterbi at the Root We wish to find the values of all our variables (i.e. all internal nodes) α, β , γ, ... that maximize the total probability of the tree p(α, β , γ, ..., X , Y , Z , ...), where X , Y , Z , ... are the observed sequences (leaf nodes). Say you are given p∗ (Dα |α) for the root node α . How would you find the value of α that maximizes p(α, β , γ, ..., X , Y , Z , ...), and the associated maximum probability p∗ (α, β , γ, ..., X , Y , Z , ...)? Intro Bioinformatics Viterbi at the Root Lessons a quarter of you got this completely right. Many of you don’t seem to have thought of this in terms of Viterbi recursion. That is, you were already given “the answer” for everything below α i.e p∗ (Dα |α), so all you had to do was choose the value of α that maximized the product of α and everything below it. Some of you referred to other variables β , γ etc. without realizing that everything below the root was already captured by p∗ (Dα |α). Most of you did not consider the prior probability p(α). I get the feeling you’re not really thinking in terms of the basic chain rule. Phylogeny Analysis Tree Reduction UPGMA is a recursive algorithm that depends on a reduction step that replaces the selected pair of clusters with a new cluster, prior to applying the tree construction algorithm recursively on the “reduced” set of clusters. Let’s consider the details of the reduction step for the algorithm for a simple tree of four sequences X,Y,Z,W. Say sequences X,Y are chosen as nearest neighbors. Indicate precisely how they would be replaced with a new cluster vertex C, by drawing a representative tree before and after this replacement. Phylogenetic Tree Construction UPGMA Tree Reduction Lessons again, about a quarter of you got this completely right. some of you forgot that all of these algorithms produce binary trees, i.e. each node has two child nodes. some of you forgot that UPGMA produces a rooted tree. most of you did not consider that the replacement node C must be the same distance from other sequences Z as X and Y were. This is required in order for recursion to work on the “reduced set of distances”. Phylogeny Analysis Tree Reduction Neighbor Joining is a recursive algorithm that depends on a reduction step that replaces the selected pair of clusters with a new cluster, prior to applying the tree construction algorithm recursively on the “reduced” set of clusters. Let’s consider the details of the reduction step for the algorithm for a simple tree of four sequences X,Y,Z,W. Say sequences X,Y are chosen as nearest neighbors. Indicate precisely how they would be replaced with a new cluster vertex C, by drawing a representative tree before and after this replacement. Phylogenetic Tree Construction NJ Tree Reduction δ (A, X ) = α + γ; δ (B , X ) = β + γ; δ (A, B ) = α + β δ (C , X ) = γ = δ ( A, X ) + δ ( B , X ) − δ ( A, B ) 2 If we use this formula for C’s distance to every other node X, we have a new set of additive distances that we can just re-apply the NJ recursion to. Phylogeny Analysis NJ Tree Reduction Lessons the key principle of “tree reduction” is that it must give a reduced set of distances (X and Y replaced by C) that we can just apply the exact same recursion to. I.e. additive distances consistent with the original tree with the X, Y edges removed. Many of you proposed an averaging rule. I think you were intuitively proposing something close to just removing the X, Y edges as we just showed. Note this is different from UPGMA tree reduction for the same reasons that NJ is different from UPGMA: additive / unrooted instead of ultrametric / rooted. Some of you assumed a rooted tree. But this requires the ultrametric assumption. Phylogeny Analysis NJ Tree Connectivity Significance? Consider the following unrooted tree produced by Neighbor Joining. Note that sequences A,B are connected directly to each other (no intervening edges) even though δAC < δAB and also δBC < δAB . Does this indicate that Neighbor Joining has made an evolutionarily incorrect tree? Does the fact that A,B are directly connected have any valid evolutionary meaning? Phylogenetic Tree Construction NJ Tree Connectivity Significance Lessons Half of you got this completely right. Some of you assumed that the connectivity (e.g. the fact that A and B are directly connected to each other) of the unrooted tree has no evolutionary meaning. No. The connectivity records the actual branching history of the sequences. Concretely, if A,B are neighbors then for any other pair of sequences C,D, the distances within the two groups must be less than the distances between them: δ (A, B ) + δ (C , D ) < δ (A, C ) + δ (B , D ) I.e. there is an “extra period of history” connecting the two pairs A, B → C , D to each other than to themselves (A → B , C → D). One of you thought this means “no other species separated along the way between A and B”. No. This is possible, e.g. if the root is on edge A or edge B. Phylogeny Analysis NJ: Where did the 1/(n-2) factor come from? Q (X , Y ) = δ (X , Y ) − 1 1 ∑ δ (X , Z ) − n − 2 ∑ δ (Y , Z ) n−2 Z Z εX is included (n-1) times in the first sum, and 1 time in the second sum, i.e. n times total. Therefore the weight of εX in Q(X,Y) is -2/(n-2). The same logic holds for εY . For any other sequence Z, εZ appears once in each sum, for a total weight in Q(X,Y) of -2/(n-2). Therefore any external edge εZ contributes exactly the same to Q(X,Y) for every possible pair X,Y (even if X or Y is Z). The value of εZ will have no effect on choosing the minimal Q(X,Y). Note that the 1/(n-2) factor is critical for achieving this. Phylogeny Analysis TREE EVALUATION & INTERPRETATION 58 evolutionary distance Monophyletic Group Analysis ? 1 2 3 confident 1 2 3 uncertain A monophyletic group is a subset of sequences that includes all descendants of their most recent common ancestor (MRCA). Only confident if all other sequences well separated from this group. Confident monophyletic groups also form a tree! Monophyletic groups Say a subset of sequences X , Y , Z ... is found to form a monophyletic group in a given rooted tree. What constraint(s) if any does this impose on what other monophyletic groups can be found in this tree? Phylogeny Analysis Monophyletic groups Answer A monophyletic group corresponds to an internal node in a rooted tree. The other possible monophyletic groups in this tree are just the other internal nodes. So if another monophyletic group overlaps this group (i.e. they have a sequence in common), then either it must be a descendant of this group (strictly contained in this group) or vice versa. Phylogeny Analysis
© Copyright 2026 Paperzz