Phylogeny Analysis

Phylogeny Analysis
Phylogeny Analysis
Announcements
Looks like online course evaluation is limited to Chemistry this
quarter. So we will presumably do course evaluations on paper
on Wednesday.
Friday’s section will be a review session for the Final Exam.
We will post practice exam materials by Wednesday.
Final exam Tues., Dec. 6, 11:30 am - 2:30 pm, here.
Covers material from the whole quarter.
Graduate projects due (by email) prior to the Final (recommended
by this Friday, to make it easier for you to study!)
Phylogeny Analysis
Ancestral State Likelihood
Say you are given a rooted tree of a set of modern sequences
X , Y , Z , .... Now consider a given nucleotide site s in the aligned
sequences where they differ (i.e. different nucleotides are observed).
You are asked to infer the ancestral states (i.e. nucleotide) at this site
in the different MRCA (ancestors) of these sequences, under the
Jukes-Cantor model.
state what hidden variables you would use, and their relation to
the tree.
state the basic form of the likelihood model you would use by
specifying what you would use as the basic transition probability
for an edge, and the basic form of how the transition probabilities
are combined.
Evolutionary Models
Ancestral State Likelihood Lessons
Most of you understood that the ancestral sequences should be
treated as hidden variables, corresponding to the internal nodes
of the rooted tree.
A few of you assumed we should model just one ancestral
sequence (i.e. the root), forgetting that phylogenies are generally
modeled as binary trees (i.e. each node has only two child nodes)
so we need two ancestral nodes to give three modern sequences.
Some of you proposed that the mutation probability should just be
proportional to rate X time. This is reasonable at short time
scales, but at long time scales it “saturates” to an upper bound
(e.g. for JC, τij → 3/4). Just remember that it’s linear if we
assume at most one mutation / site, but turns into exponential
decay if multiple mutations / site are possible.
Many of you forgot that each edge also has an associated hidden
variable that affects the mutation probability. Under JC, this is the
time t associated with this edge (which can also be expressed as
the “distance” (1 − π)λ t.
Phylogeny Analysis
Distance Metric Behavior
Under the Jukes-Cantor model (all nucleotides equally probable; all
transitions between them equally probable), how do you expect the
following distance metrics to behave as a function of evolutionary time
t? (sketch a basic graph with t as the horizontal axis and distance as
the vertical axis):
f: observed letter differences per site
δ : inferred mutation events per site
Phylogenetic Tree Construction
Distance Metric Behavior Lessons
Most of you understood that total mutation events per site will just
increase proportional to the total elapsed time.
Some of you realized that the observed letter differences / site is
likely to underestimate the total mutation events per site (because
if more than one mutation hits the same site, it will only count one
observed letter difference).
However, most of you assumed that observed letter differences /
site will just be proportional to total elapsed time. This is
reasonable at short time scales, but impossible at long time
scales, since the observed letter differences / site cannot exceed
the fraction expected for completely random sequences (e.g. for
JC, 3/4). This “saturation” at long time scales gives an
exponential decay.
Phylogeny Analysis
Ultrametric vs. Additive
Can the Jukes-Cantor model fit the additive distance assumption?
Can the Jukes-Cantor model fit the ultrametric distance
assumption?
If so, how? If not, why not?
Phylogenetic Tree Construction
Ultrametric vs. Additive Lessons
3/8 of you got this right.
A few of you interpreted JC as implying the same transition matrix
for all time intervals. No. The JC transition matrix has both a
mutation rate parameter λ and time parameter t
T t = eλ t Π
Note that an ultrametric distance by definition obeys additivity. Yet
some of you proposed that JC fits ultrametric but not additive
distance. You are misinterpreting what the “additive distance
assumption” means: it means that distances are additive (which
is true for ultrametric distances).
Most of you did not consider how to prove that distances are
guaranteed to be additive. This flows from the Markov property
and using the same mutation matrix Π on every edge (but
allowing the rate multiplier λ to vary).
Phylogeny Analysis
Tree Topology Information
You are given an alignment of four sequences, and asked to construct
an unrooted tree that minimizes the number of total mutations on the
tree, using the alignment columns where all four are aligned. Select all
of the following classes of alignment columns that contain information
about the tree topology (i.e. excluding those alignment columns from
the analysis could change which tree has the minimal number of total
mutations):
A. alignment columns where all four sequences share the same
letter.
B. alignment columns where three sequences share the same
letter, and the remaining sequence has a different letter.
C. alignment columns where two sequences share a letter, and
the other two sequences share another letter.
D. alignment columns where all four sequences disagree with
each other.
Phylogenetic Tree Construction
Tree Topology Information Lessons
common error: interpreted case where one sequence differs from
the other three as informative. But for an unrooted tree, the three
possible ways we could pair that sequence (i.e. the three distinct
unrooted trees for these four sequences) all imply exactly one
mutation. So this case is not informative for picking which of the
three trees has the fewest mutations.
Some of you thought of this question in terms of a rooted tree, in
which case the 3:1 split would be informative. But the question
was specifically for an unrooted tree.
Some of you chose the columns where all 4 sequences were
identical, apparently on the principle of minimizing the number of
mutations. But such column are totally uninformative for choosing
the best tree.
Phylogeny Analysis
UPGMA Applications
UPGMA is a general-purpose clustering algorithm that can be applied
to inferring the family tree structure of a set of related sequences. How
general is this algorithm? I.e. do you expect it to work on all kinds of
evolutionarily related sequences? If yes, briefly explain why; if not,
provide an example scenario where you would expect it to fail.
Phylogenetic Tree Construction
UPGMA Application Lessons
Most of you understood that UPGMA is not totally general.
Some of you correctly saw that UPGMA only can find the correct
tree if the distances from each ancestor to all its descendants are
equal. (So UPGMA will reconstruct the tree wrong if the mutation
rates were different on different branches).
Additive distances are not enough to make UPGMA valid.
Some of you saw that the mean-averaging assumption (reduction
step) is inadequate if the members of the group have different
mutation rates.
Some of you assumed UPGMA uses an incorrect distance metric
(e.g. observed differences / site) so it would fail at longer time
scales. No. You can input any metric you want into UPGMA.
Phylogeny Analysis
Viterbi at the Leaf Nodes
Let’s compute p∗ (Dα |α) for the simplest possible case, where both
child nodes of α are observed (i.e. leaf nodes; call them X , Y ).
Furthermore, assume that all of the variables (sequences) consist of
just a single nucleotide (e.g. X = A, Y = G).
Write an equation for how you would compute the Viterbi maximum
probability p∗ (Dα |α) in this case.
Intro Bioinformatics
Viterbi at the Leaf Nodes Lessons
Most of you applied the general Viterbi formula properly.
Some of you disregarded the fact that X,Y were leaf-nodes, and
so have no descendants.
Some of you ignored the fact that X,Y were observed, so no need
to consider multiple states.
Phylogeny Analysis
Viterbi at the Root
We wish to find the values of all our variables (i.e. all internal nodes)
α, β , γ, ... that maximize the total probability of the tree
p(α, β , γ, ..., X , Y , Z , ...), where X , Y , Z , ... are the observed
sequences (leaf nodes).
Say you are given p∗ (Dα |α) for the root node α .
How would you find the value of α that maximizes
p(α, β , γ, ..., X , Y , Z , ...), and the associated maximum
probability p∗ (α, β , γ, ..., X , Y , Z , ...)?
Intro Bioinformatics
Viterbi at the Root Lessons
a quarter of you got this completely right.
Many of you don’t seem to have thought of this in terms of Viterbi
recursion. That is, you were already given “the answer” for
everything below α i.e p∗ (Dα |α), so all you had to do was
choose the value of α that maximized the product of α and
everything below it.
Some of you referred to other variables β , γ etc. without realizing
that everything below the root was already captured by p∗ (Dα |α).
Most of you did not consider the prior probability p(α).
I get the feeling you’re not really thinking in terms of the basic
chain rule.
Phylogeny Analysis
Tree Reduction
UPGMA is a recursive algorithm that depends on a reduction step that
replaces the selected pair of clusters with a new cluster, prior to
applying the tree construction algorithm recursively on the “reduced”
set of clusters.
Let’s consider the details of the reduction step for the algorithm for a
simple tree of four sequences X,Y,Z,W. Say sequences X,Y are
chosen as nearest neighbors. Indicate precisely how they would be
replaced with a new cluster vertex C, by drawing a representative tree
before and after this replacement.
Phylogenetic Tree Construction
UPGMA Tree Reduction Lessons
again, about a quarter of you got this completely right.
some of you forgot that all of these algorithms produce binary
trees, i.e. each node has two child nodes.
some of you forgot that UPGMA produces a rooted tree.
most of you did not consider that the replacement node C must
be the same distance from other sequences Z as X and Y were.
This is required in order for recursion to work on the “reduced set
of distances”.
Phylogeny Analysis
Tree Reduction
Neighbor Joining is a recursive algorithm that depends on a reduction
step that replaces the selected pair of clusters with a new cluster, prior
to applying the tree construction algorithm recursively on the “reduced”
set of clusters.
Let’s consider the details of the reduction step for the algorithm for a
simple tree of four sequences X,Y,Z,W. Say sequences X,Y are
chosen as nearest neighbors. Indicate precisely how they would be
replaced with a new cluster vertex C, by drawing a representative tree
before and after this replacement.
Phylogenetic Tree Construction
NJ Tree Reduction
δ (A, X ) = α + γ; δ (B , X ) = β + γ; δ (A, B ) = α + β
δ (C , X ) = γ =
δ ( A, X ) + δ ( B , X ) − δ ( A, B )
2
If we use this formula for C’s distance to every other node X, we have
a new set of additive distances that we can just re-apply the NJ
recursion to.
Phylogeny Analysis
NJ Tree Reduction Lessons
the key principle of “tree reduction” is that it must give a reduced
set of distances (X and Y replaced by C) that we can just apply
the exact same recursion to. I.e. additive distances consistent
with the original tree with the X, Y edges removed.
Many of you proposed an averaging rule. I think you were
intuitively proposing something close to just removing the X, Y
edges as we just showed.
Note this is different from UPGMA tree reduction for the same
reasons that NJ is different from UPGMA: additive / unrooted
instead of ultrametric / rooted.
Some of you assumed a rooted tree. But this requires the
ultrametric assumption.
Phylogeny Analysis
NJ Tree Connectivity Significance?
Consider the following unrooted tree produced by Neighbor Joining.
Note that sequences A,B are connected directly to each other (no
intervening edges) even though δAC < δAB and also δBC < δAB . Does
this indicate that Neighbor Joining has made an evolutionarily incorrect
tree? Does the fact that A,B are directly connected have any valid
evolutionary meaning?
Phylogenetic Tree Construction
NJ Tree Connectivity Significance Lessons
Half of you got this completely right.
Some of you assumed that the connectivity (e.g. the fact that A
and B are directly connected to each other) of the unrooted tree
has no evolutionary meaning. No. The connectivity records the
actual branching history of the sequences.
Concretely, if A,B are neighbors then for any other pair of
sequences C,D, the distances within the two groups must be less
than the distances between them:
δ (A, B ) + δ (C , D ) < δ (A, C ) + δ (B , D )
I.e. there is an “extra period of history” connecting the two pairs
A, B → C , D to each other than to themselves (A → B , C → D).
One of you thought this means “no other species separated along
the way between A and B”. No. This is possible, e.g. if the root is
on edge A or edge B.
Phylogeny Analysis
NJ: Where did the 1/(n-2) factor come from?
Q (X , Y ) = δ (X , Y ) −
1
1
∑ δ (X , Z ) − n − 2 ∑ δ (Y , Z )
n−2
Z
Z
εX is included (n-1) times in the first sum, and 1 time in the
second sum, i.e. n times total.
Therefore the weight of εX in Q(X,Y) is -2/(n-2).
The same logic holds for εY .
For any other sequence Z, εZ appears once in each sum, for a
total weight in Q(X,Y) of -2/(n-2).
Therefore any external edge εZ contributes exactly the same to
Q(X,Y) for every possible pair X,Y (even if X or Y is Z).
The value of εZ will have no effect on choosing the minimal
Q(X,Y).
Note that the 1/(n-2) factor is critical for achieving this.
Phylogeny Analysis
TREE EVALUATION &
INTERPRETATION
58
evolutionary distance
Monophyletic Group
Analysis
?
1 2
3
confident
1 2
3
uncertain
A monophyletic group is a subset of sequences that
includes all descendants of their most recent common
ancestor (MRCA). Only confident if all other
sequences well separated from this group. Confident
monophyletic groups also form a tree!
Monophyletic groups
Say a subset of sequences X , Y , Z ... is found to form a monophyletic
group in a given rooted tree. What constraint(s) if any does this
impose on what other monophyletic groups can be found in this tree?
Phylogeny Analysis
Monophyletic groups Answer
A monophyletic group corresponds to an internal node in a rooted
tree.
The other possible monophyletic groups in this tree are just the
other internal nodes.
So if another monophyletic group overlaps this group (i.e. they
have a sequence in common), then either it must be a
descendant of this group (strictly contained in this group) or vice
versa.
Phylogeny Analysis