Dating phylogenetic trees

Phylogenetics and comparative methods
Dating phylogenetic trees
May 8th, 2015
Molecular phylogenetic trees
Reconstruction methods estimate
• unrooted trees
• with branch lengths in genetic change units
Ideal phylogenetic trees
• are rooted
• have branch lengths in units of time
Example of a dated tree
Dating molecular trees
Dating phylogenetic trees is challenging. The basic idea is
that given
• a tree topology
• branch length = r ∗ t
• calibration point(s), e.g. a time estimate for a node (T)
Can we date all the other nodes in the tree?
A constant evolutionary rate
• To obtain a calibrated tree, the
evolutionary model must assume a
relationship between the
accumulation of genetic diversity
and time
• Zuckerkandl and Pauling (1962): the rate of amino acid
replacements in animal haemoglobins was roughly proportional to
real time, as judged against the fossil record.
Global molecular clock
Under a global molecular clock, the rate of mutation r in each lineage is
the same and constant over time. If that’s the case, branch lengths are
simply proportional to time.
There are several reasons why we want to believe in the existence of a
constant global molecular clock
• phylogenetic inference is much simpler when constant rates of
evolution can be assumed
• a constant clock makes it possible to estimate divergence times, and
to date specific events (e.g. migration, hybridisation, host switch,
. . .)
• the molecular clock relates to Kimura’s theory of neutral evolution
Evolutionary time with a clock
• estimate pairwise genetic distance
d = genetic distance
• paleontological data to determine date
of common ancestor
T = time since divergence
• estimate calibration rate (number of
substitutions per unit of time)
r = dac /2Tac
• calculate time of divergence for all
other nodes
Tab = dab /2r
Strict molecular clock
Substitutions occur randomly according to a Poisson process
P(N(t + δt ) − N(t) = k) =
e −λδt (λδt )k
k!
Number of mutations occuring per
million year with Poisson variance
• 95% of the lineages 15 my old have
between 8 and 22 substitutions
• 8 substitutions could also be < 5
my old!
k = 0, 1, . . .
Testing the global clock
• strict molecular clock
• all lineages evolve at the same rate
• allows the estimation of the root of the tree and dates of individual
nodes
Zuckerkandl and Pauling, 1962
• unconstrained Felsenstein model
• each branch has its own rate independent of all others
• time and rate are confounded and can only be estimated as a
compound parameter (the branch length)
Felsenstein, 1981
Non-clock phylogenetic tree
• unrooted tree
• 2n − 3 independent branches
• all of bi need to be estimated
• Maximum Likelihood
Q
L(T , bi , θ) =
k
Prob(yk |T , bi , θ)
Clock phylogenetic tree
• rooted tree
• n − 1 independent
branches
• only the heights of
the nodes to estimate
• b1 = b2
b3 = b4
b6 = b5 + b1
b8 = b7 + b6 − b3
Likelihood ratio test
Alternative model H1
Null model H0
2n − 3 parameters
n − 1 parameters
• likelihood ratio test with n − 2 degrees of freedom
• 2 ∗ (lnL(H1 ) − lnL(H0 ))
Relative rate tests
Molecular clock test is a very strict statistical test, because it can be
rejected even if a single lineage is different from the other.
Idea is to detect such lineage and remove them from tree
• compare two ingroup lineages for their distance to a single outgroup
• can be modified to test multiple lineages
• for each non-root node
• test if two descendants of a node have same branch length
• remove lineages that show significant deviation from clock hypothesis
• create therefore a “linearized tree”
Solutions to molecular clocks
If the previous tests show deviation from clock hypothesis, removing taxa
might not be ideal. So how can we try to deal with that?
• if rate variation are random in direction and magnitude in different
DNA regions, combining large number of data sets might give
reasonable estimates of divergence times
• but grasses have higher rates in plastid, nuclear and mitochondrial
genes
• assume that all genes share common divergence times, but allow
pattern of rate variation to differ among genes
Local molecular clocks
If global molecular clock does not hold, we could try to fit local clocks.
• postulate some small number k > 1 of fixed but different rates for
sets of branches
• some models are not identifiable, they do not permit unambiguous
estimation of times and rates
• combinatorially huge number of ways to assign a small number of
rates on a large tree
• tests for rate differences at each node to identify subtree with
common rates
Rate of the rate of evolution
The molecular clock assumption can be relaxed by imposing a weaker
constraint that is sufficient enough to allow estimation of divergence
times
• we need to come up with a way to model changes of evolutionary
rates through time
• tractable solution is to use temporal autocorrelation to model among
lineages rate changes
Penalized likelihood
Modelling rate autocorrelation through lineages
• f (θSAT |x1 , . . . , xn ) =
Q
f (xk |rk [tanc(k) − tk ])
• “smoothing” parameter is introduced and can be tuned to allow
greater or lesser rate smoothing:
• f (θSAT |x1 , . . . , xn ) − λΘ(r1 , . . . , rn )
• Θ(r1 , . . . , rn ) can simply be minimizing rate differences between
branches
• constant rate: λ = ∞
• each branch has one rate: λ = 0
• cross-validation to estimate smoothing parameter
• allow constraints to be added to the model in the form of known
dates
Sanderson, 2002
Bayesian approach
Modeling rate evolution using a Hierarchical Bayesian setting:
f (r , a, θr , θa , θs |X ) =
f (X |r , a, θs )f (r |θr )f (a|θa )f (θs )f (θr )f (θa )
f (X )
• where
• r = substitution rates for branches, a = ages of interior nodes
• θr = lineage-specific rate variation
• θa = model of branching times
• θs = model of sequence evolution
• MCMC to integrate over all possible rate assignment
• gives credibility intervals around rate estimates for each branch and
obtained dates
• allow constraints to be added to the model in the form of known
dates
Correlated relaxed clock
Uncorrelated relaxed clock
Divergence time influences
When calibrating the divergence times of some internal nodes, the tree
prior is constructed in BEAST using three main ingredients:
1
2
3
One or more "calibration densities"
A parametric "tree prior" that specify a density on the topology and
all the divergence times of the tree
Zero or more additional constraints on the topology in the form of
subsets of taxa that are constrained to be monophyletic
Fossils
Phylogeography calibration
Volcanic islands are nice... but rare
Fleischer et al. 1998
Other possibilities
Sauquet et al. 2012
Calibration points
Heath et al. 2012
Effects of calibrations I
Sauquet et al. 2012
Effects of calibrations II
Sauquet et al. 2012
Effects of substitution models
Brandley et al. 2012