Article Fast Track Relaxing the Molecular Clock to Different Degrees

Relaxing the Molecular Clock to Different Degrees for Different
Substitution Types
Hui-Jie Lee,*,1 Nicolas Rodrigue,2 and Jeffrey L. Thorne1,3
1
Department of Statistics, North Carolina State University
Department of Biology, Carleton University, Ottawa, ON, Canada
3
Department of Biological Sciences, North Carolina State University
*Corresponding author: E-mail: [email protected].
Associate editor: Hideki Innan
2
Abstract
Rates of molecular evolution can vary over time. Diverse statistical techniques for divergence time estimation have been
developed to accommodate this variation. These typically require that all sequence (or codon) positions at a locus change
independently of one another. They also generally assume that the rates of different types of nucleotide substitutions vary
across a phylogeny in the same way. This permits divergence time estimation procedures to employ an instantaneous rate
matrix with relative rates that do not differ among branches. However, previous studies have suggested that some
substitution types (e.g., CpG to TpG changes in mammals) are more clock-like than others. As has been previously noted,
this is biologically plausible given the mutational mechanism of CpG to TpG changes. Through stochastic mapping of
sequence histories from context-independent substitution models, our approach allows for context-dependent nucleotide substitutions to change their relative rates over time. We apply our approach to the analysis of a 0.15 Mb intergenic
region from eight primates. In accord with previous findings, we find comparatively little rate variation over time for CpG
to TpG substitutions but we find more for other substitution types. We conclude by discussing the limitations and
prospects of our approach.
Key words: CpG transition rate, context-dependent substitution, relaxed molecular clock, divergence time estimation.
Introduction
Article
Fast Track
Molecular sequence data provide information about the
amount of evolution, or the branch lengths, between species,
but they do not suffice for disentangling evolutionary rates
and times. When substitution rates change over time, divergence time estimation becomes more challenging because
fossil evidence that separates rates and times on one part
of a phylogeny does not guarantee that rates and times can
be well-separated on other parts of the tree. Just as improved
analysis of fossil data can lead to better divergence time estimates, improved treatment of evolutionary rates can yield
more successful divergence time estimates.
A wide variety of relaxed clock methods have been proposed and studied for facilitating the separation of rates and
times (Sanderson 1997, 2002; Thorne et al. 1998; Huelsenbeck
et al. 2000; Yoder and Yang 2000; Kishino et al. 2001; ArisBrosou and Yang 2003; Drummond et al. 2006; Rannala and
Yang 2007). One promising direction is to identify factors that
cause rate variation over time or that at least covary with this
rate variation. For example, Lartillot and Delsuc (2012) have
considered how life-history traits such as body size might
covary with nucleotide substitution rates. By combining life
history trait information of extant species with the sequence
data from these species, Lartillot and Delsuc are able to partially separate evolutionary rates and times. They do this by
exploiting covariation across a phylogeny between body size
and evolutionary rates.
Other potential ways to partially or completely separate
rates and times are motivated by other factors that affect
substitution rates. These include the possibility that natural
selection induces a correlation between substitution rate and
effective population size (Ohta 1973) and the ability to better
understand substitution rates by incorporating mutation
data. Although diverse treatments of substitution rate variation over time are available, a typical assumption is that the
relative rates of different substitution types vary among
branches on a phylogenetic tree in relatively simple ways.
For example, the absolute rates of all substitution types
might change over time but the relative rates among types
might be invariant. Alternatively, effective population
size might vary among branches of a tree so that
mutation–selection balance approaches inspired by population genetics can assist in the determination of how rates of
different substitution types vary among branches (e.g., see
Yang and Nielsen 2008; Rodrigue et al. 2010).
Beyond the effects of generation time and natural selection, there are compelling biological reasons to believe that
additional flexibility in substitution rate variation over time is
warranted. Even for a neutrally evolving sequence, variation
across the phylogeny in generation length might be insufficient to completely explain variation across the phylogeny in
substitution rates. Although mutation rate equals substitution rate with neutral evolution (Kimura 1968, 1983), different
mutational mechanisms underlie different types of point
ß The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
1948
Mol. Biol. Evol. 32(8):1948–1961 doi:10.1093/molbev/msv099 Advance Access publication April 29, 2015
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
mutations. If mutation occurred only in meiosis, then mutation rate variation over time might be exclusively attributable
to changes in generation length. But, this is not the situation.
Some mutational mechanisms seem prone to yielding almost
clock-like substitution behavior because they can operate
nearly at any time during a life cycle whereas other mutational
mechanisms are meiotic and therefore prone to affecting a
sequence site only once per generation. In addition, the relative importance of mutational mechanisms can change
across a phylogeny because the genetic and environmental
components affecting them can change.
Despite the fact that the molecular clock seldom holds in
empirical analyses, it has been suggested that CpG dinucleotides (i.e., cytosines immediately followed in sequence by guanines) may evolve in a comparatively clock-like fashion in
mammals and primates (Hwang and Green 2004; Kim et al.
2006; Peifer et al. 2008). As CpG methylation exists during
much of the life cycle of germ-line cells (Smallwood and
Kelsey 2012), mutations associated with DNA methylation
can accumulate throughout much of the duration of each
generation (Kim et al. 2006). Hwang and Green (2004) implemented a pioneering Bayesian Markov chain Monte Carlo
(MCMC) method that allows substitution rates to depend
on neighboring nucleotides. They observed that CpG transitions have a relatively clock-like behavior in mammals. Kim et
al. (2006) noted that divergence times between the human
and chimpanzee species pair and the macaque and baboon
species pair are thought to be similar, but the human–
chimpanzee pair has longer generation times. They pointed
out that the macaque–baboon pair had accumulated significantly more transitions at non-CpG sites than the other pair,
probably by virtue of the generation-time effect. However,
inferred amounts of transition substitutions at CpG sites
were similar between pairs. As emphasized by these studies,
methylation-origin mutations at CpG sites may be dependent
on chronological time rather than on time measured in generations. In other words, CpG sites might serve as the basis of
a relatively accurate molecular clock even for a phylogeny
relating lineages with diverse generation times. Inspired by
this earlier work and also by the realization that there are a
variety of other reasons why substitution rates may vary over
time differently for different substitution types, we aim to
investigate the clock-like natures of different kinds of (possibly
context-dependent) nucleotide substitutions and we aim to
simultaneously improve divergence time estimation.
New Approaches
Prior to describing its details, we provide an overview of our
approach that begins with considering the history of
sequence changes on each branch of the phylogeny.
Unfortunately, complete substitution histories are not directly observed when phylogenetically related interspecific
data sets are collected. Instead, only the sequences at the
tips of a rooted evolutionary tree are observed. This lack of
complete information makes the inference problem more
challenging. As allowing different substitution types to
have different clocks is computationally daunting with
MBE
context-dependent substitution, we provide another
nonideal solution.
In this study, we collect substitution histories of homologous sequences according to their posterior distribution using
a context-independent substitution model through a slightly
modified version of the PhyloBayes-MPI software (Lartillot
et al. 2013). These substitution histories are then the basis
for inferring context-dependent rates on each branch. Instead
of different genes having different branch lengths as is the
situation for multigene divergence time estimation (Thorne
and Kishino 2002), now we have different kinds of nucleotide
substitutions with different “substitution lengths.” The substitution types share the same set of divergence times, but
each substitution type has its own rate trajectory on the
phylogeny. The substitution lengths and the associated uncertainty for each substitution type can be approximated
given the substitution histories. Based upon these approximations for the different substitution types, the divergence
times and chronological substitution rates of each kind of
substitution are estimated with a relaxed molecular clock
through MCMC (Thorne et al. 1998; Kishino et al. 2001;
Thorne and Kishino 2002).
Augmented Data Likelihood
The observed data are the homologous sequence alignment
D of length N from S species. The species are related through a
rooted phylogenetic tree with a topology that is assumed
known. The root has index 2S 2, tips have indices
j ¼ 0; . . . ; S 1, and nonroot internal nodes have indices
j ¼ S; . . . ; 2S 3. A branch is given the same index as the
node at its end.
Consider a specific branch j on a tree and a specific site i in
a DNA sequence. Assume the full substitution history of the
DNA sequence on branch j is observed. This history includes
the number of substitutions undergone for every site on
branch j, the timing of the substitution events, and the
states before and after each substitution. The substitution
rate per site on branch j from site context a to nucleotide
b will be abj. Here, “context” represents both the state of the
site of interest and (possibly) states at other neighboring sites.
The vector lj will represent the entire collection of abj for all
possible contexts a and nucleotides b on branch j.
We are interested in the product of abj and time for each
branch. We refer to this product as the “substitution length”
of branch j from context a to nucleotide b. Although the
product can be estimated from sequence data, rates and
the branch time are confounded. We will set the branch
time at 1 for all branches. This means that the substitution
length for change from a to b is known on branch j if abj is
known. Let aij (aij 2 ½0; 1) be the proportion of time site
i has context a on branch j with Uij being a vector representing values of the dwell proportions aij for all a, and let nabij be
the number of changes from context a to nucleotide b at site i
on branch j. We have nij be the vector of substitution counts
nabij for all a and b.
As an example, consider the substitution history illustrated
in figure 1. If j represents the branch being depicted and
1949
MBE
Lee et al. . doi:10.1093/molbev/msv099
A
time
T
1.0
T
A
i= 1
A
0.4
0.6 A
0.4
C
2
0.6
G
3
P
where the substitution counts P
Nabj ¼ i nabij and the
summed dwell proportions aj ¼ i aij . Therefore, the sufficient statistics of abj on branch j consist of the summed
dwell proportions spent in context a on branch j, aj , and the
number of changes from context a to nucleotide b on branch
j, Nabj. By setting the first derivative to zero and solving, we
have the maximum-likelihood estimates
G
1.0
G
4
FIG. 1. An example substitution history on a branch for four sites.
context a represents a CpG dinucleotide, then the
proportion of time that site i = 2 is a CpG site is
a2j ¼ 0:4=1:0 ¼ 0:4. Because there was one substitution
from C to b ¼ T for site i = 2 in the CpG dinucleotide context, nab2j ¼ 1. This will contribute to the CpG transition
length. Site 3 in figure 1 also begins branch j as part of a CpG
dinucleotide. Because there was a substitution from C to T
at time 0.4 for site i = 2, the proportion of time that site i = 3
is a Cpg site is a3j ¼ 0:4. Also, because no substitution
occurred at site i = 3 for the CpG dinucleotide context,
nab3j ¼ 0.
If the nucleotide substitution process is treated as a continuous time Markov chain, then the likelihood for the
observed history of site i on branch j is
Y Y nabij
abj expfaij abj g; ð1Þ
pðnij ; Uij j lj ; sij0 Þ ¼
a
b
where sij0 is the initial state of site i on branch j. The likelihood
without conditioning on the initial state sij0 would be
pðnij ; Uij j lj ; sij0 Þpðsij0 j lj Þ ¼ pðnij ; Uij ; sij0 j lj Þ:
One simplification that is often made when modeling
molecular evolution is to assume stationarity of the substitution process. The stationarity assumption means that the
initial state sij0 has some information pertaining to the rate
parameters. We treat pðsij0 j lj Þ as not being a function of lj .
Therefore, we are only focusing on the transient part of the
likelihood. This means that equation (1) summarizes the
information about lj in an observed history.
Suppose the complete substitution history of every site is
available. The likelihood for the observed history of all sites on
branch j is then
pðnj ; Uj j lj ; sj0 Þ
Y Y Y nabij
abj exp aij abj
¼
i
a
a
b
b
X Y Y P nabij
abj i
exp abj
aij ;
¼
ð2Þ
i
where nj ; Uj , and sj0 are, respectively, vectors of the substitution counts nij , the dwell proportions Uij , and the initial
states sij0 . The log-likelihood is then
log pðnj ; j j j ; sj0 Þ
XXX
nabij log abj aij abj
¼
¼
1950
i
a
a
b
XX
b
Nabj logfabj g aj abj ;
ð3Þ
^ abj ¼
Nabj
:
aj
We can consider lab as a vector of substitution lengths of
changes from context a to nucleotide b on the phylogenetic
tree. We have
0
1
ab1
B
C
C:
⯗
lab ¼ B
ð4Þ
@
A
abð2S2Þ
Let M ¼ fMij g1iN;1j2S2 represent the substitution
history (stochastic mapping) that generates the molecular
sequence data D such that Mij is the substitution history at
site i on branch j. The substitution history Mij can be summarized by the sufficient statistics for abj. These are nabij
and aij .
Mij ¼ ðnabij ; aij Þ; i ¼ 1; 2; . . . ; N; j ¼ 1; 2; . . . ; 2S 2:
ð5Þ
With a context-independent model of sequence change,
there are four possible states (A, T, C, and G) and 4 3 = 12
possible types of single-nucleotide substitutions (e.g., A ! T,
A ! C, etc.). If strand-symmetry is assumed so that types
have the same rate as their complement (e.g., A ! C and
T ! G have an identical rate), context-independence yields
six types of single nucleotide substitutions: Four transversions
and two transitions. These six types of single nucleotide substitutions are listed in table 1. In contrast, there are nine types
of changes (pooling together complementary changes) and
three states (contexts) when classifying a site according to
whether it or its complement might have a methylated C in a
CpG dinucleotide. These nine types of changes are listed in
table 2. They include four transversions and two transitions
for non-CpG sites, and two transversions and one transition
for CpG sites. For instance, if the substitution history is
fully observed, the substitution length of CpG transitions
(Type 9 in table 2) on a branch can be inferred by the
number of CpG transitions divided by the sum over sites of
the proportions of time that CpG sites were located on that
branch. Similarly, the substitution length of non-CpG G ! C
and C ! G substitutions (Type 1 in table 2) is computed as
the number of G ! C and C ! G substitutions at non-CpG
sites divided by the summed proportions of time that nonCpG C or G sites existed on that branch. Note that the way
we define substitution lengths can be applied to other types
of context-dependent substitutions not discussed here. It is
possible to derive the substitution lengths for contexts considering any combinations of 50 and 30 neighboring
nucleotides.
MBE
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
distribution PrðM j DÞ by integrating out all possible parameter values in has density
Z
PrðM j DÞ ¼ PrðM j D; ÞPrð j DÞd:
ð7Þ
Table 1. The Six Single Nucleotide
Substitution Types.
Type
1
2
3
4
5
6
Substitutions
G!C & C!G
G!T & C!A
T!A & A!T
T!G & A!C
G!A & C!T
A!G & T!C
NOTE.—Complementary substitutions are considered as the same type.
Table 2. The Nine Context-Dependent
CpG Substitution Types.
Type
1
2
3
4
5
6
7
8
9
Substitutions
Non-CpG G ! C & C ! G
Non-CpG G ! T & C ! A
Non-CpG T ! A & A ! T
Non-CpG T ! G & A ! C
Non-CpG G ! A & C ! T
Non-CpG A ! G & T ! C
CpG G ! C & C ! G
CpG G ! T & C ! A
CpG G ! A & C ! T
NOTE.—Complementary substitutions are considered as the same type.
Sampling Substitution Histories
The “substitution lengths” of different types of substitutions
provide a basis for inferring the rates of each kind of change.
This quantity is hard to estimate directly from the molecular
sequence alignment, but the estimation becomes rather
straightforward if the substitution histories (mappings) can
be fully observed. We therefore employ data augmentation to
help estimate the “substitution lengths” of each kind of
change. That is, the augmented data (unobserved substitution histories) are constructed from the observed sequence
alignment.
Diverse strategies have been developed for sampling substitution histories M conditional upon the sequence alignment D and a vector of parameters that represents the
tree topology, branch lengths, and the parameters of the
substitution process. Sampling histories M from the distribution Pr ðM j DÞ is categorized as endpoint-conditioned sampling because the sequence data D are observed at the
endpoints (tips) of the tree. Endpoint-conditioned sampling
strategies for molecular sequence data have been evaluated
by Hobolth and Stone (2009) and reviewed by Hobolth and
Thorne (2014).
The probability of a mapping conditioned on the molecular sequence alignment can be written as
Pr ðM j DÞ ¼ PrðM j D; Þ ¼
PrðM; D; Þ
:
PrðD; Þ
ð6Þ
However, the true values of the parameters in are usually
unknown. A mapping sampled from the marginal
Lartillot (2006) proposed a sampling algorithm that includes
data augmentation and conjugate Gibbs sampling to obtain samples from the joint posterior probability distribution
PrðM; j DÞ and implemented it in the PhyloBayes-MPI software (Lartillot et al. 2013). The algorithm proceeds in two
alternating steps. First, draw a substitution history conditional
on the parameters in in a way that is similar to Nielsen’s
algorithm (Nielsen 2002). Second, resample the parameters in
through a Gibbs sampler that is conditional on the substitution history. A Gibbs sequence is generated after repeating
this process many times, where a subset of samples of and
substitution histories are taken as draws from the full joint
posterior distribution of all parameters. The Monte Carlo estimate of any moments (e.g., mean) for the marginal distribution of the substitution histories can be directly computed
from the realizations of the Gibbs sequence.
We use a slightly modified version of the PhyloBayes-MPI
software with an independent-site time-reversible model to
generate the substitution histories. We then use these sampled histories to make inferences about parameters in a
richer context-dependent substitution model. We do this
by relying upon the assumption that the distribution of
substitution histories is robust to substitution model specification. In other words, we assume that the endpointconditioned sample from PhyloBayes-MPI for our data set
can be treated as an endpoint-conditioned sample according to our substitution model of interest. For data sets with
long branches that generate high probabilities of multiple
changes per site or high probabilities of changes at consecutive sites, this assumption will be problematic. For data sets
with short branches and little sequence divergence, the assumption should be more appropriate. Further implications
of this as well as potential improvements to it are detailed in
the Discussion section.
Substitution Length Estimation
The PhyloBayes-MPI software samples endpoint-conditioned
substitution histories M and a vector of parameters that
specifies the substitution processes, tree topology, and branch
lengths from the joint posterior distribution PrðM; j DÞ.
Further details about the PhyloBayes-MPI settings that we
used are in the Materials and Methods section. By only keeping a set of widely spaced realizations of mappings and parameters, we can generate C approximately independent
and identically distributed samples of M and from
PrðM; j DÞ. That is,
o
n
iid
~ PrðM; j DÞ;
ðMðcÞ ; ðcÞ Þ
c¼1;...;C
where ðMðcÞ ; ðcÞ Þ is the sample from the cth iteration.
In this notation, MðcÞ ¼ fMðcÞ
ij g ð1 i N; 1 j 2S 2; 1 c CÞ such that MðcÞ
is
the substitution history of site
ij
1951
MBE
Lee et al. . doi:10.1093/molbev/msv099
i on branch j in the cth iteration and has sufficient statistics
ðcÞ
ðcÞ
that are nðcÞ
abij and aij . Note that Mij includes the number of
substitutions from context a to nucleotide b at site i on
branch j and the proportion of time site i on branch j has
context a in the cth iteration.
The “substitution lengths” of changes from context a to
nucleotide b on the phylogeny lab can be estimated by maximum likelihood from MðcÞ . The maximum-likelihood estimate of lab for iteration c is
0
1
^ ðcÞ
ab1
B
C
B
C;
^ ðcÞ
ð8Þ
⯗
ab ¼ @
A
^ ðcÞ
abð2S2Þ
where
X
^ ðcÞ
abj
¼ Xi
nðcÞ
abij
ðcÞ
i aij
¼
NðcÞ
abj
ðcÞ
aj
:
ð9Þ
We can estimate lab by
^ ab ¼
C
1X
^ ðcÞ ;
C c¼1 ab
ð10Þ
^ abj ¼
C
1X
^ ðcÞ :
C c¼1 abj
ð11Þ
so that
By assuming the maximum-likelihood estimates are asymptotically normally distributed, we approximate the variance of
^ ðcÞ
abj for iteration c by the inverse Fisher information. This
yields the variance estimate
ðcÞ2
^ abj
^ ðcÞ
NðcÞ
1
abj
abj
d2 log pðnj ;Uj j l ;sj0 Þ j ^ ðcÞ ¼ ðcÞ ¼ ðcÞ2 ¼ ðcÞ : ð12Þ
j
abj
Nabj
aj
aj
d2
ab
^ abj is
By the Law of Total Variance, the variance of h i
h i
^ abj j MðcÞ : ð13Þ
^ abj Þ ¼ Var E ^ abj j MðcÞ þ E Var Varð
The first term can be estimated from the sample variance of
^ ðcÞ
abj and the second term can be estimated from the sample
ðcÞ
average of the inverse Fisher information, ^ ðcÞ
abj =aj .
We can estimate the variance–covariance matrix
c ^ abj Þ and off-diag^ ab with the diagonal elements Varð
of d ^ abj ; ^ abj0 Þ, where
onal elements Covð
C C 2 1 X
^ abj
1 X
^
^ ðcÞ
þ
; ð14Þ
abj
abj
C 1 c¼1
C c¼1 ðcÞ
aj
ðcÞ
c ^ abj Þ ¼
Varð
and
¼
1952
1
C1
"
C
X
c¼1
d ð^ abj ; ^ abj0 Þ
Cov
^ ðcÞ
^ ðcÞ
abj abj0 1
C
C
X
c¼1
^ ðcÞ
abj C
X
#
^ ðcÞ
abj0 :
c¼1
Note that some sources of uncertainty are unfortunately
not included in the estimates. For instance, we do not
include uncertainty due to using substitution histories
that are not generated from a context-dependent model.
Divergence Time Estimation
Multidivtime is a Bayesian MCMC program for estimating
divergence times on a known rooted phylogeny with a relaxed autocorrelated clock model (Thorne et al. 1998; Kishino
et al. 2001; Thorne and Kishino 2002). Data sets consisting of
multiple genes can be analyzed in Multidivtime by assuming a
common set of divergence times but allowing independent
rate trajectories for each gene. Multidivtime takes a two-step
procedure to estimate species divergence times from multigene data sets. First, it estimates branch lengths through maximum likelihood and uses the curvature of the log-likelihood
surface to estimate a variance–covariance matrix between the
branch length estimates for each gene. Second, it adopts an
MCMC procedure to sample divergence times and rates by
approximating the likelihood surface with a multivariate
normal distribution for each gene, which is determined by
the branch length estimates and the variance–covariance
matrix obtained in the first step.
In this study, we use Multidivtime for data sets where
“substitution lengths” vary among substitution types rather
than data sets where branch lengths vary among genes. By
sampling the substitution histories from PhyloBayes-MPI, the
substitution lengths and the associated variance–covariance
matrix can be estimated as described above for each kind of
nucleotide substitution. Different kinds of nucleotide substitutions are treated by Multidivtime in the same way as it
treats different genes that share the same set of divergence
times. Just as Multidivtime allows the rate trajectories of different genes to independently vary over the phylogeny and
just as it allows some genes to change rate in a more clock-like
fashion than others, our analyses with Multidivtime have different substitution types change rate independently and
allow some substitution types to be more clock-like than
others.
A weakness of Multidivtime that exists for multigene analyses is that it ignores the possibility of correlated rate changes
among genes. Likewise, a weakness of our analyses of changing
substitution rates over time is that correlated changes in rate
among substitution types are biologically plausible but
Multidivtime assumes the rates change independently. An
additional shortcoming of using Multidivtime for studying
substitution rate change is the treatment of variance and
covariance structure of estimated substitution lengths.
Although Multidivtime can account for covariances in estimation error of substitution lengths among branches for each
substitution type, its current implementation assumes no
covariance in estimation error among substitution types.
Results
ð15Þ
As described in the Materials and Methods section, the DNA
sequences for all divergence time analyses consisted of approximately 0.15 Mb from nine primates, one of which was
MBE
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
12
11
10
14
ingroup root
9
8
human
orangutan
gibbon
rhesus
baboon
greenMonkey
marmoset
squirrelMonkey
bushbaby
8 ingroup taxa
13
0
1
2
3
4
5
6
7
1 outgroup taxon
70.0
60.0
50.0
40.0
30.0
20.0
10.0
0.0 (Myr)
FIG. 2. Phylogenetic tree for nine primate species used in the analyses. There are eight ingroup taxa with an outgroup species (bushbaby) to root the
tree. Nodes are labeled from 0 to 14. The depicted divergence times are obtained from TimeTree (Hedges et al. 2006; Kumar and Hedges 2011) and
indicated by the time line with time units of one million years.
Fig. 3. Normalized root-to-tip substitution lengths for each type of context-independent single nucleotide substitution. Types are defined in table 1.
Root-to-tip substitution lengths are normalized so that within-type average substitution length is 1. The normalized root-to-tip substitution lengths are
labeled with different colors for each tip. The average root-to-tip substitution length for each type before normalization is reported as well as the
variance after normalization.
treated as the outgroup. The assumed topology is shown in
figure 2.
One difficulty with divergence time studies is that the
truth tends to be unknown. We decided to compare the
results of our analyses with the divergence time estimates
reported by the TimeTree database (Hedges et al. 2006;
Kumar and Hedges 2011). The divergence times reported
by the TimeTree database should certainly not be considered
true, but we expect them to be comparatively reliable because
the estimates emerge from studies that surpass ours in terms
of the inclusion and treatment of fossil evidence as well as
number of taxa.
Substitution Lengths
We investigate the degree of deviation from clock-like behavior of each type of substitution by first checking the substitution lengths of each type. If the rate of a substitution type is
constant over time, then the root-to-tip substitution lengths
for that type are expected to be similar for all lineages. The
substitution lengths for context-independent and contextdependent substitution types were estimated from a
sample of substitution histories from PhyloBayes-MPI.
The root-to-tip substitution lengths for all lineages were computed and normalized so that the average root-to-tip
substitution length within the same substitution type is 1.
The normalized root-to-tip substitution lengths are summarized in figures 3 and 4. These figures were inspired by corresponding figures from Hwang and Green (2004).
Among all lineages, the root-to-marmoset and root-tosquirrel monkey substitution lengths are always the greatest,
regardless of substitution type. Also, the root-to-human substitution lengths tend to be among the smallest. Species
within the same clade share branches in their root-to-tip
paths, and therefore their root-to-tip substitution lengths
are correlated. In addition, these species with similar root-to
tip substitution lengths have similar generation times and
body sizes. This observation is consistent with the
generation-time effect such that rates of lineages with relatively shorter generation times are elevated. Even though this
phenomenon is relatively minor for CpG transitions (Type 9),
it is consistent with some transitions affecting CpG sites not
being associated with methylation (e.g., replication errors).
The average root-to-tip substitution lengths for each substitution type before normalization are related to the rate of
each type of substitution. For context-independent substitutions, transition rates (Type 5 and Type 6) are higher than
transversion rates (Type 1–Type 4), whereas Type 5 (G!A
and C!T) is the highest of all (see fig. 3). This observation is
consistent with the hypothesis that mutation is biased
1953
Lee et al. . doi:10.1093/molbev/msv099
MBE
FIG. 4. Normalized root-to-tip substitution lengths for each type of context-dependent substitution. Types are defined in table 2. Root-to-tip substitution lengths are normalized so that within-type average substitution length is 1. The normalized root-to-tip substitution lengths are indicated with
different colors for different tips. The average root-to-tip substitution length for each type before normalization is reported as well as the variance after
normalization.
toward A + T content (Sueoka 1988, 1992). For contextdependent substitutions, Type 9 (CpG transitions) has the
highest rate, whereas Type 3 (non-CpG T ! A and A ! T)
and Type 4 (non-CpG T ! G and A ! C) have the lowest
rates (see fig. 4). Moreover, CpG sites are substitution hotspots so that the rates at CpG sites are accelerated. The CpG
transition rate (Type 9) is much higher than non-CpG transition rates (Type 5 and Type 6). Furthermore, CpG transversion rates (Type 7 and Type 8) are also much higher than nonCpG transversion rates (Type 1 and Type 2). Kong et al. (2012)
suggested that the high CpG transversion rate stems not only
from hypermutable CpG sites but also from mutational bias
favoring changes that decrease G + C content. As a matter of
fact, our estimated CpG transversion rates are actually comparable to non-CpG transition rates.
The spread of the root-to-tip substitution lengths (quantified by the variance after normalization) reflects the degree
of deviation from clock-like behavior. Among all contextdependent substitution types, Type 9 (CpG transitions) has
the smallest variance after normalization, suggesting that it is
the most clock-like (see fig. 4). Following CpG transitions,
Type 8 (CpG G ! T and C ! A) has the second smallest
variance after normalization. Despite these patterns, one
should not conclude that CpG sites are more clock-like
than non-CpG sites because the other transversion types
for CpG sites, Type 7 (CpG G ! C and C ! G), have a
large normalized variance for the root-to-tip substitution
lengths. In addition, Type 1 (non-CpG G ! C and C ! G)
and Type 6 (non-CpG A ! G and T ! C) have relatively
small variances, meaning that they are more clock-like than
other types. In fact, Type 1 (G ! C and C ! G) and Type 6
(A ! G and T ! C) are also the most clock-like among all
context-independent single-nucleotide substitutions. This
result, does not agree with the observations of Kim et al.
(2006) that transversions exhibit less generation-time effect
and are more clock-like than transitions.
Divergence Time and Substitution Rate Estimates
We performed a variety of divergence time analyses with the
primate data and the Multidivtime software. For each of the
1954
six context-independent substitution types and for each of
the nine context-dependent types, we inferred divergence
times using only that type. We also estimated times when
the six substitution types were jointly considered with each
type having its own relaxed clock. Likewise, a joint analysis of
the nine context-dependent types was performed. To constrain the time to the value suggested by the TimeTree database, all analyses adopted a gamma prior with mean 44.2 My
and standard deviation 0.1 My for the ingroup root.
Divergence time estimates and 95% credible intervals
for the other ingroup internal nodes are listed in table 3
(context-independent substitutions) and table 4 (contextdependent substitutions). Because the ingroup root time
had a tight prior, the estimated times for it are not reported.
Except for the external information represented by the tight
prior on the ingroup root time, no other calibration information was employed in our analyses.
Substitution types that are relatively clock-like are expected
to give better divergence time estimates because rates on different branches will be highly correlated and therefore information about rates can be shared across branches. This
prediction is indeed consistent with the divergence time estimates from using the CpG transitions only, which according to
the divergence times reported by TimeTree outperforms most
other analyses in terms of precision and accuracy. On the other
hand, substitution types that occurred less frequently (e.g., CpG
transversions) would have greater uncertainty associated with
the estimated substitution lengths, and thus their divergence
time estimates have greater uncertainty as well.
The divergence time estimates from the joint analysis for
context-dependent substitutions are dominated by less
clock-like non-CpG substitutions, because non-CpG sites represent about 99% of the data. If the proportion of clock-like
substitution types in the sequence alignment increases, the
divergence time estimates from this joint analysis would presumably improve.
We also performed analyses with the BEAST software (see
table 3). However, BEAST and Multidivtime implement quite
different treatments for the prior distributions of rates and
times, and therefore the differences in the resulting inferences
MBE
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
Table 3. Divergence Time Estimates for Single Nucleotide Substitutions.
Node (Time)
Multidivtime analyses
Prior
Type 1
Type 2
Type 3
Type 4
Type 5
Type 6
Joint relaxed clocks
Strict clock
Shared relaxed clock
BEAST analyses
Prior
Clock
UCLD
8 (18.8)
21.9
20.2
19.0
18.8
21.9
25.2
25.1
19.8
22.8
17.7
(1.1, 43.1)
(16.4, 24.5)
(15.0, 24.6)
(15.4, 23.1)
(17.1, 28.6)
(20.0, 31.6)
(20.0, 31.8)
(18.0, 21.7)
(22.3, 23.3)
(15.8, 20.1)
13.5 (0.0, 37.2)
23.4 (22.9, 23.9)
23.2 (18.3, 27.9)
9 (8.8)
11.0
6.2
6.2
6.6
5.8
8.1
6.7
6.2
5.8
6.4
(0.4,
(4.6,
(3.8,
(4.5,
(4.1,
(6.6,
(5.5,
(5.5,
(5.5,
(5.0,
31.4)
8.3)
10.9)
10.2)
8.5)
10.1)
8.4)
6.9)
6.0)
8.5)
5.6 (0.0, 17.7)
5.4 (5.2, 5.6)
5.6 (4.3, 7.0)
10 (11.5)
22.1
8.9
9.1
10.0
10.0
12.3
11.0
9.7
9.1
10.0
(4.0, 40.1)
(6.9, 11.6)
(6.0, 15.2)
(7.2, 14.7)
(7.4, 13.9)
(10.3, 14.9)
(9.1, 13.3)
(8.8, 10.8)
(8.8, 9.4)
(8.0, 13.0)
14.1 (0.2, 32.2)
8.5 (8.2, 8.8)
8.8 (7.2, 10.6)
11 (15.1)
11.0
17.5
17.1
16.6
18.1
17.0
18.8
17.7
14.1
19.3
(0.3, 31.2)
(14.8, 20.9)
(12.9, 23.4)
(12.8, 21.7)
(14.2, 23.3)
(14.6, 19.9)
(16.3, 21.7)
(16.4, 19.1)
(13.8, 14.4)
(16.5, 22.8)
5.8 (0.0, 18.2)
13.1 (12.8, 13.5)
13.9 (10.9, 16.8)
12 (18.8)
22.0
19.4
18.3
18.7
20.3
18.6
20.8
19.5
15.9
21.4
(4.1, 40.1)
(16.7, 23.0)
(13.8, 24.8)
(14.6, 23.9)
(16.2, 25.7)
(16.1, 21.5)
(18.2, 23.8)
(18.1, 20.9)
(15.5, 16.2)
(18.4, 24.9)
14.4 (0.4, 32.3)
14.7 (14.4, 15.1)
15.4 (12.3, 18.4)
13 (29.6)
33.1
29.9
29.6
29.8
31.1
31.0
31.5
30.2
26.3
31.0
(12.3,
(27.2,
(25.6,
(26.1,
(27.6,
(28.4,
(29.0,
(29.0,
(25.8,
(28.5,
43.8)
33.1)
34.3)
33.8)
35.0)
33.8)
34.1)
31.4)
26.7)
33.7)
27.9 (10.1, 44.2)
24.4 (23.9, 24.8)
24.4 (21.0, 28.0)
NOTE.—Posterior means and 95% credible intervals of divergence times in units of millions of years. The nodes are labeled as in figure 2 and the divergence times in parentheses
under the node labels are obtained from TimeTree (Hedges et al. 2006; Kumar and Hedges 2011). The types of the single nucleotide substitutions are defined in table 1. Rows for
types 1–6 are the results from Multidivtime where each type of single nucleotide substitution was analyzed separately. The “Joint Relaxed Clocks” row has the results from
Multidivtime with all substitution types analyzed jointly, each type having its own relaxed clocks. The “Strict Clock” and “Shared Relaxed Clock” represent conventional analyses
with an HKY model plus discrete-gamma heterogeneity with four categories in Multidivtime. The last three rows are the results from BEAST (with a strict clock and with an
uncorrelated lognormal relaxed clock using a GTR model plus discrete-gamma heterogeneity with four categories) when substitution types were not allowed to have different
clocks.
Table 4. Divergence Time Estimates for Context-Dependent Substitutions.
Node (Time)
Prior
Type 1
Type 2
Type 3
Type 4
Type 5
Type 6
Type 7
Type 8
Type 9
Joint Relaxed Clocks
21.9
20.5
18.8
18.7
22.0
26.1
25.1
16.5
16.9
20.7
19.7
8 (18.8)
(1.1, 43.1)
(16.6, 25.1)
(14.9, 24.4)
(15.4, 23.0)
(17.1, 28.4)
(19.9, 33.9)
(20.0, 31.6)
(4.1, 33.8)
(5.7, 31.1)
(16.8, 28.0)
(18.1, 21.5)
11.0
6.2
6.3
6.6
5.8
7.8
6.7
6.0
5.0
8.6
6.2
9 (8.8)
(0.4, 31.4)
(4.6, 8.5)
(3.9, 10.7)
(4.5, 9.9)
(4.1, 8.5)
(6.2, 10.4)
(5.5, 8.4)
(0.8, 16.0)
(0.8, 13.5)
(6.0, 12.7)
(5.6, 6.9)
10 (11.5)
22.1 (4.0, 40.1)
9.1 (7.1, 12.0)
9.5 (6.3, 15.3)
10.0 (7.2, 14.5)
10.0 (7.4, 13.7)
12.0 (9.8, 15.1)
11.0 (9.1, 13.3)
8.8 (2.4, 21.6)
6.6 (1.7, 16.9)
13.0 (9.5, 18.3)
9.7 (8.8, 10.6)
11 (15.1)
11.0 (0.3, 31.2)
17.8 (15.1, 21.4)
17.6 (13.4, 23.5)
16.6 (12.8, 21.7)
18.0 (14.1, 23.0)
16.9 (14.2, 20.3)
18.8 (16.3, 21.7)
14.2 (5.1, 27.0)
11.0 (3.2, 21.8)
16.9 (13.1, 21.2)
17.5 (16.3, 18.7)
12 (18.8)
22.0 (4.1, 40.1)
19.8 (17.0, 23.5)
18.9 (14.4, 24.8)
18.7 (14.6, 23.9)
20.2 (16.2, 25.3)
18.4 (15.6, 21.9)
20.8 (18.2, 23.9)
16.1 (6.8, 29.5)
12.9 (5.1, 24.7)
18.9 (15.0, 23.3)
19.2 (18.0, 20.6)
33.1
30.5
30.2
29.8
31.1
30.9
31.5
24.3
22.9
31.1
30.1
13 (29.6)
(12.3, 43.8)
(27.6, 33.7)
(26.2, 34.64)
(26.2, 33.7)
(27.6, 34.8)
(28.1, 34.1)
(29.0, 34.1)
(12.8, 39.7)
(12.0, 37.2)
(27.0, 36.1)
(29.0, 31.2)
NOTE.—Posterior means and 95% credible intervals of divergence times in units of millions of years. The nodes are labeled as in figure 2 and the divergence times in parentheses
under the node labels are obtained from TimeTree (Hedges et al. 2006; Kumar and Hedges 2011) and listed in table 3. The types of the context-dependent substitutions are
defined in table 2. Rows for types 1–9 are the results from Multidivtime where each type of context-dependent substitution was analyzed separately. The last row has results
from Multidivtime with context-dependent substitutions treated jointly, each type having its own relaxed clocks.
could have a variety of causes. The BEAST analysis with a strict
clock results in tight 95% credible intervals for all times, but
the posterior means depart from the times according to
TimeTree (Hedges et al. 2006; Kumar and Hedges 2011).
The posterior means of divergence times from the BEAST
lognormal uncorrelated relaxed clock model analysis are similar to those from the strict clock analysis, but allowing rate
variation among branches makes the 95% credible intervals
wider. On the other hand, the posterior means of divergence
times with a strict clock from BEAST and from Multidivtime
are similar. This is not the case when an autocorrelated lognormal relaxed clock is considered (“Shared Relaxed Clock” in
table 3).
Considering the analyses that were done in the conventional way by not having separate clocks for separate
substitution types and also considering the analyses that
combined all substitution types but let each have its own
relaxed clock and its own independent rate trajectory, there
is a disquieting tendency for producing narrow credible intervals that do not include the divergence times reported by
TimeTree. This arises both for analyses performed by
Multidivtime and by BEAST. We do not attribute these
issues to the estimates from TimeTree. Probably, it would
be desirable to allow different substitution types to have different relaxed clocks but to permit these relaxed clocks of
different substitution types change in a correlated way.
However, it is unclear what the primary cause of the misleading and overly narrow credible intervals is.
In order to exploit the best available estimates of divergence times so that rate variation over time among
1955
MBE
Lee et al. . doi:10.1093/molbev/msv099
Table 5. Average Estimated Substitution Rates among Nodes for
Each Type of Context-Dependent Substitution.
Type
Mean
SD
1
2.0
0.3
2
2.7
0.5
3
1.6
0.2
4
1.6
0.3
5
7.8
1.1
6
5.9
1.0
7
3.8
1.6
8
5.3
2.3
9
51.3
6.9
NOTE.—The posterior distributions of substitution rate per site per 1010 years were
estimated with Multidivtime for each type of context-dependent substitution for the
case where all node times are tightly constrained to the values reported by
the TimeTree database. The average among nodes of the posterior mean rate
and the average among nodes of the posterior standard deviation of the rate are
reported for each type of context-dependent substitution.
substitution types could be examined, we performed additional analyses where tight constraints were placed around
the divergence times reported by TimeTree for all internal
nodes. Using these tight constraints, we reestimated the
divergence times and substitution rates from Multidivtime
by separately analyzing each type of context-dependent substitution. The averages among nodes of the posterior means
of chronological substitution rates and the averages among
nodes of the posterior standard deviations of substitution
rates are reported in table 5. The chronological substitution
rates for CpG transitions (Type 9) are much higher than rates
for other types of substitutions. In fact, the magnitudes of
chronological substitution rates for each type of context-dependent substitution estimated by Multidivtime closely correspond to the average root-to-tip substitution lengths
reported in figure 4. To better visualize the pattern of rate
change over time, the inferred substitution rates for
each node and each substitution type were normalized so
that the within-type average among nodes is 1. Doing this
shows that the inferred rates among nodes are the most
constant for CpG transitions (fig. 5; see also supplementary
fig. S1 Supplementary Material online, for corresponding plot
of context-independent substitutions). Figure 5 is consistent
with the variance of the normalized root-to-tip substitution
lengths in figure 4. Because CpG sites are rare in the sequence
alignment (around 1%) and the rates of transversions are
relatively low compared with transitions, the inferred normalized substitution rates for Type 7 and Type 8 are associated
with large uncertainty.
Simulation
To evaluate our method, especially the assumption that the
distribution of substitution histories is robust to substitution
model specification, we simulated ten data sets with a program that we wrote (details in Materials and Methods section). Each simulated data set used the observed human
sequence as the root and the same tree topology as assumed
for the real data. Sequences were simulated with the estimated substitution lengths for each context-dependent substitution type on each branch. The simulated data sets have
about the same proportions of CpG, non-CpG C + G, and
non-CpG A + T sites as the actual data. Simulated data
were analyzed in the same way as the observed data.
To compare simulated and actual data, we estimated
branch lengths from a sample of substitution histories by
1956
counting the number of substitutions on each branch. The
branch lengths estimated from the actual and simulated data
are similar and are listed in table 6. We also used the PAML
software (Yang 2007) to estimate branch lengths. Those estimates are similar to our estimates from sampling substitution
histories (results not shown).
Root-to-tip substitution lengths for each context-dependent substitution type are compared between the actual and
an arbitrarily selected simulated data set in figure 6. The substitution lengths are generally consistent between the real and
the simulated data, except that the root-to-tip substitution
lengths for CpG transitions from the simulated data are
slightly less than those from the actual data. As the CpG
transition rate is high, these changes may tend to occur earlier
on branches than would be indicated by a sample of substitution histories obtained with a model that did not permit
special treatment of CpG transitions. This could lead to CpG
transition substitution lengths being underestimated. In general, the simulations suggest that substitution histories are
relatively insensitive to misspecification of the substitution
model.
Discussion
Endpoint-conditioned sampling of substitution histories has
received increasing attention in statistical and molecular evolutionary literature (Nielsen 2002; Hobolth and Jensen 2005,
2011; Rodrigue et al. 2005, 2006, 2008; Lartillot 2006; Mateiu
and Rannala 2006; Hobolth 2008; Minin and Suchard 2008;
Hobolth and Stone 2009; Tataru and Hobolth 2011). This
approach can be viewed as a data augmentation technique
and has been utilized to improve MCMC mixing for sampling
from the full posterior distribution. Alternatively, sampling
from the posterior distribution of substitution histories can
sometimes be employed to obtain maximum-likelihood estimates (e.g., Rodrigue et al. 2007).
Our somewhat crude approximation is simply to sample
mappings Mð1Þ , . . . , MðCÞ with a context-independent model
and to then assume that these histories are actually sampled
from the context-dependent distribution PrðM j D; lab Þ. We
P
ðcÞ
then estimate abj as ð1=CÞ Cc¼1 ðNðcÞ
abj Þ=ðaj Þ. Although this
P ðcÞ P ðcÞ
estimator may be less promising than ð c Nabj Þ=ð c aj Þ,
the sampling distribution of our estimator is asymptotically
normal with its variance approximated by the law of total
variance. Our approach also has the advantages of computational feasibility and ease of implementation. In addition, our
approach allows other types of context-dependent rates not
included in this study to be investigated.
A shortcoming of our approach is that the independentsites model for sampling substitution histories differs from the
context-dependent substitution model for which rates are
estimated from the substitution histories. For the primate
data analyzed here, this shortcoming does not appear to
have serious consequences (e.g., see table 6 and fig. 6).
However, problems would be more severe if branches were
longer so that sampled substitution histories would be more
sensitive to the model employed for sampling substitution
histories.
MBE
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
proportions aj and the substitution counts Nabj. These
would lead to new values for the inferred context-dependent
substitution rates. The iterative process of reweighting substitution histories and then estimating new context-dependent
substitution rates could continue until the rate estimates
converged.
An advantage of this envisioned importance sampling procedure is that a computationally feasible independent-sites
model could be employed for obtaining a single sample of
substitution histories. This would presumably be more computationally tractable than implementing a contextdependent model directly into an MCMC procedure that
lets the clocks of different substitution types relax to different
A future direction that might overcome the shortcoming
would be to apply importance sampling (e.g., Liu 2001) to the
substitution histories. The idea is that some of the sampled
substitution histories would become relatively more likely and
some would become relatively less likely when evaluated by
the context-dependent substitution rates that we have estimated. By evaluating the ratio of the probability density of
each sampled substitution history according to the contextdependent rates versus according to the independent-sites
model used for sampling, an importance weight could be
assigned to each history in the sample. The importance
weights associated with each sampled history could then be
employed to derive new estimates of the summed dwell
Type 2
Type 3
Type 4
Type 5
Type 6
Type 7
Type 8
Type 9
0.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Normalized substitution rate at node
0.5
1.0
1.5
2.0
2.5
3.0
Type 1
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
Node
FIG. 5. Estimated substitution rates at nodes for each type of context-dependent substitution with node times tightly constrained to values reported by
TimeTree. Rates were normalized so that within-type average among nodes is 1. Vertical bars indicate 95% credible intervals.
1957
MBE
Lee et al. . doi:10.1093/molbev/msv099
employed for inference with dependent-sites models (reviewed in Hobolth and Thorne 2014).
In accord with earlier studies, our results indicate that CpG
transitions occurred at a higher rate and are more clock-like
than other types of substitutions in primates. Based on the
TimeTree values, the CpG transition analysis performs relatively well in estimating the primate divergence times compared with other analyses. In addition, the degree to which
other substitution types change rates over time can be diverse. This suggests that divergence time analyses should not
assume that all substitution types change rates over time in
the same way.
Our strategy is to define different substitution types and
then to assume that each substitution type has a relaxed
clock that changes rate independently of all other substitution types. Based on our “joint relaxed clocks” result for context-dependent substitutions (see table 4), we suspect this
would be a poor strategy if the selected number of substitution types is too large. There are parallel issues in multilocus
divergence time analyses. Some factors affecting change in
evolutionary rates are lineage-specific whereas others are
gene-specific. Ideally, relaxed clock analyses would accommodate both gene-specific and lineage-specific tendencies to
change rates. Failure to consider both gene-specific and
degrees. Although importance sampling tends to be less successful when the distribution used for sampling differs enough
from the distribution of interest as to make all or almost all
samples particularly improbable according to the distribution
of interest, the envisioned approach could be configured to
exploit the fact that the substitution history of a sequence
consists of the substitution histories of its sites. In other
applications, sampling substitution histories of individual
sites from independent-sites models has been successfully
Table 6. Comparison of the Branch Lengths from the Actual and the
Average of Ten Simulated Data Sets.
Branch
0
1
2
3
4
5
6
Actual
0.0181 0.0199 0.0226 0.0086 0.0088 0.0142 0.0379
Simulated 0.0186 0.0202 0.0230 0.0088 0.0089 0.0145 0.0396
SD
0.0005 0.0005 0.0005 0.0003 0.0002 0.0007 0.0008
Branch
7
8
9
10
11
12
13
Actual
0.0371 0.0461 0.0049 0.0297 0.0020 0.0122 0.0183
Simulated 0.0383 0.0482 0.0050 0.0312 0.0027 0.0131 0.0185
SD
0.0006 0.0016 0.0002 0.0016 0.0021 0.0018 0.0011
log root−to−tip substitution lengths from the simulated data
NOTE.—The branches are labeled in figure 2. The branch lengths are estimated from
a sample of substitution histories by counting the number of substitutions for each
branch.
−2
Type
1
2
3
−3
4
5
6
7
8
9
−4
−5
−5
−4
−3
−2
−1
log root−to−tip substitution lengths from the real data
FIG. 6. Comparison between real and simulated data of context-dependent substitution lengths. Substitution types are defined in table 2 and labeled
with different symbols. Each point represents the log root-to-tip substitution length in the real and the simulated data set for a particular substitution
type and a particular tip.
1958
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
lineage-specific tendencies may yield multilocus divergence
time estimates that are overly precise (Thorne and Kishino
2002; Zhu et al. 2014). Similarly, relaxed clocks of different
substitution types may change in a correlated way due to
lineage-specific factors. Although our technique has the desirable feature of letting substitution types vary rates over
time in different ways, a presumably better treatment
would facilitate correlated rate changes over time among
substitution types.
Materials and Methods
Data
We explored the clock-like fashion of CpG transitions using a
subset of data previously studied by Kim et al. (2006). We
intentionally concentrated on genomic data with little evidence for important biological function so that rate variation
over time among substitution types was less apt to be
influenced by natural selection. We obtained sequence
data of humans and eight other primate species for
a region that is orthologous to human Chromosome 16
(hg19.chr16:60842337-61130970) according to the Multiz
alignment of 100 vertebrates in the UCSC (the University of
California–San Cruz) genome browser (Kent et al. 2002;
Blanchette et al. 2004).
Throughout, the nine species of primates analyzed here are
assumed related through the phylogenetic tree in figure 2
with bushbaby as the outgroup species. These nine species
yield a phylogeny for which branches are long enough to yield
substantial information about substitution lengths, but short
enough to lessen the chance of multiple substitutions per site
and avoid serious concerns about alignment uncertainty.
CpG islands are often free of methylation (Bird 1986) and
not hypermutable. Therefore, the substitution process for
these regions is not similar to CpG dinucleotides in the rest
of the genome. For this reason, CpG islands were removed
from the analysis. Removed stretches had GC content of 50%
or greater along 200 or more base pairs with an observed/
expected CpG content greater than 0.6. Similarly, positions in
the exons and repetitive elements identified by RepeatMasker
(Smit et al. 1996–2010) were excluded from the analyses. The
final data set contains approximately 115 kb for each species
and the proportion of CpG sites is around 1%.
Settings in PhyloBayes-MPI
Both the observed and the simulated DNA sequence data
were analyzed by having 312 substitution histories sampled
for every site. To generate the substitution histories with a
modified version of the PhyloBayes-MPI 1.5a software, we
used the CAT–GTR (general time reversible) model (see
Tavare 1986; Lartillot and Philippe 2004), with a four-category
discretized gamma distribution to accommodate rate heterogeneity among sites (Yang 1994). The discrete-gamma distribution of rate variation among sites was parameterized by a
shape parameter with an exponential prior of mean 1. All
prior distribution settings for branch lengths and for the
CAT–GTR model were the PhyloBayes-MPI default values.
We also obtained substitution histories through the GTR
MBE
model with four discrete-gamma categories and obtained
results similar to those from the CAT–GTR model (results
not shown).
Divergence Time Estimation in Multidivtime
and BEAST
The divergence times and substitution rates for the phylogeny in figure 2 were estimated by considering one substitution type per analysis and by jointly considering the
collection of a strand-symmetric context-dependent substitution types. The divergence times were also estimated from
Multidivtime and BEAST (Drummond et al. 2006, 2012) with
conventional analyses that have all substitution types
change rate in the same way on each branch. For all conventional analyses, the Hasegawa–Kishino–Yano (HKY)
model (Hasegawa et al. 1984) and discrete-gamma rate heterogeneity among sites with four categories (Yang 1994)
were adopted for Multidivtime analyses. For BEAST analyses,
the nucleotide substitution model was the GTR model
(Tavare 1986) instead.
The Multidivtime and BEAST analyses were run for between 2 107 and 109 iterations in order to achieve convergence. Each case was run twice to ensure convergence. The
first 10% of the samples were burn-in and not included in the
posterior approximations.
To enhance MCMC convergence, we removed the outgroup species in the BEAST analyses and placed a prior on the
ingroup root time. The tree topology was considered known
and fixed for all Multidivtime and BEAST analyses. For the
purposes of comparison, the root time prior for all analyses
was tight. It was gamma distributed with mean 44.2 My and
standard deviation 0.1 My.
For all Multidivtime analyses except the “strict clock,” the
lognormal autocorrelated relaxed clock model (Kishino et al.
2001) was employed. The root rate had a gamma prior with
mean and standard deviation 0.0012. The actual marginal
priors on divergence times were formed by the combination
of the root time prior and the generalized Dirichlet distribution of Kishino et al. (2001). The autocorrelation between
rates on adjacent branches is controlled by the parameter ( 0). When is 0, rates are forced to be the same for all
branches and a strict clock results. Rates are less correlated
with larger . The autocorrelation parameter had a gamma
prior with mean and standard deviation of 0.023, which is the
inverse of the prior mean of the root time.
For BEAST analyses, either the strict clock or the uncorrelated lognormal relaxed clock was selected. For the strict clock
case, the single rate of the phylogeny had the same prior as
the root rate prior in Multidivtime analyses. For the uncorrelated BEAST case, the mean rate (ucld.mean) had the root
rate prior of Multidivtime and the standard deviation of the
mean rate (ucld.stdev) had an exponential prior with mean
0.33. A birth–death process was used for the priors on divergence times in BEAST analyses. The mean growth rate had a
uniform (0, 100,000) prior and the relative death rate had a
uniform (0, 1) prior. Rate heterogeneity among sites was
modeled by a four-category discrete-gamma rate model
1959
Lee et al. . doi:10.1093/molbev/msv099
with an exponential prior of mean 0.5 for the gamma shape
parameter.
Simulation
The program for simulating sequences takes a root sequence,
a fixed tree topology, and rate matrices for different branches
as inputs. There are two rate matrices (one rate matrix for
non-CpG sites and one for CpG sites) for each branch. We
determined the rate matrices to use for each branch from the
substitution lengths that were estimated for each branch
from the real data analyses. In essence, our program simulates
each branch on the phylogenetic tree by alternating between
randomly sampling exponentially distributed waiting times
between events and then randomly determining the
sequence position and substitution type that occurs at
each event. Further details for using a starting sequence
and rate matrices to simulate waiting times of a Markov
chain can be found in Yang (2014).
Supplementary Material
Supplementary figure S1 is available at Molecular Biology and
Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgments
The authors thank two anonymous reviewers for their help.
H.-J.L. and J.L.T. were supported by the National Institutes of
Health (NIH) grants GM090201 and GM070806. N.R. was supported by the Natural Sciences and Engineering Research
Council of Canada. The software developed for this research
is freely available on https://github.com/HuiJieLee.
References
Aris-Brosou S, Yang Z. 2003. Bayesian models of episodic evolution
support a late Precambrian explosive diversification of the
Metazoa Mol Biol Evol. 20(12):1947–1954.
Bird A. 1986. CpG-rich islands and the function of DNA methylation
Nature 321(6067):209–213.
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM,
Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004.
Aligning multiple genomic sequences with the threaded blockset
aligner Genome Res. 14(4):708–715.
Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. 2006. Relaxed phylogenetics and dating with confidence PLoS Biol. 4(5):699–710.
Drummond AJ, Suchard MA, Xie D, Rambaut A. 2012. Bayesian
phylogenetics with beauti and the beast 1.7 Mol Biol Evol.
29(8):1969–1973.
Hasegawa M, Yano T, Kishino H. 1984. A new molecular clock of mitochondrial DNA and the evolution of hominiods Proc Jpn Acad.
B(60):95–98.
Hedges SB, Dudley J, Kumar S. 2006. TimeTree: a public knowledgebase of divergence times among organisms Bioinformatics
22(23):2971–2972.
Hobolth A. 2008. A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution
with neighbor-dependent substitution rates J Comput Graph Stat.
17(1):138–162.
Hobolth A, Jensen J. 2005. Statistical inference in evolutionary
models of DNA sequences via the EM algorithm Stat Appl Genet
Mol Biol. 4, Article 18.
Hobolth A, Jensen JL. 2011. Summary statistics for endpoint-conditioned
continuous-time Markov chains J Appl Probab. 48(4):911–924.
Hobolth A, Stone EA. 2009. Simulation from endpoint-conditioned,
continuous-time markov chains on a finite state space,
1960
MBE
with applications to molecular evolution Ann Appl Stat.
3(3):1204–1231.
Hobolth A, Thorne JL. 2014. Sampling and summary statistics of endpoint-conditioned paths in DNA sequence evolution. In: Chen M-H,
Kuo L, Lewis PO, editors. Bayesian phylogenetics: methods, algorithms, and applications. Chapman and Hall/CRC.
Huelsenbeck J, Larget B, Swofford D. 2000. A compound Poisson process
for relaxing the molecular clock Genetics 154(4):1879–1892.
Hwang D, Green P. 2004. Bayesian Markov chain Monte Carlo sequence
analysis reveals varying neutral substitution patterns in mammalian
evolution Proc Natl Acad Sci U S A. 101(39):13994–14001.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,
Haussler D. 2002. The human genome browser at UCSC Genome
Res. 12(6):996–1006.
Kim S-H, Elango N, Warden C, Vigoda E, Yi SV. 2006. Heterogeneous
genomic molecular clocks in primates PLoS Genet. 2(10):1527–1534.
Kimura M. 1968. Evolutionary rate at the molecular level Nature
217(5129):624–626.
Kimura M. 1983. The neutral theory of molecular evolution. Cambridge
University Press.
Kishino H, Thorne JL, Bruno WJ. 2001. Performance of a divergence
time estimation method under a probabilistic model of rate evolution Mol Biol Evol. 18(3):352–361.
Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G,
Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, et al.
2012. Rate of de novo mutations and the importance of father’s age
to disease risk Nature 488(7412):471–475.
Kumar S, Hedges SB. 2011. TimeTree2: species divergence times on the
iPhone Bioinformatics 27(14):2023–2024.
Lartillot N. 2006. Conjugate Gibbs sampling for Bayesian phylogenetic
models Mol Biol Evol. 13(10):1701–1722.
Lartillot N, Delsuc F. 2012. Joint reconstruction of divergence times and
life-history evolution in placental mammals using a phylogenetic
covariance model Evolution 66(6):1773–1787.
Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site
heterogeneities in the amino-acid replacement process J Comput
Biol. 21(6):1095–1109.
Lartillot N, Rodrigue N, Stubbs D, Richer J. 2013. PhyloBayes MPI:
phylogenetic reconstruction with infinite mixtures of profiles in a
parallel environment Syst Biol. 62(4):611–615.
Liu JS. 2001. Monte Carlo strategies in scientific computing. Springer.
Mateiu L, Rannala B. 2006. Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation
Syst Biol. 55(2):259–269.
Minin VN, Suchard MA. 2008. Counting labeled transitions in continuoustime Markov models of evolution J Math Biol. 56(3):391–412.
Nielsen R. 2002. Mapping mutations on phylogenies Syst Biol.
51(5):729–739.
Ohta T. 1973. Slightly deleterious mutant substitutions in evolution
Nature 246(5428):96–98.
Peifer M, Karro JE, von Gruenberg HH. 2008. Is there an acceleration of
the CpG transition rate during the mammalian radiation?
Bioinformatics 24(19):2157–2164.
Rannala B, Yang Z. 2007. Inferring speciation times under an episodic
molecular clock Syst Biol. 56(3):453–466.
Rodrigue N, Lartillot N, Bryant D, Philippe H. 2005. Site interdependence
attributed to tertiary structure in amino acid sequence evolution
Gene 347(2):207–217.
Rodrigue N, Philippe H, Lartillot N. 2006. Assessing site-interdependent
phylogenetic models of sequence evolution Mol Biol Evol.
23(9):1762–1775.
Rodrigue N, Philippe H, Lartillot N. 2007. Exploring fast computational
strategies for probabilistic phylogenetic analysis Syst Biol. 56(5):711–726.
Rodrigue N, Philippe H, Lartillot N. 2008. Uniformization for sampling
realizations of Markov processes: applications to Bayesian implementations of codon substitution models Bioinformatics 24(1):56–62.
Rodrigue N, Philippe H, Lartillot N. 2010. Mutation-selection models
of coding sequence evolution with site-heterogeneous amino acid
fitness profiles Proc Natl Acad Sci U S A. 107(10):4629–4634.
Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099
Sanderson M. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy Mol Biol Evol.
14(12):1218–1231.
Sanderson M. 2002. Estimating absolute rates of molecular evolution
and divergence times: a penalized likelihood approach Mol Biol Evol.
19(1):101–109.
Smallwood S, Kelsey G. 2012. De novo DNA methylation: a germ cell
perspective Trends Genet. 28(1):33–42.
Smit AFA, Hubley R, Green P. 1996–2010. RepeatMasker Open-3.0.
Sueoka N. 1988. Directional mutation pressure and neutral molecular
evolution Proc Natl Acad Sci U S A. 85(8):2653–2657.
Sueoka N. 1992. Directional mutation pressure, selective constraints, and
genetic equilibria J Mol Evol. 34(2):95–114.
Tataru P, Hobolth A. 2011. Comparison of methods for calculating
conditional expectations of sufficient statistics for continuous
time Markov chains BMC Bioinformatics 12:465.
Tavare S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. American Mathematical Society: Lectures
on Mathematics in the Life Sciences, Vol.17. p. 57–86.
MBE
Thorne J, Kishino H. 2002. Divergence time and evolutionary rate estimation with multilocus data Syst Biol. 51(5):689–702.
Thorne JL, Kishino H, Painter IS. 1998. Estimating the rate of evolution of the rate of molecular evolution Mol Biol Evol. 15(12):1647.
Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA
sequences with variable rates over sites: approximate methods J Mol
Evol. 39(3):306–314.
Yang Z. 2007. Paml 4: phylogenetic analysis by maximum likelihood Mol
Biol Evol. 24(8):1586–1591.
Yang Z. 2014. Simulating molecular evolution. In: Molecular evolution: a
statistical approach. Oxford University Press.
Yang Z, Nielsen R. 2008. Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage
Mol Biol Evol. 25(3):568–579.
Yoder A, Yang Z. 2000. Estimation of primate speciation dates using
local molecular clocks Mol Biol Evol. 17(7):1081–1090.
Zhu T, dos Reis M, Yang Z. 2014. Characterization of the uncertainty of
divergence time estimation under relaxed molecular clock models
using multiple loci Syst Biol. 62(2):267–280.
1961