Relaxing the Molecular Clock to Different Degrees for Different Substitution Types Hui-Jie Lee,*,1 Nicolas Rodrigue,2 and Jeffrey L. Thorne1,3 1 Department of Statistics, North Carolina State University Department of Biology, Carleton University, Ottawa, ON, Canada 3 Department of Biological Sciences, North Carolina State University *Corresponding author: E-mail: [email protected]. Associate editor: Hideki Innan 2 Abstract Rates of molecular evolution can vary over time. Diverse statistical techniques for divergence time estimation have been developed to accommodate this variation. These typically require that all sequence (or codon) positions at a locus change independently of one another. They also generally assume that the rates of different types of nucleotide substitutions vary across a phylogeny in the same way. This permits divergence time estimation procedures to employ an instantaneous rate matrix with relative rates that do not differ among branches. However, previous studies have suggested that some substitution types (e.g., CpG to TpG changes in mammals) are more clock-like than others. As has been previously noted, this is biologically plausible given the mutational mechanism of CpG to TpG changes. Through stochastic mapping of sequence histories from context-independent substitution models, our approach allows for context-dependent nucleotide substitutions to change their relative rates over time. We apply our approach to the analysis of a 0.15 Mb intergenic region from eight primates. In accord with previous findings, we find comparatively little rate variation over time for CpG to TpG substitutions but we find more for other substitution types. We conclude by discussing the limitations and prospects of our approach. Key words: CpG transition rate, context-dependent substitution, relaxed molecular clock, divergence time estimation. Introduction Article Fast Track Molecular sequence data provide information about the amount of evolution, or the branch lengths, between species, but they do not suffice for disentangling evolutionary rates and times. When substitution rates change over time, divergence time estimation becomes more challenging because fossil evidence that separates rates and times on one part of a phylogeny does not guarantee that rates and times can be well-separated on other parts of the tree. Just as improved analysis of fossil data can lead to better divergence time estimates, improved treatment of evolutionary rates can yield more successful divergence time estimates. A wide variety of relaxed clock methods have been proposed and studied for facilitating the separation of rates and times (Sanderson 1997, 2002; Thorne et al. 1998; Huelsenbeck et al. 2000; Yoder and Yang 2000; Kishino et al. 2001; ArisBrosou and Yang 2003; Drummond et al. 2006; Rannala and Yang 2007). One promising direction is to identify factors that cause rate variation over time or that at least covary with this rate variation. For example, Lartillot and Delsuc (2012) have considered how life-history traits such as body size might covary with nucleotide substitution rates. By combining life history trait information of extant species with the sequence data from these species, Lartillot and Delsuc are able to partially separate evolutionary rates and times. They do this by exploiting covariation across a phylogeny between body size and evolutionary rates. Other potential ways to partially or completely separate rates and times are motivated by other factors that affect substitution rates. These include the possibility that natural selection induces a correlation between substitution rate and effective population size (Ohta 1973) and the ability to better understand substitution rates by incorporating mutation data. Although diverse treatments of substitution rate variation over time are available, a typical assumption is that the relative rates of different substitution types vary among branches on a phylogenetic tree in relatively simple ways. For example, the absolute rates of all substitution types might change over time but the relative rates among types might be invariant. Alternatively, effective population size might vary among branches of a tree so that mutation–selection balance approaches inspired by population genetics can assist in the determination of how rates of different substitution types vary among branches (e.g., see Yang and Nielsen 2008; Rodrigue et al. 2010). Beyond the effects of generation time and natural selection, there are compelling biological reasons to believe that additional flexibility in substitution rate variation over time is warranted. Even for a neutrally evolving sequence, variation across the phylogeny in generation length might be insufficient to completely explain variation across the phylogeny in substitution rates. Although mutation rate equals substitution rate with neutral evolution (Kimura 1968, 1983), different mutational mechanisms underlie different types of point ß The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] 1948 Mol. Biol. Evol. 32(8):1948–1961 doi:10.1093/molbev/msv099 Advance Access publication April 29, 2015 Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 mutations. If mutation occurred only in meiosis, then mutation rate variation over time might be exclusively attributable to changes in generation length. But, this is not the situation. Some mutational mechanisms seem prone to yielding almost clock-like substitution behavior because they can operate nearly at any time during a life cycle whereas other mutational mechanisms are meiotic and therefore prone to affecting a sequence site only once per generation. In addition, the relative importance of mutational mechanisms can change across a phylogeny because the genetic and environmental components affecting them can change. Despite the fact that the molecular clock seldom holds in empirical analyses, it has been suggested that CpG dinucleotides (i.e., cytosines immediately followed in sequence by guanines) may evolve in a comparatively clock-like fashion in mammals and primates (Hwang and Green 2004; Kim et al. 2006; Peifer et al. 2008). As CpG methylation exists during much of the life cycle of germ-line cells (Smallwood and Kelsey 2012), mutations associated with DNA methylation can accumulate throughout much of the duration of each generation (Kim et al. 2006). Hwang and Green (2004) implemented a pioneering Bayesian Markov chain Monte Carlo (MCMC) method that allows substitution rates to depend on neighboring nucleotides. They observed that CpG transitions have a relatively clock-like behavior in mammals. Kim et al. (2006) noted that divergence times between the human and chimpanzee species pair and the macaque and baboon species pair are thought to be similar, but the human– chimpanzee pair has longer generation times. They pointed out that the macaque–baboon pair had accumulated significantly more transitions at non-CpG sites than the other pair, probably by virtue of the generation-time effect. However, inferred amounts of transition substitutions at CpG sites were similar between pairs. As emphasized by these studies, methylation-origin mutations at CpG sites may be dependent on chronological time rather than on time measured in generations. In other words, CpG sites might serve as the basis of a relatively accurate molecular clock even for a phylogeny relating lineages with diverse generation times. Inspired by this earlier work and also by the realization that there are a variety of other reasons why substitution rates may vary over time differently for different substitution types, we aim to investigate the clock-like natures of different kinds of (possibly context-dependent) nucleotide substitutions and we aim to simultaneously improve divergence time estimation. New Approaches Prior to describing its details, we provide an overview of our approach that begins with considering the history of sequence changes on each branch of the phylogeny. Unfortunately, complete substitution histories are not directly observed when phylogenetically related interspecific data sets are collected. Instead, only the sequences at the tips of a rooted evolutionary tree are observed. This lack of complete information makes the inference problem more challenging. As allowing different substitution types to have different clocks is computationally daunting with MBE context-dependent substitution, we provide another nonideal solution. In this study, we collect substitution histories of homologous sequences according to their posterior distribution using a context-independent substitution model through a slightly modified version of the PhyloBayes-MPI software (Lartillot et al. 2013). These substitution histories are then the basis for inferring context-dependent rates on each branch. Instead of different genes having different branch lengths as is the situation for multigene divergence time estimation (Thorne and Kishino 2002), now we have different kinds of nucleotide substitutions with different “substitution lengths.” The substitution types share the same set of divergence times, but each substitution type has its own rate trajectory on the phylogeny. The substitution lengths and the associated uncertainty for each substitution type can be approximated given the substitution histories. Based upon these approximations for the different substitution types, the divergence times and chronological substitution rates of each kind of substitution are estimated with a relaxed molecular clock through MCMC (Thorne et al. 1998; Kishino et al. 2001; Thorne and Kishino 2002). Augmented Data Likelihood The observed data are the homologous sequence alignment D of length N from S species. The species are related through a rooted phylogenetic tree with a topology that is assumed known. The root has index 2S 2, tips have indices j ¼ 0; . . . ; S 1, and nonroot internal nodes have indices j ¼ S; . . . ; 2S 3. A branch is given the same index as the node at its end. Consider a specific branch j on a tree and a specific site i in a DNA sequence. Assume the full substitution history of the DNA sequence on branch j is observed. This history includes the number of substitutions undergone for every site on branch j, the timing of the substitution events, and the states before and after each substitution. The substitution rate per site on branch j from site context a to nucleotide b will be abj. Here, “context” represents both the state of the site of interest and (possibly) states at other neighboring sites. The vector lj will represent the entire collection of abj for all possible contexts a and nucleotides b on branch j. We are interested in the product of abj and time for each branch. We refer to this product as the “substitution length” of branch j from context a to nucleotide b. Although the product can be estimated from sequence data, rates and the branch time are confounded. We will set the branch time at 1 for all branches. This means that the substitution length for change from a to b is known on branch j if abj is known. Let aij (aij 2 ½0; 1) be the proportion of time site i has context a on branch j with Uij being a vector representing values of the dwell proportions aij for all a, and let nabij be the number of changes from context a to nucleotide b at site i on branch j. We have nij be the vector of substitution counts nabij for all a and b. As an example, consider the substitution history illustrated in figure 1. If j represents the branch being depicted and 1949 MBE Lee et al. . doi:10.1093/molbev/msv099 A time T 1.0 T A i= 1 A 0.4 0.6 A 0.4 C 2 0.6 G 3 P where the substitution counts P Nabj ¼ i nabij and the summed dwell proportions aj ¼ i aij . Therefore, the sufficient statistics of abj on branch j consist of the summed dwell proportions spent in context a on branch j, aj , and the number of changes from context a to nucleotide b on branch j, Nabj. By setting the first derivative to zero and solving, we have the maximum-likelihood estimates G 1.0 G 4 FIG. 1. An example substitution history on a branch for four sites. context a represents a CpG dinucleotide, then the proportion of time that site i = 2 is a CpG site is a2j ¼ 0:4=1:0 ¼ 0:4. Because there was one substitution from C to b ¼ T for site i = 2 in the CpG dinucleotide context, nab2j ¼ 1. This will contribute to the CpG transition length. Site 3 in figure 1 also begins branch j as part of a CpG dinucleotide. Because there was a substitution from C to T at time 0.4 for site i = 2, the proportion of time that site i = 3 is a Cpg site is a3j ¼ 0:4. Also, because no substitution occurred at site i = 3 for the CpG dinucleotide context, nab3j ¼ 0. If the nucleotide substitution process is treated as a continuous time Markov chain, then the likelihood for the observed history of site i on branch j is Y Y nabij abj expfaij abj g; ð1Þ pðnij ; Uij j lj ; sij0 Þ ¼ a b where sij0 is the initial state of site i on branch j. The likelihood without conditioning on the initial state sij0 would be pðnij ; Uij j lj ; sij0 Þpðsij0 j lj Þ ¼ pðnij ; Uij ; sij0 j lj Þ: One simplification that is often made when modeling molecular evolution is to assume stationarity of the substitution process. The stationarity assumption means that the initial state sij0 has some information pertaining to the rate parameters. We treat pðsij0 j lj Þ as not being a function of lj . Therefore, we are only focusing on the transient part of the likelihood. This means that equation (1) summarizes the information about lj in an observed history. Suppose the complete substitution history of every site is available. The likelihood for the observed history of all sites on branch j is then pðnj ; Uj j lj ; sj0 Þ Y Y Y nabij abj exp aij abj ¼ i a a b b X Y Y P nabij abj i exp abj aij ; ¼ ð2Þ i where nj ; Uj , and sj0 are, respectively, vectors of the substitution counts nij , the dwell proportions Uij , and the initial states sij0 . The log-likelihood is then log pðnj ; j j j ; sj0 Þ XXX nabij log abj aij abj ¼ ¼ 1950 i a a b XX b Nabj logfabj g aj abj ; ð3Þ ^ abj ¼ Nabj : aj We can consider lab as a vector of substitution lengths of changes from context a to nucleotide b on the phylogenetic tree. We have 0 1 ab1 B C C: ⯗ lab ¼ B ð4Þ @ A abð2S2Þ Let M ¼ fMij g1iN;1j2S2 represent the substitution history (stochastic mapping) that generates the molecular sequence data D such that Mij is the substitution history at site i on branch j. The substitution history Mij can be summarized by the sufficient statistics for abj. These are nabij and aij . Mij ¼ ðnabij ; aij Þ; i ¼ 1; 2; . . . ; N; j ¼ 1; 2; . . . ; 2S 2: ð5Þ With a context-independent model of sequence change, there are four possible states (A, T, C, and G) and 4 3 = 12 possible types of single-nucleotide substitutions (e.g., A ! T, A ! C, etc.). If strand-symmetry is assumed so that types have the same rate as their complement (e.g., A ! C and T ! G have an identical rate), context-independence yields six types of single nucleotide substitutions: Four transversions and two transitions. These six types of single nucleotide substitutions are listed in table 1. In contrast, there are nine types of changes (pooling together complementary changes) and three states (contexts) when classifying a site according to whether it or its complement might have a methylated C in a CpG dinucleotide. These nine types of changes are listed in table 2. They include four transversions and two transitions for non-CpG sites, and two transversions and one transition for CpG sites. For instance, if the substitution history is fully observed, the substitution length of CpG transitions (Type 9 in table 2) on a branch can be inferred by the number of CpG transitions divided by the sum over sites of the proportions of time that CpG sites were located on that branch. Similarly, the substitution length of non-CpG G ! C and C ! G substitutions (Type 1 in table 2) is computed as the number of G ! C and C ! G substitutions at non-CpG sites divided by the summed proportions of time that nonCpG C or G sites existed on that branch. Note that the way we define substitution lengths can be applied to other types of context-dependent substitutions not discussed here. It is possible to derive the substitution lengths for contexts considering any combinations of 50 and 30 neighboring nucleotides. MBE Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 distribution PrðM j DÞ by integrating out all possible parameter values in has density Z PrðM j DÞ ¼ PrðM j D; ÞPrð j DÞd: ð7Þ Table 1. The Six Single Nucleotide Substitution Types. Type 1 2 3 4 5 6 Substitutions G!C & C!G G!T & C!A T!A & A!T T!G & A!C G!A & C!T A!G & T!C NOTE.—Complementary substitutions are considered as the same type. Table 2. The Nine Context-Dependent CpG Substitution Types. Type 1 2 3 4 5 6 7 8 9 Substitutions Non-CpG G ! C & C ! G Non-CpG G ! T & C ! A Non-CpG T ! A & A ! T Non-CpG T ! G & A ! C Non-CpG G ! A & C ! T Non-CpG A ! G & T ! C CpG G ! C & C ! G CpG G ! T & C ! A CpG G ! A & C ! T NOTE.—Complementary substitutions are considered as the same type. Sampling Substitution Histories The “substitution lengths” of different types of substitutions provide a basis for inferring the rates of each kind of change. This quantity is hard to estimate directly from the molecular sequence alignment, but the estimation becomes rather straightforward if the substitution histories (mappings) can be fully observed. We therefore employ data augmentation to help estimate the “substitution lengths” of each kind of change. That is, the augmented data (unobserved substitution histories) are constructed from the observed sequence alignment. Diverse strategies have been developed for sampling substitution histories M conditional upon the sequence alignment D and a vector of parameters that represents the tree topology, branch lengths, and the parameters of the substitution process. Sampling histories M from the distribution Pr ðM j DÞ is categorized as endpoint-conditioned sampling because the sequence data D are observed at the endpoints (tips) of the tree. Endpoint-conditioned sampling strategies for molecular sequence data have been evaluated by Hobolth and Stone (2009) and reviewed by Hobolth and Thorne (2014). The probability of a mapping conditioned on the molecular sequence alignment can be written as Pr ðM j DÞ ¼ PrðM j D; Þ ¼ PrðM; D; Þ : PrðD; Þ ð6Þ However, the true values of the parameters in are usually unknown. A mapping sampled from the marginal Lartillot (2006) proposed a sampling algorithm that includes data augmentation and conjugate Gibbs sampling to obtain samples from the joint posterior probability distribution PrðM; j DÞ and implemented it in the PhyloBayes-MPI software (Lartillot et al. 2013). The algorithm proceeds in two alternating steps. First, draw a substitution history conditional on the parameters in in a way that is similar to Nielsen’s algorithm (Nielsen 2002). Second, resample the parameters in through a Gibbs sampler that is conditional on the substitution history. A Gibbs sequence is generated after repeating this process many times, where a subset of samples of and substitution histories are taken as draws from the full joint posterior distribution of all parameters. The Monte Carlo estimate of any moments (e.g., mean) for the marginal distribution of the substitution histories can be directly computed from the realizations of the Gibbs sequence. We use a slightly modified version of the PhyloBayes-MPI software with an independent-site time-reversible model to generate the substitution histories. We then use these sampled histories to make inferences about parameters in a richer context-dependent substitution model. We do this by relying upon the assumption that the distribution of substitution histories is robust to substitution model specification. In other words, we assume that the endpointconditioned sample from PhyloBayes-MPI for our data set can be treated as an endpoint-conditioned sample according to our substitution model of interest. For data sets with long branches that generate high probabilities of multiple changes per site or high probabilities of changes at consecutive sites, this assumption will be problematic. For data sets with short branches and little sequence divergence, the assumption should be more appropriate. Further implications of this as well as potential improvements to it are detailed in the Discussion section. Substitution Length Estimation The PhyloBayes-MPI software samples endpoint-conditioned substitution histories M and a vector of parameters that specifies the substitution processes, tree topology, and branch lengths from the joint posterior distribution PrðM; j DÞ. Further details about the PhyloBayes-MPI settings that we used are in the Materials and Methods section. By only keeping a set of widely spaced realizations of mappings and parameters, we can generate C approximately independent and identically distributed samples of M and from PrðM; j DÞ. That is, o n iid ~ PrðM; j DÞ; ðMðcÞ ; ðcÞ Þ c¼1;...;C where ðMðcÞ ; ðcÞ Þ is the sample from the cth iteration. In this notation, MðcÞ ¼ fMðcÞ ij g ð1 i N; 1 j 2S 2; 1 c CÞ such that MðcÞ is the substitution history of site ij 1951 MBE Lee et al. . doi:10.1093/molbev/msv099 i on branch j in the cth iteration and has sufficient statistics ðcÞ ðcÞ that are nðcÞ abij and aij . Note that Mij includes the number of substitutions from context a to nucleotide b at site i on branch j and the proportion of time site i on branch j has context a in the cth iteration. The “substitution lengths” of changes from context a to nucleotide b on the phylogeny lab can be estimated by maximum likelihood from MðcÞ . The maximum-likelihood estimate of lab for iteration c is 0 1 ^ ðcÞ ab1 B C B C; ^ ðcÞ ð8Þ ⯗ ab ¼ @ A ^ ðcÞ abð2S2Þ where X ^ ðcÞ abj ¼ Xi nðcÞ abij ðcÞ i aij ¼ NðcÞ abj ðcÞ aj : ð9Þ We can estimate lab by ^ ab ¼ C 1X ^ ðcÞ ; C c¼1 ab ð10Þ ^ abj ¼ C 1X ^ ðcÞ : C c¼1 abj ð11Þ so that By assuming the maximum-likelihood estimates are asymptotically normally distributed, we approximate the variance of ^ ðcÞ abj for iteration c by the inverse Fisher information. This yields the variance estimate ðcÞ2 ^ abj ^ ðcÞ NðcÞ 1 abj abj d2 log pðnj ;Uj j l ;sj0 Þ j ^ ðcÞ ¼ ðcÞ ¼ ðcÞ2 ¼ ðcÞ : ð12Þ j abj Nabj aj aj d2 ab ^ abj is By the Law of Total Variance, the variance of h i h i ^ abj j MðcÞ : ð13Þ ^ abj Þ ¼ Var E ^ abj j MðcÞ þ E Var Varð The first term can be estimated from the sample variance of ^ ðcÞ abj and the second term can be estimated from the sample ðcÞ average of the inverse Fisher information, ^ ðcÞ abj =aj . We can estimate the variance–covariance matrix c ^ abj Þ and off-diag^ ab with the diagonal elements Varð of d ^ abj ; ^ abj0 Þ, where onal elements Covð C C 2 1 X ^ abj 1 X ^ ^ ðcÞ þ ; ð14Þ abj abj C 1 c¼1 C c¼1 ðcÞ aj ðcÞ c ^ abj Þ ¼ Varð and ¼ 1952 1 C1 " C X c¼1 d ð^ abj ; ^ abj0 Þ Cov ^ ðcÞ ^ ðcÞ abj abj0 1 C C X c¼1 ^ ðcÞ abj C X # ^ ðcÞ abj0 : c¼1 Note that some sources of uncertainty are unfortunately not included in the estimates. For instance, we do not include uncertainty due to using substitution histories that are not generated from a context-dependent model. Divergence Time Estimation Multidivtime is a Bayesian MCMC program for estimating divergence times on a known rooted phylogeny with a relaxed autocorrelated clock model (Thorne et al. 1998; Kishino et al. 2001; Thorne and Kishino 2002). Data sets consisting of multiple genes can be analyzed in Multidivtime by assuming a common set of divergence times but allowing independent rate trajectories for each gene. Multidivtime takes a two-step procedure to estimate species divergence times from multigene data sets. First, it estimates branch lengths through maximum likelihood and uses the curvature of the log-likelihood surface to estimate a variance–covariance matrix between the branch length estimates for each gene. Second, it adopts an MCMC procedure to sample divergence times and rates by approximating the likelihood surface with a multivariate normal distribution for each gene, which is determined by the branch length estimates and the variance–covariance matrix obtained in the first step. In this study, we use Multidivtime for data sets where “substitution lengths” vary among substitution types rather than data sets where branch lengths vary among genes. By sampling the substitution histories from PhyloBayes-MPI, the substitution lengths and the associated variance–covariance matrix can be estimated as described above for each kind of nucleotide substitution. Different kinds of nucleotide substitutions are treated by Multidivtime in the same way as it treats different genes that share the same set of divergence times. Just as Multidivtime allows the rate trajectories of different genes to independently vary over the phylogeny and just as it allows some genes to change rate in a more clock-like fashion than others, our analyses with Multidivtime have different substitution types change rate independently and allow some substitution types to be more clock-like than others. A weakness of Multidivtime that exists for multigene analyses is that it ignores the possibility of correlated rate changes among genes. Likewise, a weakness of our analyses of changing substitution rates over time is that correlated changes in rate among substitution types are biologically plausible but Multidivtime assumes the rates change independently. An additional shortcoming of using Multidivtime for studying substitution rate change is the treatment of variance and covariance structure of estimated substitution lengths. Although Multidivtime can account for covariances in estimation error of substitution lengths among branches for each substitution type, its current implementation assumes no covariance in estimation error among substitution types. Results ð15Þ As described in the Materials and Methods section, the DNA sequences for all divergence time analyses consisted of approximately 0.15 Mb from nine primates, one of which was MBE Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 12 11 10 14 ingroup root 9 8 human orangutan gibbon rhesus baboon greenMonkey marmoset squirrelMonkey bushbaby 8 ingroup taxa 13 0 1 2 3 4 5 6 7 1 outgroup taxon 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 (Myr) FIG. 2. Phylogenetic tree for nine primate species used in the analyses. There are eight ingroup taxa with an outgroup species (bushbaby) to root the tree. Nodes are labeled from 0 to 14. The depicted divergence times are obtained from TimeTree (Hedges et al. 2006; Kumar and Hedges 2011) and indicated by the time line with time units of one million years. Fig. 3. Normalized root-to-tip substitution lengths for each type of context-independent single nucleotide substitution. Types are defined in table 1. Root-to-tip substitution lengths are normalized so that within-type average substitution length is 1. The normalized root-to-tip substitution lengths are labeled with different colors for each tip. The average root-to-tip substitution length for each type before normalization is reported as well as the variance after normalization. treated as the outgroup. The assumed topology is shown in figure 2. One difficulty with divergence time studies is that the truth tends to be unknown. We decided to compare the results of our analyses with the divergence time estimates reported by the TimeTree database (Hedges et al. 2006; Kumar and Hedges 2011). The divergence times reported by the TimeTree database should certainly not be considered true, but we expect them to be comparatively reliable because the estimates emerge from studies that surpass ours in terms of the inclusion and treatment of fossil evidence as well as number of taxa. Substitution Lengths We investigate the degree of deviation from clock-like behavior of each type of substitution by first checking the substitution lengths of each type. If the rate of a substitution type is constant over time, then the root-to-tip substitution lengths for that type are expected to be similar for all lineages. The substitution lengths for context-independent and contextdependent substitution types were estimated from a sample of substitution histories from PhyloBayes-MPI. The root-to-tip substitution lengths for all lineages were computed and normalized so that the average root-to-tip substitution length within the same substitution type is 1. The normalized root-to-tip substitution lengths are summarized in figures 3 and 4. These figures were inspired by corresponding figures from Hwang and Green (2004). Among all lineages, the root-to-marmoset and root-tosquirrel monkey substitution lengths are always the greatest, regardless of substitution type. Also, the root-to-human substitution lengths tend to be among the smallest. Species within the same clade share branches in their root-to-tip paths, and therefore their root-to-tip substitution lengths are correlated. In addition, these species with similar root-to tip substitution lengths have similar generation times and body sizes. This observation is consistent with the generation-time effect such that rates of lineages with relatively shorter generation times are elevated. Even though this phenomenon is relatively minor for CpG transitions (Type 9), it is consistent with some transitions affecting CpG sites not being associated with methylation (e.g., replication errors). The average root-to-tip substitution lengths for each substitution type before normalization are related to the rate of each type of substitution. For context-independent substitutions, transition rates (Type 5 and Type 6) are higher than transversion rates (Type 1–Type 4), whereas Type 5 (G!A and C!T) is the highest of all (see fig. 3). This observation is consistent with the hypothesis that mutation is biased 1953 Lee et al. . doi:10.1093/molbev/msv099 MBE FIG. 4. Normalized root-to-tip substitution lengths for each type of context-dependent substitution. Types are defined in table 2. Root-to-tip substitution lengths are normalized so that within-type average substitution length is 1. The normalized root-to-tip substitution lengths are indicated with different colors for different tips. The average root-to-tip substitution length for each type before normalization is reported as well as the variance after normalization. toward A + T content (Sueoka 1988, 1992). For contextdependent substitutions, Type 9 (CpG transitions) has the highest rate, whereas Type 3 (non-CpG T ! A and A ! T) and Type 4 (non-CpG T ! G and A ! C) have the lowest rates (see fig. 4). Moreover, CpG sites are substitution hotspots so that the rates at CpG sites are accelerated. The CpG transition rate (Type 9) is much higher than non-CpG transition rates (Type 5 and Type 6). Furthermore, CpG transversion rates (Type 7 and Type 8) are also much higher than nonCpG transversion rates (Type 1 and Type 2). Kong et al. (2012) suggested that the high CpG transversion rate stems not only from hypermutable CpG sites but also from mutational bias favoring changes that decrease G + C content. As a matter of fact, our estimated CpG transversion rates are actually comparable to non-CpG transition rates. The spread of the root-to-tip substitution lengths (quantified by the variance after normalization) reflects the degree of deviation from clock-like behavior. Among all contextdependent substitution types, Type 9 (CpG transitions) has the smallest variance after normalization, suggesting that it is the most clock-like (see fig. 4). Following CpG transitions, Type 8 (CpG G ! T and C ! A) has the second smallest variance after normalization. Despite these patterns, one should not conclude that CpG sites are more clock-like than non-CpG sites because the other transversion types for CpG sites, Type 7 (CpG G ! C and C ! G), have a large normalized variance for the root-to-tip substitution lengths. In addition, Type 1 (non-CpG G ! C and C ! G) and Type 6 (non-CpG A ! G and T ! C) have relatively small variances, meaning that they are more clock-like than other types. In fact, Type 1 (G ! C and C ! G) and Type 6 (A ! G and T ! C) are also the most clock-like among all context-independent single-nucleotide substitutions. This result, does not agree with the observations of Kim et al. (2006) that transversions exhibit less generation-time effect and are more clock-like than transitions. Divergence Time and Substitution Rate Estimates We performed a variety of divergence time analyses with the primate data and the Multidivtime software. For each of the 1954 six context-independent substitution types and for each of the nine context-dependent types, we inferred divergence times using only that type. We also estimated times when the six substitution types were jointly considered with each type having its own relaxed clock. Likewise, a joint analysis of the nine context-dependent types was performed. To constrain the time to the value suggested by the TimeTree database, all analyses adopted a gamma prior with mean 44.2 My and standard deviation 0.1 My for the ingroup root. Divergence time estimates and 95% credible intervals for the other ingroup internal nodes are listed in table 3 (context-independent substitutions) and table 4 (contextdependent substitutions). Because the ingroup root time had a tight prior, the estimated times for it are not reported. Except for the external information represented by the tight prior on the ingroup root time, no other calibration information was employed in our analyses. Substitution types that are relatively clock-like are expected to give better divergence time estimates because rates on different branches will be highly correlated and therefore information about rates can be shared across branches. This prediction is indeed consistent with the divergence time estimates from using the CpG transitions only, which according to the divergence times reported by TimeTree outperforms most other analyses in terms of precision and accuracy. On the other hand, substitution types that occurred less frequently (e.g., CpG transversions) would have greater uncertainty associated with the estimated substitution lengths, and thus their divergence time estimates have greater uncertainty as well. The divergence time estimates from the joint analysis for context-dependent substitutions are dominated by less clock-like non-CpG substitutions, because non-CpG sites represent about 99% of the data. If the proportion of clock-like substitution types in the sequence alignment increases, the divergence time estimates from this joint analysis would presumably improve. We also performed analyses with the BEAST software (see table 3). However, BEAST and Multidivtime implement quite different treatments for the prior distributions of rates and times, and therefore the differences in the resulting inferences MBE Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 Table 3. Divergence Time Estimates for Single Nucleotide Substitutions. Node (Time) Multidivtime analyses Prior Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Joint relaxed clocks Strict clock Shared relaxed clock BEAST analyses Prior Clock UCLD 8 (18.8) 21.9 20.2 19.0 18.8 21.9 25.2 25.1 19.8 22.8 17.7 (1.1, 43.1) (16.4, 24.5) (15.0, 24.6) (15.4, 23.1) (17.1, 28.6) (20.0, 31.6) (20.0, 31.8) (18.0, 21.7) (22.3, 23.3) (15.8, 20.1) 13.5 (0.0, 37.2) 23.4 (22.9, 23.9) 23.2 (18.3, 27.9) 9 (8.8) 11.0 6.2 6.2 6.6 5.8 8.1 6.7 6.2 5.8 6.4 (0.4, (4.6, (3.8, (4.5, (4.1, (6.6, (5.5, (5.5, (5.5, (5.0, 31.4) 8.3) 10.9) 10.2) 8.5) 10.1) 8.4) 6.9) 6.0) 8.5) 5.6 (0.0, 17.7) 5.4 (5.2, 5.6) 5.6 (4.3, 7.0) 10 (11.5) 22.1 8.9 9.1 10.0 10.0 12.3 11.0 9.7 9.1 10.0 (4.0, 40.1) (6.9, 11.6) (6.0, 15.2) (7.2, 14.7) (7.4, 13.9) (10.3, 14.9) (9.1, 13.3) (8.8, 10.8) (8.8, 9.4) (8.0, 13.0) 14.1 (0.2, 32.2) 8.5 (8.2, 8.8) 8.8 (7.2, 10.6) 11 (15.1) 11.0 17.5 17.1 16.6 18.1 17.0 18.8 17.7 14.1 19.3 (0.3, 31.2) (14.8, 20.9) (12.9, 23.4) (12.8, 21.7) (14.2, 23.3) (14.6, 19.9) (16.3, 21.7) (16.4, 19.1) (13.8, 14.4) (16.5, 22.8) 5.8 (0.0, 18.2) 13.1 (12.8, 13.5) 13.9 (10.9, 16.8) 12 (18.8) 22.0 19.4 18.3 18.7 20.3 18.6 20.8 19.5 15.9 21.4 (4.1, 40.1) (16.7, 23.0) (13.8, 24.8) (14.6, 23.9) (16.2, 25.7) (16.1, 21.5) (18.2, 23.8) (18.1, 20.9) (15.5, 16.2) (18.4, 24.9) 14.4 (0.4, 32.3) 14.7 (14.4, 15.1) 15.4 (12.3, 18.4) 13 (29.6) 33.1 29.9 29.6 29.8 31.1 31.0 31.5 30.2 26.3 31.0 (12.3, (27.2, (25.6, (26.1, (27.6, (28.4, (29.0, (29.0, (25.8, (28.5, 43.8) 33.1) 34.3) 33.8) 35.0) 33.8) 34.1) 31.4) 26.7) 33.7) 27.9 (10.1, 44.2) 24.4 (23.9, 24.8) 24.4 (21.0, 28.0) NOTE.—Posterior means and 95% credible intervals of divergence times in units of millions of years. The nodes are labeled as in figure 2 and the divergence times in parentheses under the node labels are obtained from TimeTree (Hedges et al. 2006; Kumar and Hedges 2011). The types of the single nucleotide substitutions are defined in table 1. Rows for types 1–6 are the results from Multidivtime where each type of single nucleotide substitution was analyzed separately. The “Joint Relaxed Clocks” row has the results from Multidivtime with all substitution types analyzed jointly, each type having its own relaxed clocks. The “Strict Clock” and “Shared Relaxed Clock” represent conventional analyses with an HKY model plus discrete-gamma heterogeneity with four categories in Multidivtime. The last three rows are the results from BEAST (with a strict clock and with an uncorrelated lognormal relaxed clock using a GTR model plus discrete-gamma heterogeneity with four categories) when substitution types were not allowed to have different clocks. Table 4. Divergence Time Estimates for Context-Dependent Substitutions. Node (Time) Prior Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9 Joint Relaxed Clocks 21.9 20.5 18.8 18.7 22.0 26.1 25.1 16.5 16.9 20.7 19.7 8 (18.8) (1.1, 43.1) (16.6, 25.1) (14.9, 24.4) (15.4, 23.0) (17.1, 28.4) (19.9, 33.9) (20.0, 31.6) (4.1, 33.8) (5.7, 31.1) (16.8, 28.0) (18.1, 21.5) 11.0 6.2 6.3 6.6 5.8 7.8 6.7 6.0 5.0 8.6 6.2 9 (8.8) (0.4, 31.4) (4.6, 8.5) (3.9, 10.7) (4.5, 9.9) (4.1, 8.5) (6.2, 10.4) (5.5, 8.4) (0.8, 16.0) (0.8, 13.5) (6.0, 12.7) (5.6, 6.9) 10 (11.5) 22.1 (4.0, 40.1) 9.1 (7.1, 12.0) 9.5 (6.3, 15.3) 10.0 (7.2, 14.5) 10.0 (7.4, 13.7) 12.0 (9.8, 15.1) 11.0 (9.1, 13.3) 8.8 (2.4, 21.6) 6.6 (1.7, 16.9) 13.0 (9.5, 18.3) 9.7 (8.8, 10.6) 11 (15.1) 11.0 (0.3, 31.2) 17.8 (15.1, 21.4) 17.6 (13.4, 23.5) 16.6 (12.8, 21.7) 18.0 (14.1, 23.0) 16.9 (14.2, 20.3) 18.8 (16.3, 21.7) 14.2 (5.1, 27.0) 11.0 (3.2, 21.8) 16.9 (13.1, 21.2) 17.5 (16.3, 18.7) 12 (18.8) 22.0 (4.1, 40.1) 19.8 (17.0, 23.5) 18.9 (14.4, 24.8) 18.7 (14.6, 23.9) 20.2 (16.2, 25.3) 18.4 (15.6, 21.9) 20.8 (18.2, 23.9) 16.1 (6.8, 29.5) 12.9 (5.1, 24.7) 18.9 (15.0, 23.3) 19.2 (18.0, 20.6) 33.1 30.5 30.2 29.8 31.1 30.9 31.5 24.3 22.9 31.1 30.1 13 (29.6) (12.3, 43.8) (27.6, 33.7) (26.2, 34.64) (26.2, 33.7) (27.6, 34.8) (28.1, 34.1) (29.0, 34.1) (12.8, 39.7) (12.0, 37.2) (27.0, 36.1) (29.0, 31.2) NOTE.—Posterior means and 95% credible intervals of divergence times in units of millions of years. The nodes are labeled as in figure 2 and the divergence times in parentheses under the node labels are obtained from TimeTree (Hedges et al. 2006; Kumar and Hedges 2011) and listed in table 3. The types of the context-dependent substitutions are defined in table 2. Rows for types 1–9 are the results from Multidivtime where each type of context-dependent substitution was analyzed separately. The last row has results from Multidivtime with context-dependent substitutions treated jointly, each type having its own relaxed clocks. could have a variety of causes. The BEAST analysis with a strict clock results in tight 95% credible intervals for all times, but the posterior means depart from the times according to TimeTree (Hedges et al. 2006; Kumar and Hedges 2011). The posterior means of divergence times from the BEAST lognormal uncorrelated relaxed clock model analysis are similar to those from the strict clock analysis, but allowing rate variation among branches makes the 95% credible intervals wider. On the other hand, the posterior means of divergence times with a strict clock from BEAST and from Multidivtime are similar. This is not the case when an autocorrelated lognormal relaxed clock is considered (“Shared Relaxed Clock” in table 3). Considering the analyses that were done in the conventional way by not having separate clocks for separate substitution types and also considering the analyses that combined all substitution types but let each have its own relaxed clock and its own independent rate trajectory, there is a disquieting tendency for producing narrow credible intervals that do not include the divergence times reported by TimeTree. This arises both for analyses performed by Multidivtime and by BEAST. We do not attribute these issues to the estimates from TimeTree. Probably, it would be desirable to allow different substitution types to have different relaxed clocks but to permit these relaxed clocks of different substitution types change in a correlated way. However, it is unclear what the primary cause of the misleading and overly narrow credible intervals is. In order to exploit the best available estimates of divergence times so that rate variation over time among 1955 MBE Lee et al. . doi:10.1093/molbev/msv099 Table 5. Average Estimated Substitution Rates among Nodes for Each Type of Context-Dependent Substitution. Type Mean SD 1 2.0 0.3 2 2.7 0.5 3 1.6 0.2 4 1.6 0.3 5 7.8 1.1 6 5.9 1.0 7 3.8 1.6 8 5.3 2.3 9 51.3 6.9 NOTE.—The posterior distributions of substitution rate per site per 1010 years were estimated with Multidivtime for each type of context-dependent substitution for the case where all node times are tightly constrained to the values reported by the TimeTree database. The average among nodes of the posterior mean rate and the average among nodes of the posterior standard deviation of the rate are reported for each type of context-dependent substitution. substitution types could be examined, we performed additional analyses where tight constraints were placed around the divergence times reported by TimeTree for all internal nodes. Using these tight constraints, we reestimated the divergence times and substitution rates from Multidivtime by separately analyzing each type of context-dependent substitution. The averages among nodes of the posterior means of chronological substitution rates and the averages among nodes of the posterior standard deviations of substitution rates are reported in table 5. The chronological substitution rates for CpG transitions (Type 9) are much higher than rates for other types of substitutions. In fact, the magnitudes of chronological substitution rates for each type of context-dependent substitution estimated by Multidivtime closely correspond to the average root-to-tip substitution lengths reported in figure 4. To better visualize the pattern of rate change over time, the inferred substitution rates for each node and each substitution type were normalized so that the within-type average among nodes is 1. Doing this shows that the inferred rates among nodes are the most constant for CpG transitions (fig. 5; see also supplementary fig. S1 Supplementary Material online, for corresponding plot of context-independent substitutions). Figure 5 is consistent with the variance of the normalized root-to-tip substitution lengths in figure 4. Because CpG sites are rare in the sequence alignment (around 1%) and the rates of transversions are relatively low compared with transitions, the inferred normalized substitution rates for Type 7 and Type 8 are associated with large uncertainty. Simulation To evaluate our method, especially the assumption that the distribution of substitution histories is robust to substitution model specification, we simulated ten data sets with a program that we wrote (details in Materials and Methods section). Each simulated data set used the observed human sequence as the root and the same tree topology as assumed for the real data. Sequences were simulated with the estimated substitution lengths for each context-dependent substitution type on each branch. The simulated data sets have about the same proportions of CpG, non-CpG C + G, and non-CpG A + T sites as the actual data. Simulated data were analyzed in the same way as the observed data. To compare simulated and actual data, we estimated branch lengths from a sample of substitution histories by 1956 counting the number of substitutions on each branch. The branch lengths estimated from the actual and simulated data are similar and are listed in table 6. We also used the PAML software (Yang 2007) to estimate branch lengths. Those estimates are similar to our estimates from sampling substitution histories (results not shown). Root-to-tip substitution lengths for each context-dependent substitution type are compared between the actual and an arbitrarily selected simulated data set in figure 6. The substitution lengths are generally consistent between the real and the simulated data, except that the root-to-tip substitution lengths for CpG transitions from the simulated data are slightly less than those from the actual data. As the CpG transition rate is high, these changes may tend to occur earlier on branches than would be indicated by a sample of substitution histories obtained with a model that did not permit special treatment of CpG transitions. This could lead to CpG transition substitution lengths being underestimated. In general, the simulations suggest that substitution histories are relatively insensitive to misspecification of the substitution model. Discussion Endpoint-conditioned sampling of substitution histories has received increasing attention in statistical and molecular evolutionary literature (Nielsen 2002; Hobolth and Jensen 2005, 2011; Rodrigue et al. 2005, 2006, 2008; Lartillot 2006; Mateiu and Rannala 2006; Hobolth 2008; Minin and Suchard 2008; Hobolth and Stone 2009; Tataru and Hobolth 2011). This approach can be viewed as a data augmentation technique and has been utilized to improve MCMC mixing for sampling from the full posterior distribution. Alternatively, sampling from the posterior distribution of substitution histories can sometimes be employed to obtain maximum-likelihood estimates (e.g., Rodrigue et al. 2007). Our somewhat crude approximation is simply to sample mappings Mð1Þ , . . . , MðCÞ with a context-independent model and to then assume that these histories are actually sampled from the context-dependent distribution PrðM j D; lab Þ. We P ðcÞ then estimate abj as ð1=CÞ Cc¼1 ðNðcÞ abj Þ=ðaj Þ. Although this P ðcÞ P ðcÞ estimator may be less promising than ð c Nabj Þ=ð c aj Þ, the sampling distribution of our estimator is asymptotically normal with its variance approximated by the law of total variance. Our approach also has the advantages of computational feasibility and ease of implementation. In addition, our approach allows other types of context-dependent rates not included in this study to be investigated. A shortcoming of our approach is that the independentsites model for sampling substitution histories differs from the context-dependent substitution model for which rates are estimated from the substitution histories. For the primate data analyzed here, this shortcoming does not appear to have serious consequences (e.g., see table 6 and fig. 6). However, problems would be more severe if branches were longer so that sampled substitution histories would be more sensitive to the model employed for sampling substitution histories. MBE Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 proportions aj and the substitution counts Nabj. These would lead to new values for the inferred context-dependent substitution rates. The iterative process of reweighting substitution histories and then estimating new context-dependent substitution rates could continue until the rate estimates converged. An advantage of this envisioned importance sampling procedure is that a computationally feasible independent-sites model could be employed for obtaining a single sample of substitution histories. This would presumably be more computationally tractable than implementing a contextdependent model directly into an MCMC procedure that lets the clocks of different substitution types relax to different A future direction that might overcome the shortcoming would be to apply importance sampling (e.g., Liu 2001) to the substitution histories. The idea is that some of the sampled substitution histories would become relatively more likely and some would become relatively less likely when evaluated by the context-dependent substitution rates that we have estimated. By evaluating the ratio of the probability density of each sampled substitution history according to the contextdependent rates versus according to the independent-sites model used for sampling, an importance weight could be assigned to each history in the sample. The importance weights associated with each sampled history could then be employed to derive new estimates of the summed dwell Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9 0.0 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Normalized substitution rate at node 0.5 1.0 1.5 2.0 2.5 3.0 Type 1 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Node FIG. 5. Estimated substitution rates at nodes for each type of context-dependent substitution with node times tightly constrained to values reported by TimeTree. Rates were normalized so that within-type average among nodes is 1. Vertical bars indicate 95% credible intervals. 1957 MBE Lee et al. . doi:10.1093/molbev/msv099 employed for inference with dependent-sites models (reviewed in Hobolth and Thorne 2014). In accord with earlier studies, our results indicate that CpG transitions occurred at a higher rate and are more clock-like than other types of substitutions in primates. Based on the TimeTree values, the CpG transition analysis performs relatively well in estimating the primate divergence times compared with other analyses. In addition, the degree to which other substitution types change rates over time can be diverse. This suggests that divergence time analyses should not assume that all substitution types change rates over time in the same way. Our strategy is to define different substitution types and then to assume that each substitution type has a relaxed clock that changes rate independently of all other substitution types. Based on our “joint relaxed clocks” result for context-dependent substitutions (see table 4), we suspect this would be a poor strategy if the selected number of substitution types is too large. There are parallel issues in multilocus divergence time analyses. Some factors affecting change in evolutionary rates are lineage-specific whereas others are gene-specific. Ideally, relaxed clock analyses would accommodate both gene-specific and lineage-specific tendencies to change rates. Failure to consider both gene-specific and degrees. Although importance sampling tends to be less successful when the distribution used for sampling differs enough from the distribution of interest as to make all or almost all samples particularly improbable according to the distribution of interest, the envisioned approach could be configured to exploit the fact that the substitution history of a sequence consists of the substitution histories of its sites. In other applications, sampling substitution histories of individual sites from independent-sites models has been successfully Table 6. Comparison of the Branch Lengths from the Actual and the Average of Ten Simulated Data Sets. Branch 0 1 2 3 4 5 6 Actual 0.0181 0.0199 0.0226 0.0086 0.0088 0.0142 0.0379 Simulated 0.0186 0.0202 0.0230 0.0088 0.0089 0.0145 0.0396 SD 0.0005 0.0005 0.0005 0.0003 0.0002 0.0007 0.0008 Branch 7 8 9 10 11 12 13 Actual 0.0371 0.0461 0.0049 0.0297 0.0020 0.0122 0.0183 Simulated 0.0383 0.0482 0.0050 0.0312 0.0027 0.0131 0.0185 SD 0.0006 0.0016 0.0002 0.0016 0.0021 0.0018 0.0011 log root−to−tip substitution lengths from the simulated data NOTE.—The branches are labeled in figure 2. The branch lengths are estimated from a sample of substitution histories by counting the number of substitutions for each branch. −2 Type 1 2 3 −3 4 5 6 7 8 9 −4 −5 −5 −4 −3 −2 −1 log root−to−tip substitution lengths from the real data FIG. 6. Comparison between real and simulated data of context-dependent substitution lengths. Substitution types are defined in table 2 and labeled with different symbols. Each point represents the log root-to-tip substitution length in the real and the simulated data set for a particular substitution type and a particular tip. 1958 Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 lineage-specific tendencies may yield multilocus divergence time estimates that are overly precise (Thorne and Kishino 2002; Zhu et al. 2014). Similarly, relaxed clocks of different substitution types may change in a correlated way due to lineage-specific factors. Although our technique has the desirable feature of letting substitution types vary rates over time in different ways, a presumably better treatment would facilitate correlated rate changes over time among substitution types. Materials and Methods Data We explored the clock-like fashion of CpG transitions using a subset of data previously studied by Kim et al. (2006). We intentionally concentrated on genomic data with little evidence for important biological function so that rate variation over time among substitution types was less apt to be influenced by natural selection. We obtained sequence data of humans and eight other primate species for a region that is orthologous to human Chromosome 16 (hg19.chr16:60842337-61130970) according to the Multiz alignment of 100 vertebrates in the UCSC (the University of California–San Cruz) genome browser (Kent et al. 2002; Blanchette et al. 2004). Throughout, the nine species of primates analyzed here are assumed related through the phylogenetic tree in figure 2 with bushbaby as the outgroup species. These nine species yield a phylogeny for which branches are long enough to yield substantial information about substitution lengths, but short enough to lessen the chance of multiple substitutions per site and avoid serious concerns about alignment uncertainty. CpG islands are often free of methylation (Bird 1986) and not hypermutable. Therefore, the substitution process for these regions is not similar to CpG dinucleotides in the rest of the genome. For this reason, CpG islands were removed from the analysis. Removed stretches had GC content of 50% or greater along 200 or more base pairs with an observed/ expected CpG content greater than 0.6. Similarly, positions in the exons and repetitive elements identified by RepeatMasker (Smit et al. 1996–2010) were excluded from the analyses. The final data set contains approximately 115 kb for each species and the proportion of CpG sites is around 1%. Settings in PhyloBayes-MPI Both the observed and the simulated DNA sequence data were analyzed by having 312 substitution histories sampled for every site. To generate the substitution histories with a modified version of the PhyloBayes-MPI 1.5a software, we used the CAT–GTR (general time reversible) model (see Tavare 1986; Lartillot and Philippe 2004), with a four-category discretized gamma distribution to accommodate rate heterogeneity among sites (Yang 1994). The discrete-gamma distribution of rate variation among sites was parameterized by a shape parameter with an exponential prior of mean 1. All prior distribution settings for branch lengths and for the CAT–GTR model were the PhyloBayes-MPI default values. We also obtained substitution histories through the GTR MBE model with four discrete-gamma categories and obtained results similar to those from the CAT–GTR model (results not shown). Divergence Time Estimation in Multidivtime and BEAST The divergence times and substitution rates for the phylogeny in figure 2 were estimated by considering one substitution type per analysis and by jointly considering the collection of a strand-symmetric context-dependent substitution types. The divergence times were also estimated from Multidivtime and BEAST (Drummond et al. 2006, 2012) with conventional analyses that have all substitution types change rate in the same way on each branch. For all conventional analyses, the Hasegawa–Kishino–Yano (HKY) model (Hasegawa et al. 1984) and discrete-gamma rate heterogeneity among sites with four categories (Yang 1994) were adopted for Multidivtime analyses. For BEAST analyses, the nucleotide substitution model was the GTR model (Tavare 1986) instead. The Multidivtime and BEAST analyses were run for between 2 107 and 109 iterations in order to achieve convergence. Each case was run twice to ensure convergence. The first 10% of the samples were burn-in and not included in the posterior approximations. To enhance MCMC convergence, we removed the outgroup species in the BEAST analyses and placed a prior on the ingroup root time. The tree topology was considered known and fixed for all Multidivtime and BEAST analyses. For the purposes of comparison, the root time prior for all analyses was tight. It was gamma distributed with mean 44.2 My and standard deviation 0.1 My. For all Multidivtime analyses except the “strict clock,” the lognormal autocorrelated relaxed clock model (Kishino et al. 2001) was employed. The root rate had a gamma prior with mean and standard deviation 0.0012. The actual marginal priors on divergence times were formed by the combination of the root time prior and the generalized Dirichlet distribution of Kishino et al. (2001). The autocorrelation between rates on adjacent branches is controlled by the parameter ( 0). When is 0, rates are forced to be the same for all branches and a strict clock results. Rates are less correlated with larger . The autocorrelation parameter had a gamma prior with mean and standard deviation of 0.023, which is the inverse of the prior mean of the root time. For BEAST analyses, either the strict clock or the uncorrelated lognormal relaxed clock was selected. For the strict clock case, the single rate of the phylogeny had the same prior as the root rate prior in Multidivtime analyses. For the uncorrelated BEAST case, the mean rate (ucld.mean) had the root rate prior of Multidivtime and the standard deviation of the mean rate (ucld.stdev) had an exponential prior with mean 0.33. A birth–death process was used for the priors on divergence times in BEAST analyses. The mean growth rate had a uniform (0, 100,000) prior and the relative death rate had a uniform (0, 1) prior. Rate heterogeneity among sites was modeled by a four-category discrete-gamma rate model 1959 Lee et al. . doi:10.1093/molbev/msv099 with an exponential prior of mean 0.5 for the gamma shape parameter. Simulation The program for simulating sequences takes a root sequence, a fixed tree topology, and rate matrices for different branches as inputs. There are two rate matrices (one rate matrix for non-CpG sites and one for CpG sites) for each branch. We determined the rate matrices to use for each branch from the substitution lengths that were estimated for each branch from the real data analyses. In essence, our program simulates each branch on the phylogenetic tree by alternating between randomly sampling exponentially distributed waiting times between events and then randomly determining the sequence position and substitution type that occurs at each event. Further details for using a starting sequence and rate matrices to simulate waiting times of a Markov chain can be found in Yang (2014). Supplementary Material Supplementary figure S1 is available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). Acknowledgments The authors thank two anonymous reviewers for their help. H.-J.L. and J.L.T. were supported by the National Institutes of Health (NIH) grants GM090201 and GM070806. N.R. was supported by the Natural Sciences and Engineering Research Council of Canada. The software developed for this research is freely available on https://github.com/HuiJieLee. References Aris-Brosou S, Yang Z. 2003. Bayesian models of episodic evolution support a late Precambrian explosive diversification of the Metazoa Mol Biol Evol. 20(12):1947–1954. Bird A. 1986. CpG-rich islands and the function of DNA methylation Nature 321(6067):209–213. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner Genome Res. 14(4):708–715. Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. 2006. Relaxed phylogenetics and dating with confidence PLoS Biol. 4(5):699–710. Drummond AJ, Suchard MA, Xie D, Rambaut A. 2012. Bayesian phylogenetics with beauti and the beast 1.7 Mol Biol Evol. 29(8):1969–1973. Hasegawa M, Yano T, Kishino H. 1984. A new molecular clock of mitochondrial DNA and the evolution of hominiods Proc Jpn Acad. B(60):95–98. Hedges SB, Dudley J, Kumar S. 2006. TimeTree: a public knowledgebase of divergence times among organisms Bioinformatics 22(23):2971–2972. Hobolth A. 2008. A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution with neighbor-dependent substitution rates J Comput Graph Stat. 17(1):138–162. Hobolth A, Jensen J. 2005. Statistical inference in evolutionary models of DNA sequences via the EM algorithm Stat Appl Genet Mol Biol. 4, Article 18. Hobolth A, Jensen JL. 2011. Summary statistics for endpoint-conditioned continuous-time Markov chains J Appl Probab. 48(4):911–924. Hobolth A, Stone EA. 2009. Simulation from endpoint-conditioned, continuous-time markov chains on a finite state space, 1960 MBE with applications to molecular evolution Ann Appl Stat. 3(3):1204–1231. Hobolth A, Thorne JL. 2014. Sampling and summary statistics of endpoint-conditioned paths in DNA sequence evolution. In: Chen M-H, Kuo L, Lewis PO, editors. Bayesian phylogenetics: methods, algorithms, and applications. Chapman and Hall/CRC. Huelsenbeck J, Larget B, Swofford D. 2000. A compound Poisson process for relaxing the molecular clock Genetics 154(4):1879–1892. Hwang D, Green P. 2004. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution Proc Natl Acad Sci U S A. 101(39):13994–14001. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The human genome browser at UCSC Genome Res. 12(6):996–1006. Kim S-H, Elango N, Warden C, Vigoda E, Yi SV. 2006. Heterogeneous genomic molecular clocks in primates PLoS Genet. 2(10):1527–1534. Kimura M. 1968. Evolutionary rate at the molecular level Nature 217(5129):624–626. Kimura M. 1983. The neutral theory of molecular evolution. Cambridge University Press. Kishino H, Thorne JL, Bruno WJ. 2001. Performance of a divergence time estimation method under a probabilistic model of rate evolution Mol Biol Evol. 18(3):352–361. Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, et al. 2012. Rate of de novo mutations and the importance of father’s age to disease risk Nature 488(7412):471–475. Kumar S, Hedges SB. 2011. TimeTree2: species divergence times on the iPhone Bioinformatics 27(14):2023–2024. Lartillot N. 2006. Conjugate Gibbs sampling for Bayesian phylogenetic models Mol Biol Evol. 13(10):1701–1722. Lartillot N, Delsuc F. 2012. Joint reconstruction of divergence times and life-history evolution in placental mammals using a phylogenetic covariance model Evolution 66(6):1773–1787. Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process J Comput Biol. 21(6):1095–1109. Lartillot N, Rodrigue N, Stubbs D, Richer J. 2013. PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment Syst Biol. 62(4):611–615. Liu JS. 2001. Monte Carlo strategies in scientific computing. Springer. Mateiu L, Rannala B. 2006. Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation Syst Biol. 55(2):259–269. Minin VN, Suchard MA. 2008. Counting labeled transitions in continuoustime Markov models of evolution J Math Biol. 56(3):391–412. Nielsen R. 2002. Mapping mutations on phylogenies Syst Biol. 51(5):729–739. Ohta T. 1973. Slightly deleterious mutant substitutions in evolution Nature 246(5428):96–98. Peifer M, Karro JE, von Gruenberg HH. 2008. Is there an acceleration of the CpG transition rate during the mammalian radiation? Bioinformatics 24(19):2157–2164. Rannala B, Yang Z. 2007. Inferring speciation times under an episodic molecular clock Syst Biol. 56(3):453–466. Rodrigue N, Lartillot N, Bryant D, Philippe H. 2005. Site interdependence attributed to tertiary structure in amino acid sequence evolution Gene 347(2):207–217. Rodrigue N, Philippe H, Lartillot N. 2006. Assessing site-interdependent phylogenetic models of sequence evolution Mol Biol Evol. 23(9):1762–1775. Rodrigue N, Philippe H, Lartillot N. 2007. Exploring fast computational strategies for probabilistic phylogenetic analysis Syst Biol. 56(5):711–726. Rodrigue N, Philippe H, Lartillot N. 2008. Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models Bioinformatics 24(1):56–62. Rodrigue N, Philippe H, Lartillot N. 2010. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles Proc Natl Acad Sci U S A. 107(10):4629–4634. Molecular Clocks for Different Substitution Types . doi:10.1093/molbev/msv099 Sanderson M. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy Mol Biol Evol. 14(12):1218–1231. Sanderson M. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach Mol Biol Evol. 19(1):101–109. Smallwood S, Kelsey G. 2012. De novo DNA methylation: a germ cell perspective Trends Genet. 28(1):33–42. Smit AFA, Hubley R, Green P. 1996–2010. RepeatMasker Open-3.0. Sueoka N. 1988. Directional mutation pressure and neutral molecular evolution Proc Natl Acad Sci U S A. 85(8):2653–2657. Sueoka N. 1992. Directional mutation pressure, selective constraints, and genetic equilibria J Mol Evol. 34(2):95–114. Tataru P, Hobolth A. 2011. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains BMC Bioinformatics 12:465. Tavare S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. American Mathematical Society: Lectures on Mathematics in the Life Sciences, Vol.17. p. 57–86. MBE Thorne J, Kishino H. 2002. Divergence time and evolutionary rate estimation with multilocus data Syst Biol. 51(5):689–702. Thorne JL, Kishino H, Painter IS. 1998. Estimating the rate of evolution of the rate of molecular evolution Mol Biol Evol. 15(12):1647. Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods J Mol Evol. 39(3):306–314. Yang Z. 2007. Paml 4: phylogenetic analysis by maximum likelihood Mol Biol Evol. 24(8):1586–1591. Yang Z. 2014. Simulating molecular evolution. In: Molecular evolution: a statistical approach. Oxford University Press. Yang Z, Nielsen R. 2008. Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage Mol Biol Evol. 25(3):568–579. Yoder A, Yang Z. 2000. Estimation of primate speciation dates using local molecular clocks Mol Biol Evol. 17(7):1081–1090. Zhu T, dos Reis M, Yang Z. 2014. Characterization of the uncertainty of divergence time estimation under relaxed molecular clock models using multiple loci Syst Biol. 62(2):267–280. 1961
© Copyright 2026 Paperzz