Substitution Model Adequacy and Assessing the Reliability of Estimates of Virus Evolutionary Rates and Time Scales Sebastian Duch^ene,y,1 Francesca Di Giallonardo,y,1 and Edward C. Holmes*,1 1 Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia y These authors contributed equally to this work. Associate editor: Sergei Kosakovsky Pond *Corresponding author: E-mail: [email protected]. Abstract Determining the time scale of virus evolution is central to understanding their origins and emergence. The phylogenetic methods commonly used for this purpose can be misleading if the substitution model makes incorrect assumptions about the data. Empirical studies consider a pool of models and select that with the highest statistical fit. However, this does not allow the rejection of all models, even if they poorly describe the data. An alternative is to use model adequacy methods that evaluate the ability of a model to predict hypothetical future observations. This can be done by comparing the empirical data with data generated under the model in question. We conducted simulations to evaluate the sensitivity of such methods with nucleotide, amino acid, and codon data. These effectively detected underparameterized models, but failed to detect mutational saturation and some instances of nonstationary base composition, which can lead to biases in estimates of tree topology and length. To test the applicability of these methods with real data, we analyzed nucleotide and amino acid data sets from the genus Flavivirus of RNA viruses. In most cases these models were inadequate, with the exception of a data set of relatively closely related sequences of Dengue virus, for which the GTR+G nucleotide and LG+G amino acid substitution models were adequate. Our results partly explain the lack of consensus over estimates of the long-term evolutionary time scale of these viruses, and indicate that assessing the adequacy of substitution models should be routinely used to determine whether estimates are reliable. Key words: substitution model, model adequacy, parametric bootstrap, posterior predictive simulation, Bayesian model averaging, virus evolution Introduction substitution as a reversible Markovian process. Second, the sampling time of the sequences, or calibration, must capture a sufficient number of substitutions. Errors in any of these two factors will lead to spurious estimates of rates and evolutionary time scales (recently reviewed by Ho and Duch^ene 2014). Despite the ongoing development of molecular clock methods and substitution models, their use in RNA viruses has sometimes proven contentious and there is substantial debate over the evolutionary time scales of many viruses (Holmes 2003; Jorba et al. 2008; Wertheim and Pond 2011; Duch^ene et al. 2014). For example, it has been estimated that viruses of the family Coronaviridae originated around 10,100 years before present using standard phylogenetic methods (Woo et al. 2012). However, subsequent studies using improved substitution models suggested a much older origin for these viruses, of the order of millions of years (Wertheim et al. 2013). A major reason for these discrepancies is that viruses accumulate mutations much faster than cellular organisms, such that mutational saturation can become apparent after a few years and which adversely affects analyses of divergence times (Holmes 2003; Duch^ene et al. 2014; Duch^ene, Ho, et al. 2015). Moreover, natural selection operates at different levels, such as within an infected host, between transmissions to different hosts, and during codivergence between host species (Belshaw et al. 2011; ß The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 33(1):255–267 doi:10.1093/molbev/msv207 Advance Access publication September 28, 2015 255 Article Estimating evolutionary rates and time scales in viruses is central to understanding their emergence and spread. This endeavor has been facilitated by the increasing availability of molecular sequence data, which can be combined with information about time to infer divergence times, effective population sizes (Ne), and key epidemiological parameters such as the basic reproductive number (R0). However, all of these estimates rely on molecular clock methods, the simplest of which posits that the accumulation of nucleotide or amino acid substitutions is constant through time (Zuckerkandl 1976; Fitch et al. 1991; Korber et al. 2000), known as the “strict” molecular clock. Under this assumption, substitution rates can be estimated simply by dividing the sampling times of a pair of virus samples by their genetic divergence. Similarly, the divergence time between any pair of samples can be estimated by dividing their genetic divergence by the evolutionary rate. More flexible approaches, known as “relaxed” molecular clocks, allow variation in the rate of evolution among lineages according to a statistical model (e.g., Sanderson 1997; Thorne et al. 1998; Drummond et al. 2006; Yang and Rannala 2006). Importantly, all molecular clock methods have some common requirements. First, genetic divergence must be accurately estimated, which is typically done using models that describe nucleotide or amino acid Duch^ene et al. . doi:10.1093/molbev/msv207 Sanjuan 2012), which also impacts estimates of divergence times. Statistical modeling of these processes is challenging, although there are ongoing efforts to describe these processes more accurately (Volz et al. 2009; K€uhnert et al. 2011; Wertheim and Pond 2011; Volz et al. 2013). The difficulty in analyzing the long-term evolutionary processes in viruses, particularly their divergence times, can to some extent be alleviated by choosing amino acid rather than nucleotide sequences. The lower rate of evolution in amino acid data is an attractive property that can allow resolution of deep time scales, where nucleotide sequences are largely saturated. However, amino acid sequence evolution is more difficult to model because the number of possible substitutions is far greater than that of nucleotide sequences. For example, the general-time reversible (GTR) nucleotide substitution model has nine parameters, whereas the equivalent for amino acids has 208. In practice, using amino acid models with free parameters is hindered by the information content in most data sets and the considerable computational burden. Therefore, it is common to use empirically determined amino acid models, in which the transitions rates and amino acid frequencies have fixed values. An obvious drawback of this approach is that the fixed parameter values of all available models might be incorrect for a given data sets, leading to potential biases in phylogenetic inference. Huelsenbeck et al. (2008) proposed some alternatives to using fixed parameter values for amino acid models in Bayesian analyses. One of these methods is Bayesian model averaging, in which the analysis samples a range of substitution matrices in proportion to their posterior probability. As a result, the estimates of tree topology and branch lengths are averaged over different fixed rate matrices, without the increased computational burden of using a large number of parameters. Another option is to use informative prior distributions on the parameters of the transition matrix such as a Dirichlet distribution, as a means of guiding their estimation. Codon models can also be used in place of nucleotide substitution models. These typically infer selective constraints by estimating the proportion of nonsynonymous (dN) to synonymous (dS) substitutions per site, known as the dN/dS ratio. One of the most widely used codon models, known as M3, specifies three dN/dS categories for codons, corresponding to those under purifying, neutral, and positive selection (Yang et al. 2000). Parametric codon models can be computationally demanding because they have even more parameters than amino acid models, although empirical or semiparametric methods have also been implemented (e.g., Schneider et al. 2005; Kosiol et al. 2007). Developing more realistic evolutionary models and selecting different sources of data will clearly be beneficial to resolve discrepancies in phylogenetic estimates in viruses. An arguably equally important task is to develop methods to assess the reliability of the estimates obtained from these models (Gatesy 2007; Gelman and Shalizi 2013). The most common approach is to consider a pool of models and to select that with the highest statistical fit for subsequent analysis. This is commonly done through likelihood ratio tests or information criteria under maximum-likelihood frameworks, or by 256 MBE estimating marginal likelihoods in Bayesian analyses (Posada and Buckley 2004; Sullivan and Joyce 2005; Xie et al. 2011). These methods have the limitation that they assume that the set of models considered contains the true model, or at least a model that adequately describes the data. However, even if none of the models considered are adequate descriptions of the data, they will still select the model with the highest statistical fit. In contrast, model adequacy methods can reject all of the models if they are inadequate for the data at hand (Bollback 2002; Ripplinger and Sullivan 2010; Brown 2014b). Substitution model adequacy methods for phylogenetics have implementations within maximum likelihood (Goldman 1993a, 1993b) and Bayesian frameworks (Bollback 2002). Most of these are similar to a parametric bootstrap in that they consist of simulating sequence data under the model in question and comparing the simulations with the empirical data using a summary statistic. Goldman (1993a) proposed using the -test statistic for maximum-likelihood analyses, which is the difference between the multinomial loglikelihood and the log-likelihood of the model in question. For Bayesian inference, the multinomial log-likelihood appears to perform well as the test statistic (Huelsenbeck et al. 2001; Bollback 2002), but it is often advisable to also use a 2 test statistic (Foster 2004). These methods have been shown to detect some conditions of nucleotide substitution model underparameterization (Whelan et al. 2001; Bollback 2002; Ripplinger and Sullivan 2010) and violations of model assumptions (Foster 2004). However, their performance has not been thoroughly evaluated under some conditions that might be particularly difficult to account for in virus data sets, such as base heterogeneity and mutational saturation (but see Foster 2004). Moreover, model adequacy methods have largely focused on nucleotide data, although they are critical for analyses of amino acid sequences for which frequently used models are more restrictive. In this study, we investigated the usefulness of model adequacy methods to detect misleading phylogenetic estimates in virus data sets with nucleotide, amino acid, and codon sequences. To that end we analyzed four publically available complete genome data sets of flaviviruses (genus Flavivirus, family Flaviviridae of positive-sense single-stranded RNA viruses), referred to herein as (a), (b), (c), and (d) (table 1 and supplementary material, Supplementary Material online). Data set (a) includes only tick-borne flaviviruses (Heinze et al. 2012), whereas data sets (b) (Moureau et al. 2015) and (c) (Pettersson and Fiz-Palacios 2014) include sequences from the four main viral groups (mosquito-borne, tick-borne, insect specific, and no-known vector). For comparison, we utilized a data set (d) that only includes sequences from one virus species infecting humans, Dengue virus type 2 (DENV-2), that is presumed to have emerged more recently than the viral diversity contained within data sets (a)–(c). Estimating the evolutionary time scale of the Flavivirus genus has attracted considerable attention due to the propensity of these viruses to cause human disease (e.g., dengue, yellow fever, and West Nile viruses) and their broad host range (vertebrates and invertebrates) (Gould et al. 2001). MBE Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207 Table 1. Flavivirus Data Sets Used in This Study, Nucleotide (NT) and Amino Acid (AA) Sequence Lengths in Alignments after Using Gblocks, Substitution Models Chosen According to the BIC, and the Range of Sampling Years. Data Set Reference Number of sequences AA model AA sequence length NT model NT sequence length Sampling year range (a) Heinze et al. (2012) 58 LG+ 3,324 GTR+ 10,056 1937–2008 (b) Moureau et al. (2015) 100 LG+ 2,463 GTR+ 8,856 1937–2010 However, there are wide discrepancies in the estimates of the time of origin of this genus, with recent estimates ranging from 20,000 (Moureau et al. 2015) to about 300,000 (Pettersson and Fiz-Palacios 2014) years before present. Resolving these discrepancies is difficult, especially in the light of evidence supporting different patterns of base composition bias among the major flavivirus taxa (Jenkins et al. 2001), which violate the assumption of stationarity implicit in most substitution models, and adversely affecting the use of many phylogenetic methods. For this reason, model adequacy methods can provide useful guidelines to understand potential sources of bias in some of these estimates. To illustrate the effect of substitution model violation, we also simulated data under different conditions and show how the estimates of tree topology and tree length were affected. These simulations also reveal the cases in which model adequacy methods can successfully detect misleading estimates, and specific situations under which they fail. Results Nucleotide Sequence Simulations and Sensitivity of the Goldman–Cox Test We simulated the evolution of nucleotide sequences along phylogenetic trees with 50 tips under different scenarios: 1) The Jukes–Cantor (JC) model; 2) the GTR model; 3) the GTR model with a G distribution for among-site rate variation (GTR+G); 4) a scenario of trees with the GTR+G model and mean branch lengths of 0.5 substitutions per site (subs/site), with a tree length of approximately 42 subs/site, resulting in considerable mutational saturation (referred to here as analyses with “long trees”); and 5) a scenario in which GC content varies by several fold along the tree, representing a nonstationary process (referred to here as “heterogeneous”). We analyzed these data under the JC, GTR, and the GTR+G models (fig. 1). To evaluate whether these scenarios resulted in biased estimates, we compared the tree topology and the tree length with those used to generate the data. We also conducted the Goldman–Cox test (Goldman 1993a, 1993b; Whelan et al. 2001) of substitution model adequacy (referred to here as the GC test) to evaluate its power under these conditions. We generated 100 data sets for each of these simulation scenarios, and we report the mean statistics for each scenario. (c) Pettersson and Fiz-Palacios (2014) 83 JTT+ 2,016 GTR+ 7,647 1937–2011 (d) This study 110 JTT+ 3,391 GTR+ 10,173 1944–2013 The GC test correctly identified instances in which the substitution model was underparameterized, but it was not sensitive to overparameterization (fig. 1). When the data were generated under the JC model, the mean P value over 100 simulations was 0.62, 0.71, and 0.58 for analyses under the JC, GTR, and GTR+G models, respectively. In contrast, for the data simulated under the GTR+G model, the P values were 0 for JC and GTR, and 0.65 for the GTR+G, such that underparameterized models were considered inadequate. The GC test was also able to detect compositional heterogeneity, with P < 0.05 in all cases. We found that the GC test had poor performance for the data generated with mutational saturation and the GTR+G model, with P values that ranged between 0.3 and 0.71, incorrectly suggesting that all three models used to analyze the data (JC, GTR, and GTR+G) were adequate. We quantified the impact of using incorrect substitution models in the estimates of tree topology and tree length by comparing the trees used to generate the data and those estimated using maximum likelihood. To do this, we calculated the PH83 tree topology distance (Penny and Hendy 1985), and the percentage of error in the estimate of the total tree length. Again, we report the mean values over 100 simulations of each scenario. The errors in these two quantities were smaller for our data generated under the GTR+G submodels (i.e., models that are nested within the GTR+G; JC, GTR, and GTR+G), compared with those using long trees and heterogeneous base composition. For the GTR+G submodels, the PH85 distances ranged from 0.02 for data simulated and analyzed using JC to 0.18 for the data simulated using GTR+G and analyzed with JC. The corresponding errors in the estimates of tree length were between 0.2% for the data simulated and analyzed using JC, and 19.6% for the data simulated using GTR+G and analyzed with JC (fig. 1). In the analyses of the GTR+G submodels, the largest errors were caused by underparameterization, particularly from the data generated using GTR+G and analyzed using JC or GTR. These results are similar to previous studies on substitution model adequacy suggesting that underparameterization is a large source of inaccuracy, and that the G distribution is an important component of these models (Bollback 2002; Ripplinger and Sullivan 2010). Our simulations with long trees (i.e., mutational saturation) and with heterogeneous base composition produced errors that were several fold higher than those of the 257 MBE Duch^ene et al. . doi:10.1093/molbev/msv207 (i) GC P-values (ii) PH85 tree topology distance (iii) Difference in tree length (%) 59.9 9.4 GTR 0.71 0.64 0.00 0.30 0.00 0.03 0.06 0.16 12.2 10.1 0.39 0.30 10.8 51.8 13.3 GTR+Γ 0.58 0.59 0.65 0.71 0.00 0.04 0.04 0.1 9.2 8 0.32 1.50 8.0 11.5 16.6 G TR +Γ lo ng he tre te es ro ge ne ou s 19.6 G TR 6.60 JC 0.20 G TR +Γ lo ng he tre te es ro ge ne ou s 6.66 G TR 20.2 JC 0.18 G TR +Γ lo ng he tre te es ro ge ne ou s 0.04 G TR 0.02 JC JC 0.62 0.01 0.00 0.54 0.04 FIG. 1. Simulation results for nucleotide substitution models. The columns represent the simulation conditions, whereas the rows correspond to the model used to analyze the data. The boxes are colored according to the mean values of 100 simulations as described below (i) P value of the Goldman– Cox test, (ii) PH85 tree topology distance to the true tree, and (iii) percent difference between the length of estimated and true trees. The colors are used to represent the different statistics. Darker shades correspond to poor model performance. (i) (ii) (iii) GC P-values PH85 tree topology distance Difference in tree length (%) JTT 0.59 0.00 0.65 0.00 0.09 0.85 0.02 0.06 0.00 0.02 2 0.02 1.6 10.7 4.1 12.9 33.1 6.0 1.6 0.02 1.5 6.4 5.4 6.6 2.9 8.0 LG 0.08 0.00 0.60 0.00 0.00 0.00 0.00 0.12 0.02 0.12 1.4 0.05 1.8 11.3 1.9 7.5 30.5 8.9 LG+Γ 0.90 0.80 0.60 0.62 0.01 0.00 0.02 0.02 0.02 0.02 1.6 0.04 1.6 10.0 2.3 1.7 14.3 10.1 JT T+ Γ LG LG + lo ng Γ he tre te es ro ge ne ou s 0.06 JT T JT T+ Γ 0.02 0.02 0.04 LG LG + lo ng Γ he tre te es ro ge ne ou s 0.63 0.33 0.82 JT T 0.64 0.67 LG LG + lo ng Γ he tre te es ro ge ne ou s 0.60 JT T JT T+ Γ JTT+Γ FIG. 2. Simulation results for amino acid substitution models. The rows, columns, and color pattern correspond to those of figure 1. GTR+G submodels. The mean PH85 was between 6.66 for the data simulated using heterogeneous base composition and analyzed using JC, and 20.2 for those simulated using long trees and analyzed with JC. The mean percentage difference in tree length ranged from 9.4% for the heterogeneous data analyzed using JC to 59.9% for the simulations with long trees analyzed using JC (fig. 1). Although the errors in these simulations were much larger than those using GTR+G submodels, using the GTR+G appeared to reduce the extent of the error in the data generated under long trees. Amino Acid Sequence Simulations and Sensitivity of the Goldman–Cox Test Our approach to assess the performance of the GC test in amino acid sequences was similar to that described above for nucleotide sequences. We simulated sequences along phylogenetic trees under the following conditions: 1) the JTT model (Jones et al. 1992); 2) the JTT model with a G distribution for among-site rate variation (JTT+G); 3) the LG model (Le and Gascuel 2008); 4) the LG+G model; 5) a scenario with mean branch lengths of 0.7 subs/site, with a tree length of about 70 subs/site, and the JTT+G model; and 6) a nonstationary process, in which we simulate sequences under the JTT model 258 but a portion of the sequences within the tree have a different amino acid composition (heterogeneous; fig. 2). Our simulation of amino acid sequences showed that the GC test is also useful for these data, although it is less powerful than for nucleotides. All the substitution models appeared adequate for the data generated under the JTT and LG models with no among-site rate heterogeneity, with P values between 0.59 for the data simulated and analyzed with the JTT model, and 0.92 for the simulations under the JTT and analyzed with LG (fig. 2). Similar to our nucleotide analyses, the inclusion of the G distribution was an important aspect of the adequacy of the models, such that only models that included the G distribution were adequate for the data simulated under JTT+G and LG+G. Also, consistent with our nucleotide analyses, the test was not sensitive to mutational saturation. For the data simulated with long trees and the JTT+G model, it suggested that the JTT and JTT+G models were adequate, with P values of 0.09 and 0.33, respectively. In contrast to our nucleotide simulation analyses, the test did not detect heterogeneity in amino acid composition. Rather, it suggested that the JTT and JTT+G models were adequate for these data, with P values of 0.85 and 0.82, respectively. The impact of substitution model misspecification on the estimates of tree topology and tree length was very small for MBE Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207 the data simulated under the JTT, JTT+G, LG, and LG+G models. The mean PH85 distance to the true tree ranged from 0 for the simulations under JTT and LG, and analyzed with matching models, to 0.12 for those simulated under JTT+G and LG+G and analyzed with LG. For these simulations, the mean percentage difference in tree length ranged from 1.6% for the data simulated and analyzed under the JTT model to 12.9% for those simulated using LG+G and analyzed using JTT (fig. 2). The largest errors were for the data simulated with long trees. In this case, the mean PH85 distance to the true tree ranged between 1.6 for the analyses using LG+G or JTT+G and 2 for those using JTT. The mean percentage difference in tree length was between 2.9% for the analyses using JTT+G, which matches the model used to generate the data, and 30.5% for LG. These errors were smaller than those for our nucleotide simulations with long trees, which is likely because amino acid sequences have more character states than nucleotide sequences, such that it is possible to estimate larger genetic distances. Contrary to our nucleotide analyses, heterogeneity in amino acid composition did not lead to large errors, with mean PH85 distances between 0.02 for analyses with JTT and JTT+G, and 0.05 for those using LG. The mean percentage difference in tree length for these data was between 0.6% and 10%. Simulation of Codon Sequences and Posterior Predictive Simulations We conducted a set of simulations of protein coding sequences to assess the performance of codon models. Accordingly, we simulated phylogenetic trees of 50 taxa and mean branch lengths of 0.1 subs/site to generate sequences of 990 nt using the Python module Pyvolve (Spielman and Wilke 2015a, 2015b). The dN/dS ratio followed the model described by Goldman and Yang (1994) with three dN/dS categories; 0.5, 1, 2.5, that is compatible with the empirical values observed in some Flavivirus data sets (Duch^ene, Ho, et al. 2015). For this set of simulations, we generated ten data sets because their analysis is more computationally intensive than that for nucleotide and amino acid models. To assess the adequacy of substitution models in these data we used posterior predictive simulations, instead of the GC test. This consists of analyzing the data within a Bayesian framework to estimate the posterior distribution, and simulating data using parameters GTR+Γ M3 sampled form the posterior. In turn, the simulations are used to generate a null distribution of the multinomial loglikelihood and the 2 test statistics, as described by Huelsenbeck et al. (2001) and Foster (2004). We chose a Bayesian approach because the GC test requires additional maximum-likelihood optimization for every parametric bootstrap sample, which is computationally demanding for codon models. In contrast, in posterior predictive simulations the test statistics can be calculated directly from the simulated data (supplementary material, Supplementary Material online). A disadvantage of this method is that it is necessary to compute two test statistics (the multinomial log-likelihood and the 2), instead of one, , as in the case of the GC test (Huelsenbeck et al. 2001; Foster 2004). We analyzed the codon simulations using the M3 codon model implemented in MrBayes v3.2 (Ronquist et al. 2012). This model allows three dN/dS categories, such that it closely matches that used to generate the data. For comparison, we analyzed the data using the GTR+G nucleotide substitution model. We computed the mean of the test statistics described above and we assessed the performance in the estimates of tree topology and tree length. We considered that the estimate of the tree topology was accurate if the tree used to generate the data was contained within the 95% posterior credible interval of the posterior trees. To measure the accuracy in the estimate of the tree length, we calculated the percentage difference between the length of the tree used to generate the data and the mean posterior tree length. Similarly to our previous analyses, we report the mean values across our simulations. The multinomial log-likelihood test statistic correctly identified the M3 model as being correct, while it always rejected the GTR+G model, with mean P-values of 0.38 and 0, respectively (fig. 3). However, the 2 suggested that both models were adequate, at 0.5 for M3 and 0.62 for GTR+G. This discrepancy likely occurs because the site patterns under both models are different, which is detected by the multinomial log-likelihood, but they result in similar base composition, producing in similar 2 values. In this particular case, using an incorrect model did not result in poor estimates of tree topology and tree length. In all cases, the true tree was within the posterior samples. The performance in the estimates of the tree length was very similar, ranging from 0.6 and 10.9 for the GTR+G model, and from 0.2 and 11.9 for the M3 model. multinominal likelihood P-values x2 P-values tree topology in posterior Difference in tree length % (min, max) 0 0.62 1 5.10 (0.6, 10.9) 0.38 0.50 1 6.2 (0.2, 11.9) FIG. 3. Results of the codon model simulations. The data were generated using a codon model with three dN/dS categories and analyzed under the M3 codon model and GTR+G nucleotide substitution model. The statistics correspond to the mean over ten simulation replicates. The column labeled “tree topology in posterior” corresponds to the proportion of replicates for which the tree used to generate the data was found within the 95% posterior credible interval. The difference in tree length is the percentage of difference between the length of the tree used to generate the data and the mean of the posterior, with the minimum and maximum values, respectively. 259 MBE Duch^ene et al. . doi:10.1093/molbev/msv207 Nucleotide Substitution Model Adequacy in Flavivirus Data Sets We used the Bayesian information criterion (BIC) to select the best-fit substitution model for the four Flavivirus data sets, which was found to be GTR+G in all cases (table 1). We then conducted the GC test using GTR+G, GTR, and JC. The test suggested that data sets (a), (b) and (c) were not adequately described by the GTR+G or any submodels, with P values < 0.05 in all cases (table 2). In contrast, data set (d), appeared to be adequately described by the GTR+G model only, with a P value of 0.08. Although our simulations, described above, showed that the GC test could detect compositional heterogeneity in nucleotide sequences, we also conducted a 2 homogeneity test to detect the percentage of sequences with heterogeneous base composition. This 2 homogeneity test is different to the 2 test statistic described above for posterior predictive simulations. Its purpose is to determine whether any sequences have base compositions that deviate substantially from the rest of the data. Data set (a) only had 7.70% of heterogeneous sequences, whereas data sets (b) and (c) had 73.00% and 89.00%, respectively. Only data set (d) displayed stationarity, with no heterogeneous sequences detected by the test. Although these values of heterogeneity are highly variable between data sets, even the 7.70% of sequences with heterogeneous base composition in data set (a) can be sufficient to contribute to substitution model inadequacy (Huelsenbeck et al. 2001; Foster 2004). With empirical data it is not possible to assess whether the estimates of tree topology and tree length are biased. However, the estimates of tree length for the different models show that increasing the number of parameters resulted in longer trees (table 2). In all data sets, increasing the number of parameters increased the estimate of the tree length, sometimes by more than 2-fold. For example, in data set (c) the total tree length using GTR+G was 33.29 subs/site, compared with 18.70 subs/ site under the JC model. This trend was less pronounced for data set (d), for which the tree length estimate was 0.88 for GTR+G and 0.85 for JC, a difference of only 0.03 subs/site. Amino Acid Sequence Analyses in Flavivirus Data Sets: Bayesian Model Averaging, Informative Priors on Transition Rates, and Model Adequacy We conducted the GC test with four empirically determined substitution models: JTT, JTT+G, LG, and LG+G. These models included those with the highest statistical fit for each data set, according to the BIC (tables 1 and 3). The GC test suggested that none of these was adequate for data sets (a), (b), and (c), with P values of 0.00. However, the LG+G model was adequate for data set (d). The 2 homogeneity test suggested that the proportion of sequences with heterogeneous amino acid composition for each data set were 2.00% for (a), 14.22% for (b), 10.02% for (c), and 0.00% for (d). This contrasts with the results from nucleotide sequences, which violate the assumption of stationarity more severely in most data sets. Commonly used amino acid models, including those used here, employ fixed parameter values for the rate matrix, such that exchange rates are fixed. We relaxed this constraint by using two methods. First, we analyzed the data using Bayesian model averaging with ten substitution rate matrices and a G distribution for among-site rate variation. Second, we used a Dirichlet prior distribution on the transition rates and amino acid frequencies, as proposed by Huelsenbeck et al. (2008). In Bayesian model averaging the sampling proportion of each model is equivalent to their posterior probability, so it can be used to select the maximum a posteriori model, known as the MAP (Huelsenbeck et al. 2004). For data sets (a) and (d) the MAP corresponded to the Poisson model, whereas for data sets (b) and (c) the MAP corresponded to the WAG model (Whelan and Goldman 2001). These models were overwhelmingly sampled, with posterior probabilities of 0.99, 0.96 0.97, and 0.97 for the four data sets, respectively (table 3). We assessed model adequacy for Bayesian model averaging and the informative Dirichlet priors by using posterior predictive simulations as described for our codon model simulations. The multinomial log-likelihood suggested that Bayesian model averaging and using an informative Dirichlet prior were inadequate for data sets (a)–(c), with P < 0.05. For data set (d), both methods appeared adequate, with P values of 0.13 and 0.25, respectively. The 2 test statistic rejected both methods for data sets (b), (c) and (d) (P = 0.00), but not for data set (a), with a P value of 0.02 (table 3). The combination of the two test statistics indicates that none of the data sets was accurately described by either of the methods. This result appears surprising for data set (d), for which the LG+G model is adequate. However, it is important to note that this model is not included in the model averaging method in MrBayes, although it can be specified manually. Similarly, if the data were sufficiently informative Table 2. Nucleotide Substitution Model Adequacy Using the GC Test, Estimates of Tree Length (subs/site), and the Proportion of Sequences with Significant Differences in Base Composition for Different Flavivirus Data Sets (% het. sequences). Data Set P-value JC P-value GTR P-value GTR+ Tree length JC Tree length GTR Tree length GTR+ % het. sequences 260 (a) 0.00 0.00 0.00 4.34 4.50 9.86 7.70 (b) 0.004 0.00 0.00 17.27 15.90 30.83 73.00 (c) 0.00 0.00 0.00 18.70 17.07 33.29 89.00 (d) 0.00 0.00 0.08 0.85 0.87 0.88 0.00 MBE Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207 Table 3. Substitution Model Adequacy, Homogeneity in Amino Acid Composition, and Estimates of Tree Length for Amino Acid Models in Different Flavivirus Data Sets. Data Set P-value JTT P-value JTT+ P-value LG P-value LG+ P-value ML BMA MAP (posterior) P-value ML DP P-value.v2 BMA P-value.v2 DP Tree length JTT Tree length JTT+ Tree length LG Tree length LG+ Tree length BMA Tree length DP % het. sequences (a) 0.00 0.00 0.00 0.00 0.00 Poisson (0.99) 0.02 0.00 0.20 3.04 3.80 3.10 3.98 3.74 4.41 2.00 (b) 0.00 0.00 0.00 0.00 0.00 WAG (0.96) 0.00 0.00 0.00 12.05 14.95 12.21 15.97 14.94 17.98 14.22 (c) 0.00 0.00 0.00 0.00 0.00 WAG (0.97) 0.03 0.00 0.00 14.31 19.35 14.54 20.89 19.57 23.47 10.02 (d) 0.00 0.00 0.00 0.58 0.13 Poisson (0.99) 0.26 0.00 0.00 0.29 0.29 0.29 0.31 0.32 0.37 0.00 Table 4. Codon Model Adequacy Statistics Using Posterior Predictive Simulations for the Different Empirical Data Sets of Flavivirus Sequences. Data set P-value ML M3 P-value v2 M3 Tree length M3 (a) 0.003 0.01 9.81 the amino acid transition matrix in the method using Dirichlet priors could collapse to the LG+G model, but this does not appear to be the case for data set (d). The different approaches to analyze amino acid data produced marked differences in the estimates of tree length. Among the empirically determined models (JTT, JTT+G, LG, and LG+G), the LG+G model produced the longest trees. However, in all cases the method of using informative Dirichlet priors resulted in even longer trees (table 3). For example, the estimate of the tree length for data set (c) was 14.3 subs/site for the JTT model, 19.57 subs/site for model averaging, 20.89 subs/site for LG+G, and 23.47 subs/ site for that with informative priors (table 3). Codon Model Adequacy in Flavivirus Data Sets We analyzed our Flavivirus data sets using the M3 codon model in MrBayes. The procedure to assess the adequacy of this model also consisted of posterior predictive simulations. The multinomial log-likelihood suggested that this model was adequate for data set (d) only, with a P value of 0.07. However, the 2 test rejected this model for all data sets, with P < 0.05 in all cases. The tree length estimates with this model can be compared with those from the nucleotide sequences. For data sets (b) and (c), the tree lengths with the M3 model were 40.30 and 43.90 subs/site (table 4), which are about 25% longer than those with GTR+G, at 30.83 and 33.29 subs/site, (b) 0.00 0.00 40.3 (c) 0.01 0.00 43.9 (d) 0.07 0.00 0.89 respectively. The difference was much smaller for data sets (a) and (d), at 9.81 and 0.89 subs/site with the M3 model, and 9.86 and 0.88 subs/site with GTR+G, respectively. Temporal Signal in Flavivirus Data Sets In empirical data, it is impossible to quantify the impact of using inadequate models on phylogenetic estimates. If model violation is mild, such as in our analyses with data simulated under GTR and analyzed with the JC model, then the errors in the estimates might be small. In contrast, if the assumptions of the model are more severely violated, such as in our analyses of data generated under GTR+G and analyzed under the JC model, the errors are much larger. Therefore, it might be tempting to estimate evolutionary time scales on the grounds that model inadequacy might lead to small errors in a particular data set. However, an additional consideration is that the data should have sufficient temporal structure, which refers to whether the number of substitutions that accumulate over the sampling time is sufficient to obtain reliable estimates of the evolutionary rates and time scales. We assessed the temporal structure in the four Flavivirus data sets by conducting a date-randomization test (Ramsden et al. 2009). The motivation behind this test is to determine whether estimates of evolutionary rates and time scales are simply driven by the prior in Bayesian phylogenetic analyses. It consists of estimating the rate several times, after 261 MBE Duch^ene et al. . doi:10.1093/molbev/msv207 (c) (d) nucleotides (b) true tip randomisations true tip randomisations true tip randomisations true tip randomisations true tip randomisations true tip randomisations true tip randomisations true tip randomisations amino acids rate (substitutions/site/year) (a) FIG. 4. Substitution rate estimates for nucleotide and amino acid data sets of flaviviruses (a–d), and ten replicates of the date-randomization test. The circles correspond to the mean estimate, whereas the error bars represent the 95% posterior credible interval. The first estimate was obtained with the correct sampling times and the remaining ten were obtained with randomized sampling times. amino acid subs/site nucleotide subs/site (a) (b) (c) (d) p 0.6173 p 0.1666 p 0.8813 p <0.0001 p 0.5866 p 0.6207 p 0.6207 p <0.0001 FIG. 5. Regressions of the number of substitutions from the root to the tips as a function of the sampling time for nucleotide and amino acid data of the Flavivirus data sets (a)–(d). Each point corresponds to a sequence. The central line corresponds to the regression line, and the curved lines are the 95% confidence intervals. randomizing the sampling times. The data are considered to have sufficient temporal structure (i.e., they pass the test) if the rate estimates and their associated uncertainty with the correct sampling times do not overlap with those of the randomizations (Firth et al. 2010; Duch^ene S, Duch^ene DA, et al. 2015). We conducted this test for the nucleotide and amino acid sequences, but we did not use the M3 codon model because it was inadequate for all data sets. Data sets (a)–(c) failed the date-randomization test for at least one of ten replicates for the nucleotide and amino acid sequences. In contrast, for data set (d), both nucleotide and amino acid data passed the test. For data sets (a)–(c), there were some clear differences between the results for the different data sets (fig. 4). For data set (a) the rate estimate overlapped with one of the randomizations for the nucleotide data, and with four randomizations for the amino acid data. The estimate for data set (b) overlapped with three randomizations in the nucleotide data, and with nine for the amino acid data. Data set (c) had the least temporal structure, with 262 an estimate that overlapped with those from all the randomizations in the nucleotide and amino acid data. This result was consistent with our regressions of the rootto-tip distance (in subs/site) as a function of the sampling time (years). For data sets (a)–(c) the P values were between 0.17 and 0.88 for nucleotide sequences, and between 0.58 and 0.62 for amino acid sequences. For data set (d), the P values were less than 0.0001 for both nucleotide and amino acid sequences (fig. 5). In sum, this means that neither type of data had temporal structure for data sets (a)–(c), such that estimates of evolutionary rates and time scales for these data are likely to be spurious. Discussion Violation of substitution model assumptions can be an important source of error in phylogenetic analyses. Our simulations show the extent to which estimates of tree topology and tree length can be misled. Accurate estimates of these parameters are important to infer evolutionary rates and time scales. Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207 For example, using underparameterized substitution models can result in an underestimation of evolutionary time scales, and a corresponding overestimation of the rate of evolution (Phillips 2009). Our simulations illustrate three potential sources of error: Substitution model misspecification, mutational saturation, and compositional heterogeneity. The latter is a prevalent feature of RNA viruses (Mooers and Holmes 2000; Jenkins and Holmes 2003), including those of the Flavivirus genus, where differences in base composition seem to fit the host and vector in question (Jenkins et al. 2001). Some of these sources of error can be detected through model adequacy methods. For example, the GC test is useful to detect model misspecification and compositional heterogeneity (Foster 2004), but it is not able to distinguish whether the model correctly accounts for mutational saturation. In our simulations with mutational saturation the GC test incorrectly suggested that overly simple models were adequate, even if they did not include among-site rate heterogeneity. One reason for this finding is that mutational saturation does not constitute a form of model violation, but it can obscure information about the number of substitutions at individual sites, such that among-site rate heterogeneity is underestimated. This pattern is consistent with a previous study, in which the estimate of the parameter of the G distribution was overestimated for data sets with mutational saturation, indicating that among-site rate heterogeneity was underestimated (Duch^ene, Ho, et al. 2015). The analyses of our amino acid simulations were largely similar to those of nucleotides. However, an important difference is that compositional heterogeneity does not lead to errors of the magnitude that we observed in nucleotide sequences. Using the model and amino acid frequencies that described most of the data generally lead to accurate estimates of the tree topology and tree length. This is a probable reason for why the GC test was unable to detect compositional heterogeneity in amino acid data (fig. 2). Importantly, however, the advantages of using amino acid data are hindered by the restrictive nature of fixed parameter models, except in cases in which the model is adequate such as for the LG+G model for our empirical data set (d). Bayesian frameworks allow the implementation of potentially effective solutions to this problem. The methods used here—model averaging and using informative prior distributions—estimated substantially longer trees than the models with fixed parameters for data sets (a)–(c), for which models with fixed rate matrices were inadequate. We consider this to be evidence that these approaches mitigate this problem because using incorrect substitution models tends to produce under-, rather than overestimation of the tree length, given that the estimate of the tree topology is not highly biased (Lemmon and Moriarty 2004). Importantly, these methods were inadequate for our empirical data, implying that they are still not ideal for many data sets. Our codon sequence analyses showed that although model adequacy methods are readily applicable to this kind of data, it is difficult to determine a priori the benefits of using these models. The difference in the estimates of tree length between the GTR+G model and the M3 model was very small MBE for our simulations, but very large for our empirical data sets (b) and (c). Our simulations were necessarily limited to a simple scenario using a widely used mechanistic codon model, although more realistic models are also available, such as mutation–selection models (Halpern and Bruno 1998) and empirical models (Kosiol et al. 2007). A recent simulation study found that inferences of selective constraints using the best-fitting codon model, according to the Akaike Information Criterion and the BIC, could produce biased estimates of the parameters typically used to detect adaptive evolution (Spielman and Wilke 2015b). Clearly, further research is necessary to improve the selection and assessment of codon models, for example, using different test statistics or performance-based metrics (e.g., Minin et al. 2003; Abdo et al. 2005). The test statistics used here to assess model adequacy are sensitive to certain properties of the models and the data. Underparameterization can be detected using the statistic in the GC test, and the multinomial log-likelihood when conducing posterior predictive simulations. Compositional heterogeneity can also be detected using the statistic in nucleotide sequences, or using the 2 statistic in Bayesian analyses of amino acid data. However, in the context of model adequacy, measuring the extent to which mutational saturation is accounted for by the model is still difficult. A promising solution is to develop model adequacy statistics based on currently existing methods to quantify saturation, such as entropy indices (Xia et al. 2003), homoplasy statistics (Nielsen 2002; Bollback 2006; Lartillot et al. 2007), and changes in the ratio of transitions to transversions over time (Duch^ene, Ho, et al. 2015). Our results support the view that accurate estimation of deep evolutionary time scales in viruses is probably not possible using nucleotide sequence data alone (Holmes 2003). Although amino acid sequences appear beneficial for this purpose there is clearly ample room to improve the analyses of these data. Importantly, we find that Bayesian methods that circumvent the need to estimate a large number of parameters are a fruitful avenue of research. The development of these models and algorithms should be informed by test statistics in a model adequacy framework, including those used here to measure site patterns, compositional heterogeneity, and others based on mutational saturation. Even in cases in which the substitution models are adequate, inferring evolutionary rates and time scales in virus data sets requires that the data have sufficient temporal structure. This means that the number of substitutions that have accumulated over the sampling time should be sufficient to estimate the rate of evolution, which is especially problematic for Bayesian analyses, in which the prior can unduly influence the estimates. Simulation studies show that analyses of data with insufficient temporal structure tend overestimate the rate, and consequently underestimate the evolutionary time scale (Duch^ene S, Duch^ene DA, et al. 2015), which in part offers an explanation for why the estimates of the time scales for some viruses are much younger than suggested by other lines of evidence, such as possible codivergence with host species. 263 MBE Duch^ene et al. . doi:10.1093/molbev/msv207 Strikingly, three of the four Flavivirus data sets analyzed here have no temporal structure. This result might appear surprising, given that these viruses can have a short-term evolutionary rate of the order of 104–103 subs/site/year (e.g., Sall et al. 2010; Volk et al. 2010; Carrington and Auguste 2013; Nunes et al. 2014, 2015), such that it is expected that they should rapidly accumulate a large number of substitutions. This might occur because the bias in branch length estimates due to model inadequacy might have a stronger effect in long branches than in those that are short (Duch^ene DA, Duch^ene S, and Ho 2015). Therefore, the relationship between substitutions and sampling time appears overly dispersed, which is also consistent with our regressions of rootto-tip distance as a function of sampling time for data sets (a)–(c) (supplementary material, Supplementary Material online). We also found substantial overlap in the estimates of amino acid and nucleotide substitution rates for our Flavivirus data sets (a)–(c) (fig. 4). Although these estimates are unreliable, we attribute this pattern to a combination of overestimation of the amino acid substitution rate and strong mutational saturation in the nucleotide data. The sum of our results motivates further effort in the development of substitution models and methods for model checking. Model adequacy methods are a critical for this purpose because they can reveal whether important features of the data are accounted for, given that relevant test statistics are used (Brown 2014a; Doyle et al. 2015; Duch^ene DA, Duch^ene S, Holmes, et al. 2015). In this respect, our observation that our genus-wide Flavivirus data sets are poorly described by commonly used models suggests that current estimates of their long-term evolutionary time scale are likely unreliable, except for very closely related lineages, such as those of DENV-2 that comprise data set (d). Additional sources of information, such as ancient samples, will provide invaluable information to elucidate their time of origin. Materials and Methods Nucleotide Simulations We simulated phylogenetic trees with 50 tips and a sequence length of 2,000 sites using the packages Ape v3.0 (Popescu et al. 2012) and Phangorn v1.99 (Schliep 2011). For most simulations, the branch lengths were drawn from a lognormal distribution with mean 0.2 subs/site and standard deviation of 30% of the mean. These settings are similar to those employed in other simulation studies (e.g., Lemmon and Moriarty 2004). We simulated nucleotide sequences along these trees as implemented in Phangorn using three nucleotide substitution models; JC, GTR, and GTR+G. For the GTR and GTR+G models, we specified the Q matrix as used in a previous simulation study (Duch^ene, Ho, et al. 2015). The corresponding stationary frequencies were: 0.49, 0.1, 0.1, and 0.49, for a, c, g, and t, respectively. For the G distribution we specified four discrete categories and a shape parameter, , of 1. To generate the data for our simulations with compositional heterogeneity, we used the GTR+G model as parameterized above, although for a set of ten taxa within the tree we specified the model with the 264 following base composition: 0.1, 0.49, 0.49, and 0.1, for a, c, g, and t, respectively. We chose these simulation settings to represent the most extreme conditions in our empirical data, particularly data sets (b) and (c) which had high variation in base frequencies among sequences. We included a set of simulations with considerable mutational saturation by using the GTR+G model but under trees with mean branch lengths of 0.5 subs/site. We generated 100 data sets under each of these conditions. We performed the GC test for the simulated data sets as described by Goldman (1993a, 1993b). To do this, we used a maximum-likelihood method to estimate the phylogenetic tree, branch lengths, and substitution model parameters in Phangorn. From this we used the maximum-likelihood parameter estimates to simulate 1,000 data sets, which is equivalent to a parametric bootstrap. For each of these data sets we calculated the statistic, which represents the difference between the multinomial log-likelihood and the log-likelihood under the model in question. We used the statistics from the simulated data to generate a null distribution and calculate a P value. We performed the test under three nucleotide substitution models: JC, GTR, and GTR+G. Amino Acid Simulations We simulated the evolution of amino acid sequences using trees of 50 tips and a sequence length of 1,000 sites. The branch lengths were drawn from a lognormal distribution with mean 0.4 subs/site and a standard deviation of 30% of the mean. We chose the mean branch length to be twice that used for nucleotides to represent the fact that amino acid sequences are effective at estimating longer evolutionary distances. We simulated the data under four commonly used amino acid substitution models: JTT, JTT+G, LG, and LG+G. Our parameterization of the G distribution was the same as that of the nucleotide simulations. We simulated compositionally heterogeneous data by specifying the JTT+G model for most of the tree, with the exception of ten sequences for which we specified the amino acid frequencies to those of the LG model. To simulate data under mutational saturation, we sampled the branch lengths from a lognormal distribution with mean 0.7 subs/site and a standard deviation of 30% of the mean. Using the GC test on amino acid data is different to that for nucleotides because the model parameters are usually not optimized. Instead, the simulations were conducted using the maximum-likelihood estimates of the branch lengths and the tree topology. The rest of the test remains the same. Codon Sequence Simulation For our codon sequence simulations, we generated phylogenetic trees with 50 tips and a mean branch length of 0.1 subs/ site. We simulated sequences of 990 nt, resulting in 330 codon sites, under the GY model (Goldman and Yang 1994), implemented in Pyvolve (Spielman and Wilke 2015a). We specified three dN/dS categories for codon sites: 0.5, 1, and 2.5, which include the range of values estimated for some Flavivirus data sets (Duch^ene, Ho, et al. 2015), and which represents sites Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207 under considerably different selective constraints. We simulated ten data sets and analyzed them in MrBayes using the M3 codon model and the GTR+G nucleotide model. In both cases, we ran a single cold chain with a chain length of 1 108 steps and we verified that the effective sample size for all parameters was at least 200 using the package CODA v0.16 (Plummer et al. 2006). To test the adequacy of these models we conducted posterior predictive simulations, which consisted of simulating data using samples from the posterior under each of the models. We conducted 1,000 posterior predictive simulations for each of the ten data sets. We then obtained null distributions for the multinomial loglikelihood and the 2 by calculating these test statistics for each posterior predictive data set. To calculate a P value, we compared the test statistic from the “true” data (i.e., that analyzed in MrBayes) with the corresponding null distribution under both models. We compared the estimates from the M3 and GTR+G models by considering whether the true tree topology was contained within the 95% posterior credible interval of trees, and by calculating the percentage of difference between the true tree length and the mean posterior. We determined that the true topology was within the posterior if the tree topology distance between the true tree and any of those in the posterior was zero. The percentage difference in tree length was similar to that used in our nucleotide and amino acid sequence simulations, except that we used the mean posterior tree length, instead of the maximum-likelihood estimate. Flavivirus Data Sets We analyzed four data sets, three of which were used in previous studies of Flavivirus evolution. All data sets consisted in complete genome nucleotide sequences downloaded from GenBank. We aligned the nucleotide and the translated amino acid data using MAFFT v7 (Katoh et al. 2002). Before our phylogenetic analyses we removed ambiguous or misaligned sites using Gblocks v0.91 (Castresana 2000; Talavera and Castresana 2007). This is important for model adequacy methods because they rely on site patterns so they can be misled by ambiguous or misaligned regions. The resulting data sets were (a) 58 sequences of different tick-borne flaviviruses sampled between the years 1937 and 2008, with sequence lengths of 3,324 amino acids and 10,056 nt; (b) 100 sequences sampled from 1937 to 2010, with sequence lengths of 2,463 amino acids and 8,856 nt; (c) 83 sequences sampled from 1937 to 2011, with sequence lengths of 2,016 amino acids and 7,647 nt; (d) 110 sequences of DENV-2 infecting humans sampled between 1944 and 2013, with sequence length of 3,391 amino acids and 10,713 nt (table 1). For our codon model analyses, we removed ambiguous sites while maintaining the correct reading frame. MBE that we used for our simulations. We verified that the nucleotide and amino acid models with the greatest statistical fit for all these data sets were included in those that we considered for the GC test (table 1). This is an important aspect of our study because it allows us to determine whether models that have high statistical fit are also adequate. To do this, we chose the models according to the BIC as implemented in Modelgenerator v0.85 (Keane et al. 2006). Although the GC test can detect compositional heterogeneity, it cannot determine whether this is the reason why a model is inadequate. For example, if an empirical data set fails the test using a particular model, it might be because the model is underparameterized, the data have substantial compositional heterogeneity, or both. To disentangle these factors, we included a 2 compositional homogeneity test as implemented in TREEPUZZLE v5.2 (Schmidt et al. 2002). This test determines the proportion of sequences for which the base composition deviates significantly from the average and it serves a different purpose to that of the 2 test statistic used in our posterior predictive simulations. For the amino acid sequences, we used two techniques to relax the strong assumptions made by typical amino acid rate matrices. The first was Bayesian model averaging, where a range of amino acid matrices are considered as part of the parameter space. The Markov chain proposes different models throughout the analysis, and they are accepted or rejected according to their posterior probability (Huelsenbeck et al. 2004). We conducted these analyses in MrBayes, which includes ten amino acid matrices. The second method we used consists of specifying informative priors in the form of a Dirichlet distribution for the parameters of the rate matrix, as described by Huelsenbeck et al. (2008), and also implemented in MrBayes. For these analyses we specified a G distribution with four discrete categories, and ran the analyses with two cold chains with lengths of 5 107 steps. We used CODA to assess sufficient sampling from the stationary distribution by ensuring that the effective-sample size of all parameters was at least 200, and by visually inspecting the trace. We also analyzed the Flavivirus data sets under the M3 codon model in MrBayes, which specifies three dN/dS categories for codon positions. The settings in MrBayes and assessment of sufficient sampling were the same as those for our amino acid sequence analyses. Because we used a Bayesian framework for our amino acid and codon sequence analyses, we assessed the adequacy of the models through posterior predictive simulations. We simulated 1,000 posterior predictive data sets for each analysis, and calculated the P value of the multinomial log-likelihood and the 2 test statistics, as described for our codon model simulations. The computer code and empirical data sets used in this study are available in GitHub (https://github.com/sebastianduchene/virus_model_ adequacy, last accessed October 7, 2015). Flavivirus Sequence Analyses: Model Adequacy, Bayesian Model Averaging, and Informative Priors on the Substitution Rate Matrix Assessing Temporal Structure in the Flavivirus Data Sets We performed the GC test on the four Flavivirus data sets utilizing the nucleotide, amino acid, and the codon model We assessed whether the Flavivirus data sets had sufficient temporal structure to perform tip-date based molecular clock 265 Duch^ene et al. . doi:10.1093/molbev/msv207 dating by conducting a date-randomization test with ten replicates. We utilized the best-fit nucleotide and amino acid models in all cases (table 1), as well as an uncorrelated lognormal clock model (Drummond et al. 2006) and a constant-size tree prior. We conducted these analyses in BEAST v1.8 (Drummond et al. 2012) with a chain length of 1 108 steps, and assessed sufficient sampling from the stationary distribution by ensuring that the effective sample size of all parameters was at least 200. For comparison, we also fit a linear regression of the root-to-tip distance as a function of the sampling time. The slope of the line corresponds to the rate, the R2 value is the extent to which evolution has been clock-like, and the P value is used to determine whether the estimate of the rate is significantly different from 0. To do this, we estimated maximum-likelihood trees in PhyML (Guindon et al. 2010) and used the optimal root found in Path-O-Gen v1.4 (tree.bio.ed.ac.uk/software/pathogen/; Drummond et al. 2003). We then extracted the root-to-tip distances using NELSI v0.2 (Ho et al. 2015) and fit the regressions using R v3.1 (Ihaka and Gentleman 1996). Supplementary Material Supplementary data S1, table S1, and figure S1 are available at Molecular Biology and Evolution online (http://www.mbe. oxfordjournals.org/). Acknowledgments The study was supported by the Swiss National Science Foundation grant number P2ZHP3_151594 to F.D.G. and an NHMRC Australia Fellowship (AF30) to E.C.H. References Abdo Z, Minin VN, Joyce P, Sullivan J. 2005. Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. Mol Biol Evol. 22:691–703. Belshaw R, Sanjuan R, Pybus OG. 2011. Viral mutation and substitution: units and levels. Curr Opin Virol. 1:430–435. Bollback JP. 2002. Bayesian model adequacy and choice in phylogenetics. Mol Biol Evol. 19:1171–1180. Bollback JP. 2006. SIMMAP: stochastic character mapping of discrete traits on phylogenies. BMC Bioinformatics 7:88. Brown JM. 2014a. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit. Syst Biol. 63:334–348. Brown JM. 2014b. Predictive approaches to assessing the fit of evolutionary models. Syst Biol. 63:289–292. Carrington CVF, Auguste AJ. 2013. Evolutionary and ecological factors underlying the tempo and distribution of yellow fever virus activity. Infect. Genet Evol. 13:198–210. Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 17:540–552. Doyle VP, Young RE, Naylor GJP, Brown JM. 2015. Can we identify genes with increased phylogenetic reliability? Syst Biol 64:824–837. Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. 2006. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4:699–710. Drummond AJ, Pybus OG, Rambaut A. 2003. Inference of viral evolutionary rates from molecular sequences. Adv Parasitol. 54:331–358. Drummond AJ, Suchard MA, Xie D, Rambaut A. 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 29:1969–1973. 266 MBE Duch^ene DA, Duch^ene S, Ho SYW. 2015. Tree imbalance causes a bias in phylogenetic estimation of evolutionary time scales using heterochronous sequences. Mol Ecol Resour. 15:785–794. Duch^ene DA, Duch^ene S, Holmes EC, Ho SYW. 2015. Evaluating the adequacy of molecular clock models using posterior predictive simulations. Mol Biol Evol. 32(11):2986–2995. Duch^ene S, Duch^ene DA, Holmes EC, Ho SYW. 2015. The performance of the date-randomization test in phylogenetic analyses of timestructured virus data. Mol Biol Evol. 32:1895–1906. Duch^ene S, Ho SYW, Holmes EC. 2015. Declining transition/transversion ratios through time reveal limitations to the accuracy of nucleotide substitution models. BMC Evol Biol. 15:36. Duch^ene S, Holmes EC, Ho SYW. 2014. Analyses of evolutionary dynamics in viruses are hindered by a time-dependent bias in rate estimates. Proc R Soc Lond B Biol Sci. 281:20140732. Firth C, Kitchen A, Shapiro B, Suchard MA, Holmes EC, Rambaut A. 2010. Using time-structured data to estimate evolutionary rates of double-stranded DNA viruses. Mol Biol Evol. 27:2038–2051. Fitch WM, Leiter JM, Li XQ, Palese P. 1991. Positive Darwinian evolution in human influenza A viruses. Proc Natl Acad Sci U S A. 88:4270– 4274. Foster PG. 2004. Modeling compositional heterogeneity. Syst Biol. 53:485–495. Gatesy J. 2007. A tenth crucial question regarding model use in phylogenetics. Trends Ecol Evol. 274:3–14. Gelman A, Shalizi CR. 2013. Philosophy and the practice of Bayesian statistics. Br J Math Stat Psychol. 66:8–38. Goldman N. 1993a. Simple diagnostic statistical tests of models for DNA substitution. J Mol Evol. 37:650–661. Goldman N. 1993b. Statistical tests of models of DNA substitution. J Mol Evol. 36:182–198. Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 11:725–736. Gould EA, de Lamballerie X, Zanotto PM, Holmes EC. 2001. Evolution, epidemiology, and dispersal of flaviviruses revealed by molecular phylogenies. Adv Virus Res. 57:71–103. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. 2010. New algorithms and methods to estimate maximum likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 59:307–321. Halpern AL, Bruno WJ. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 15:910–917. Heinze DM, Gould EA, Forrester NL. 2012. Revisiting the clinal concept of evolution and dispersal for the tick-borne flaviviruses by using phylogenetic and biogeographic analyses. J Virol. 86:8663–8671. Ho SYW, Duch^ene S. 2014. Molecular-clock methods for estimating evolutionary rates and time scales. Mol Ecol. 23:5947–5975. Ho SYW, Duch^ene S, Duch^ene DA. 2015. Simulating and detecting autocorrelation of molecular evolutionary rates among lineages. Mol Ecol Resour. 15:688–696. Holmes EC. 2003. Molecular clocks and the puzzle of RNA virus origins. J Virol. 77:3893–3897. Huelsenbeck JP, Joyce P, Lakner C, Ronquist F. 2008. Bayesian analysis of amino acid substitution models. Philos Trans R Soc Lond B Biol Sci. 363:3941–3953. Huelsenbeck JP, Larget B, Alfaro ME. 2004. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol Biol Evol. 21:1123–1133. Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310–2314. Ihaka R, Gentleman R. 1996. R: a language for data analysis and graphics. J Comput Graph Stat. 5:299–314. Jenkins GM, Holmes EC. 2003. The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Res. 92:1–7. Jenkins GM, Pagel M, Gould EA, de A Zanotto PM, Holmes EC. 2001. Evolution of base composition and codon usage bias in the genus Flavivirus. J Mol Evol. 52:383–390. Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207 Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 8:275–282. Jorba J, Campagnoli R, De L, Kew O. 2008. Calibration of multiple poliovirus molecular clocks covering an extended evolutionary range. J Virol. 82:4429–4440. Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059–3066. Keane TM, Creevey CJ, Pentony MM, Naughton TJ, Mclnerney JO. 2006. Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol Biol. 6:29. Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. 2000. Timing the ancestor of the HIV-1 pandemic strains. Science 288:1789–1796. Kosiol C, Holmes I, Goldman N. 2007. An empirical codon model for protein sequence evolution. Mol Biol Evol. 24:1464–1479. K€uhnert D, Wu CH, Drummond AJ. 2011. Phylogenetic and epidemic modeling of rapidly evolving infectious diseases. Infect Genet Evol. 11:1825–1841. Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol. 7:S4. Le SQ, Gascuel O. 2008. An improved general amino acid replacement matrix. Mol Biol Evol. 25:1307–1320. Lemmon AR, Moriarty EC. 2004. The importance of proper model assumption in Bayesian phylogenetics. Syst Biol. 53:265–277. Minin V, Abdo Z, Joyce P, Sullivan J. 2003. Performance-based selection of likelihood models for phylogeny estimation. Syst Biol. 52:674–683. Mooers A, Holmes EC. 2000. The evolution of base composition and phylogenetic inference. Trends Ecol Evol. 15:365–369. Moureau G, Cook S, Lemey P, Nougairede A, Forrester NL, Khasnatinov M, Charrel RN, Firth AE, Gould EA, de Lamballerie X. 2015. New insights into flavivirus evolution, taxonomy and biogeographic history, extended by analysis of canonical and alternative coding sequences. PLoS One 10:e0117849. Nielsen R. 2002. Mapping mutations on phylogenies. Syst Biol. 51:729– 739. Nunes MRT, Faria NR, de Vasconcelos JM, Golding N, Kraemer MUG, de Oliveira LF, Azevedo RSS, da Silva DEA, da Silva EVP, da Silva SP, et al. 2015. Emergence and potential for spread of Chikungunya virus in Brazil. BMC Med. 13:102. Nunes MRT, Palacios G, Faria NR, Sousa EC Jr, Pantoja JA, Rodrigues SG, Carvalho VL, Medeiros DBA, Savji N, Baele G. 2014. Air travel is associated with intracontinental spread of dengue virus serotypes 1–3 in Brazil. PLoS Negl Trop Dis. 8:e2769. Penny D, Hendy M. 1985. The use of tree comparison metrics. Syst Zool. 34:75–82. Pettersson JH-O, Fiz-Palacios O. 2014. Dating the origin of the genus Flavivirus in the light of Beringian biogeography. J Gen Virol. 95:1969–1982. Phillips MJ. 2009. Branch-length estimation bias misleads molecular dating for a vertebrate mitochondrial phylogeny. Gene 441:132–140. Plummer M, Best N, Cowles K, Vines K. 2006. CODA: convergence diagnosis and output analysis for MCMC. R News 6:7–11. Popescu A-A, Huber KT, Paradis E. 2012. ape 3.0: new tools for distancebased phylogenetics and evolutionary analysis in R. Bioinformatics 28:1536–1537. Posada D, Buckley TR. 2004. Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst Biol. 53:793–808. Ramsden C, Holmes EC, Charleston MA. 2009. Hantavirus evolution in relation to its rodent and insectivore hosts: no evidence for codivergence. Mol Biol Evol. 26:143–153. Ripplinger J, Sullivan J. 2010. Assessment of substitution model adequacy using Frequentist and Bayesian methods. Mol Biol Evol. 27:2790– 2803. MBE Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, H€ohna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. 2012. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 61:539–542. Sall AA, Faye O, Diallo M, Firth C, Kitchen A, Holmes EC. 2010. Yellow fever virus exhibits slower evolutionary dynamics than dengue virus. J Virol. 84:765–772. Sanderson MJ. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol Biol Evol. 14:1218– 1231. Sanjuan R. 2012. From molecular genetics to phylodynamics: evolutionary relevance of mutation rates across viruses. PLoS Pathog. 8:e1002685. Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics 27:592–593. Schmidt HA, Strimmer K, Vingron M, von Haeseler A. 2002. TREEPUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502–504. Schneider A, Cannarozzi GM, Gonnet GH. 2005. Empirical codon substitution matrix. BMC Bioinformatics 6:134. Spielman SJ, Wilke CO. 2015a. Pyvolve: a flexible Python module for simulating sequences along phylogenies. PLoS One 10:e0139047. Spielman SJ, Wilke CO. 2015b. The relationship between dN/dS and scaled selection coefficients. Mol Biol Evol. 32:1097–1108. Sullivan J, Joyce P. 2005. Model selection in phylogenetics. Annu Rev Ecol Evol Syst. 36:445–466. Talavera G, Castresana J. 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 56:564–577. Thorne JL, Kishino H, Painter IS. 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol Biol Evol. 15:1647–1657. Volk SM, Chen R, Tsetsarkin KA, Adams AP, Garcia TI, Sall AA, Nasar F, Schuh AJ, Holmes EC, Higgs S, et al. 2010. Genome-scale phylogenetic analyses of chikungunya virus reveal independent emergences of recent epidemics and various evolutionary rates. J Virol. 84:6497–6504. Volz EM, Koelle K, Bedford T. 2013. Viral phylodynamics. PLoS Comput Biol. 9:e1002947. Volz EM, Pond SLK, Ward MJ, Brown AJL, Frost SDW. 2009. Phylodynamics of infectious disease epidemics. Genetics 183:1421–1430. Wertheim JO, Chu DKW, Peiris JSM, Pond SLK, Poon LLM. 2013. A case for the ancient origin of coronaviruses. J Virol. 87:7039–7045. Wertheim JO, Pond SLK. 2011. Purifying selection can obscure the ancient age of viral lineages. Mol Biol Evol. 28:3355–3365. Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol. 18:691–699. Whelan S, Lio P, Goldman N. 2001. Molecular phylogenetics: state-ofthe-art methods for looking into the past. Trends Genet. 17:262–272. Woo PCY, Lau SKP, Lam CSF, Lau CCY, Tsang AKL, Lau JHN, Bai R, Teng JLL, Tsang CCC, Wang M. 2012. Discovery of seven novel mammalian and avian coronaviruses in Deltacoronavirus supports bat coronaviruses as the gene source of Alphacoronavirus and Betacoronavirus and avian coronaviruses as the gene source of Gammacoronavirus and Deltacoronavirus. J Virol. 86:3995–4008. Xia X, Xie Z, Salemi M, Chen L, Wang Y. 2003. An index of substitution saturation and its application. Mol Phylogenet Evol. 26:1–7. Xie W, Lewis PO, Fan Y, Kuo L, Chen M-H. 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst Biol. 60:150–160. Yang Z, Nielsen R, Goldman N, Pedersen A-MK. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449. Yang Z, Rannala B. 2006. Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Mol Biol Evol. 23:212–226. Zuckerkandl E. 1976. Evolutionary processes and evolutionary noise at the molecular level. J Mol Evol. 7:269–311. 267
© Copyright 2026 Paperzz