Article Substitution Model Adequacy and

Substitution Model Adequacy and Assessing the Reliability
of Estimates of Virus Evolutionary Rates and Time Scales
Sebastian Duch^ene,y,1 Francesca Di Giallonardo,y,1 and Edward C. Holmes*,1
1
Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney
Medical School, The University of Sydney, Sydney, NSW, Australia
y
These authors contributed equally to this work.
Associate editor: Sergei Kosakovsky Pond
*Corresponding author: E-mail: [email protected].
Abstract
Determining the time scale of virus evolution is central to understanding their origins and emergence. The phylogenetic
methods commonly used for this purpose can be misleading if the substitution model makes incorrect assumptions about
the data. Empirical studies consider a pool of models and select that with the highest statistical fit. However, this does not
allow the rejection of all models, even if they poorly describe the data. An alternative is to use model adequacy methods
that evaluate the ability of a model to predict hypothetical future observations. This can be done by comparing the
empirical data with data generated under the model in question. We conducted simulations to evaluate the sensitivity of
such methods with nucleotide, amino acid, and codon data. These effectively detected underparameterized models, but
failed to detect mutational saturation and some instances of nonstationary base composition, which can lead to biases in
estimates of tree topology and length. To test the applicability of these methods with real data, we analyzed nucleotide
and amino acid data sets from the genus Flavivirus of RNA viruses. In most cases these models were inadequate, with the
exception of a data set of relatively closely related sequences of Dengue virus, for which the GTR+G nucleotide and LG+G
amino acid substitution models were adequate. Our results partly explain the lack of consensus over estimates of the
long-term evolutionary time scale of these viruses, and indicate that assessing the adequacy of substitution models should
be routinely used to determine whether estimates are reliable.
Key words: substitution model, model adequacy, parametric bootstrap, posterior predictive simulation, Bayesian model
averaging, virus evolution
Introduction
substitution as a reversible Markovian process. Second, the
sampling time of the sequences, or calibration, must capture a
sufficient number of substitutions. Errors in any of these two
factors will lead to spurious estimates of rates and evolutionary time scales (recently reviewed by Ho and Duch^ene 2014).
Despite the ongoing development of molecular clock
methods and substitution models, their use in RNA viruses
has sometimes proven contentious and there is substantial
debate over the evolutionary time scales of many viruses
(Holmes 2003; Jorba et al. 2008; Wertheim and Pond 2011;
Duch^ene et al. 2014). For example, it has been estimated that
viruses of the family Coronaviridae originated around 10,100
years before present using standard phylogenetic methods
(Woo et al. 2012). However, subsequent studies using improved substitution models suggested a much older origin
for these viruses, of the order of millions of years (Wertheim
et al. 2013). A major reason for these discrepancies is that
viruses accumulate mutations much faster than cellular organisms, such that mutational saturation can become apparent after a few years and which adversely affects analyses of
divergence times (Holmes 2003; Duch^ene et al. 2014;
Duch^ene, Ho, et al. 2015). Moreover, natural selection operates at different levels, such as within an infected host, between transmissions to different hosts, and during
codivergence between host species (Belshaw et al. 2011;
ß The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
Mol. Biol. Evol. 33(1):255–267 doi:10.1093/molbev/msv207 Advance Access publication September 28, 2015
255
Article
Estimating evolutionary rates and time scales in viruses is
central to understanding their emergence and spread. This
endeavor has been facilitated by the increasing availability of
molecular sequence data, which can be combined with information about time to infer divergence times, effective population sizes (Ne), and key epidemiological parameters such as
the basic reproductive number (R0). However, all of these
estimates rely on molecular clock methods, the simplest of
which posits that the accumulation of nucleotide or amino
acid substitutions is constant through time (Zuckerkandl
1976; Fitch et al. 1991; Korber et al. 2000), known as the
“strict” molecular clock. Under this assumption, substitution
rates can be estimated simply by dividing the sampling times
of a pair of virus samples by their genetic divergence. Similarly,
the divergence time between any pair of samples can be estimated by dividing their genetic divergence by the evolutionary rate. More flexible approaches, known as “relaxed”
molecular clocks, allow variation in the rate of evolution
among lineages according to a statistical model (e.g.,
Sanderson 1997; Thorne et al. 1998; Drummond et al. 2006;
Yang and Rannala 2006). Importantly, all molecular clock
methods have some common requirements. First, genetic
divergence must be accurately estimated, which is typically
done using models that describe nucleotide or amino acid
Duch^ene et al. . doi:10.1093/molbev/msv207
Sanjuan 2012), which also impacts estimates of divergence
times. Statistical modeling of these processes is challenging,
although there are ongoing efforts to describe these processes
more accurately (Volz et al. 2009; K€uhnert et al. 2011;
Wertheim and Pond 2011; Volz et al. 2013).
The difficulty in analyzing the long-term evolutionary processes in viruses, particularly their divergence times, can to
some extent be alleviated by choosing amino acid rather than
nucleotide sequences. The lower rate of evolution in amino
acid data is an attractive property that can allow resolution of
deep time scales, where nucleotide sequences are largely saturated. However, amino acid sequence evolution is more difficult to model because the number of possible substitutions
is far greater than that of nucleotide sequences. For example,
the general-time reversible (GTR) nucleotide substitution
model has nine parameters, whereas the equivalent for
amino acids has 208. In practice, using amino acid models
with free parameters is hindered by the information content
in most data sets and the considerable computational
burden. Therefore, it is common to use empirically determined amino acid models, in which the transitions rates
and amino acid frequencies have fixed values. An obvious
drawback of this approach is that the fixed parameter
values of all available models might be incorrect for a given
data sets, leading to potential biases in phylogenetic inference.
Huelsenbeck et al. (2008) proposed some alternatives to using
fixed parameter values for amino acid models in Bayesian
analyses. One of these methods is Bayesian model averaging,
in which the analysis samples a range of substitution matrices
in proportion to their posterior probability. As a result, the
estimates of tree topology and branch lengths are averaged
over different fixed rate matrices, without the increased computational burden of using a large number of parameters.
Another option is to use informative prior distributions on
the parameters of the transition matrix such as a Dirichlet
distribution, as a means of guiding their estimation.
Codon models can also be used in place of nucleotide
substitution models. These typically infer selective constraints
by estimating the proportion of nonsynonymous (dN) to synonymous (dS) substitutions per site, known as the dN/dS ratio.
One of the most widely used codon models, known as M3,
specifies three dN/dS categories for codons, corresponding to
those under purifying, neutral, and positive selection (Yang
et al. 2000). Parametric codon models can be computationally
demanding because they have even more parameters than
amino acid models, although empirical or semiparametric
methods have also been implemented (e.g., Schneider et al.
2005; Kosiol et al. 2007).
Developing more realistic evolutionary models and selecting different sources of data will clearly be beneficial to resolve
discrepancies in phylogenetic estimates in viruses. An arguably equally important task is to develop methods to assess
the reliability of the estimates obtained from these models
(Gatesy 2007; Gelman and Shalizi 2013). The most common
approach is to consider a pool of models and to select that
with the highest statistical fit for subsequent analysis. This is
commonly done through likelihood ratio tests or information
criteria under maximum-likelihood frameworks, or by
256
MBE
estimating marginal likelihoods in Bayesian analyses (Posada
and Buckley 2004; Sullivan and Joyce 2005; Xie et al. 2011).
These methods have the limitation that they assume that the
set of models considered contains the true model, or at least a
model that adequately describes the data. However, even if
none of the models considered are adequate descriptions of
the data, they will still select the model with the highest
statistical fit. In contrast, model adequacy methods can
reject all of the models if they are inadequate for the data
at hand (Bollback 2002; Ripplinger and Sullivan 2010; Brown
2014b).
Substitution model adequacy methods for phylogenetics
have implementations within maximum likelihood (Goldman
1993a, 1993b) and Bayesian frameworks (Bollback 2002).
Most of these are similar to a parametric bootstrap in that
they consist of simulating sequence data under the model in
question and comparing the simulations with the empirical
data using a summary statistic. Goldman (1993a) proposed
using the -test statistic for maximum-likelihood analyses,
which is the difference between the multinomial loglikelihood and the log-likelihood of the model in question.
For Bayesian inference, the multinomial log-likelihood appears to perform well as the test statistic (Huelsenbeck
et al. 2001; Bollback 2002), but it is often advisable to also
use a 2 test statistic (Foster 2004). These methods have been
shown to detect some conditions of nucleotide substitution
model underparameterization (Whelan et al. 2001; Bollback
2002; Ripplinger and Sullivan 2010) and violations of model
assumptions (Foster 2004). However, their performance has
not been thoroughly evaluated under some conditions that
might be particularly difficult to account for in virus data sets,
such as base heterogeneity and mutational saturation (but
see Foster 2004). Moreover, model adequacy methods have
largely focused on nucleotide data, although they are critical
for analyses of amino acid sequences for which frequently
used models are more restrictive.
In this study, we investigated the usefulness of model adequacy methods to detect misleading phylogenetic estimates
in virus data sets with nucleotide, amino acid, and codon
sequences. To that end we analyzed four publically available
complete genome data sets of flaviviruses (genus Flavivirus,
family Flaviviridae of positive-sense single-stranded RNA viruses), referred to herein as (a), (b), (c), and (d) (table 1 and
supplementary material, Supplementary Material online).
Data set (a) includes only tick-borne flaviviruses (Heinze
et al. 2012), whereas data sets (b) (Moureau et al. 2015)
and (c) (Pettersson and Fiz-Palacios 2014) include sequences
from the four main viral groups (mosquito-borne, tick-borne,
insect specific, and no-known vector). For comparison, we
utilized a data set (d) that only includes sequences from
one virus species infecting humans, Dengue virus type 2
(DENV-2), that is presumed to have emerged more recently
than the viral diversity contained within data sets (a)–(c).
Estimating the evolutionary time scale of the Flavivirus
genus has attracted considerable attention due to the propensity of these viruses to cause human disease (e.g., dengue,
yellow fever, and West Nile viruses) and their broad host
range (vertebrates and invertebrates) (Gould et al. 2001).
MBE
Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207
Table 1. Flavivirus Data Sets Used in This Study, Nucleotide (NT) and Amino Acid (AA) Sequence Lengths in Alignments after Using Gblocks,
Substitution Models Chosen According to the BIC, and the Range of Sampling Years.
Data Set
Reference
Number of sequences
AA model
AA sequence length
NT model
NT sequence length
Sampling year range
(a)
Heinze et al. (2012)
58
LG+
3,324
GTR+
10,056
1937–2008
(b)
Moureau et al. (2015)
100
LG+
2,463
GTR+
8,856
1937–2010
However, there are wide discrepancies in the estimates of the
time of origin of this genus, with recent estimates ranging
from 20,000 (Moureau et al. 2015) to about 300,000
(Pettersson and Fiz-Palacios 2014) years before present.
Resolving these discrepancies is difficult, especially in the
light of evidence supporting different patterns of base composition bias among the major flavivirus taxa (Jenkins et al.
2001), which violate the assumption of stationarity implicit in
most substitution models, and adversely affecting the use of
many phylogenetic methods. For this reason, model adequacy
methods can provide useful guidelines to understand potential sources of bias in some of these estimates. To illustrate the
effect of substitution model violation, we also simulated data
under different conditions and show how the estimates of
tree topology and tree length were affected. These simulations also reveal the cases in which model adequacy methods
can successfully detect misleading estimates, and specific situations under which they fail.
Results
Nucleotide Sequence Simulations and Sensitivity of
the Goldman–Cox Test
We simulated the evolution of nucleotide sequences along
phylogenetic trees with 50 tips under different scenarios: 1)
The Jukes–Cantor (JC) model; 2) the GTR model; 3) the GTR
model with a G distribution for among-site rate variation
(GTR+G); 4) a scenario of trees with the GTR+G model
and mean branch lengths of 0.5 substitutions per site
(subs/site), with a tree length of approximately 42 subs/site,
resulting in considerable mutational saturation (referred to
here as analyses with “long trees”); and 5) a scenario in which
GC content varies by several fold along the tree, representing
a nonstationary process (referred to here as “heterogeneous”).
We analyzed these data under the JC, GTR, and the GTR+G
models (fig. 1). To evaluate whether these scenarios resulted
in biased estimates, we compared the tree topology and the
tree length with those used to generate the data. We also
conducted the Goldman–Cox test (Goldman 1993a, 1993b;
Whelan et al. 2001) of substitution model adequacy (referred
to here as the GC test) to evaluate its power under these
conditions. We generated 100 data sets for each of these
simulation scenarios, and we report the mean statistics for
each scenario.
(c)
Pettersson and Fiz-Palacios (2014)
83
JTT+
2,016
GTR+
7,647
1937–2011
(d)
This study
110
JTT+
3,391
GTR+
10,173
1944–2013
The GC test correctly identified instances in which the
substitution model was underparameterized, but it was not
sensitive to overparameterization (fig. 1). When the data were
generated under the JC model, the mean P value over 100
simulations was 0.62, 0.71, and 0.58 for analyses under the JC,
GTR, and GTR+G models, respectively. In contrast, for the
data simulated under the GTR+G model, the P values were 0
for JC and GTR, and 0.65 for the GTR+G, such that underparameterized models were considered inadequate. The GC test
was also able to detect compositional heterogeneity, with
P < 0.05 in all cases. We found that the GC test had poor
performance for the data generated with mutational saturation and the GTR+G model, with P values that ranged between 0.3 and 0.71, incorrectly suggesting that all three
models used to analyze the data (JC, GTR, and GTR+G)
were adequate.
We quantified the impact of using incorrect substitution
models in the estimates of tree topology and tree length by
comparing the trees used to generate the data and those
estimated using maximum likelihood. To do this, we calculated the PH83 tree topology distance (Penny and Hendy
1985), and the percentage of error in the estimate of the
total tree length. Again, we report the mean values over
100 simulations of each scenario. The errors in these two
quantities were smaller for our data generated under the
GTR+G submodels (i.e., models that are nested within the
GTR+G; JC, GTR, and GTR+G), compared with those using
long trees and heterogeneous base composition. For the
GTR+G submodels, the PH85 distances ranged from 0.02
for data simulated and analyzed using JC to 0.18 for the
data simulated using GTR+G and analyzed with JC. The corresponding errors in the estimates of tree length were between 0.2% for the data simulated and analyzed using JC, and
19.6% for the data simulated using GTR+G and analyzed with
JC (fig. 1). In the analyses of the GTR+G submodels, the largest
errors were caused by underparameterization, particularly
from the data generated using GTR+G and analyzed using
JC or GTR. These results are similar to previous studies on
substitution model adequacy suggesting that underparameterization is a large source of inaccuracy, and that the G distribution is an important component of these models
(Bollback 2002; Ripplinger and Sullivan 2010).
Our simulations with long trees (i.e., mutational saturation) and with heterogeneous base composition produced
errors that were several fold higher than those of the
257
MBE
Duch^ene et al. . doi:10.1093/molbev/msv207
(i)
GC P-values
(ii)
PH85 tree topology distance
(iii)
Difference in tree length (%)
59.9
9.4
GTR 0.71 0.64 0.00 0.30 0.00
0.03
0.06
0.16
12.2
10.1
0.39
0.30
10.8
51.8
13.3
GTR+Γ 0.58 0.59 0.65 0.71 0.00
0.04
0.04
0.1
9.2
8
0.32
1.50
8.0
11.5
16.6
G
TR
+Γ
lo
ng
he
tre
te
es
ro
ge
ne
ou
s
19.6
G
TR
6.60
JC
0.20
G
TR
+Γ
lo
ng
he
tre
te
es
ro
ge
ne
ou
s
6.66
G
TR
20.2
JC
0.18
G
TR
+Γ
lo
ng
he
tre
te
es
ro
ge
ne
ou
s
0.04
G
TR
0.02
JC
JC 0.62 0.01 0.00 0.54 0.04
FIG. 1. Simulation results for nucleotide substitution models. The columns represent the simulation conditions, whereas the rows correspond to the
model used to analyze the data. The boxes are colored according to the mean values of 100 simulations as described below (i) P value of the Goldman–
Cox test, (ii) PH85 tree topology distance to the true tree, and (iii) percent difference between the length of estimated and true trees. The colors are used
to represent the different statistics. Darker shades correspond to poor model performance.
(i)
(ii)
(iii)
GC P-values
PH85 tree topology distance
Difference in tree length (%)
JTT 0.59 0.00 0.65 0.00 0.09 0.85
0.02 0.06 0.00
0.02
2
0.02
1.6
10.7
4.1
12.9 33.1
6.0
1.6
0.02
1.5
6.4
5.4
6.6
2.9
8.0
LG
0.08
0.00 0.60
0.00 0.00 0.00
0.00 0.12 0.02
0.12
1.4
0.05
1.8
11.3
1.9
7.5
30.5
8.9
LG+Γ
0.90
0.80 0.60
0.62 0.01 0.00
0.02 0.02 0.02
0.02
1.6
0.04
1.6
10.0
2.3
1.7
14.3 10.1
JT
T+
Γ
LG
LG
+
lo
ng Γ
he
tre
te
es
ro
ge
ne
ou
s
0.06
JT
T
JT
T+
Γ
0.02 0.02 0.04
LG
LG
+
lo
ng Γ
he
tre
te
es
ro
ge
ne
ou
s
0.63 0.33 0.82
JT
T
0.64 0.67
LG
LG
+
lo
ng Γ
he
tre
te
es
ro
ge
ne
ou
s
0.60
JT
T
JT
T+
Γ
JTT+Γ
FIG. 2. Simulation results for amino acid substitution models. The rows, columns, and color pattern correspond to those of figure 1.
GTR+G submodels. The mean PH85 was between 6.66 for the
data simulated using heterogeneous base composition and
analyzed using JC, and 20.2 for those simulated using long
trees and analyzed with JC. The mean percentage difference in
tree length ranged from 9.4% for the heterogeneous data
analyzed using JC to 59.9% for the simulations with long
trees analyzed using JC (fig. 1). Although the errors in these
simulations were much larger than those using GTR+G submodels, using the GTR+G appeared to reduce the extent of
the error in the data generated under long trees.
Amino Acid Sequence Simulations and Sensitivity of
the Goldman–Cox Test
Our approach to assess the performance of the GC test in
amino acid sequences was similar to that described above for
nucleotide sequences. We simulated sequences along phylogenetic trees under the following conditions: 1) the JTT model
(Jones et al. 1992); 2) the JTT model with a G distribution for
among-site rate variation (JTT+G); 3) the LG model (Le and
Gascuel 2008); 4) the LG+G model; 5) a scenario with mean
branch lengths of 0.7 subs/site, with a tree length of about 70
subs/site, and the JTT+G model; and 6) a nonstationary process, in which we simulate sequences under the JTT model
258
but a portion of the sequences within the tree have a different
amino acid composition (heterogeneous; fig. 2).
Our simulation of amino acid sequences showed that the
GC test is also useful for these data, although it is less powerful
than for nucleotides. All the substitution models appeared
adequate for the data generated under the JTT and LG
models with no among-site rate heterogeneity, with P values
between 0.59 for the data simulated and analyzed with the
JTT model, and 0.92 for the simulations under the JTT and
analyzed with LG (fig. 2). Similar to our nucleotide analyses,
the inclusion of the G distribution was an important aspect of
the adequacy of the models, such that only models that included the G distribution were adequate for the data simulated under JTT+G and LG+G. Also, consistent with our
nucleotide analyses, the test was not sensitive to mutational
saturation. For the data simulated with long trees and the
JTT+G model, it suggested that the JTT and JTT+G models
were adequate, with P values of 0.09 and 0.33, respectively. In
contrast to our nucleotide simulation analyses, the test did
not detect heterogeneity in amino acid composition. Rather,
it suggested that the JTT and JTT+G models were adequate
for these data, with P values of 0.85 and 0.82, respectively.
The impact of substitution model misspecification on the
estimates of tree topology and tree length was very small for
MBE
Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207
the data simulated under the JTT, JTT+G, LG, and LG+G
models. The mean PH85 distance to the true tree ranged
from 0 for the simulations under JTT and LG, and analyzed
with matching models, to 0.12 for those simulated under
JTT+G and LG+G and analyzed with LG. For these simulations, the mean percentage difference in tree length ranged
from 1.6% for the data simulated and analyzed under the JTT
model to 12.9% for those simulated using LG+G and analyzed
using JTT (fig. 2). The largest errors were for the data simulated with long trees. In this case, the mean PH85 distance to
the true tree ranged between 1.6 for the analyses using LG+G
or JTT+G and 2 for those using JTT. The mean percentage
difference in tree length was between 2.9% for the analyses
using JTT+G, which matches the model used to generate the
data, and 30.5% for LG. These errors were smaller than those
for our nucleotide simulations with long trees, which is likely
because amino acid sequences have more character states
than nucleotide sequences, such that it is possible to estimate
larger genetic distances. Contrary to our nucleotide analyses,
heterogeneity in amino acid composition did not lead to large
errors, with mean PH85 distances between 0.02 for analyses
with JTT and JTT+G, and 0.05 for those using LG. The mean
percentage difference in tree length for these data was between 0.6% and 10%.
Simulation of Codon Sequences and Posterior
Predictive Simulations
We conducted a set of simulations of protein coding sequences to assess the performance of codon models.
Accordingly, we simulated phylogenetic trees of 50 taxa and
mean branch lengths of 0.1 subs/site to generate sequences of
990 nt using the Python module Pyvolve (Spielman and Wilke
2015a, 2015b). The dN/dS ratio followed the model described
by Goldman and Yang (1994) with three dN/dS categories; 0.5,
1, 2.5, that is compatible with the empirical values observed in
some Flavivirus data sets (Duch^ene, Ho, et al. 2015). For this
set of simulations, we generated ten data sets because their
analysis is more computationally intensive than that for nucleotide and amino acid models. To assess the adequacy of
substitution models in these data we used posterior predictive simulations, instead of the GC test. This consists of analyzing the data within a Bayesian framework to estimate the
posterior distribution, and simulating data using parameters
GTR+Γ
M3
sampled form the posterior. In turn, the simulations are
used to generate a null distribution of the multinomial loglikelihood and the 2 test statistics, as described by
Huelsenbeck et al. (2001) and Foster (2004). We chose a
Bayesian approach because the GC test requires additional
maximum-likelihood optimization for every parametric bootstrap sample, which is computationally demanding for codon
models. In contrast, in posterior predictive simulations the
test statistics can be calculated directly from the simulated
data (supplementary material, Supplementary Material
online). A disadvantage of this method is that it is necessary
to compute two test statistics (the multinomial log-likelihood
and the 2), instead of one, , as in the case of the GC test
(Huelsenbeck et al. 2001; Foster 2004).
We analyzed the codon simulations using the M3 codon
model implemented in MrBayes v3.2 (Ronquist et al. 2012).
This model allows three dN/dS categories, such that it closely
matches that used to generate the data. For comparison, we
analyzed the data using the GTR+G nucleotide substitution
model. We computed the mean of the test statistics described
above and we assessed the performance in the estimates of
tree topology and tree length. We considered that the estimate of the tree topology was accurate if the tree used to
generate the data was contained within the 95% posterior
credible interval of the posterior trees. To measure the accuracy in the estimate of the tree length, we calculated the
percentage difference between the length of the tree used
to generate the data and the mean posterior tree length.
Similarly to our previous analyses, we report the mean
values across our simulations.
The multinomial log-likelihood test statistic correctly identified the M3 model as being correct, while it always rejected
the GTR+G model, with mean P-values of 0.38 and 0, respectively (fig. 3). However, the 2 suggested that both models
were adequate, at 0.5 for M3 and 0.62 for GTR+G. This discrepancy likely occurs because the site patterns under both
models are different, which is detected by the multinomial
log-likelihood, but they result in similar base composition,
producing in similar 2 values. In this particular case, using
an incorrect model did not result in poor estimates of tree
topology and tree length. In all cases, the true tree was within
the posterior samples. The performance in the estimates of
the tree length was very similar, ranging from 0.6 and 10.9 for
the GTR+G model, and from 0.2 and 11.9 for the M3 model.
multinominal
likelihood P-values
x2 P-values
tree topology
in posterior
Difference in tree length
% (min, max)
0
0.62
1
5.10 (0.6, 10.9)
0.38
0.50
1
6.2 (0.2, 11.9)
FIG. 3. Results of the codon model simulations. The data were generated using a codon model with three dN/dS categories and analyzed under the M3
codon model and GTR+G nucleotide substitution model. The statistics correspond to the mean over ten simulation replicates. The column labeled
“tree topology in posterior” corresponds to the proportion of replicates for which the tree used to generate the data was found within the 95% posterior
credible interval. The difference in tree length is the percentage of difference between the length of the tree used to generate the data and the mean of
the posterior, with the minimum and maximum values, respectively.
259
MBE
Duch^ene et al. . doi:10.1093/molbev/msv207
Nucleotide Substitution Model Adequacy in Flavivirus
Data Sets
We used the Bayesian information criterion (BIC) to select the
best-fit substitution model for the four Flavivirus data sets,
which was found to be GTR+G in all cases (table 1). We
then conducted the GC test using GTR+G, GTR, and JC. The
test suggested that data sets (a), (b) and (c) were not adequately described by the GTR+G or any submodels, with P
values < 0.05 in all cases (table 2). In contrast, data set (d),
appeared to be adequately described by the GTR+G model
only, with a P value of 0.08. Although our simulations, described
above, showed that the GC test could detect compositional
heterogeneity in nucleotide sequences, we also conducted a 2
homogeneity test to detect the percentage of sequences with
heterogeneous base composition. This 2 homogeneity test is
different to the 2 test statistic described above for posterior
predictive simulations. Its purpose is to determine whether any
sequences have base compositions that deviate substantially
from the rest of the data. Data set (a) only had 7.70% of heterogeneous sequences, whereas data sets (b) and (c) had
73.00% and 89.00%, respectively. Only data set (d) displayed
stationarity, with no heterogeneous sequences detected by the
test. Although these values of heterogeneity are highly variable
between data sets, even the 7.70% of sequences with heterogeneous base composition in data set (a) can be sufficient to
contribute to substitution model inadequacy (Huelsenbeck
et al. 2001; Foster 2004).
With empirical data it is not possible to assess whether the
estimates of tree topology and tree length are biased. However,
the estimates of tree length for the different models show that
increasing the number of parameters resulted in longer trees
(table 2). In all data sets, increasing the number of parameters
increased the estimate of the tree length, sometimes by more
than 2-fold. For example, in data set (c) the total tree length
using GTR+G was 33.29 subs/site, compared with 18.70 subs/
site under the JC model. This trend was less pronounced for
data set (d), for which the tree length estimate was 0.88 for
GTR+G and 0.85 for JC, a difference of only 0.03 subs/site.
Amino Acid Sequence Analyses in Flavivirus Data
Sets: Bayesian Model Averaging, Informative Priors on
Transition Rates, and Model Adequacy
We conducted the GC test with four empirically determined
substitution models: JTT, JTT+G, LG, and LG+G. These
models included those with the highest statistical fit for
each data set, according to the BIC (tables 1 and 3). The
GC test suggested that none of these was adequate for
data sets (a), (b), and (c), with P values of 0.00. However,
the LG+G model was adequate for data set (d). The 2 homogeneity test suggested that the proportion of sequences
with heterogeneous amino acid composition for each data set
were 2.00% for (a), 14.22% for (b), 10.02% for (c), and 0.00% for
(d). This contrasts with the results from nucleotide sequences,
which violate the assumption of stationarity more severely in
most data sets.
Commonly used amino acid models, including those used
here, employ fixed parameter values for the rate matrix, such
that exchange rates are fixed. We relaxed this constraint by
using two methods. First, we analyzed the data using Bayesian
model averaging with ten substitution rate matrices and a G
distribution for among-site rate variation. Second, we used a
Dirichlet prior distribution on the transition rates and amino
acid frequencies, as proposed by Huelsenbeck et al. (2008). In
Bayesian model averaging the sampling proportion of each
model is equivalent to their posterior probability, so it can be
used to select the maximum a posteriori model, known as the
MAP (Huelsenbeck et al. 2004). For data sets (a) and (d) the
MAP corresponded to the Poisson model, whereas for data
sets (b) and (c) the MAP corresponded to the WAG model
(Whelan and Goldman 2001). These models were overwhelmingly sampled, with posterior probabilities of 0.99, 0.96 0.97,
and 0.97 for the four data sets, respectively (table 3).
We assessed model adequacy for Bayesian model averaging
and the informative Dirichlet priors by using posterior predictive simulations as described for our codon model simulations. The multinomial log-likelihood suggested that
Bayesian model averaging and using an informative
Dirichlet prior were inadequate for data sets (a)–(c), with
P < 0.05. For data set (d), both methods appeared adequate,
with P values of 0.13 and 0.25, respectively. The 2 test statistic rejected both methods for data sets (b), (c) and (d)
(P = 0.00), but not for data set (a), with a P value of 0.02
(table 3). The combination of the two test statistics indicates
that none of the data sets was accurately described by either
of the methods. This result appears surprising for data set (d),
for which the LG+G model is adequate. However, it is important to note that this model is not included in the model
averaging method in MrBayes, although it can be specified
manually. Similarly, if the data were sufficiently informative
Table 2. Nucleotide Substitution Model Adequacy Using the GC Test, Estimates of Tree Length (subs/site), and the Proportion of Sequences with
Significant Differences in Base Composition for Different Flavivirus Data Sets (% het. sequences).
Data Set
P-value JC
P-value GTR
P-value GTR+
Tree length JC
Tree length GTR
Tree length GTR+
% het. sequences
260
(a)
0.00
0.00
0.00
4.34
4.50
9.86
7.70
(b)
0.004
0.00
0.00
17.27
15.90
30.83
73.00
(c)
0.00
0.00
0.00
18.70
17.07
33.29
89.00
(d)
0.00
0.00
0.08
0.85
0.87
0.88
0.00
MBE
Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207
Table 3. Substitution Model Adequacy, Homogeneity in Amino Acid Composition, and Estimates of Tree Length for Amino Acid Models in
Different Flavivirus Data Sets.
Data Set
P-value JTT
P-value JTT+
P-value LG
P-value LG+
P-value ML BMA
MAP (posterior)
P-value ML DP
P-value.v2 BMA
P-value.v2 DP
Tree length JTT
Tree length JTT+
Tree length LG
Tree length LG+
Tree length BMA
Tree length DP
% het. sequences
(a)
0.00
0.00
0.00
0.00
0.00
Poisson (0.99)
0.02
0.00
0.20
3.04
3.80
3.10
3.98
3.74
4.41
2.00
(b)
0.00
0.00
0.00
0.00
0.00
WAG (0.96)
0.00
0.00
0.00
12.05
14.95
12.21
15.97
14.94
17.98
14.22
(c)
0.00
0.00
0.00
0.00
0.00
WAG (0.97)
0.03
0.00
0.00
14.31
19.35
14.54
20.89
19.57
23.47
10.02
(d)
0.00
0.00
0.00
0.58
0.13
Poisson (0.99)
0.26
0.00
0.00
0.29
0.29
0.29
0.31
0.32
0.37
0.00
Table 4. Codon Model Adequacy Statistics Using Posterior Predictive Simulations for the Different Empirical Data Sets of Flavivirus Sequences.
Data set
P-value ML M3
P-value v2 M3
Tree length M3
(a)
0.003
0.01
9.81
the amino acid transition matrix in the method using
Dirichlet priors could collapse to the LG+G model, but this
does not appear to be the case for data set (d).
The different approaches to analyze amino acid data produced marked differences in the estimates of tree length.
Among the empirically determined models (JTT, JTT+G,
LG, and LG+G), the LG+G model produced the longest
trees. However, in all cases the method of using informative
Dirichlet priors resulted in even longer trees (table 3). For
example, the estimate of the tree length for data set (c)
was 14.3 subs/site for the JTT model, 19.57 subs/site for
model averaging, 20.89 subs/site for LG+G, and 23.47 subs/
site for that with informative priors (table 3).
Codon Model Adequacy in Flavivirus Data Sets
We analyzed our Flavivirus data sets using the M3 codon
model in MrBayes. The procedure to assess the adequacy of
this model also consisted of posterior predictive simulations.
The multinomial log-likelihood suggested that this model was
adequate for data set (d) only, with a P value of 0.07. However,
the 2 test rejected this model for all data sets, with P < 0.05
in all cases. The tree length estimates with this model can be
compared with those from the nucleotide sequences. For
data sets (b) and (c), the tree lengths with the M3 model
were 40.30 and 43.90 subs/site (table 4), which are about 25%
longer than those with GTR+G, at 30.83 and 33.29 subs/site,
(b)
0.00
0.00
40.3
(c)
0.01
0.00
43.9
(d)
0.07
0.00
0.89
respectively. The difference was much smaller for data sets (a)
and (d), at 9.81 and 0.89 subs/site with the M3 model, and
9.86 and 0.88 subs/site with GTR+G, respectively.
Temporal Signal in Flavivirus Data Sets
In empirical data, it is impossible to quantify the impact of
using inadequate models on phylogenetic estimates. If model
violation is mild, such as in our analyses with data simulated
under GTR and analyzed with the JC model, then the errors in
the estimates might be small. In contrast, if the assumptions
of the model are more severely violated, such as in our analyses of data generated under GTR+G and analyzed under the
JC model, the errors are much larger. Therefore, it might be
tempting to estimate evolutionary time scales on the grounds
that model inadequacy might lead to small errors in a particular data set. However, an additional consideration is that
the data should have sufficient temporal structure, which
refers to whether the number of substitutions that accumulate over the sampling time is sufficient to obtain reliable
estimates of the evolutionary rates and time scales.
We assessed the temporal structure in the four Flavivirus
data sets by conducting a date-randomization test (Ramsden
et al. 2009). The motivation behind this test is to determine
whether estimates of evolutionary rates and time scales are
simply driven by the prior in Bayesian phylogenetic analyses.
It consists of estimating the rate several times, after
261
MBE
Duch^ene et al. . doi:10.1093/molbev/msv207
(c)
(d)
nucleotides
(b)
true
tip randomisations
true
tip randomisations
true
tip randomisations
true
tip randomisations
true
tip randomisations
true
tip randomisations
true
tip randomisations
true
tip randomisations
amino acids
rate (substitutions/site/year)
(a)
FIG. 4. Substitution rate estimates for nucleotide and amino acid data sets of flaviviruses (a–d), and ten replicates of the date-randomization test. The
circles correspond to the mean estimate, whereas the error bars represent the 95% posterior credible interval. The first estimate was obtained with the
correct sampling times and the remaining ten were obtained with randomized sampling times.
amino acid
subs/site
nucleotide
subs/site
(a)
(b)
(c)
(d)
p 0.6173
p 0.1666
p 0.8813
p <0.0001
p 0.5866
p 0.6207
p 0.6207
p <0.0001
FIG. 5. Regressions of the number of substitutions from the root to the tips as a function of the sampling time for nucleotide and amino acid data of the
Flavivirus data sets (a)–(d). Each point corresponds to a sequence. The central line corresponds to the regression line, and the curved lines are the 95%
confidence intervals.
randomizing the sampling times. The data are considered to
have sufficient temporal structure (i.e., they pass the test) if
the rate estimates and their associated uncertainty with the
correct sampling times do not overlap with those of the
randomizations (Firth et al. 2010; Duch^ene S, Duch^ene DA,
et al. 2015). We conducted this test for the nucleotide and
amino acid sequences, but we did not use the M3 codon
model because it was inadequate for all data sets.
Data sets (a)–(c) failed the date-randomization test for at
least one of ten replicates for the nucleotide and amino acid
sequences. In contrast, for data set (d), both nucleotide and
amino acid data passed the test. For data sets (a)–(c), there
were some clear differences between the results for the different data sets (fig. 4). For data set (a) the rate estimate
overlapped with one of the randomizations for the nucleotide
data, and with four randomizations for the amino acid data.
The estimate for data set (b) overlapped with three randomizations in the nucleotide data, and with nine for the amino
acid data. Data set (c) had the least temporal structure, with
262
an estimate that overlapped with those from all the randomizations in the nucleotide and amino acid data.
This result was consistent with our regressions of the rootto-tip distance (in subs/site) as a function of the sampling
time (years). For data sets (a)–(c) the P values were between
0.17 and 0.88 for nucleotide sequences, and between 0.58 and
0.62 for amino acid sequences. For data set (d), the P values
were less than 0.0001 for both nucleotide and amino acid
sequences (fig. 5). In sum, this means that neither type of
data had temporal structure for data sets (a)–(c), such that
estimates of evolutionary rates and time scales for these data
are likely to be spurious.
Discussion
Violation of substitution model assumptions can be an important source of error in phylogenetic analyses. Our simulations show the extent to which estimates of tree topology and
tree length can be misled. Accurate estimates of these parameters are important to infer evolutionary rates and time scales.
Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207
For example, using underparameterized substitution models
can result in an underestimation of evolutionary time scales,
and a corresponding overestimation of the rate of evolution
(Phillips 2009). Our simulations illustrate three potential
sources of error: Substitution model misspecification, mutational saturation, and compositional heterogeneity. The latter
is a prevalent feature of RNA viruses (Mooers and Holmes
2000; Jenkins and Holmes 2003), including those of the
Flavivirus genus, where differences in base composition
seem to fit the host and vector in question (Jenkins et al.
2001). Some of these sources of error can be detected through
model adequacy methods. For example, the GC test is useful
to detect model misspecification and compositional heterogeneity (Foster 2004), but it is not able to distinguish whether
the model correctly accounts for mutational saturation. In
our simulations with mutational saturation the GC test incorrectly suggested that overly simple models were adequate,
even if they did not include among-site rate heterogeneity.
One reason for this finding is that mutational saturation does
not constitute a form of model violation, but it can obscure
information about the number of substitutions at individual
sites, such that among-site rate heterogeneity is underestimated. This pattern is consistent with a previous study, in
which the estimate of the parameter of the G distribution
was overestimated for data sets with mutational saturation,
indicating that among-site rate heterogeneity was underestimated (Duch^ene, Ho, et al. 2015).
The analyses of our amino acid simulations were largely
similar to those of nucleotides. However, an important difference is that compositional heterogeneity does not lead to
errors of the magnitude that we observed in nucleotide sequences. Using the model and amino acid frequencies that
described most of the data generally lead to accurate estimates of the tree topology and tree length. This is a probable
reason for why the GC test was unable to detect compositional heterogeneity in amino acid data (fig. 2).
Importantly, however, the advantages of using amino acid
data are hindered by the restrictive nature of fixed parameter
models, except in cases in which the model is adequate such
as for the LG+G model for our empirical data set (d). Bayesian
frameworks allow the implementation of potentially effective
solutions to this problem. The methods used here—model
averaging and using informative prior distributions—estimated substantially longer trees than the models with fixed
parameters for data sets (a)–(c), for which models with fixed
rate matrices were inadequate. We consider this to be evidence that these approaches mitigate this problem because
using incorrect substitution models tends to produce under-,
rather than overestimation of the tree length, given that the
estimate of the tree topology is not highly biased (Lemmon
and Moriarty 2004). Importantly, these methods were inadequate for our empirical data, implying that they are still not
ideal for many data sets.
Our codon sequence analyses showed that although
model adequacy methods are readily applicable to this kind
of data, it is difficult to determine a priori the benefits of using
these models. The difference in the estimates of tree length
between the GTR+G model and the M3 model was very small
MBE
for our simulations, but very large for our empirical data sets
(b) and (c). Our simulations were necessarily limited to a
simple scenario using a widely used mechanistic codon
model, although more realistic models are also available,
such as mutation–selection models (Halpern and Bruno
1998) and empirical models (Kosiol et al. 2007). A recent
simulation study found that inferences of selective constraints
using the best-fitting codon model, according to the Akaike
Information Criterion and the BIC, could produce biased estimates of the parameters typically used to detect adaptive
evolution (Spielman and Wilke 2015b). Clearly, further research is necessary to improve the selection and assessment
of codon models, for example, using different test statistics or
performance-based metrics (e.g., Minin et al. 2003; Abdo et al.
2005).
The test statistics used here to assess model adequacy are
sensitive to certain properties of the models and the data.
Underparameterization can be detected using the statistic
in the GC test, and the multinomial log-likelihood when conducing posterior predictive simulations. Compositional heterogeneity can also be detected using the statistic in
nucleotide sequences, or using the 2 statistic in Bayesian
analyses of amino acid data. However, in the context of
model adequacy, measuring the extent to which mutational
saturation is accounted for by the model is still difficult. A
promising solution is to develop model adequacy statistics
based on currently existing methods to quantify saturation,
such as entropy indices (Xia et al. 2003), homoplasy statistics
(Nielsen 2002; Bollback 2006; Lartillot et al. 2007), and changes
in the ratio of transitions to transversions over time
(Duch^ene, Ho, et al. 2015).
Our results support the view that accurate estimation of
deep evolutionary time scales in viruses is probably not possible using nucleotide sequence data alone (Holmes 2003).
Although amino acid sequences appear beneficial for this
purpose there is clearly ample room to improve the analyses
of these data. Importantly, we find that Bayesian methods
that circumvent the need to estimate a large number of parameters are a fruitful avenue of research. The development
of these models and algorithms should be informed by test
statistics in a model adequacy framework, including those
used here to measure site patterns, compositional heterogeneity, and others based on mutational saturation.
Even in cases in which the substitution models are adequate, inferring evolutionary rates and time scales in virus
data sets requires that the data have sufficient temporal structure. This means that the number of substitutions that have
accumulated over the sampling time should be sufficient to
estimate the rate of evolution, which is especially problematic
for Bayesian analyses, in which the prior can unduly influence
the estimates. Simulation studies show that analyses of data
with insufficient temporal structure tend overestimate the
rate, and consequently underestimate the evolutionary time
scale (Duch^ene S, Duch^ene DA, et al. 2015), which in part
offers an explanation for why the estimates of the time scales
for some viruses are much younger than suggested by other
lines of evidence, such as possible codivergence with host
species.
263
MBE
Duch^ene et al. . doi:10.1093/molbev/msv207
Strikingly, three of the four Flavivirus data sets analyzed
here have no temporal structure. This result might appear
surprising, given that these viruses can have a short-term
evolutionary rate of the order of 104–103 subs/site/year
(e.g., Sall et al. 2010; Volk et al. 2010; Carrington and Auguste
2013; Nunes et al. 2014, 2015), such that it is expected that
they should rapidly accumulate a large number of substitutions. This might occur because the bias in branch length
estimates due to model inadequacy might have a stronger
effect in long branches than in those that are short (Duch^ene
DA, Duch^ene S, and Ho 2015). Therefore, the relationship
between substitutions and sampling time appears overly dispersed, which is also consistent with our regressions of rootto-tip distance as a function of sampling time for data sets
(a)–(c) (supplementary material, Supplementary Material
online). We also found substantial overlap in the estimates
of amino acid and nucleotide substitution rates for our
Flavivirus data sets (a)–(c) (fig. 4). Although these estimates
are unreliable, we attribute this pattern to a combination of
overestimation of the amino acid substitution rate and strong
mutational saturation in the nucleotide data.
The sum of our results motivates further effort in the development of substitution models and methods for model
checking. Model adequacy methods are a critical for this purpose because they can reveal whether important features of
the data are accounted for, given that relevant test statistics
are used (Brown 2014a; Doyle et al. 2015; Duch^ene DA,
Duch^ene S, Holmes, et al. 2015). In this respect, our observation that our genus-wide Flavivirus data sets are poorly described by commonly used models suggests that current
estimates of their long-term evolutionary time scale are
likely unreliable, except for very closely related lineages,
such as those of DENV-2 that comprise data set (d).
Additional sources of information, such as ancient samples,
will provide invaluable information to elucidate their time of
origin.
Materials and Methods
Nucleotide Simulations
We simulated phylogenetic trees with 50 tips and a sequence
length of 2,000 sites using the packages Ape v3.0 (Popescu et al.
2012) and Phangorn v1.99 (Schliep 2011). For most simulations,
the branch lengths were drawn from a lognormal distribution
with mean 0.2 subs/site and standard deviation of 30% of the
mean. These settings are similar to those employed in other
simulation studies (e.g., Lemmon and Moriarty 2004). We simulated nucleotide sequences along these trees as implemented
in Phangorn using three nucleotide substitution models; JC,
GTR, and GTR+G. For the GTR and GTR+G models, we specified the Q matrix as used in a previous simulation study
(Duch^ene, Ho, et al. 2015). The corresponding stationary frequencies were: 0.49, 0.1, 0.1, and 0.49, for a, c, g, and t,
respectively. For the G distribution we specified four discrete
categories and a shape parameter, , of 1. To generate the data
for our simulations with compositional heterogeneity, we used
the GTR+G model as parameterized above, although for a set
of ten taxa within the tree we specified the model with the
264
following base composition: 0.1, 0.49, 0.49, and 0.1, for a, c, g,
and t, respectively. We chose these simulation settings to
represent the most extreme conditions in our empirical data,
particularly data sets (b) and (c) which had high variation in
base frequencies among sequences. We included a set of simulations with considerable mutational saturation by using the
GTR+G model but under trees with mean branch lengths of
0.5 subs/site. We generated 100 data sets under each of these
conditions.
We performed the GC test for the simulated data sets as
described by Goldman (1993a, 1993b). To do this, we used a
maximum-likelihood method to estimate the phylogenetic
tree, branch lengths, and substitution model parameters in
Phangorn. From this we used the maximum-likelihood parameter estimates to simulate 1,000 data sets, which is equivalent to a parametric bootstrap. For each of these data sets we
calculated the statistic, which represents the difference between the multinomial log-likelihood and the log-likelihood
under the model in question. We used the statistics from
the simulated data to generate a null distribution and calculate a P value. We performed the test under three nucleotide
substitution models: JC, GTR, and GTR+G.
Amino Acid Simulations
We simulated the evolution of amino acid sequences using
trees of 50 tips and a sequence length of 1,000 sites. The
branch lengths were drawn from a lognormal distribution
with mean 0.4 subs/site and a standard deviation of 30% of
the mean. We chose the mean branch length to be twice that
used for nucleotides to represent the fact that amino acid
sequences are effective at estimating longer evolutionary distances. We simulated the data under four commonly used
amino acid substitution models: JTT, JTT+G, LG, and LG+G.
Our parameterization of the G distribution was the same as
that of the nucleotide simulations. We simulated compositionally heterogeneous data by specifying the JTT+G model
for most of the tree, with the exception of ten sequences for
which we specified the amino acid frequencies to those of the
LG model. To simulate data under mutational saturation, we
sampled the branch lengths from a lognormal distribution
with mean 0.7 subs/site and a standard deviation of 30% of
the mean.
Using the GC test on amino acid data is different to that
for nucleotides because the model parameters are usually not
optimized. Instead, the simulations were conducted using the
maximum-likelihood estimates of the branch lengths and the
tree topology. The rest of the test remains the same.
Codon Sequence Simulation
For our codon sequence simulations, we generated phylogenetic trees with 50 tips and a mean branch length of 0.1 subs/
site. We simulated sequences of 990 nt, resulting in 330 codon
sites, under the GY model (Goldman and Yang 1994), implemented in Pyvolve (Spielman and Wilke 2015a). We specified
three dN/dS categories for codon sites: 0.5, 1, and 2.5, which
include the range of values estimated for some Flavivirus data
sets (Duch^ene, Ho, et al. 2015), and which represents sites
Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207
under considerably different selective constraints. We simulated ten data sets and analyzed them in MrBayes using the
M3 codon model and the GTR+G nucleotide model. In both
cases, we ran a single cold chain with a chain length of 1 108
steps and we verified that the effective sample size for all
parameters was at least 200 using the package CODA v0.16
(Plummer et al. 2006). To test the adequacy of these models
we conducted posterior predictive simulations, which consisted of simulating data using samples from the posterior
under each of the models. We conducted 1,000 posterior
predictive simulations for each of the ten data sets. We
then obtained null distributions for the multinomial loglikelihood and the 2 by calculating these test statistics for
each posterior predictive data set. To calculate a P value, we
compared the test statistic from the “true” data (i.e., that
analyzed in MrBayes) with the corresponding null distribution
under both models.
We compared the estimates from the M3 and GTR+G
models by considering whether the true tree topology was
contained within the 95% posterior credible interval of trees,
and by calculating the percentage of difference between the
true tree length and the mean posterior. We determined that
the true topology was within the posterior if the tree topology
distance between the true tree and any of those in the posterior was zero. The percentage difference in tree length was
similar to that used in our nucleotide and amino acid sequence simulations, except that we used the mean posterior
tree length, instead of the maximum-likelihood estimate.
Flavivirus Data Sets
We analyzed four data sets, three of which were used in previous studies of Flavivirus evolution. All data sets consisted in
complete genome nucleotide sequences downloaded from
GenBank. We aligned the nucleotide and the translated
amino acid data using MAFFT v7 (Katoh et al. 2002). Before
our phylogenetic analyses we removed ambiguous or misaligned sites using Gblocks v0.91 (Castresana 2000; Talavera
and Castresana 2007). This is important for model adequacy
methods because they rely on site patterns so they can be
misled by ambiguous or misaligned regions. The resulting
data sets were (a) 58 sequences of different tick-borne flaviviruses sampled between the years 1937 and 2008, with sequence lengths of 3,324 amino acids and 10,056 nt; (b) 100
sequences sampled from 1937 to 2010, with sequence lengths
of 2,463 amino acids and 8,856 nt; (c) 83 sequences sampled
from 1937 to 2011, with sequence lengths of 2,016 amino
acids and 7,647 nt; (d) 110 sequences of DENV-2 infecting
humans sampled between 1944 and 2013, with sequence
length of 3,391 amino acids and 10,713 nt (table 1). For our
codon model analyses, we removed ambiguous sites while
maintaining the correct reading frame.
MBE
that we used for our simulations. We verified that the nucleotide and amino acid models with the greatest statistical fit
for all these data sets were included in those that we considered for the GC test (table 1). This is an important aspect of
our study because it allows us to determine whether models
that have high statistical fit are also adequate. To do this, we
chose the models according to the BIC as implemented in
Modelgenerator v0.85 (Keane et al. 2006). Although the GC
test can detect compositional heterogeneity, it cannot determine whether this is the reason why a model is inadequate.
For example, if an empirical data set fails the test using a
particular model, it might be because the model is underparameterized, the data have substantial compositional heterogeneity, or both. To disentangle these factors, we included a
2 compositional homogeneity test as implemented in TREEPUZZLE v5.2 (Schmidt et al. 2002). This test determines the
proportion of sequences for which the base composition deviates significantly from the average and it serves a different
purpose to that of the 2 test statistic used in our posterior
predictive simulations.
For the amino acid sequences, we used two techniques to
relax the strong assumptions made by typical amino acid rate
matrices. The first was Bayesian model averaging, where a
range of amino acid matrices are considered as part of the
parameter space. The Markov chain proposes different
models throughout the analysis, and they are accepted or
rejected according to their posterior probability
(Huelsenbeck et al. 2004). We conducted these analyses in
MrBayes, which includes ten amino acid matrices. The second
method we used consists of specifying informative priors in
the form of a Dirichlet distribution for the parameters of the
rate matrix, as described by Huelsenbeck et al. (2008), and also
implemented in MrBayes. For these analyses we specified a G
distribution with four discrete categories, and ran the analyses
with two cold chains with lengths of 5 107 steps. We used
CODA to assess sufficient sampling from the stationary distribution by ensuring that the effective-sample size of all parameters was at least 200, and by visually inspecting the trace.
We also analyzed the Flavivirus data sets under the M3
codon model in MrBayes, which specifies three dN/dS categories for codon positions. The settings in MrBayes and assessment of sufficient sampling were the same as those for our
amino acid sequence analyses. Because we used a Bayesian
framework for our amino acid and codon sequence analyses,
we assessed the adequacy of the models through posterior
predictive simulations. We simulated 1,000 posterior predictive data sets for each analysis, and calculated the P value of
the multinomial log-likelihood and the 2 test statistics, as
described for our codon model simulations. The computer
code and empirical data sets used in this study are available in
GitHub (https://github.com/sebastianduchene/virus_model_
adequacy, last accessed October 7, 2015).
Flavivirus Sequence Analyses: Model Adequacy,
Bayesian Model Averaging, and Informative Priors on
the Substitution Rate Matrix
Assessing Temporal Structure in the Flavivirus Data
Sets
We performed the GC test on the four Flavivirus data sets
utilizing the nucleotide, amino acid, and the codon model
We assessed whether the Flavivirus data sets had sufficient
temporal structure to perform tip-date based molecular clock
265
Duch^ene et al. . doi:10.1093/molbev/msv207
dating by conducting a date-randomization test with ten
replicates. We utilized the best-fit nucleotide and amino
acid models in all cases (table 1), as well as an uncorrelated
lognormal clock model (Drummond et al. 2006) and a
constant-size tree prior. We conducted these analyses in
BEAST v1.8 (Drummond et al. 2012) with a chain length of
1 108 steps, and assessed sufficient sampling from the stationary distribution by ensuring that the effective sample size
of all parameters was at least 200. For comparison, we also fit a
linear regression of the root-to-tip distance as a function of
the sampling time. The slope of the line corresponds to the
rate, the R2 value is the extent to which evolution has been
clock-like, and the P value is used to determine whether the
estimate of the rate is significantly different from 0. To do this,
we estimated maximum-likelihood trees in PhyML (Guindon
et al. 2010) and used the optimal root found in Path-O-Gen
v1.4 (tree.bio.ed.ac.uk/software/pathogen/; Drummond et al.
2003). We then extracted the root-to-tip distances using
NELSI v0.2 (Ho et al. 2015) and fit the regressions using R
v3.1 (Ihaka and Gentleman 1996).
Supplementary Material
Supplementary data S1, table S1, and figure S1 are available at
Molecular Biology and Evolution online (http://www.mbe.
oxfordjournals.org/).
Acknowledgments
The study was supported by the Swiss National Science
Foundation grant number P2ZHP3_151594 to F.D.G. and
an NHMRC Australia Fellowship (AF30) to E.C.H.
References
Abdo Z, Minin VN, Joyce P, Sullivan J. 2005. Accounting for uncertainty
in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. Mol Biol Evol.
22:691–703.
Belshaw R, Sanjuan R, Pybus OG. 2011. Viral mutation and substitution:
units and levels. Curr Opin Virol. 1:430–435.
Bollback JP. 2002. Bayesian model adequacy and choice in phylogenetics.
Mol Biol Evol. 19:1171–1180.
Bollback JP. 2006. SIMMAP: stochastic character mapping of discrete
traits on phylogenies. BMC Bioinformatics 7:88.
Brown JM. 2014a. Detection of implausible phylogenetic inferences
using posterior predictive assessment of model fit. Syst Biol.
63:334–348.
Brown JM. 2014b. Predictive approaches to assessing the fit of evolutionary models. Syst Biol. 63:289–292.
Carrington CVF, Auguste AJ. 2013. Evolutionary and ecological factors
underlying the tempo and distribution of yellow fever virus activity.
Infect. Genet Evol. 13:198–210.
Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol.
17:540–552.
Doyle VP, Young RE, Naylor GJP, Brown JM. 2015. Can we identify genes
with increased phylogenetic reliability? Syst Biol 64:824–837.
Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. 2006. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4:699–710.
Drummond AJ, Pybus OG, Rambaut A. 2003. Inference of viral evolutionary rates from molecular sequences. Adv Parasitol. 54:331–358.
Drummond AJ, Suchard MA, Xie D, Rambaut A. 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol.
29:1969–1973.
266
MBE
Duch^ene DA, Duch^ene S, Ho SYW. 2015. Tree imbalance causes a bias in
phylogenetic estimation of evolutionary time scales using heterochronous sequences. Mol Ecol Resour. 15:785–794.
Duch^ene DA, Duch^ene S, Holmes EC, Ho SYW. 2015. Evaluating the
adequacy of molecular clock models using posterior predictive simulations. Mol Biol Evol. 32(11):2986–2995.
Duch^ene S, Duch^ene DA, Holmes EC, Ho SYW. 2015. The performance
of the date-randomization test in phylogenetic analyses of timestructured virus data. Mol Biol Evol. 32:1895–1906.
Duch^ene S, Ho SYW, Holmes EC. 2015. Declining transition/transversion
ratios through time reveal limitations to the accuracy of nucleotide
substitution models. BMC Evol Biol. 15:36.
Duch^ene S, Holmes EC, Ho SYW. 2014. Analyses of evolutionary dynamics in viruses are hindered by a time-dependent bias in rate estimates. Proc R Soc Lond B Biol Sci. 281:20140732.
Firth C, Kitchen A, Shapiro B, Suchard MA, Holmes EC, Rambaut A.
2010. Using time-structured data to estimate evolutionary rates of
double-stranded DNA viruses. Mol Biol Evol. 27:2038–2051.
Fitch WM, Leiter JM, Li XQ, Palese P. 1991. Positive Darwinian evolution
in human influenza A viruses. Proc Natl Acad Sci U S A. 88:4270–
4274.
Foster PG. 2004. Modeling compositional heterogeneity. Syst Biol.
53:485–495.
Gatesy J. 2007. A tenth crucial question regarding model use in phylogenetics. Trends Ecol Evol. 274:3–14.
Gelman A, Shalizi CR. 2013. Philosophy and the practice of Bayesian
statistics. Br J Math Stat Psychol. 66:8–38.
Goldman N. 1993a. Simple diagnostic statistical tests of models for DNA
substitution. J Mol Evol. 37:650–661.
Goldman N. 1993b. Statistical tests of models of DNA substitution. J Mol
Evol. 36:182–198.
Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 11:725–736.
Gould EA, de Lamballerie X, Zanotto PM, Holmes EC. 2001. Evolution,
epidemiology, and dispersal of flaviviruses revealed by molecular
phylogenies. Adv Virus Res. 57:71–103.
Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O.
2010. New algorithms and methods to estimate maximum likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol.
59:307–321.
Halpern AL, Bruno WJ. 1998. Evolutionary distances for protein-coding
sequences: modeling site-specific residue frequencies. Mol Biol Evol.
15:910–917.
Heinze DM, Gould EA, Forrester NL. 2012. Revisiting the clinal concept
of evolution and dispersal for the tick-borne flaviviruses by using
phylogenetic and biogeographic analyses. J Virol. 86:8663–8671.
Ho SYW, Duch^ene S. 2014. Molecular-clock methods for estimating
evolutionary rates and time scales. Mol Ecol. 23:5947–5975.
Ho SYW, Duch^ene S, Duch^ene DA. 2015. Simulating and detecting
autocorrelation of molecular evolutionary rates among lineages.
Mol Ecol Resour. 15:688–696.
Holmes EC. 2003. Molecular clocks and the puzzle of RNA virus origins.
J Virol. 77:3893–3897.
Huelsenbeck JP, Joyce P, Lakner C, Ronquist F. 2008. Bayesian analysis of
amino acid substitution models. Philos Trans R Soc Lond B Biol Sci.
363:3941–3953.
Huelsenbeck JP, Larget B, Alfaro ME. 2004. Bayesian phylogenetic model
selection using reversible jump Markov chain Monte Carlo. Mol Biol
Evol. 21:1123–1133.
Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science
294:2310–2314.
Ihaka R, Gentleman R. 1996. R: a language for data analysis and graphics.
J Comput Graph Stat. 5:299–314.
Jenkins GM, Holmes EC. 2003. The extent of codon usage bias in human
RNA viruses and its evolutionary origin. Virus Res. 92:1–7.
Jenkins GM, Pagel M, Gould EA, de A Zanotto PM, Holmes EC. 2001.
Evolution of base composition and codon usage bias in the genus
Flavivirus. J Mol Evol. 52:383–390.
Substitution Model Adequacy in Viruses . doi:10.1093/molbev/msv207
Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci.
8:275–282.
Jorba J, Campagnoli R, De L, Kew O. 2008. Calibration of multiple poliovirus molecular clocks covering an extended evolutionary range.
J Virol. 82:4429–4440.
Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method
for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059–3066.
Keane TM, Creevey CJ, Pentony MM, Naughton TJ, Mclnerney JO. 2006.
Assessment of methods for amino acid matrix selection and their
use on empirical data shows that ad hoc assumptions for choice of
matrix are not justified. BMC Evol Biol. 6:29.
Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH,
Wolinsky S, Bhattacharya T. 2000. Timing the ancestor of the HIV-1
pandemic strains. Science 288:1789–1796.
Kosiol C, Holmes I, Goldman N. 2007. An empirical codon model for
protein sequence evolution. Mol Biol Evol. 24:1464–1479.
K€uhnert D, Wu CH, Drummond AJ. 2011. Phylogenetic and epidemic
modeling of rapidly evolving infectious diseases. Infect Genet Evol.
11:1825–1841.
Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch
attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol. 7:S4.
Le SQ, Gascuel O. 2008. An improved general amino acid replacement
matrix. Mol Biol Evol. 25:1307–1320.
Lemmon AR, Moriarty EC. 2004. The importance of proper model assumption in Bayesian phylogenetics. Syst Biol. 53:265–277.
Minin V, Abdo Z, Joyce P, Sullivan J. 2003. Performance-based selection
of likelihood models for phylogeny estimation. Syst Biol. 52:674–683.
Mooers A, Holmes EC. 2000. The evolution of base composition and
phylogenetic inference. Trends Ecol Evol. 15:365–369.
Moureau G, Cook S, Lemey P, Nougairede A, Forrester NL, Khasnatinov
M, Charrel RN, Firth AE, Gould EA, de Lamballerie X. 2015. New
insights into flavivirus evolution, taxonomy and biogeographic history, extended by analysis of canonical and alternative coding sequences. PLoS One 10:e0117849.
Nielsen R. 2002. Mapping mutations on phylogenies. Syst Biol. 51:729–
739.
Nunes MRT, Faria NR, de Vasconcelos JM, Golding N, Kraemer MUG, de
Oliveira LF, Azevedo RSS, da Silva DEA, da Silva EVP, da Silva SP, et al.
2015. Emergence and potential for spread of Chikungunya virus in
Brazil. BMC Med. 13:102.
Nunes MRT, Palacios G, Faria NR, Sousa EC Jr, Pantoja JA, Rodrigues SG,
Carvalho VL, Medeiros DBA, Savji N, Baele G. 2014. Air travel is
associated with intracontinental spread of dengue virus serotypes
1–3 in Brazil. PLoS Negl Trop Dis. 8:e2769.
Penny D, Hendy M. 1985. The use of tree comparison metrics. Syst Zool.
34:75–82.
Pettersson JH-O, Fiz-Palacios O. 2014. Dating the origin of the genus
Flavivirus in the light of Beringian biogeography. J Gen Virol.
95:1969–1982.
Phillips MJ. 2009. Branch-length estimation bias misleads molecular
dating for a vertebrate mitochondrial phylogeny. Gene 441:132–140.
Plummer M, Best N, Cowles K, Vines K. 2006. CODA: convergence
diagnosis and output analysis for MCMC. R News 6:7–11.
Popescu A-A, Huber KT, Paradis E. 2012. ape 3.0: new tools for distancebased phylogenetics and evolutionary analysis in R. Bioinformatics
28:1536–1537.
Posada D, Buckley TR. 2004. Model selection and model averaging in
phylogenetics: advantages of Akaike information criterion and
Bayesian approaches over likelihood ratio tests. Syst Biol. 53:793–808.
Ramsden C, Holmes EC, Charleston MA. 2009. Hantavirus evolution in
relation to its rodent and insectivore hosts: no evidence for codivergence. Mol Biol Evol. 26:143–153.
Ripplinger J, Sullivan J. 2010. Assessment of substitution model adequacy
using Frequentist and Bayesian methods. Mol Biol Evol. 27:2790–
2803.
MBE
Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, H€ohna S,
Larget B, Liu L, Suchard MA, Huelsenbeck JP. 2012. MrBayes 3.2:
efficient Bayesian phylogenetic inference and model choice across
a large model space. Syst Biol. 61:539–542.
Sall AA, Faye O, Diallo M, Firth C, Kitchen A, Holmes EC. 2010. Yellow
fever virus exhibits slower evolutionary dynamics than dengue virus.
J Virol. 84:765–772.
Sanderson MJ. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol Biol Evol. 14:1218–
1231.
Sanjuan R. 2012. From molecular genetics to phylodynamics: evolutionary relevance of mutation rates across viruses. PLoS Pathog.
8:e1002685.
Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics
27:592–593.
Schmidt HA, Strimmer K, Vingron M, von Haeseler A. 2002. TREEPUZZLE: maximum likelihood phylogenetic analysis using quartets
and parallel computing. Bioinformatics 18:502–504.
Schneider A, Cannarozzi GM, Gonnet GH. 2005. Empirical codon substitution matrix. BMC Bioinformatics 6:134.
Spielman SJ, Wilke CO. 2015a. Pyvolve: a flexible Python module for
simulating sequences along phylogenies. PLoS One 10:e0139047.
Spielman SJ, Wilke CO. 2015b. The relationship between dN/dS and
scaled selection coefficients. Mol Biol Evol. 32:1097–1108.
Sullivan J, Joyce P. 2005. Model selection in phylogenetics. Annu Rev Ecol
Evol Syst. 36:445–466.
Talavera G, Castresana J. 2007. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein
sequence alignments. Syst Biol. 56:564–577.
Thorne JL, Kishino H, Painter IS. 1998. Estimating the rate of evolution of
the rate of molecular evolution. Mol Biol Evol. 15:1647–1657.
Volk SM, Chen R, Tsetsarkin KA, Adams AP, Garcia TI, Sall AA, Nasar F,
Schuh AJ, Holmes EC, Higgs S, et al. 2010. Genome-scale phylogenetic analyses of chikungunya virus reveal independent emergences of recent epidemics and various evolutionary rates. J
Virol. 84:6497–6504.
Volz EM, Koelle K, Bedford T. 2013. Viral phylodynamics. PLoS Comput
Biol. 9:e1002947.
Volz EM, Pond SLK, Ward MJ, Brown AJL, Frost SDW. 2009. Phylodynamics
of infectious disease epidemics. Genetics 183:1421–1430.
Wertheim JO, Chu DKW, Peiris JSM, Pond SLK, Poon LLM. 2013. A case
for the ancient origin of coronaviruses. J Virol. 87:7039–7045.
Wertheim JO, Pond SLK. 2011. Purifying selection can obscure the ancient age of viral lineages. Mol Biol Evol. 28:3355–3365.
Whelan S, Goldman N. 2001. A general empirical model of protein
evolution derived from multiple protein families using a maximum
likelihood approach. Mol Biol Evol. 18:691–699.
Whelan S, Lio P, Goldman N. 2001. Molecular phylogenetics: state-ofthe-art methods for looking into the past. Trends Genet. 17:262–272.
Woo PCY, Lau SKP, Lam CSF, Lau CCY, Tsang AKL, Lau JHN, Bai R, Teng
JLL, Tsang CCC, Wang M. 2012. Discovery of seven novel mammalian and avian coronaviruses in Deltacoronavirus supports bat coronaviruses as the gene source of Alphacoronavirus and
Betacoronavirus and avian coronaviruses as the gene source of
Gammacoronavirus and Deltacoronavirus. J Virol. 86:3995–4008.
Xia X, Xie Z, Salemi M, Chen L, Wang Y. 2003. An index of substitution
saturation and its application. Mol Phylogenet Evol. 26:1–7.
Xie W, Lewis PO, Fan Y, Kuo L, Chen M-H. 2011. Improving marginal
likelihood estimation for Bayesian phylogenetic model selection. Syst
Biol. 60:150–160.
Yang Z, Nielsen R, Goldman N, Pedersen A-MK. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid
sites. Genetics 155:431–449.
Yang Z, Rannala B. 2006. Bayesian estimation of species divergence times
under a molecular clock using multiple fossil calibrations with soft
bounds. Mol Biol Evol. 23:212–226.
Zuckerkandl E. 1976. Evolutionary processes and evolutionary noise at
the molecular level. J Mol Evol. 7:269–311.
267