Variation in Evolutionary Processes at Different Codon Positions

Variation in Evolutionary Processes at Different Codon Positions
Lee Bofkin and Nick Goldman
European Molecular Biology Laboratory–European Bioinformatics Institute, Hinxton, United Kingdom
Evolutionary studies commonly model single nucleotide substitutions and assume that they occur as independent draws
from a unique probability distribution across the sequence studied. This assumption is violated for protein-coding sequences, and we consider modeling approaches where codon positions (CPs) are treated as separate categories of sites because
within each category the assumption is more reasonable. Such ‘‘codon-position’’ models have been shown to explain the
evolution of codon data better than homogenous models in previous studies. This paper examines the ways in which codonposition models outperform homogeneous models and characterizes the differences in estimates of model parameters
across CPs. Using the PANDIT database of multiple species DNA sequence alignments, we quantify the differences
in the evolutionary processes at the 3 CPs in a systematic and comprehensive manner, characterizing previously undescribed features of protein evolution. We relate our findings to the functional constraints imposed by the genetic code,
protein function, and the types of mutation that cause synonymous and nonsynonymous codon changes. The results increase our understanding of selective constraints and could be incorporated into phylogenetic analyses or gene-finding
techniques in the future. The methods used are extended to an overlapping reading frame data set, and we discover that
overlapping reading frames do not necessarily cause more stringent evolutionary constraints.
Introduction
The occurrence of point (single nucleotide) mutations
is common in nature. Although larger scale mutations certainly exist (such as doublet and triplet mutations or gene
conversion: see Whelan and Goldman 2004 and references
therein), the majority of phylogenetic reconstruction methods in the parsimony, distance matrix, and maximum likelihood frameworks utilize information from point mutations.
Such mutations also underpin single nucleotide polymorphism analyses (The International HapMap Consortium
2003). However, just because the majority of studies are
at the single nucleotide level, this does not mean all sites
evolve in a homogenous pattern.
It is a common practice to treat different positions in
a multiple sequence DNA alignment as if they evolve under
identical mutational and selective pressures and can be described using the same mathematical model. For example,
the program Modeltest (Posada and Crandall 1998) that is
widely used to determine the model that should be used to
analyze a DNA sequence multiple alignment only considers
models where the nucleotide frequencies and transition:
transversion (ts:tv) biases are assumed to be identical across
all sites (models with heterogeneity in evolutionary rates
across sites are allowed, but such models still consider sites
to be independent and identically-distributed). Mutational
and selective pressures on a sequence alignment are measured
as estimates of explicit model parameters in mechanistic
models of evolution, which estimate the process of sequence
evolution using the data itself. In models that assume a homogeneous evolutionary process across sites, model parameters
are estimated only once, as average values over all sites.
Sites in a DNA multiple alignment data set may not
have evolved according to a single common evolutionary
pattern. We may expect sites in coding sequences, overlapping reading frames, isochores, regions of a gene that encode part of the same structural domain of a protein, and
genes within the same chromosome to evolve more simiKey words: adaptive evolution, codon positions, phylogenetic inference, protein-coding sequences, sequence evolution.
E-mail: [email protected].
Mol. Biol. Evol. 24(2):513–521. 2007
doi:10.1093/molbev/msl178
Advance Access publication November 21, 2006
larly to other sites in the same ‘‘site category,’’ as defined
by a biological expectation, than to sites in other site categories. We may even expect variation in mutational processes across nonfunctional ‘‘junk’’ DNA. If we ignore
differences in the evolutionary processes between heterogeneously evolving sites, then the parameter estimates of our
mechanistic models will be poor explanations of the evolution of the data (inappropriate for all positions), which
jeopardizes the accuracy of our inferences.
Here, we use the systematic variation that is specific to
coding DNA sequences (CDSs) to investigate differences
in evolutionary rates, heterogeneity of evolutionary rates,
ts:tv biases, and nucleotide frequencies between the 3 CPs
of protein-coding genes. We expect first-codon positions
within a gene to evolve more similarly to other first-codon
positions than to either second- or third-codon positions,
second-codon positions to evolve more similarly to other
second-codon positions than to first- or third-codon positions, and likewise for third-codon positions.
Variation in mutation and selection may cause differences in evolutionary patterns across a sequence. For proteincoding DNA sequences, although the actual pattern of
mutation may vary across sites, the systematic difference in
fixation rates of mutations across sites is more likely to be
due to differences in natural selection as a consequence of
the structure of the genetic code. Even if the probabilities of
a given mutation occurring at different CPs were identical,
mutations will be fixed in the species according to patterns
specific to each CP, due to differences in functional constraints (and thus natural selection). In the case of codons,
different evolutionary constraints at different CPs result
from the functional constraints imposed by the genetic code
and the physicochemical properties of encoded amino acids.
The evolutionary constraints at different CPs are best
considered in light of the genetic code. The 61 sense codons
code for 20 amino acids in the universal genetic code, leading to some redundancy. The second-codon position is the
most functionally constrained; any change to the secondcodon position causes a nonsynonymous change in the
coding sequence. The third-codon position is the least functionally constrained; indeed, with particular combinations
of first- and second-codon position sequences, the thirdcodon position may be 4-fold degenerate, in which case
Ó 2006 The Authors
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
514 Bofkin and Goldman
any base at the third-codon position will code for the same
amino acid.
Thus, the approach we adopt assumes that there are
differences in the evolutionary patterns between the 3
CPs and that all sites at the same position within different
codons evolve according to the same evolutionary pattern.
Therefore, this investigation uses nucleotide-based models
that can estimate model parameters independently for each
of the 3 CPs in a multiple alignment. Previous investigations that have used large data sets have not investigated
how a wide range of estimates of model parameters may
vary between CPs. Conversely, studies that have used wider
ranges of models have used very small data sets, which may
be prone to poorly defined biases. The models we use are
similar to the CP models used by Shapiro et al. (2006).
Yang (1996) has shown that estimates of certain model
parameters, such as evolutionary rates, rate heterogeneity,
and ts:tv biases, may vary significantly between different
CPs for genes in a small mitochondrial data set. Kumar
(1996) demonstrated that rate heterogeneity, ts:tv biases,
and nucleotide frequencies may differ significantly between
CPs in a small data set of mitochondrial genes. Huelsenbeck
and Nielsen (1999) demonstrated that the ts:tv bias may differ significantly between CPs on a small data set of proteincoding genes. Shapiro et al. (2006) have shown that models
accounting for differences between the 3 CPs explain the
evolution of 283 protein-encoding multiple alignment data
sets from yeast and RNA virus genes better than models
assuming a homogeneous evolutionary pattern.
We do not expect CP models to fully explain proteincoding sequence evolution because variation in selection
between codons is not accounted for, but they do provide
a more reasonable model than homogeneous nucleotide
evolutionary models. Furthermore, unlike protein domains,
which have different positions in different data sets, identification of different CP categories in different multiple
alignments is trivial.
Although modeling CDS evolution using codons, not
nucleotides, as the smallest unit of sequence evolution
might be beneficial, there are reasons why we do not pursue
this approach. Published codon models, such as those of
Goldman and Yang (1994), Muse and Gaut (1994), and
Yang et al. (2000), cannot readily be used to estimate
the differences in estimates of certain parameters at the different CPs: certain parameters are not estimated independently at the different CPs. Additionally, codon models
require more computation, which may be prohibitive for
such a large data set as is used here (see Ren et al.
2005). The CP models that we use capture some of the
complexity of codon sequence evolution and can be
implemented with little extra computational effort than
standard homogeneous nucleotide models.
Although it is known that CPs evolve differently, the
nature of these differences has not been fully characterized
with sizeable data sets. In addition, some biological factors
have not previously been investigated for variation between
CPs. This study uses large amounts of data and a wider
range of models than any previous study of its kind, to characterize differences in the evolutionary properties at the
different CPs. We investigate the differences in evolutionary rates, heterogeneity of evolutionary rates, ts:tv biases
(denoted by j), and nucleotide frequencies (denoted by
p) between the CPs. Results are discussed in terms of
the functional constraints of the genetic code and the effects
that mutations have on coding sequences. We discuss how
the results demonstrate uses for CP models, how parameter
estimates may be useful in Bayesian studies, and how simple models could be used to identify CDSs in multiple
alignments.
Materials and Methods
Models
Following standard procedures in probabilistic modeling of DNA sequence evolution (see, e.g., Felsenstein 2004),
we assume that sites evolve independently, each according
to a Markov process: the probability of a site changing to
any other given state (i.e., nucleotide) in any small time
interval depends only on its current state and not on previous states (the process is memoryless). The evolutionary
process is assumed to be time homogeneous, reversible,
and at equilibrium, as is standard practice within the field
of maximum likelihood phylogenetics (Felsenstein 2004).
A range of evolutionary models that explicitly describe
the process of substitution between nucleotide states was applied to the data. Some models include allowance for 3 categories of sites, where each category contains all the sites of
a specific CP. Here, the process of evolution follows an identical distribution across all sites in the same category: models
with multiple-site categories have common estimates for
model parameters within each category, but parameter estimates may differ between categories. Thus, specific model
parameters may be estimated independently for different categories of sites in a sequence alignment. We used maximum
likelihood to fit free parameters to the data sets studied.
Parameter value estimates were examined to observe differences between site categories; statistical tests (see below)
determined when these differences were significant.
The HKY model of DNA substitution (Hasegawa et al.
1985) is the basic model from which our more complex CP
models are developed. Under the HKY model, the probability of a nucleotide changing to a different given nucleotide depends on the frequency of the target nucleotide and
whether or not the mutation is a transition or a transversion.
The HKY model is complex enough to describe the features
that we wish to model but simple enough to produce results
that can be easily interpreted.
Rate heterogeneity across sites is usually modeled
using a distribution of rates, commonly a discretised cdistribution whose shape is governed by a single parameter,
a (Yang 1993, 1994b). The rate at each site is taken to be a
random draw from this distribution. When a is low then
there is extreme variation in the rate at which different positions in the alignment have evolved, with most sites having
evolved at very low rates and relatively few having evolved
at much higher rates. As a increases, the rate variation
decreases and the distribution of evolutionary rates
becomes bell shaped, and as a tends to infinity, all sites tend
toward evolving at the same rate.
In the more complex models used in this investigation,
we investigated the evolution of evolutionary rates (models
denoted 1R), rate heterogeneity using a c-distribution
Variation in Evolution at Different Codon Positions 515
Table 1
Models and Statistical Tests Used and the Percentage of PANDIT Families that Are Significant for Various Tests
Test
Null (A) and Alternate (B) Models
Degrees of
Freedom
T-1
T-2
T-3
T-4
A: HKY 1 G; B: HKY 1 3R 1 G
A: HKY 1 3R 1 3N 1 3T 1 G; B: HKY 1 3R 1 3N 1 3T 1 3G
A: HKY 1 3R 1 3N 1 3G; B: HKY 1 3R 1 3N 1 3T 1 3G
A: HKY 1 3R 1 3T 1 3G; B: HKY 1 3R 1 3N 1 3T 1 3G
2
2
2
6
Percentage of PANDIT
Families with Alternate
Model Significantly Better
97
30
75
95
NOTE.—Notation is such that, for example, for the model HKY 1 3R 1 3N 1 3T 1 3G, separate rates (R), nucleotide frequencies (N), ts:tv biases (T), and rate
heterogeneities (G) were estimated for each category (13) of CP across the sites in a multiple alignment. A simple ‘‘1G,’’ as in the model HKY 1 G, means that a single
c-distribution is applied across all sites to account for heterogeneity in evolutionary rates.
(denoted 1G), ts:tv biases (1T), and nucleotide frequencies (1N) in various combinations. Where these estimates
were allowed to be independent for each category of CPs,
models are denoted 13R, 13G, etc. Thus, for example,
model HKY 1 3R is based on the HKY model and estimates only the evolutionary rates independently for each
of the 3 categories of CPs. The most complex model used
estimates the evolutionary rates, as, js, and ps independently for the 3 CP categories; we write this model as
HKY 1 3R 1 3N 1 3T 1 3G. The models used in this
investigation are presented in table 1. All models were applied to the data using the program BASEML in the PAML
suite of programs (Yang 1997).
Statistical Tests
We can compare how well different models explain
the evolution of the same data set by comparing the maximum likelihood values of the models (Goldman 1993;
Yang et al. 1994; Whelan et al. 2001). All tests used in this
paper compare null and alternate models where the null hypothesis model is a special case of the alternate hypothesis
model, that is, with some of its parameters fixed to certain
values (the simpler null hypothesis model is ‘‘nested’’ in the
more complex alternate hypothesis model). In this situation,
the model comparison can be performed using a likelihood
ratio test (LRT), where the distribution of twice the difference in maximum log likelihoods of the models given the
data (the LRT statistic) is v2 if the null hypothesis model is
correct (Yang et al. 1994; Whelan and Goldman 1999). The
v2 distribution has its number of degrees of freedom equal
to the difference in numbers of free parameters between the
models tested. Where the LRT statistic exceeds the 95%
mark of the v2 distribution, we consider the alternate
hypothesis to be a significantly better explanation of the
evolution of the sequences studied.
The tests that were constructed from the models applied to the data are presented in table 1. Test T-1 (HKY 1
G vs. HKY 1 3R 1 G, i.e., models differing by the 13R
component) is a test of whether making an allowance for
different evolutionary rates at each of the 3 CPs is significant. T-2 tests whether there are significant differences
between CPs in the levels of among-site rate heterogeneity
(1G cf. 13G). T-3 and T-4 test for significant differences
in ts:tv biases and nucleotide frequencies at different CPs,
respectively.
Limits to the models that can be applied using
BASEML affected how some of the tests were devised.
For example, BASEML does not permit testing for differences in nucleotide frequencies or ts:tv biases between CPs
without also estimating differences in evolutionary rates
between CPs. Furthermore, it is not possible to test for
differences in evolutionary rates between CPs while
simultaneously accounting for possible differences between
nucleotide frequencies and ts:tv biases at the different CPs.
Data
We used the PANDIT database (release 17), which
contains 7,738 multiple alignments of protein-coding
DNA sequences, each with a phylogenetic tree (Whelan
et al. 2006). The alignments and tree topologies were used
as provided and branch lengths were reestimated for each
analysis. Although there may be some errors in the alignments and phylogenetic trees, note that using reasonable
estimates of phylogenetic trees should give reasonable parameter estimates (Yang 1994a, 1994b; Yang et al. 1994,
1998; Yang, Goldman, et al. 1995; Sullivan et al. 1996;
Adachi et al. 2000). We assume that this is also true for
multiple alignments. These alignments represent a considerably greater number of data sets than has previously been
studied in this manner, spanning a broader range of genes
and species.
Results
Successful Optimizations
The program BASEML performs parameter estimation by likelihood maximization and sometimes fails when
the algorithm fails to find the optimal set of parameter values for a data set. Results for any given PANDIT family
were retained only when all models shown in table 1 optimized successfully for that family. Negative values for the
test statistics of nested hypotheses are indicative of failed
optimizations because such values are theoretically impossible. If one test fails for a given family, we consider it more
likely that optimization is challenging for other models, and
the best guarantee of accuracy is to discard the results for
the entire family.
Applying this criterion, optimizations were successful
for 7,158 of the 7,738 alignments (93%). The families
where optimizations failed tended to have fewer sequences
in the alignment (median 106 sequences per family for families where optimization failed vs. 309 sequences per family
for successful optimizations), which suggests that small
amounts of data may make optimization of parameter
516 Bofkin and Goldman
10000
Codon position 2
Codon position 3
Number of families
1000
100
10
0.
1
0.
5
0.
9
1.
3
1.
7
2.
1
2.
5
2.
9
3.
3
3.
7
4.
1
4.
5
4.
9
5.
3
5.
7
6.
1
6.
5
6.
9
7.
3
7.
7
8.
1
8.
5
8.
9
9.
3
9.
7
10
00
1
Evolutionary rate
FIG. 1.—Distributions of evolutionary rates at second- and third-codon positions (relative to rate 1 for CP 1). Note the logarithmic scaling on the
y-axis.
estimates more challenging. The percentages of the 7,158
successfully optimized families for which results for each
test were significant are presented in table 1.
CP Differences
Test T-1 investigates the effect of making an allowance for a difference in evolutionary rates at different
CPs. We find that evolutionary rates vary significantly between CPs in 97% of families. Figure 1 shows the estimates
of evolutionary rates at the second- and third-codon positions, relative to the estimated evolutionary rate at the firstcodon position, which is arbitrarily set equal to 1 (results
only reported when test T-1 was significant). Second-codon
positions tend to evolve more slowly than first-codon positions, which in turn tend to evolve more slowly than thirdcodon positions. The width of the distributions is also worth
noting; the rates of the third-codon positions of the different
PANDIT families have a broader distribution than the rates
of the second-codon positions. Presumably, these results reflect differences in the functional constraints that the genetic
code places on the different CPs.
The results for test T-2 indicate that heterogeneity in
evolutionary rates varies significantly between CPs in 30%
of families. This is the weakest effect studied, but still
affects a large number of families. The low percentage
may be because there is relatively little such variation,
but is probably because a single c-distribution applied
across all 3 CPs can already model much of the rate variation between CPs as well as within each CP category. Distributions of estimates of the parameter a at the different
CPs are shown in figure 2 (where test T-2 was significant).
The second-codon position shows the most rate heterogeneity (lowest value of a). Third-codon positions have the
least rate heterogeneity; such sites tend to evolve at more
similar rates to each other than other CPs do. It is interesting
that most estimated values of a exceed 1, which means that
rate heterogeneity within CP categories is not strong. Indeed, the many high a values at third-codon positions, although shown only when Test T-2 was significant, suggest
that it might be interesting in future to consider models that
allow rate heterogeneity at only the first- and second-codon
positions.
The results for test T-3 indicate that ts:tv biases vary
significantly between CPs in 75% of families. First- and
second-codon positions have virtually identical distributions of estimates of j across families in the PANDIT
database. The third-codon position tends to have a much
higher ts:tv bias than the first- or second-codon position.
Some values for the estimates of j are very high for the
third-codon position, which may occur when the evolutionary distance between sequences is small and almost all
changes inferred are transitions; these estimates will have
a very high variance. The distributions of estimates of j
at the 3 CPs inferred using model HKY 1 3R 1 3N 1
3T 1 G are shown in figure 3 for each PANDIT family
where test T-3 was significant.
The results from T-4 indicate that nucleotide frequencies vary significantly between CPs in 95% of families. The
distributions of the frequencies of the four nucleotides at
each CP are shown in figure 4A–C. CP 2 has a bias against
nucleotides G and C, which are more mutagenic than nucleotides A and T (Costantini et al. 2006). CP 1 often has a bias
toward purines, and CP 3 has a slight bias toward pyrimidine.
Consistency of Results Under Different Models
Different model choices are available for testing
the between-codon position variation of evolutionary
parameters—for example, differences in ts:tv biases (test
Variation in Evolution at Different Codon Positions 517
FIG. 2.—Distributions of rate heterogeneity parameters a at different CPs.
T-3) could instead have been studied by comparing models
HKY 1 3R and HKY 1 3R 1 3T. We have generally presented results from the most complex models appropriate
for the investigation of each feature. Because more complex
models were preferred for most data sets, these are generally expected to give superior parameter estimates.
The use of less complex models made little difference
to the number of families that were significant when testing
for differences between CPs for all parameters except for
rate heterogeneity (results not shown). Other versions of
T-2 using alternate hypothesis models less complex than
HKY 1 3R 1 3N 1 3T 1 3G led to considerable increases
in the percentage of families that gave significant test results
when considering differences in rate heterogeneity between
CPs. Failure to account for other differences (in evolutionary rate, ts:tv bias, etc.) between CPs presumably causes an
increase in the number of families where rate heterogeneity
was significantly different between the CPs due to interactions between parameters: the ‘‘13G’’ model of CP
differences in rate heterogeneity can generate likelihood
improvements in the absence of this form of variation when
other genuine effects are not explicitly modeled. Thus, our
results are clearer when considering the models and tests
that we have presented.
FIG. 3.—Distributions of transition:transversion biases at different CPs.
518 Bofkin and Goldman
3500
A
C
G
T
Number of families
3000
2500
2000
1500
1000
500
0
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
Nucleotide frequencies at codon position 1
3500
A
C
G
T
Number of families
3000
2500
2000
1500
1000
500
0
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
Nucleotide frequencies at codon position 2
3500
A
C
G
T
Number of families
3000
2500
2000
1500
1000
500
0
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
Nucleotide frequencies at codon position 3
FIG. 4.—Distributions of nucleotide frequencies at the (A) first-, (B) second-, and (C) third-codon position.
Discussion
Of the parameters studied in this paper, the evolutionary rate and nucleotide frequencies differ significantly most
often between CPs. Alongside the ts:tv biases, modeling the
differences in these factors between CPs is very important
in our modeling approaches. Rate heterogeneity is the
weakest of the factors tested, with the least number of
families showing significant differences between the
CPs. However, 30% of 7,158 data sets represents enough
cases to suggest it is important to test for rate heterogeneity
differences between CPs, even though models permitting
Variation in Evolution at Different Codon Positions 519
Table 2
Parameter Estimates for the HBV Data Set
CP Category
1
2
3
4 (1 1 3)
5 (1 1 2)
6 (2 1 3)
a
Rate
1
0.33
3.92
1.87
0.66
0.94
6
6
6
6
6
0.07
0.52
0.28
0.12
0.16
0.27
0.05
1.02
0.33
0.14
0.21
6
6
6
6
6
6
j
0.07
0.00
0.18
0.06
0.05
0.05
1.56
1.97
4.34
3.18
1.73
2.92
6
6
6
6
6
6
p
0.28
0.57
0.45
0.45
0.39
0.55
A:0.25,
A:0.27,
A:0.24,
A:0.21,
A:0.21,
A:0.18,
C:0.24;
C:0.25;
C:0.20;
C:0.30;
C:0.31;
C:0.32;
G:0.25,
G:0.18,
G:0.21,
G:0.24,
G:0.23,
G:0.21,
T:0.26
T:0.29
T:0.35
T:0.26
T:0.26
T:0.29
NOTE.—For category rate, a and j standard errors of parameter estimates are shown after each point estimate.
these differences may be found unnecessary for a majority
of studies.
The estimates of model parameters at the different
CPs, and their distributions over many data sets, can be related to our knowledge of the functional constraints of the
genetic code and the probability of different mutations
causing synonymous and nonsynonymous changes to the
coding sequence. Indeed, the parameter estimates (including the fact that ts:tv biases are present) support the hypothesis that the genetic code is adaptive (Freeland and Hurst
1998; Freeland et al. 2000).
The functional constraints that result from the genetic
code may explain the observed variation in evolutionary
rates at the different CPs. Similarly, the most extreme rate
heterogeneity at CP 2, which means that the majority of
second-codon positions evolve slowly and a small proportion evolve more rapidly, may be explained by the strong
functional constraint on the majority of such sites. More
rapidly evolving sites may be under less purifying selection
than average.
Interestingly, first- and second-codon positions have
virtually identical distributions of estimates of j, which
may be because the proportion of transitions and transversions that cause synonymous or nonsynonymous changes
in the amino acid encoded by a codon is virtually identical
for first- and second-codon positions (Rambaut A, personal
communication). The third-codon position tends to have
a much higher ts:tv bias than the first- or second-codon position, and this is likely to be because the majority of transversions are nonsynonymous for the third-codon position.
Thus, transversions are disproportionally selected against at
the third-codon position, elevating the ts:tv bias. We therefore expect that values of j estimated for third-codon positions will overestimate underlying mutation bias toward
transitions, which may be better described by the lower values found at first- and second-codon positions.
The results regarding j (fig. 2) might suggest that firstand second-codon positions could be grouped together in
a single category in future investigations. Although this
might be appropriate for the j parameter, however, figures
1, 3, and 4A and B suggest that significant differences in
rates, rate heterogeneity, and nucleotide frequencies between CPs 1 and 2 are common.
The bias against nucleotides G and C that is observed
for CP 2 can be viewed in terms of the evolutionary constraints for the preservation of codon function at the secondcodon position and selection against rapid change. G and C
are more mutagenic than nucleotides A and T (Costantini
et al. 2006), and reducing the GC frequency at CP 2 pre-
sumably helps to reduce the mutation rate. It is interesting to
note that CP 1 has a purine bias (A and G) and CP 3 has a
slight pyrimidine bias (T and C). The biological reasons
for this are not entirely clear but nucleotide size may
affect mRNA properties and transcription or translation
efficiency.
In an extension to the study described above, a data set
of overlapping reading frames was analyzed in much the
same way as the PANDIT families. There are 6 categories
of sites in an overlapping reading frame data set (3 categories of sites correspond to the 3 CPs in nonoverlapping
regions, and 3 categories of sites correspond to overlaps between CPs 1 and 2, 1 and 3, and 2 and 3, respectively).
Yang, Lauder, et al. (1995) published a hepatitis B virus
(HBV) data set of 13 aligned sequences and analyzed
the rate differences only between the site categories. In order to achieve statistical significance when comparing models, they concatenated all of the HBV genes in the multiple
alignments into a single meta-data set with 6 site categories.
We analyzed this data set in the same way, using the greater
variety of models as presented in table 1. After adjusting the
degrees of freedom in the tests in table 1 for the greater
number of site categories, each of tests T-1 to T-4 is significant for the concatenated HBV data set; parameter estimates are presented in table 2.
Notice that the evolutionary rates, rate heterogeneities,
and ts:tv biases of the overlapping reading frame positions
(categories 4–6 in table 2) are intermediate between the values of their component separate CPs. This is in contrast to
the naı̈ve expectation, with respect to evolutionary rates,
that ‘‘considering the double roles performed by sites in
these [overlapping] classes, we should expect these 3 rate
parameters to be less than 1.’’ (Yang, Lauder, et al. 1995,
p. 591), where ‘‘1’’ refers to the rate of the first-codon position against which the rates for the other site categories are
compared. Yang, Lauder, et al. (1995) interpreted this result, again for rates only, as a consequence of concatenating
the separate genes. We consider these results to be a reflection of the evolutionary processes occurring in the data set
and not a modeling artifact. Because the order of evolutionary rates at the different CPs (3 . 1 . 2) is not the same as
the order of ts:tv biases (3 . 1 ’ 2) for the PANDIT studies, the intermediate values of such parameters in the HBV
data set are unlikely to be a consequence of interactions in
the rate parameters of different genes. Overlapping reading
frames may evolve in the least constrained parts of the HBV
genome (Rambaut A, personal communication), allowing
lower levels of constraint than we might expect. This may
lead to intermediate evolutionary parameters for estimates
520 Bofkin and Goldman
of rate, ts:tv biases, and rate heterogeneity. Increased evolutionary constraint is not a necessary consequence of
overlapping reading frame positions; indeed, more rapid
evolution may help the virus to evade the host immune
system. Care should be taken in extrapolating our results
to other overlapping reading frames as only a single data
set has been used.
Conclusions
There are clear systematic differences in the evolutionary patterns of the 3 CPs of protein-coding DNA. This variation is not accounted for when we use simple nucleotide
models of evolution that do not explicitly incorporate CP
effects. Despite this, overly simplistic models are used in
many current studies.
The effects of using overly simplistic models of evolution will vary from data set to data set. Firstly, simple
models tend to underestimate the evolutionary distance between sequences by inadequately estimating the number of
multiple substitutions that have occurred at any given site
(e.g., Gojobori et al. 1982; Yang et al. 1994). Incorrect
models may misestimate the phylogeny (e.g., Philippe
and Germot 2000; Phillips et al. 2004) or inflate the confidence in any given maximum likelihood topology (Yang,
Goldman, et al. 1995). Additionally, parameter estimates of
simplistic models may not have biological validity; they are
biased averages of the parameter estimates of different categories of sites and may also be confounded by other factors
that have not been modeled explicitly. Thus, in agreement
with Shapiro et al. (2006), CP models provide better
explanations of the evolution of codon data sets for the vast
majority of such multiple alignments, compared with simple models.
It may in future be possible to relate differences in
parameter estimates between CPs to different gene functions (annotation that is not yet available in PANDIT),
the degree of purifying selection (ascertained using codon
models: Yang and Bielawski 2000), or both (see ArisBrosou 2005). We have not yet attempted these analyses.
Our findings are relevant to the design of gene-finding
algorithms that use multiple sequence alignments to identify candidate CDSs. Because the evolutionary rate, rate
heterogeneity, ts:tv biases, and nucleotide frequencies
may all differ between CPs, a model incorporating allowances for such periodic variation in evolutionary patterns
may add power in the identification of CDSs. Even the most
complex current methods (e.g., McAuliffe et al. 2004) do
not take advantage of this evolutionary information. Our
research also suggests, in agreement with Shapiro et al.
(2006), that models accounting for differences in evolutionary patterns across CPs should be used to model CDS evolution more accurately, instead of the homogeneous models
that are currently widespread. Additionally, parameter
distributions obtained in this investigation can be used as
appropriate prior distributions in Bayesian studies.
This investigation has examined many large codon
sequence multiple alignments and described in detail the
variation in evolutionary properties of the CPs. The
quantification of parameter values has not previously been
studied as comprehensively. The evolutionary patterns
detected are consistent with our knowledge of functional
constraints of the genetic code. Furthermore, an increase
in evolutionary constraint is not an inevitable consequence
of overlapping reading frames, although we caution against
extrapolating these results to all overlapping reading
frames. The types of models discussed can be implemented
using existing software, such as PAML (Yang 1997); they
should be used in preference to homogeneous models to
reflect the biology of protein-coding sequences better
and to improve our inferences regarding protein sequence
evolution. This research provides insights into our developing understanding of the effects of selection on sequence
evolution.
Acknowledgments
L.B. was supported by a Wellcome Trust Prize Studentship and the European Molecular Biology Laboratory
and was a member of Darwin College, University of
Cambridge. N.G. was supported by the Wellcome Trust.
We thank Adrian Friday, Andrew Rambaut, Alexei Drummond, and an anonymous reviewer for helpful suggestions.
Funding for the Open Access publication charges was
provided by the Wellcome Trust.
Literature Cited
Adachi J, Waddell PJ, Martin W, Hasegawa, M. 2000. Model of
amino acid substitution in proteins encoded by mitochondrial
DNA. J Mol Evol. 42:459–468.
Aris-Brosou S. 2005. Determinants of adaptive evolution at the
molecular level: the extended complexity hypothesis. Mol Biol
Evol. 22:200–209.
Costantini M, Clay O, Auletta F, Bernardi G. 2006. An isochore
map of human chromosomes. Genome Res. 16:536–541.
Felsenstein J. 2004. Inferring phylogenies. Sunderland (MA):
Sinauer Associates.
Freeland S, Hurst L. 1998. The genetic code is one in a million.
J Mol Evol. 47:238–248.
Freeland S, Knight R, Landweber L, Hurst L. 2000. Early fixation
of an optimal genetic code. Mol Biol Evol. 17:511–518.
Gojobori T, Ishii K, Nei M. 1982. Estimation of average number
of nucleotide substitutions when the rate of substitution varies
with nucleotide. J Mol Evol. 18:414–422.
Goldman N. 1993. Statistical tests of models of DNA substitution.
J Mol Evol. 37:650–661.
Goldman N, Yang Z. 1994. A codon-based model of nucleotide
substitution for protein-coding DNA sequences. Mol Biol
Evol. 11:725–736.
Hasegawa M, Kishino H, Yano T. 1985. Dating the human-ape
splitting by a molecular clock of mitochondrial DNA. J Mol
Evol. 22:160–174.
Huelsenbeck J, Nielsen R. 1999. Variation in the pattern of nucleotide substitution across sites. J Mol Evol. 48:86–93.
Kumar S. 1996. Patterns of nucleotide substitution in mitochondrial protein coding genes of vertebrates. Genetics. 143:537–
548.
McAuliffe JD, Pachter L, Jordan MI. 2004. Multiple-sequence
functional annotation and the generalized hidden Markov phylogeny. Bioinformatics. 20:1850–1860.
Muse S, Gaut B. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates,
with applications to the chloroplast genome. Mol Biol Evol.
11:715–724.
Variation in Evolution at Different Codon Positions 521
Philippe H, Germot A. 2000. Phylogeny of eukaryotes based on
ribosomal RNA: long-branch attraction and models of sequence evolution. Mol Biol Evol. 17:830–834.
Phillips M, Delsuc F, Penny D. 2004. Genome-scale phylogeny
and the detection of systematic biases. Mol Biol Evol.
21:1455–1458.
Posada D, Crandall K. 1998. Modeltest: testing the model of DNA
substitution. Bioinformatics. 14:817–818.
Ren F, Tanaka H, Yang Z. 2005. An empirical examination of the
utility of codon-substitution models in phylogeny reconstruction. Syst Biol. 54:808–818.
Shapiro B, Rambaut A, Drummond A. 2006. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol Biol Evol. 23:7–9.
Sullivan J, Holsinger K, Simon C. 1996. The effect of topology
on estimates of among-site variation. J Mol Evol. 42:308–312.
The International HapMap Consortium. 2003. The international
HapMap project. Nature. 426:789–796.
Whelan S, de Bakker PIW, Quevillon E, Rodriguez N, Goldman
N. 2006. PANDIT: an evolution-centric database of protein
and associated nucleotide domains with inferred trees. Nucleic
Acids Res. 34:D327–D331.
Whelan S, Goldman N. 1999. Distributions of statistics used for
the comparison of models of sequence evolution in phylogenetics. Mol Biol Evol. 16:1292–1299.
Whelan S, Goldman N. 2004. Estimating the frequency of events that
cause multiple-nucleotide changes. Genetics. 167:2027–2043.
Whelan S, Liò P, Goldman N. 2001. Molecular phylogenetics:
state-of-the-art methods for looking into the past. Trends
Genet. 17:261–272.
Yang Z. 1993. Maximum likelihood estimation of phylogeny from
DNA sequences when substitution rates differ over sites. Mol
Biol Evol. 10:1396–1401.
Yang Z. 1994a. Estimating the pattern of nucleotide substitution.
J Mol Evol. 39:105–111.
Yang Z. 1994b. Maximum likelihood phylogenetic estimation
from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39:306–314.
Yang Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J Mol Evol. 42:587–596.
Yang Z. 1997. PAML: a program package for phylogenetic
analysis by maximum likelihood. Comput Appl Biosci. 13:
555–556.
Yang Z, Bielawski JP. 2000. Statistical methods for detecting
molecular adaptation. Trends Ecol Evol. 15:496–503.
Yang Z, Goldman N, Friday A. 1994. Comparison of
models for nucleotide substitution used in maximumlikelihood phylogenetic estimation. Mol Biol Evol. 11:
316–324.
Yang Z, Goldman N, Friday A. 1995. Maximum likelihood trees
from DNA sequences: a peculiar statistical estimation problem.
Syst Biol. 44:384–399.
Yang Z, Lauder IJ, Lin HJ. 1995. Molecular evolution of the hepatitis B virus genome. J Mol Evol. 41:587–596.
Yang Z, Nielsen R, Goldman N, Pedersen A-M. 2000. Codonsubstitution models for the heterogeneous selection pressure
at amino acid sites. Genetics. 155:431–449.
Yang Z, Nielsen R, Hasegawa M. 1998. Models of amino acid
substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 15:1600–1611.
Martin Embley, Associate Editor
Accepted November 15, 2006