Variation in Evolutionary Processes at Different Codon Positions Lee Bofkin and Nick Goldman European Molecular Biology Laboratory–European Bioinformatics Institute, Hinxton, United Kingdom Evolutionary studies commonly model single nucleotide substitutions and assume that they occur as independent draws from a unique probability distribution across the sequence studied. This assumption is violated for protein-coding sequences, and we consider modeling approaches where codon positions (CPs) are treated as separate categories of sites because within each category the assumption is more reasonable. Such ‘‘codon-position’’ models have been shown to explain the evolution of codon data better than homogenous models in previous studies. This paper examines the ways in which codonposition models outperform homogeneous models and characterizes the differences in estimates of model parameters across CPs. Using the PANDIT database of multiple species DNA sequence alignments, we quantify the differences in the evolutionary processes at the 3 CPs in a systematic and comprehensive manner, characterizing previously undescribed features of protein evolution. We relate our findings to the functional constraints imposed by the genetic code, protein function, and the types of mutation that cause synonymous and nonsynonymous codon changes. The results increase our understanding of selective constraints and could be incorporated into phylogenetic analyses or gene-finding techniques in the future. The methods used are extended to an overlapping reading frame data set, and we discover that overlapping reading frames do not necessarily cause more stringent evolutionary constraints. Introduction The occurrence of point (single nucleotide) mutations is common in nature. Although larger scale mutations certainly exist (such as doublet and triplet mutations or gene conversion: see Whelan and Goldman 2004 and references therein), the majority of phylogenetic reconstruction methods in the parsimony, distance matrix, and maximum likelihood frameworks utilize information from point mutations. Such mutations also underpin single nucleotide polymorphism analyses (The International HapMap Consortium 2003). However, just because the majority of studies are at the single nucleotide level, this does not mean all sites evolve in a homogenous pattern. It is a common practice to treat different positions in a multiple sequence DNA alignment as if they evolve under identical mutational and selective pressures and can be described using the same mathematical model. For example, the program Modeltest (Posada and Crandall 1998) that is widely used to determine the model that should be used to analyze a DNA sequence multiple alignment only considers models where the nucleotide frequencies and transition: transversion (ts:tv) biases are assumed to be identical across all sites (models with heterogeneity in evolutionary rates across sites are allowed, but such models still consider sites to be independent and identically-distributed). Mutational and selective pressures on a sequence alignment are measured as estimates of explicit model parameters in mechanistic models of evolution, which estimate the process of sequence evolution using the data itself. In models that assume a homogeneous evolutionary process across sites, model parameters are estimated only once, as average values over all sites. Sites in a DNA multiple alignment data set may not have evolved according to a single common evolutionary pattern. We may expect sites in coding sequences, overlapping reading frames, isochores, regions of a gene that encode part of the same structural domain of a protein, and genes within the same chromosome to evolve more simiKey words: adaptive evolution, codon positions, phylogenetic inference, protein-coding sequences, sequence evolution. E-mail: [email protected]. Mol. Biol. Evol. 24(2):513–521. 2007 doi:10.1093/molbev/msl178 Advance Access publication November 21, 2006 larly to other sites in the same ‘‘site category,’’ as defined by a biological expectation, than to sites in other site categories. We may even expect variation in mutational processes across nonfunctional ‘‘junk’’ DNA. If we ignore differences in the evolutionary processes between heterogeneously evolving sites, then the parameter estimates of our mechanistic models will be poor explanations of the evolution of the data (inappropriate for all positions), which jeopardizes the accuracy of our inferences. Here, we use the systematic variation that is specific to coding DNA sequences (CDSs) to investigate differences in evolutionary rates, heterogeneity of evolutionary rates, ts:tv biases, and nucleotide frequencies between the 3 CPs of protein-coding genes. We expect first-codon positions within a gene to evolve more similarly to other first-codon positions than to either second- or third-codon positions, second-codon positions to evolve more similarly to other second-codon positions than to first- or third-codon positions, and likewise for third-codon positions. Variation in mutation and selection may cause differences in evolutionary patterns across a sequence. For proteincoding DNA sequences, although the actual pattern of mutation may vary across sites, the systematic difference in fixation rates of mutations across sites is more likely to be due to differences in natural selection as a consequence of the structure of the genetic code. Even if the probabilities of a given mutation occurring at different CPs were identical, mutations will be fixed in the species according to patterns specific to each CP, due to differences in functional constraints (and thus natural selection). In the case of codons, different evolutionary constraints at different CPs result from the functional constraints imposed by the genetic code and the physicochemical properties of encoded amino acids. The evolutionary constraints at different CPs are best considered in light of the genetic code. The 61 sense codons code for 20 amino acids in the universal genetic code, leading to some redundancy. The second-codon position is the most functionally constrained; any change to the secondcodon position causes a nonsynonymous change in the coding sequence. The third-codon position is the least functionally constrained; indeed, with particular combinations of first- and second-codon position sequences, the thirdcodon position may be 4-fold degenerate, in which case Ó 2006 The Authors This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 514 Bofkin and Goldman any base at the third-codon position will code for the same amino acid. Thus, the approach we adopt assumes that there are differences in the evolutionary patterns between the 3 CPs and that all sites at the same position within different codons evolve according to the same evolutionary pattern. Therefore, this investigation uses nucleotide-based models that can estimate model parameters independently for each of the 3 CPs in a multiple alignment. Previous investigations that have used large data sets have not investigated how a wide range of estimates of model parameters may vary between CPs. Conversely, studies that have used wider ranges of models have used very small data sets, which may be prone to poorly defined biases. The models we use are similar to the CP models used by Shapiro et al. (2006). Yang (1996) has shown that estimates of certain model parameters, such as evolutionary rates, rate heterogeneity, and ts:tv biases, may vary significantly between different CPs for genes in a small mitochondrial data set. Kumar (1996) demonstrated that rate heterogeneity, ts:tv biases, and nucleotide frequencies may differ significantly between CPs in a small data set of mitochondrial genes. Huelsenbeck and Nielsen (1999) demonstrated that the ts:tv bias may differ significantly between CPs on a small data set of proteincoding genes. Shapiro et al. (2006) have shown that models accounting for differences between the 3 CPs explain the evolution of 283 protein-encoding multiple alignment data sets from yeast and RNA virus genes better than models assuming a homogeneous evolutionary pattern. We do not expect CP models to fully explain proteincoding sequence evolution because variation in selection between codons is not accounted for, but they do provide a more reasonable model than homogeneous nucleotide evolutionary models. Furthermore, unlike protein domains, which have different positions in different data sets, identification of different CP categories in different multiple alignments is trivial. Although modeling CDS evolution using codons, not nucleotides, as the smallest unit of sequence evolution might be beneficial, there are reasons why we do not pursue this approach. Published codon models, such as those of Goldman and Yang (1994), Muse and Gaut (1994), and Yang et al. (2000), cannot readily be used to estimate the differences in estimates of certain parameters at the different CPs: certain parameters are not estimated independently at the different CPs. Additionally, codon models require more computation, which may be prohibitive for such a large data set as is used here (see Ren et al. 2005). The CP models that we use capture some of the complexity of codon sequence evolution and can be implemented with little extra computational effort than standard homogeneous nucleotide models. Although it is known that CPs evolve differently, the nature of these differences has not been fully characterized with sizeable data sets. In addition, some biological factors have not previously been investigated for variation between CPs. This study uses large amounts of data and a wider range of models than any previous study of its kind, to characterize differences in the evolutionary properties at the different CPs. We investigate the differences in evolutionary rates, heterogeneity of evolutionary rates, ts:tv biases (denoted by j), and nucleotide frequencies (denoted by p) between the CPs. Results are discussed in terms of the functional constraints of the genetic code and the effects that mutations have on coding sequences. We discuss how the results demonstrate uses for CP models, how parameter estimates may be useful in Bayesian studies, and how simple models could be used to identify CDSs in multiple alignments. Materials and Methods Models Following standard procedures in probabilistic modeling of DNA sequence evolution (see, e.g., Felsenstein 2004), we assume that sites evolve independently, each according to a Markov process: the probability of a site changing to any other given state (i.e., nucleotide) in any small time interval depends only on its current state and not on previous states (the process is memoryless). The evolutionary process is assumed to be time homogeneous, reversible, and at equilibrium, as is standard practice within the field of maximum likelihood phylogenetics (Felsenstein 2004). A range of evolutionary models that explicitly describe the process of substitution between nucleotide states was applied to the data. Some models include allowance for 3 categories of sites, where each category contains all the sites of a specific CP. Here, the process of evolution follows an identical distribution across all sites in the same category: models with multiple-site categories have common estimates for model parameters within each category, but parameter estimates may differ between categories. Thus, specific model parameters may be estimated independently for different categories of sites in a sequence alignment. We used maximum likelihood to fit free parameters to the data sets studied. Parameter value estimates were examined to observe differences between site categories; statistical tests (see below) determined when these differences were significant. The HKY model of DNA substitution (Hasegawa et al. 1985) is the basic model from which our more complex CP models are developed. Under the HKY model, the probability of a nucleotide changing to a different given nucleotide depends on the frequency of the target nucleotide and whether or not the mutation is a transition or a transversion. The HKY model is complex enough to describe the features that we wish to model but simple enough to produce results that can be easily interpreted. Rate heterogeneity across sites is usually modeled using a distribution of rates, commonly a discretised cdistribution whose shape is governed by a single parameter, a (Yang 1993, 1994b). The rate at each site is taken to be a random draw from this distribution. When a is low then there is extreme variation in the rate at which different positions in the alignment have evolved, with most sites having evolved at very low rates and relatively few having evolved at much higher rates. As a increases, the rate variation decreases and the distribution of evolutionary rates becomes bell shaped, and as a tends to infinity, all sites tend toward evolving at the same rate. In the more complex models used in this investigation, we investigated the evolution of evolutionary rates (models denoted 1R), rate heterogeneity using a c-distribution Variation in Evolution at Different Codon Positions 515 Table 1 Models and Statistical Tests Used and the Percentage of PANDIT Families that Are Significant for Various Tests Test Null (A) and Alternate (B) Models Degrees of Freedom T-1 T-2 T-3 T-4 A: HKY 1 G; B: HKY 1 3R 1 G A: HKY 1 3R 1 3N 1 3T 1 G; B: HKY 1 3R 1 3N 1 3T 1 3G A: HKY 1 3R 1 3N 1 3G; B: HKY 1 3R 1 3N 1 3T 1 3G A: HKY 1 3R 1 3T 1 3G; B: HKY 1 3R 1 3N 1 3T 1 3G 2 2 2 6 Percentage of PANDIT Families with Alternate Model Significantly Better 97 30 75 95 NOTE.—Notation is such that, for example, for the model HKY 1 3R 1 3N 1 3T 1 3G, separate rates (R), nucleotide frequencies (N), ts:tv biases (T), and rate heterogeneities (G) were estimated for each category (13) of CP across the sites in a multiple alignment. A simple ‘‘1G,’’ as in the model HKY 1 G, means that a single c-distribution is applied across all sites to account for heterogeneity in evolutionary rates. (denoted 1G), ts:tv biases (1T), and nucleotide frequencies (1N) in various combinations. Where these estimates were allowed to be independent for each category of CPs, models are denoted 13R, 13G, etc. Thus, for example, model HKY 1 3R is based on the HKY model and estimates only the evolutionary rates independently for each of the 3 categories of CPs. The most complex model used estimates the evolutionary rates, as, js, and ps independently for the 3 CP categories; we write this model as HKY 1 3R 1 3N 1 3T 1 3G. The models used in this investigation are presented in table 1. All models were applied to the data using the program BASEML in the PAML suite of programs (Yang 1997). Statistical Tests We can compare how well different models explain the evolution of the same data set by comparing the maximum likelihood values of the models (Goldman 1993; Yang et al. 1994; Whelan et al. 2001). All tests used in this paper compare null and alternate models where the null hypothesis model is a special case of the alternate hypothesis model, that is, with some of its parameters fixed to certain values (the simpler null hypothesis model is ‘‘nested’’ in the more complex alternate hypothesis model). In this situation, the model comparison can be performed using a likelihood ratio test (LRT), where the distribution of twice the difference in maximum log likelihoods of the models given the data (the LRT statistic) is v2 if the null hypothesis model is correct (Yang et al. 1994; Whelan and Goldman 1999). The v2 distribution has its number of degrees of freedom equal to the difference in numbers of free parameters between the models tested. Where the LRT statistic exceeds the 95% mark of the v2 distribution, we consider the alternate hypothesis to be a significantly better explanation of the evolution of the sequences studied. The tests that were constructed from the models applied to the data are presented in table 1. Test T-1 (HKY 1 G vs. HKY 1 3R 1 G, i.e., models differing by the 13R component) is a test of whether making an allowance for different evolutionary rates at each of the 3 CPs is significant. T-2 tests whether there are significant differences between CPs in the levels of among-site rate heterogeneity (1G cf. 13G). T-3 and T-4 test for significant differences in ts:tv biases and nucleotide frequencies at different CPs, respectively. Limits to the models that can be applied using BASEML affected how some of the tests were devised. For example, BASEML does not permit testing for differences in nucleotide frequencies or ts:tv biases between CPs without also estimating differences in evolutionary rates between CPs. Furthermore, it is not possible to test for differences in evolutionary rates between CPs while simultaneously accounting for possible differences between nucleotide frequencies and ts:tv biases at the different CPs. Data We used the PANDIT database (release 17), which contains 7,738 multiple alignments of protein-coding DNA sequences, each with a phylogenetic tree (Whelan et al. 2006). The alignments and tree topologies were used as provided and branch lengths were reestimated for each analysis. Although there may be some errors in the alignments and phylogenetic trees, note that using reasonable estimates of phylogenetic trees should give reasonable parameter estimates (Yang 1994a, 1994b; Yang et al. 1994, 1998; Yang, Goldman, et al. 1995; Sullivan et al. 1996; Adachi et al. 2000). We assume that this is also true for multiple alignments. These alignments represent a considerably greater number of data sets than has previously been studied in this manner, spanning a broader range of genes and species. Results Successful Optimizations The program BASEML performs parameter estimation by likelihood maximization and sometimes fails when the algorithm fails to find the optimal set of parameter values for a data set. Results for any given PANDIT family were retained only when all models shown in table 1 optimized successfully for that family. Negative values for the test statistics of nested hypotheses are indicative of failed optimizations because such values are theoretically impossible. If one test fails for a given family, we consider it more likely that optimization is challenging for other models, and the best guarantee of accuracy is to discard the results for the entire family. Applying this criterion, optimizations were successful for 7,158 of the 7,738 alignments (93%). The families where optimizations failed tended to have fewer sequences in the alignment (median 106 sequences per family for families where optimization failed vs. 309 sequences per family for successful optimizations), which suggests that small amounts of data may make optimization of parameter 516 Bofkin and Goldman 10000 Codon position 2 Codon position 3 Number of families 1000 100 10 0. 1 0. 5 0. 9 1. 3 1. 7 2. 1 2. 5 2. 9 3. 3 3. 7 4. 1 4. 5 4. 9 5. 3 5. 7 6. 1 6. 5 6. 9 7. 3 7. 7 8. 1 8. 5 8. 9 9. 3 9. 7 10 00 1 Evolutionary rate FIG. 1.—Distributions of evolutionary rates at second- and third-codon positions (relative to rate 1 for CP 1). Note the logarithmic scaling on the y-axis. estimates more challenging. The percentages of the 7,158 successfully optimized families for which results for each test were significant are presented in table 1. CP Differences Test T-1 investigates the effect of making an allowance for a difference in evolutionary rates at different CPs. We find that evolutionary rates vary significantly between CPs in 97% of families. Figure 1 shows the estimates of evolutionary rates at the second- and third-codon positions, relative to the estimated evolutionary rate at the firstcodon position, which is arbitrarily set equal to 1 (results only reported when test T-1 was significant). Second-codon positions tend to evolve more slowly than first-codon positions, which in turn tend to evolve more slowly than thirdcodon positions. The width of the distributions is also worth noting; the rates of the third-codon positions of the different PANDIT families have a broader distribution than the rates of the second-codon positions. Presumably, these results reflect differences in the functional constraints that the genetic code places on the different CPs. The results for test T-2 indicate that heterogeneity in evolutionary rates varies significantly between CPs in 30% of families. This is the weakest effect studied, but still affects a large number of families. The low percentage may be because there is relatively little such variation, but is probably because a single c-distribution applied across all 3 CPs can already model much of the rate variation between CPs as well as within each CP category. Distributions of estimates of the parameter a at the different CPs are shown in figure 2 (where test T-2 was significant). The second-codon position shows the most rate heterogeneity (lowest value of a). Third-codon positions have the least rate heterogeneity; such sites tend to evolve at more similar rates to each other than other CPs do. It is interesting that most estimated values of a exceed 1, which means that rate heterogeneity within CP categories is not strong. Indeed, the many high a values at third-codon positions, although shown only when Test T-2 was significant, suggest that it might be interesting in future to consider models that allow rate heterogeneity at only the first- and second-codon positions. The results for test T-3 indicate that ts:tv biases vary significantly between CPs in 75% of families. First- and second-codon positions have virtually identical distributions of estimates of j across families in the PANDIT database. The third-codon position tends to have a much higher ts:tv bias than the first- or second-codon position. Some values for the estimates of j are very high for the third-codon position, which may occur when the evolutionary distance between sequences is small and almost all changes inferred are transitions; these estimates will have a very high variance. The distributions of estimates of j at the 3 CPs inferred using model HKY 1 3R 1 3N 1 3T 1 G are shown in figure 3 for each PANDIT family where test T-3 was significant. The results from T-4 indicate that nucleotide frequencies vary significantly between CPs in 95% of families. The distributions of the frequencies of the four nucleotides at each CP are shown in figure 4A–C. CP 2 has a bias against nucleotides G and C, which are more mutagenic than nucleotides A and T (Costantini et al. 2006). CP 1 often has a bias toward purines, and CP 3 has a slight bias toward pyrimidine. Consistency of Results Under Different Models Different model choices are available for testing the between-codon position variation of evolutionary parameters—for example, differences in ts:tv biases (test Variation in Evolution at Different Codon Positions 517 FIG. 2.—Distributions of rate heterogeneity parameters a at different CPs. T-3) could instead have been studied by comparing models HKY 1 3R and HKY 1 3R 1 3T. We have generally presented results from the most complex models appropriate for the investigation of each feature. Because more complex models were preferred for most data sets, these are generally expected to give superior parameter estimates. The use of less complex models made little difference to the number of families that were significant when testing for differences between CPs for all parameters except for rate heterogeneity (results not shown). Other versions of T-2 using alternate hypothesis models less complex than HKY 1 3R 1 3N 1 3T 1 3G led to considerable increases in the percentage of families that gave significant test results when considering differences in rate heterogeneity between CPs. Failure to account for other differences (in evolutionary rate, ts:tv bias, etc.) between CPs presumably causes an increase in the number of families where rate heterogeneity was significantly different between the CPs due to interactions between parameters: the ‘‘13G’’ model of CP differences in rate heterogeneity can generate likelihood improvements in the absence of this form of variation when other genuine effects are not explicitly modeled. Thus, our results are clearer when considering the models and tests that we have presented. FIG. 3.—Distributions of transition:transversion biases at different CPs. 518 Bofkin and Goldman 3500 A C G T Number of families 3000 2500 2000 1500 1000 500 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Nucleotide frequencies at codon position 1 3500 A C G T Number of families 3000 2500 2000 1500 1000 500 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Nucleotide frequencies at codon position 2 3500 A C G T Number of families 3000 2500 2000 1500 1000 500 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Nucleotide frequencies at codon position 3 FIG. 4.—Distributions of nucleotide frequencies at the (A) first-, (B) second-, and (C) third-codon position. Discussion Of the parameters studied in this paper, the evolutionary rate and nucleotide frequencies differ significantly most often between CPs. Alongside the ts:tv biases, modeling the differences in these factors between CPs is very important in our modeling approaches. Rate heterogeneity is the weakest of the factors tested, with the least number of families showing significant differences between the CPs. However, 30% of 7,158 data sets represents enough cases to suggest it is important to test for rate heterogeneity differences between CPs, even though models permitting Variation in Evolution at Different Codon Positions 519 Table 2 Parameter Estimates for the HBV Data Set CP Category 1 2 3 4 (1 1 3) 5 (1 1 2) 6 (2 1 3) a Rate 1 0.33 3.92 1.87 0.66 0.94 6 6 6 6 6 0.07 0.52 0.28 0.12 0.16 0.27 0.05 1.02 0.33 0.14 0.21 6 6 6 6 6 6 j 0.07 0.00 0.18 0.06 0.05 0.05 1.56 1.97 4.34 3.18 1.73 2.92 6 6 6 6 6 6 p 0.28 0.57 0.45 0.45 0.39 0.55 A:0.25, A:0.27, A:0.24, A:0.21, A:0.21, A:0.18, C:0.24; C:0.25; C:0.20; C:0.30; C:0.31; C:0.32; G:0.25, G:0.18, G:0.21, G:0.24, G:0.23, G:0.21, T:0.26 T:0.29 T:0.35 T:0.26 T:0.26 T:0.29 NOTE.—For category rate, a and j standard errors of parameter estimates are shown after each point estimate. these differences may be found unnecessary for a majority of studies. The estimates of model parameters at the different CPs, and their distributions over many data sets, can be related to our knowledge of the functional constraints of the genetic code and the probability of different mutations causing synonymous and nonsynonymous changes to the coding sequence. Indeed, the parameter estimates (including the fact that ts:tv biases are present) support the hypothesis that the genetic code is adaptive (Freeland and Hurst 1998; Freeland et al. 2000). The functional constraints that result from the genetic code may explain the observed variation in evolutionary rates at the different CPs. Similarly, the most extreme rate heterogeneity at CP 2, which means that the majority of second-codon positions evolve slowly and a small proportion evolve more rapidly, may be explained by the strong functional constraint on the majority of such sites. More rapidly evolving sites may be under less purifying selection than average. Interestingly, first- and second-codon positions have virtually identical distributions of estimates of j, which may be because the proportion of transitions and transversions that cause synonymous or nonsynonymous changes in the amino acid encoded by a codon is virtually identical for first- and second-codon positions (Rambaut A, personal communication). The third-codon position tends to have a much higher ts:tv bias than the first- or second-codon position, and this is likely to be because the majority of transversions are nonsynonymous for the third-codon position. Thus, transversions are disproportionally selected against at the third-codon position, elevating the ts:tv bias. We therefore expect that values of j estimated for third-codon positions will overestimate underlying mutation bias toward transitions, which may be better described by the lower values found at first- and second-codon positions. The results regarding j (fig. 2) might suggest that firstand second-codon positions could be grouped together in a single category in future investigations. Although this might be appropriate for the j parameter, however, figures 1, 3, and 4A and B suggest that significant differences in rates, rate heterogeneity, and nucleotide frequencies between CPs 1 and 2 are common. The bias against nucleotides G and C that is observed for CP 2 can be viewed in terms of the evolutionary constraints for the preservation of codon function at the secondcodon position and selection against rapid change. G and C are more mutagenic than nucleotides A and T (Costantini et al. 2006), and reducing the GC frequency at CP 2 pre- sumably helps to reduce the mutation rate. It is interesting to note that CP 1 has a purine bias (A and G) and CP 3 has a slight pyrimidine bias (T and C). The biological reasons for this are not entirely clear but nucleotide size may affect mRNA properties and transcription or translation efficiency. In an extension to the study described above, a data set of overlapping reading frames was analyzed in much the same way as the PANDIT families. There are 6 categories of sites in an overlapping reading frame data set (3 categories of sites correspond to the 3 CPs in nonoverlapping regions, and 3 categories of sites correspond to overlaps between CPs 1 and 2, 1 and 3, and 2 and 3, respectively). Yang, Lauder, et al. (1995) published a hepatitis B virus (HBV) data set of 13 aligned sequences and analyzed the rate differences only between the site categories. In order to achieve statistical significance when comparing models, they concatenated all of the HBV genes in the multiple alignments into a single meta-data set with 6 site categories. We analyzed this data set in the same way, using the greater variety of models as presented in table 1. After adjusting the degrees of freedom in the tests in table 1 for the greater number of site categories, each of tests T-1 to T-4 is significant for the concatenated HBV data set; parameter estimates are presented in table 2. Notice that the evolutionary rates, rate heterogeneities, and ts:tv biases of the overlapping reading frame positions (categories 4–6 in table 2) are intermediate between the values of their component separate CPs. This is in contrast to the naı̈ve expectation, with respect to evolutionary rates, that ‘‘considering the double roles performed by sites in these [overlapping] classes, we should expect these 3 rate parameters to be less than 1.’’ (Yang, Lauder, et al. 1995, p. 591), where ‘‘1’’ refers to the rate of the first-codon position against which the rates for the other site categories are compared. Yang, Lauder, et al. (1995) interpreted this result, again for rates only, as a consequence of concatenating the separate genes. We consider these results to be a reflection of the evolutionary processes occurring in the data set and not a modeling artifact. Because the order of evolutionary rates at the different CPs (3 . 1 . 2) is not the same as the order of ts:tv biases (3 . 1 ’ 2) for the PANDIT studies, the intermediate values of such parameters in the HBV data set are unlikely to be a consequence of interactions in the rate parameters of different genes. Overlapping reading frames may evolve in the least constrained parts of the HBV genome (Rambaut A, personal communication), allowing lower levels of constraint than we might expect. This may lead to intermediate evolutionary parameters for estimates 520 Bofkin and Goldman of rate, ts:tv biases, and rate heterogeneity. Increased evolutionary constraint is not a necessary consequence of overlapping reading frame positions; indeed, more rapid evolution may help the virus to evade the host immune system. Care should be taken in extrapolating our results to other overlapping reading frames as only a single data set has been used. Conclusions There are clear systematic differences in the evolutionary patterns of the 3 CPs of protein-coding DNA. This variation is not accounted for when we use simple nucleotide models of evolution that do not explicitly incorporate CP effects. Despite this, overly simplistic models are used in many current studies. The effects of using overly simplistic models of evolution will vary from data set to data set. Firstly, simple models tend to underestimate the evolutionary distance between sequences by inadequately estimating the number of multiple substitutions that have occurred at any given site (e.g., Gojobori et al. 1982; Yang et al. 1994). Incorrect models may misestimate the phylogeny (e.g., Philippe and Germot 2000; Phillips et al. 2004) or inflate the confidence in any given maximum likelihood topology (Yang, Goldman, et al. 1995). Additionally, parameter estimates of simplistic models may not have biological validity; they are biased averages of the parameter estimates of different categories of sites and may also be confounded by other factors that have not been modeled explicitly. Thus, in agreement with Shapiro et al. (2006), CP models provide better explanations of the evolution of codon data sets for the vast majority of such multiple alignments, compared with simple models. It may in future be possible to relate differences in parameter estimates between CPs to different gene functions (annotation that is not yet available in PANDIT), the degree of purifying selection (ascertained using codon models: Yang and Bielawski 2000), or both (see ArisBrosou 2005). We have not yet attempted these analyses. Our findings are relevant to the design of gene-finding algorithms that use multiple sequence alignments to identify candidate CDSs. Because the evolutionary rate, rate heterogeneity, ts:tv biases, and nucleotide frequencies may all differ between CPs, a model incorporating allowances for such periodic variation in evolutionary patterns may add power in the identification of CDSs. Even the most complex current methods (e.g., McAuliffe et al. 2004) do not take advantage of this evolutionary information. Our research also suggests, in agreement with Shapiro et al. (2006), that models accounting for differences in evolutionary patterns across CPs should be used to model CDS evolution more accurately, instead of the homogeneous models that are currently widespread. Additionally, parameter distributions obtained in this investigation can be used as appropriate prior distributions in Bayesian studies. This investigation has examined many large codon sequence multiple alignments and described in detail the variation in evolutionary properties of the CPs. The quantification of parameter values has not previously been studied as comprehensively. The evolutionary patterns detected are consistent with our knowledge of functional constraints of the genetic code. Furthermore, an increase in evolutionary constraint is not an inevitable consequence of overlapping reading frames, although we caution against extrapolating these results to all overlapping reading frames. The types of models discussed can be implemented using existing software, such as PAML (Yang 1997); they should be used in preference to homogeneous models to reflect the biology of protein-coding sequences better and to improve our inferences regarding protein sequence evolution. This research provides insights into our developing understanding of the effects of selection on sequence evolution. Acknowledgments L.B. was supported by a Wellcome Trust Prize Studentship and the European Molecular Biology Laboratory and was a member of Darwin College, University of Cambridge. N.G. was supported by the Wellcome Trust. We thank Adrian Friday, Andrew Rambaut, Alexei Drummond, and an anonymous reviewer for helpful suggestions. Funding for the Open Access publication charges was provided by the Wellcome Trust. Literature Cited Adachi J, Waddell PJ, Martin W, Hasegawa, M. 2000. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol. 42:459–468. Aris-Brosou S. 2005. Determinants of adaptive evolution at the molecular level: the extended complexity hypothesis. Mol Biol Evol. 22:200–209. Costantini M, Clay O, Auletta F, Bernardi G. 2006. An isochore map of human chromosomes. Genome Res. 16:536–541. Felsenstein J. 2004. Inferring phylogenies. Sunderland (MA): Sinauer Associates. Freeland S, Hurst L. 1998. The genetic code is one in a million. J Mol Evol. 47:238–248. Freeland S, Knight R, Landweber L, Hurst L. 2000. Early fixation of an optimal genetic code. Mol Biol Evol. 17:511–518. Gojobori T, Ishii K, Nei M. 1982. Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. J Mol Evol. 18:414–422. Goldman N. 1993. Statistical tests of models of DNA substitution. J Mol Evol. 37:650–661. Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 11:725–736. Hasegawa M, Kishino H, Yano T. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 22:160–174. Huelsenbeck J, Nielsen R. 1999. Variation in the pattern of nucleotide substitution across sites. J Mol Evol. 48:86–93. Kumar S. 1996. Patterns of nucleotide substitution in mitochondrial protein coding genes of vertebrates. Genetics. 143:537– 548. McAuliffe JD, Pachter L, Jordan MI. 2004. Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics. 20:1850–1860. Muse S, Gaut B. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with applications to the chloroplast genome. Mol Biol Evol. 11:715–724. Variation in Evolution at Different Codon Positions 521 Philippe H, Germot A. 2000. Phylogeny of eukaryotes based on ribosomal RNA: long-branch attraction and models of sequence evolution. Mol Biol Evol. 17:830–834. Phillips M, Delsuc F, Penny D. 2004. Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 21:1455–1458. Posada D, Crandall K. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics. 14:817–818. Ren F, Tanaka H, Yang Z. 2005. An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst Biol. 54:808–818. Shapiro B, Rambaut A, Drummond A. 2006. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol Biol Evol. 23:7–9. Sullivan J, Holsinger K, Simon C. 1996. The effect of topology on estimates of among-site variation. J Mol Evol. 42:308–312. The International HapMap Consortium. 2003. The international HapMap project. Nature. 426:789–796. Whelan S, de Bakker PIW, Quevillon E, Rodriguez N, Goldman N. 2006. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 34:D327–D331. Whelan S, Goldman N. 1999. Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol Biol Evol. 16:1292–1299. Whelan S, Goldman N. 2004. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics. 167:2027–2043. Whelan S, Liò P, Goldman N. 2001. Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet. 17:261–272. Yang Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol. 10:1396–1401. Yang Z. 1994a. Estimating the pattern of nucleotide substitution. J Mol Evol. 39:105–111. Yang Z. 1994b. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39:306–314. Yang Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J Mol Evol. 42:587–596. Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13: 555–556. Yang Z, Bielawski JP. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 15:496–503. Yang Z, Goldman N, Friday A. 1994. Comparison of models for nucleotide substitution used in maximumlikelihood phylogenetic estimation. Mol Biol Evol. 11: 316–324. Yang Z, Goldman N, Friday A. 1995. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst Biol. 44:384–399. Yang Z, Lauder IJ, Lin HJ. 1995. Molecular evolution of the hepatitis B virus genome. J Mol Evol. 41:587–596. Yang Z, Nielsen R, Goldman N, Pedersen A-M. 2000. Codonsubstitution models for the heterogeneous selection pressure at amino acid sites. Genetics. 155:431–449. Yang Z, Nielsen R, Hasegawa M. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 15:1600–1611. Martin Embley, Associate Editor Accepted November 15, 2006
© Copyright 2026 Paperzz