Syst. Biol. 46(2):346-353, 1997 SITE-BY-SITE ESTIMATION OF THE RATE OF SUBSTITUTION AND THE CORRELATION OF RATES IN MITOCHONDRIAL DNA RASMUS NIELSEN Department of Integrative Biology, University of California, Berkeley, California 94720, USA; E-mail: [email protected]. berkeley.edu Abstract.—A maximum likelihood method for independently estimating the relative rate of substitution at different nucleotide sites is presented. With this method, the evolution of DNA sequences can be analyzed without assuming a specific distribution of rates among sites. To investigate the pattern of correlation of rates among sites, the method was applied to a data set consisting of the protein-coding regions of the mitochondrial genome from 10 vertebrate species. Rates appear to be strongly correlated at distances up to 40 codons apart. Furthermore, there appears to be some higher order correlation of sites approximately 75 codons apart. The method of site-by-site estimation of the rate of substitution may also be applied to examine other aspects of rate variation along a DNA sequence and to assess the difference in the support of a tree along the sequence. [Correlation among sites; maximum likelihood; mtDNA; rate variation; site-by-site estimation.] The number of data used in phylogenetic studies has increased dramatically in recent years. Commonly, several different genes are sequenced for the same set of species. For example, the mitochondrial genome has been fully sequenced in many vertebrate species (Anderson et al., 1981, 1982; Bibb et al., 1981; Roe et al., 1985; Gadaleta et al., 1989; Desjardins and Morais, 1990; Arnason et al., 1991; Arnason and Johnson, 1992; Tzeng et al., 1992; Chang et al., 1994). Such large data sets probably will become more common in the future. Therefore, new methods that take advantage of the estimates of topology and branch lengths obtained from large data sets should be developed. In this paper, I describe a maximum likelihood method for independently estimating the relative rate of substitution at each site when such estimates have been obtained. The method was used to estimate the degree of correlation of rates in mitochondrial DNA (mtDNA). A decade ago, most models of DNA sequence evolution (e.g., Jukes and Cantor, 1969; Kimura, 1980; Msenstein, 1981; Hasegawa et al., 1985) assumed constancy of rates among sites. This assumption is clearly unrealistic. For example, the observation that coding sequences contain sites with predominantly replacement substi- tutions and sites with predominantly silent substitutions shows that a model assuming constancy in rate among sites is not satisfactory. Other causes of rate variation may include differences in the biological mutation rate along the sequence and varying degrees of selection in different regions or domains. Several models of sequence evolution have recently been proposed that allow for among-site rate variation (Yang, 1993; Kelly and Rice, 1996). Particularly popular are models based on gamma-distributed rates (Uzzell and Corbin, 1971; Jin and Nei, 1990; Yang, 1993, 1994; see Yang, 1996, for a review). Yang (1995) devised a method for estimating the rate at each site and the degree of correlation of rates among sites assuming gamma-distributed rates and a stationary autocorrelation function where only adjacent sites are correlated. The aim of the present study was to devise an estimator of the rate at each site and the degree of correlation that does not rely on these assumptions. The disadvantage is that the topology and branch lengths of the underlying tree must be known prior to the analysis. The relative rates can then be estimated independently for each site. Because such estimates are independent, further analysis such as estimation of the autocorrelation function may be performed directly on the obtained estimates. 346 1997 347 NIELSEN—ESTIMATION OF M T D N A SUBSTITUTION RATES Estimates of the rate of substitution at individual sites have already been obtained in various cases by the parsimony method (see Horovitz and Meyer, 1995). This parsimony-based method may be reliable if all branches are short. However, when some branches are long, parsimony may provide unreliable estimates of the rate of substitution because several substitutions may occur on the same branch. In such cases, the maximum likelihood method should provide better estimates. In this paper, I explore the properties of the maximum likelihood estimator of the rate of substitution. The maximum likelihood method assumes, as does the parsimony method, that the underlying topology is known with great certainty. In addition, the maximum likelihood estimator requires that reasonable estimates of the branch lengths have been obtained. ESTIMATING THE SUBSTITUTION RATE For the maximum likelihood estimator of the relative substitution rate (w) at individual sites of a DNA sequence, u refers to the composite parameter \t, where A. is the rate of change at each site and t is the total tree length. Reliable estimates of u at each site cannot be obtained while estimating the topology and branch lengths of a tree. However, if the topology and relative branch lengths of a given tree are already known, an estimate can easily be obtained for each site by maximizing the likelihood of observing the data for different values of the substitution rate. That is, an estimate (ft) of the relative rate of substitution at a site is obtained by maximizing L{u) = P(site pattern|w). (1) The likelihood for a given u can be calculated using standard methods (Felsenstein, 1981). For example, for three sequences L(u) = 2 TT,-P,B1(W, O-P1B2(W, t2) i=a,c,t,g •Pmfr. 'a), (2) where irr is the stationary frequency of base i and PiBl(u, tj is the probability of observing base B1 at time t1 (the relative r a = {G, A, A} 0.02- 0.01- • * ^ — * ^ C} 00 0.5 1.0 u 1.5 2.0 FIGURE 1. The likelihood functions (L) of u for two different site patterns (a and b). An unrooted threetaxon tree with relative branch lengths 2,1, and 1 was assumed. For site a with site pattern {G, A, A}, the likelihood function has a maximum at 0.5. However, for site b with site pattern {G, A, C}, the likelihood is a strictly increasing function of u. In that case, fl = <». length of the particular branch) when base i is observed at time 0. The function P/;(«, t) is specified by the particular model of evolution assumed. Any desired substitution matrix can thereby in principle be used. The actual maximization procedure is trivial because, given the topology and relative branch lengths, only one parameter (u) is unknown. Clearly, estimates of ft = 0 will be obtained when the particular site considered is not variable. However, when sites are highly variable, estimates for which ft = °° may be obtained. To understand this phenomenon, consider a tree with three taxa. In such a tree, any site pattern {a, b, c}, where a ¥^ b, a ¥=• c, and b ¥^ c, will be more likely to occur as the substitution rate increases. Therefore, the estimate of the relative rate goes to infinity for sites with such patterns (Fig. 1). Although the probability of observing site patterns with this property in examples with moderate or low substitution rates decreases rapidly as more taxa are included, these site patterns may still occur. If they are ignored in the analysis, the estimator ft will be biased toward smaller values. There may be three solutions to this problem: (1) sites with estimates of ft = °° 348 VOL. 46 SYSTEMATIC BIOLOGY (3) L(um) = P(site pattern|u)-P(u), which can be maximized for u. The function P(u) can be specified by, for example, a gamma or exponential distribution. Using this approach is not equivalent to forcing the distribution to follow some predefined function. The distribution of rates obtained by this approach may be very different from the prior distribution. However, this approach solves the problem of obtaining estimates of ft = <». SIMULATIONS 10. Likelihood >^^ 1E 8. 6. c jctatio may simply be excluded, (2) an arbitrary maximum value of u may be predefined, and (3) a modified likelihood approach may be applied. The aim of the modified likelihood approach method is in this case to force the likelihood function to be centered around smaller values, which can be done by using any decreasing function as a prior distribution of the rate of substitution. The modified likelihood function becomes Q. X LU /,' y^Likelihood / ' y^ + prior 4. 2. y^T^ Parsimony 0 0 2 4 6 8 True rate (u) 10 FIGURE 2. The mean of the estimators U (Likelihood), Um (Likelihood + prior), and flpars (Parsimony) in 10,000 simulations under a Jukes and Cantor (1969) model of evolution. The tree estimated from the mtDNA of 10 vertebrate species was assumed. Notice that u (the true relative substitution rate) is equivalent to the expected number of substitutions on the tree in one site. are ignored, ft will be biased (Fig. 2). The Simulations were performed to evaluate downward bias increases as the rate inthe performance of ft. These simulations creases simply because more sites have eswere performed using the topology and timates of ft = oo. The result is that ft conbranch lengths estimated from a data set verges to a finite value (Fig. 2). This consisting of the mitochondrial genome asymptotic behavior is expected for any esfrom 10 different vertebrates. This data set, timator because there is only a finite numand the topology of the tree, was discussed ber of possible different site patterns. The in more detail by Cummings et al. (1995). estimates of the rate must converge to a One of the goals of the simulations was to finite value equal to or less than max{tf}. compare the maximum likelihood estimator Another source of bias is small sample with the commonly applied parsimony es- size (i.e., few taxa). This bias results in timator. Because the parsimony method can- larger values. Figure 2 illustrates that the not easily take mutational biases or unequal source of bias is important when u is small base frequencies into account, the Jukes and or moderate. To illustrate the effect of samCantor (1969) model was chosen for a fair ple size, simulations were performed in comparison. The evolution in one site was which the number of taxa was varied. Taxa simulated under a given mutation rate, and were added to the sample by simply joinestimates using the two different estima- ing together identical 10-taxon phylogenies tors (ft and the parsimony estimator) were at the root (Fig. 3). The bias is greatly reobtained. For each value of the true sub- duced for larger sample sizes, suggesting stitution rate (u), 10,000 replicates were that the estimator is consistent (i.e., the exgenerated. These 10,000 replicates were pectation of the estimate will converge to then used to estimate the expectation of the true value as more taxa are added) and the estimators under a given value of the that the bias observed is a small sample true mutation rate. Observations for which size effect. The expectation of a modified u = oo were ignored. likelihood estimator with a gamma distriBecause site patterns for which ft = o° bution with mean 10.0 and shape param- 1997 NIELSEN—ESTIMATION OF M T D N A SUBSTITUTION RATES 20 40 60 80 Number of taxa 100 120 FIGURE 3. The bias in ft as a function of the number of taxa for u - 6.0 under the Jukes and Cantor model of evolution. The bias is defined as the true value minus the expectation of the estimator. eter 0.5 as a prior distribution was also explored (Fig. 2). The use of such a prior distribution implies that the estimator will underestimate the true rate. However, the bias is not substantially larger than the bias for the normal likelihood approach. In the 10-taxon case, tf increases approximately linearly with u except for extremely high values of u. Therefore, although ti cannot be applied as an unbiased estimate in all cases, it may still be applied to distinguish between rates at different sites even when the sequences apparently are very saturated. The maximum likelihood estimators u and Um were compared with the commonly applied parsimony estimator (tfpars). Simulation results for tfpars (Fig. 2) indicate that tfpars will strongly underestimate the true mutation rate. Furthermore, tipais converges quickly to its asymptotic value and therefore does not behave approximately linearly for nearly as long as do U and tim. A standard method for comparing different estimators is by the mean squared error (MSE) of the estimators (see Rice, 1995). The MSE of ti. and tfpars are compared in Figure 4. At low levels of divergence, the parsimony estimator performs better or as well as the maximum likelihood estimators. The parsimony estimator performs well for the low levels of diver- 349 4 6 8 True rate (u) FIGURE 4. The mean squared error (MSE) of the estimators ft (Likelihood), ftm (w/ prior), and #pars (Parsimony) as determined by simulations of the Jukes and Cantor (1969) model of evolution on the 10-taxon tree; 10,000 simulations were performed. For tim, a gamma distribution with a mean of 10.0 and shape parameter equal to 0.5 was chosen as the prior distribution. gence because the probability of two or more substitutions occurring in the same lineage is very low. However, the methods differ substantially for high levels of divergence. In these cases, the maximum likelihood estimator (u) performs substantially better than the parsimony estimator. This result is also expected because the parsimony estimator of the rate of substitution by definition will perform poorly when the probability of several substitutions occurring on the same branch is high. The main difference between the two methods is that the maximum likelihood estimator can distinguish between different rates even when the overall level of divergence is very high, whereas the parsimony estimator cannot. Also, tim (w/ prior in Fig. 4) has a much lower MSE over most of the parameter space than does ti. In fact, the use of a prior distribution has in this case a clear variance-reducing effect. It therefore appears that tim should be preferred over ti and tfpars in most cases. CORRELATION OF RATES IN MTDNA With this method, the correlation of rates among sites can be directly examined by estimating the autocorrelation function 350 VOL. 46 SYSTEMATIC BIOLOGY along the sequence. The autocorrelation function provides information about the degree to which the rate in one site can be predicted from the rate in other sites a certain distance away. If the autocorrelation function is estimated along the sequence without taking the position in the codon into account, a strong periodicity at distances of 3 bp will naturally occur in the autocorrelation function because of the nature of the genetic code. Instead, the autocorrelation for each codon position (first, second, third) should be examined separately. In this manner the correlation among codons can be explored. The estimator um was used to analyze the protein-coding regions of the mitochondrial genome for 10 vertebrate taxa (Cummings et al, 1995). The maximum likelihood estimator is particularly appropriate for this use because it provides an estimate of the rate at each site and is able to deal with the very high rates of evolution encountered in vertebrate mtDNA. Estimates of topology and branch lengths of the phylogeny were obtained by the maximum likelihood method used on a data set of 14,416 protein-coding sites of mtDNA (Cummings et al., 1995). The HKY model of DNA substitution with gammadistributed rates (Hasegawa et al., 1985; Yang, 1993) was used to estimate the topology and branch lengths. The estimates of these parameters were assumed to be accurate under the model. This assumption is justified by the large number sites included in the analysis and is also supported by the fact that other methods of tree reconstruction provide identical topologies (see Cummings et al., 1995). Subsequently, the transition/transversion (Ts/Tv) bias, the mean number of substitutions, and the shape parameter of the gamma distribution were estimated independently for the first, second, and third codon positions (Table 1). Applying these estimated parameter values, um was obtained for each site using the HKY model (Hasegawa et al., 1985) with a prior distribution given by a gamma distribution with shape parameter identical to the empirically estimated value. The ratio of transitions to transversions TABLE 1. Estimated parameter values for correlation of substitution rates in mtDNA. Codon position Ts/Tv bias 1.38 1.19 2.47 Gamma shape parameter Mean no. substitutions/ site 0.45 0.27 1.90 0.80 6.16 was assumed to be the same for each site within each codon position. After estimation of the relevant parameters of the mutation model, the autocorrelation function was estimated using standard methods (see Box and Jenkins, 1976) separately for each codon position (Fig. 5). There is a strong correlation of rates for distances up to 40 bp apart in the first and second positions and a peak in the autocorrelation function at distances of 75 codons. However, in the third codon position there is little or no correlation of rates among sites. None of the same data were used when estimating the autocorrelation function in the first and in the second position. The close similarity of the two functions therefore strongly suggests that the pattern observed (Fig. 5) is not related to sampling. DISCUSSION There may be several explanations for the correlation in rates in second and first positions among codons close to each other: the rate of mutation may vary among regions, mutations may stretch over several codons (e.g., if they arise as gene conversions), or selection may act differently in different regions. The existence of separate functional domains predicts such differences in the intensity of selection between different regions. If the lack of correlation in rates between codons in the third position is caused by constancy and /or independence in the rate among third positions, the third explanation probably is correct. However, because third positions are strongly saturated (Table 1), it is difficult to distinguish among these three explanations. A high degree of saturation will increase the variance in the estimate and will tend to ac- 1997 351 NIELSEN—ESTIMATION OF M T D N A SUBSTITUTION RATES 0.25 First position Second position Third position 50 100 150 Distance in codons 200 FIGURE 5. The autocorrelation function along the sequence of mtDNA for the first 200 codons estimated applying the HKY model with a gamma distribution as the prior distribution. The values of the Ts/Tv bias, the gamma shape parameter, the mean number of substitutions, and the base frequencies were first estimated for each codon position separately. These values were then used when the rate was estimated at each site. The curves were obtained by binomial smoothing. centuate any deficiencies of the model of DNA evolution. As more sequences of the mitochondrial genome become available, this problem may be alleviated by including less divergent sequences in the analysis. The pattern observed in the first 40 codons at the first and second codon positions is not surprising. This type of firstorder correlation is expected if the rate at one site is positively correlated with the rate at neighboring sites (this pattern was also observed by Yang, 1995). The general observation that most proteins are organized into different functional domains predicts such correlation. Nucleotides specifying amino acids within functional domains are expected to have lower rates than nucleotides located between domains because these supposedly are more free to vary. Therefore, one does not need to invoke correlated changes, such as expected under the covarion hypothesis (Fitch, 1971) or when gene conversions are frequent, to explain this pattern of correlated rates. However, the second peak in the autocorrelation function (at a distance of 75 codons) cannot in itself be explained by the observation that nucleotides close to each other have similar rates. Instead, it may be an effect of the organization of the secondary structure of the mitochondrially encoded proteins or it may be caused by variation in the GC content in the mitochondrial genome. There are several important consequences of the observation that rates are strongly spatially correlated. First, the variance in the estimate of parameters obtained using phylogenetic methods may be greater than that originally reported. Second, tests aimed at detecting gene conversion, recombination, or conflicting phylogenetic signal should be interpreted with caution. Such tests typically examine whether the distribution of certain site patterns along the sequence is clustered. For example, some tests for gene conversion search for runs of certain site patterns (see Takahata, 1994, for a review). However, 352 VOL. 46 SYSTEMATIC BIOLOGY some degree of clustering is expected simply because rates are spatially correlated. If such tests implicitly or explicitly assume that the rates at different sites are independent, the tests may not be valid when rates are correlated. Maximum likelihood estimate of the rate at each site in a DNA sequence is an obvious method for analyzing rate variation and the factors underlying molecular evolution. This method performs much better than the commonly applied parsimony estimator when the levels of divergence are high. However, the method assumes that certain parameters, such as topology and branch lengths of the underlying phylogenetic tree, are known. There may be three situations where this assumption is reasonable. First, when the number of sites is very high one may trust the large sample properties of the maximum likelihood estimator and conclude that under the model the estimates obtained are equal to the true values. Second, one may perform robustness analyses (e.g., by simulations) to assess to what degree the conclusions of the study are affected by the assumption of knowledge about the true parameter values. Third, in some cases it may be established a priori that the conclusions are not sensitive to any bias introduced by assuming a wrong topology or wrong branch lengths. For example, in the present case, assuming a wrong set of branch lengths could not possibly in itself introduce any spurious correlation where no correlation existed. However, if the method is heuristically applied to data sets containing many fewer sites than in the present example or to examine other problems, the robustness of the method should be carefully analyzed. In such cases, it may often be more appropriate to apply the method of Yang (1995). The maximum likelihood approach described here is applicable for purposes other than the study of the correlation of rates. For example, the likelihood value at each site provides a measure of the support of the tree and the model at that site given the estimated mutation rate. In this way, il and the corresponding likelihood value may be used jointly for assessing the difference in the support of a tree along the sequence. The potential dependence of the site-bysite rate estimates on the assumed branch lengths and topology should not prevent researchers from using this method. As more sequence data become available, estimates of branch lengths will gradually improve. Methods such as the one presented here, which take advantage of the information obtained in estimates of the topology and branch lengths of phylogenetic trees, are another means of examining many interesting evolutionary questions. ACKNOWLEDGMENTS I thank M. Slatkin, J. Huelsenbeck, Z. Yang, M. Hasegawa, C. Simon, T. Speed, and an anonymous reviewer for comments and discussion. I am grateful to Mike Cummings for providing the aligned nucleotide sequences. This work was supported in part by NIH grant GM40282 to M. Slatkin and by personal grants to R.N. from the Danish Research Academy and the Danish Research Council. REFERENCES ANDERSON, S., A. T. BANKIER, B. G. BARRELL, M. H. DE BRUIJN, A. R. COULSON, J. DROUIN, I. C. EPERON, D. P. NIERLICH, B. A. ROE, AND F. SANGER. 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457-465. ANDERSON, S V M. H. L. DE BRUIJNA, A. R. COULSON, I. C. EPERON, F. SANGER, AND I. G. YOUNG. 1982. Complete sequence of bovine mitochondrial DNA. Conserved features of the mammalian mitochondrial genome. J. Mol. Biol. 156:683-717. ARNASON, U, A. GULLBERG, AND B. WIDEGREN. 1991. The complete nucleotide sequence of the fin whale, Balaenoptera physalus. J. Mol. Evol. 33:556-568. ARNASON, U, AND E. JOHNSON. 1992. The complete nucleotide sequence of the mitochondrial DNA of the harbor seal, Phoca vitulina. J. Mol. Evol. 34:493505. BIBB, M. J., R. A. VAN ETTEN, C. T. WRIGHT, M. W. WALBERG, AND D. A. CLAYTON. 1981. Sequence and gene organization of mouse mitochondrial DNA. Cell 26:167-180. Box, G. E. P., AND G. M. JENKINS. 1976. Time series analysis: forecasting and control. Holden-Day, San Francisco. CHANG, Y.-S., F.-L. HUANG, AND T.-B. LO. 1994. The complete nucleotide sequence and gene organization of carp (Cyprinus carpio) mitochondrial genome. J. Mol. Evol. 38:138-155. CUMMINGS, M., J. WAKELEY, AND S. P. OTTO. 1995. Sampling DNA sequence data in phylogenetic analysis. Mol. Biol. Evol. 12:814-822. 1997 NIELSEN—ESTIMATION OF MTDNA SUBSTITUTION RATES 353 evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. gene organization of the chicken mitochondrial geEvol. 16:111-120. nome, a novel gene order in higher vertebrates. J. Mol. Biol. 121:599-634. RICE, J. A. 1995. Mathematical statistics and data analysis. Duxbury Press, Belmont, California. FELSENSTEIN, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. ROE, B. A., D.-P. M. WILSON, AND J. F.-H. WONG. 1985. The complete nucleotide sequence of the Xenopus Evol. 17:368-376. laevis mitochondrial genome. J. Biol. Chem. 260: FITCH, W. M. 1971. Rate of change of concomitantly 9759-9774. variable codons. J. Mol. Evol. 1:84-96. TAKAHATA, N. 1994. Comments on the detection of GADALETA, G., G. PEPE, C. DE CANDIA, C. QUAGLIreciprocal recombination or gene conversion. ImARIELLO, E. SBIS, AND C. SACCONE. 1989. The complete nucleotide sequence of the Rattits norvegicus munogenetics 39:146-149. mitochondrial genome: Signals revealed by com- TZENG , C. S., C.-F. Hu, S.-C. SHEN, AND P. C. HUANG. parative analysis between vertebrates. J. Mol. Evol. 1992. The complete nucleotide sequence of the Crossostoma lacustre mitochondrial genome: Conserva28:497-516. tion and variations among vertebrates. Nucleic AcHASEGAWA, M., H. KISHINO, AND T. YANO. 1985. Datids Res. 20:4853-4858. ing of the human-ape splitting by a molecular clock UZZELL, T., AND K. W. CORBIN. 1971. Fitting discrete of mitochondrial DNA. J. Mol. Evol. 22:160-174. probability distributions to evolutionary events. SciHOROVITZ, I., AND A. MEYER. 1995. Systematics of ence 172:1089-1096. New World monkeys (Platyrrhini, Primates) based on 16S mitochondrial DNA sequences: A compara- YANG, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution tive analysis of different weighting methods in clarates differ over sites. Mol. Biol. Evol. 39:105-111. distic analysis. Mol. Phylogenet. Evol. 4:448-456. JIN, L., AND M. NEI. 1990. Limitations of the evolu- YANG, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates tionary parsimony of phylogenetic analysis. Mol. over sites: Approximate methods. J. Mol. Evol. 39: Biol. Evol. 7:82-102. 306-314. JUKES, T. H., AND C. R. CANTOR. 1969. Evolution of protein molecules. Pages 21-132 in Mammalian YANG, Z. 1995. A space-time process model for the evolution of DNA sequences. Genetics 139:993-1005. protein metabolism III (H. N. Munro, ed.). AcademYANG, Z. 1996. Among-site rate variation and its imic Press, New York. pact on phylogenetic analyses. Trends Ecol. Evol. KELLY, C, AND J. RICE. 1996. Modeling nucleotide 11:367-372. evolution: A heterogeneous rate analysis. Math. Biosci. 133:85-109. Received 2 April 1996; accepted 15 January 1997 KIMURA, M. 1980. A simple method for estimating Associate Editor: Chris Simon DESJARDINS, P., AND R. MORAIS. 1990. Sequence and
© Copyright 2026 Paperzz