Syst. Biol. 58(2):199–210, 2009 c Society of Systematic Biologists Copyright DOI:10.1093/sysbio/syp015 Advance Access publication on June 29, 2009 Statistical Comparison of Nucleotide, Amino Acid, and Codon Substitution Models for Evolutionary Analysis of Protein-Coding Sequences TAE -K UN S EO1,∗ AND H IROHISA K ISHINO 2 1 Professional Programme for Agricultural Bioinformatics; 2 Laboratory of Biometrics and Bioinformatics, Graduate School of Agricultural and Life Sciences, University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo 113-8657, Japan; ∗ Correspondence to be sent to: Professional Programme for Agricultural Bioinformatics, Graduate School of Agricultural and Life Sciences, University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo 113-8657, Japan; E-mail: [email protected]. Abstract.—Statistical models for the evolution of molecular sequences play an important role in the study of evolutionary processes. For the evolutionary analysis of protein-coding sequences, 3 types of evolutionary models are available: 1) nucleotide, 2) amino acid, and 3) codon substitution models. Selecting appropriate models can greatly improve the estimation of phylogenies and divergence times and the detection of positive selection. Although much attention has been paid to the comparisons among the same types of models, relatively little attention has been paid to the comparisons among the different types of models. Additionally, because such models have different data structures, comparison of those models using conventional model selection criteria such as Akaike information criterion (AIC) or Bayesian information criterion (BIC) is not straightforward. Here, we suggest new procedures to convert models of the above-mentioned 3 types to 64dimensional models with nucleotide triplet substitution. These conversion procedures render it possible to statistically compare the models of these 3 types by using AIC or BIC. By analyzing divergent and conserved interspecific mammalian sequences and intraspecific human population data, we show the superiority of the codon substitution models and discuss the advantages and disadvantages of the models of the 3 types. [AIC; amino acid model; BIC; codon model; likelihood ratio test; model comparison; nucleotide model.] The statistical models for the evolution of molecular sequences are designed to approximate the simplified forms of the complex aspects of evolutionary processes. In these simplified forms, evolutionary processes are summarized using a certain number of parameters. This simplification enables us to easily understand evolutionary processes and identify the major driving forces behind them. For the evolutionary analysis of nucleotide sequences, 3 types of evolutionary models are used: 1) nucleotide, 2) amino acid, and 3) codon substitution models. Depending on the type of data, all or some of these models are applicable. This study focuses on the analysis of protein-coding DNA sequences in which all 3 types of models are applicable. Nucleotide substitution models define the relative occurrence of substitutions among the 4 nucleotides; hence, we refer to them as “4-dimensional” substitution models. Here, the “dimension” represents the number of states and the number of rows and columns of the transition rate matrix of the models. The simple Jukes– Cantor model (Jukes and Cantor 1969) assumes the equal occurrence of transitions and transversions and equal nucleotide frequencies, whereas the improved models relax either the former (Kimura 1980) or the latter assumptions (Felsenstein 1981). In addition, both assumptions can be relaxed simultaneously (Hasegawa et al. 1985), and the type of transitions can be distinguished (Tamura and Nei 1993). All these models are generalized by the general time-reversible model (GTR; Tavar é 1986; Yang 1994) in which the 4 nucleotide frequencies and 12 possible types of substitutions can be different. Amino acid substitution models such as JTT (Jones et al. 1992), Dayhoff (Dayhoff et al. 1978), mtREV (Adachi and Hasegawa 1996), WAG (Whelan and Goldman 2001), and so on, define the relative occurrence of substitutions among the 20 amino acids; thus, we refer to them as “20dimensional” substitution models. Database-derived information for substitution patterns is obtained by using parsimony-based methods (e.g., Dayhoff et al. 1978; Jones et al. 1992) or likelihood-based methods (e.g., Adachi and Hasegawa 1996; Whelan and Goldman 2001). In the simple “proportional amino acid model” (Cao et al. 1994), all types of amino acid substitutions are equally probable, and the mechanistic amino acid models that consider the mutational biases of nucleotides (Yang et al. 1998) are also available. The amino acid frequencies can be derived either from a database or from the analysis of sequence data. In case the latter technique is used, the models have been found to be considerably improved; the names of such models are assigned the suffix “+F” (Cao et al. 1994). Codon substitution models define the relative occurrence of substitutions among sense codons (Goldman and Yang, 1994; Muse and Gaut, 1994). For simplicity, we describe codon models using the universal codon table; hence, we refer to them as “61-dimensional” substitution models. However, the specific number of dimensions of a codon model is not crucial in our study, and our descriptions are equally valid in codon models that use different codon tables. Using the codon model developed by Goldman and Yang (1994; GY94), the instantaneous transition rate from codon i to codon j is defined as follows: 199 (61) Qij ⎧ 0, more than one DNA substitution ⎪ ⎪ ⎪ ⎪ , synonymous transversion π ⎪ ⎨ j synonymous transition = κπj , , (1) ⎪ ⎪ ⎪ ωπ , nonsynonymous transversion ⎪ j ⎪ ⎩ ωκπj , nonsynonymous transition 200 SYSTEMATIC BIOLOGY where πj and κ represent the frequency of codon j and the relative occurrence of transitions with respect to transversions, respectively, and ω represents the relative occurrence of nonsynonymous substitutions with respect to synonymous substitutions and is used to detect deviation from neutral evolution. The superscript “(61)” indicates that the transition is among 61 sense codons. The codon model in equation 1 can be improved by allowing instantaneous multiple nucleotide substitutions (Whelan and Goldman 2004) or by incorporating the physical properties of proteins (Robinson et al. 2003) and the empirical substitution pattern derived from a database (Doron-Faigenboim and Pupko 2007; Kosiol et al. 2007; Seo and Kishino 2008). Because statistical models are “approximations” of natural phenomena, all statistical models are fundamentally inaccurate. Thus, it is very crucial to select an appropriate model that minimizes the discrepancy between the model and the true evolutionary process. It has been noted that the selection of inappropriate models may result in the incorrect estimation of phylogenetic relationships and the incorrect estimation of uncertainty of tree topologies (e.g., Gaut and Lewis 1995; Sullivan and Swofford 1997; Bruno and Halpern 1999; Buckley 2002; Anderson and Swofford 2004). In the Bayesian framework, incorrect branch-length estimation due to the adoption of inappropriate models may cause incorrect estimation of divergence time (Thorne et al. 1998) or incorrect estimation of the trend in selective pressure (Seo et al. 2004). Typically, likelihood-based models are compared using both frequentist and Bayesian inference methods. When one model is nested within another, that is, when one model is a special case of the other model, the likelihood ratio test (LRT; Stuart and Ord 1996) can be adopted to test the significance of the advantages of nesting models. For various nucleotide substitution models, hierarchical LRT can be adopted to select the set of appropriate models (e.g., Frati et al. 1997; Sullivan et al. 1997; Posada and Crandall 2001). For models that are not nesting or nested, LRT is not applicable, and the Akaike information criterion (AIC; Akaike 1974) or the Bayesian information criterion (BIC; Schwarz 1974) can be adopted for model selection. For a given data set (X) comprising n independent data items (X: = {x1 , · · · , xn }) and model i, the AIC and BIC are defined as i |X) + 2K AIC(i): = −2l(θ (2) i |X) + K log n, BIC(i): = −2l(θ (3) i is the vector of the maximum likelihood estiwhere θ i , and l(θ i |X) is mates of model i, K is the dimension of θ the log-likelihood score for θ i and X. The model whose AIC (or BIC) score is the minimum score is regarded as the best model. Models can also be compared using a decision-theoretic approach (Minin et al. 2003; Abdo et al. 2005) and the Bayes factor (e.g., Suchard et al. 2001; Aris-Brosou and Yang 2002). In these comparisons, appropriate loss functions and posterior distributions have VOL. 58 to be determined by using the likelihood calculated using substitution models. Although considerable attention has been paid to the comparisons among the same types of models (e.g., Nielsen and Yang 1998; Posada and Crandall 1998; Yang and Nielsen 1998, 2002; Yang et al. 2000; Abascal et al., 2005), relatively little attention has been paid to the comparisons among the different types of models (Shapiro et al. 2006; Seo and Kishino 2008). The comparisons among different types of models are difficult because these models have different data structures, and hence, the AIC or BIC cannot be applied in a straightforward manner in these comparisons. For calculating the AIC (or BIC) scores for different models, the data X in equation 2 (or equation 3) should be identical. Although the entire data set of aligned sequences is identical, individual data items are different for different types of models. When applying nucleotide models, we separate each codon column into 3 nucleotide columns. This “codon separation” yields different data X for nucleotide and codon models, and the number of individual data items for nucleotide models becomes 3n. For applying amino acid models, we translate codons into amino acids. Even a “translation” yields different data X for amino acid and codon models, although the number of individual data items is identical in this case. Recently, we have developed a new procedure to statistically compare amino acid models with codon models (Seo and Kishino 2008). By converting the 20dimensional amino acid models into 61-dimensional codon models and applying amino acid models, it is possible to measure the log-likelihood scores corresponding to translations. Further, amino acid models can be regarded as special cases of codon models, and both types of models with the same 61-dimensional data structure can be compared (see Theory). Although amino acid models can be compared with codon models, the technique to compare these models with 4-dimensional nucleotide models is still unclear. A comparison of the AIC scores calculated using the nucleotide and codon models has been attempted (Shapiro et al. 2006); however, no theoretical justification has been provided for the comparison. Here, we solve the problem of comparing nucleotide and codon models despite their different data structures. We propose new procedures to convert both 4-dimensional nucleotide models and 61-dimensional codon models into 64-dimensional codon models. In these 64-dimensional codon models, stop codons as well as sense codons are taken into consideration. Thus, 20-dimensional amino acid models can be converted initially into 61-dimensional codon models via our previous procedure (Seo and Kishino 2008) and then into 64-dimensional codon models using our new procedure. The resulting 64-dimensional codon models are mathematically equivalent to 4dimensional nucleotide, 20-dimensional amino acid, and 61-dimensional codon models. Furthermore, due to this mathematical equivalence, it is possible to obtain identical data X in equations 2 and 3 and to compare the 3 types of models using the same 64-dimensional data 2009 SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS structure. We have found that it is always possible to define a codon model, which is better than the given nucleotide model, by converting 4-dimensional nucleotide models into 64-dimensional codon models. It is always possible to define a codon model, which is better than the given amino acid model, by converting 20-dimensional amino acid models into 61-dimensional codon models (Seo and Kishino 2008). These theoretical findings exhibit the potential advantages and applications of codon models. We have applied our procedures to the analysis of 1) highly divergent interspecific mammalian mitochondrial, 2) highly conserved interspecific mammalian nuclear, and 3) polymorphic intraspecific human mitochondrial protein–coding sequences. Using empirical data analysis, we demonstrate and discuss the advantages of codon models over the other 2 types of models. T HEORY To begin with, we provide an overview of our previous procedure (Seo and Kishino 2008) used to convert a 20-dimensional amino acid model into 61-dimensional codon models and then explain the new procedures used to convert nucleotide and codon models into 64dimensional models. We demonstrate our conversion procedures using the HKY (Hasegawa et al. 1985) and GY94 (Goldman and Yang 1994) models, but these conversion procedures can be easily applied to any type of nucleotide or codon model. 201 transition probabilities during time t in the SK-P1 model in equation 5 can be analytically obtained using the transition probabilities and eigenvalues of the amino acid model in equation 4. When ρ = ∞ in equation 5, synonymous substitutions are completely saturated, and the SK-P1 model is reduced to an “SK-P0” model, which is mathematically equivalent to the amino acid model in equation 4. The empirical sai aj values obtained from equation 4 can be used to improve the conventional codon model in equation 1; the improved model is referred to as the “SK-P2” model (Seo and Kishino 2008). For the SK-P2 model, the instantaneous transition rate from codon i to codon j is defined as ⎧ 0, more than one DNA substitution ⎪ ⎪ ⎪ ⎪ ⎪ synonymous transversion ⎪ ⎨ρπj , synonymous transition . (6) Rij = ρκπj , ⎪ ⎪ ⎪ s π , nonsynonymous transversion ⎪ ai aj j ⎪ ⎪ ⎩s κπ , nonsynonymous transition ai aj j When sai aj ≡ 1 and ρ = 1/ω, the SK-P2 model becomes identical to the GY94 model. That is, the GY94 model is an SK-P2 model in which the simple proportional amino acid model (sai aj ≡ 1; Cao et al. 1994) is incorporated. Because any amino acid model can be incorporated into the SK-P2 model, the SK-P2 model is more flexible than the GY94 model. Converting 20-Dimensional Amino Acid Models into 61-Dimensional Codon Models: Overview Let us consider an amino acid model in which the instantaneous transition rate is defined as sai aj πaj , if ai = aj Rai aj = , (4) − ak :ak =ai Rai ak , if ai = aj Converting 4-Dimensional HKY Model into 64-Dimensional Model In the HKY model (Hasegawa et al. 1985), the instantaneous transition rate from nucleotide a to nucleotide b is defined as follows: κπb , transition (4) Rab = , (7) πb , transversion where πaj is the frequency of the amino acid aj and sai aj represents the relative occurrence of substitutions from ai to aj during an infinitesimal time period. The sai aj values are typically estimated by parsimony (Dayhoff et al. 1978; Jones et al. 1992) or likelihood (Adachi and Hasegawa 1996; Whelan and Goldman 2001) methods. From the amino acid model in equation 4, we derive a 61-dimensional SK-P1 model (Seo and Kishino 2008) in which the instantaneous transition rate from codon i to codon j is defined as sai aj πj , nonsynonymous change Rij = , (5) synonymous change ρπj , where πb represents the frequency of nucleotide b. We use the superscript “(4)” to indicate that equation 7 defines the instantaneous transition rates among the 4 nucleotides. Let us denote the 4 × 4 rate matrix derived from equation 7 as R(4) . Then, the transition probability from nucleotide a to nucleotide b during time t is given by the element in the ath row and bth column of P(4) (t): = exp{tR(4) }. For simplicity, we do not consider the normalizing factors of rate matrices in our study. Similar to the definition in equation 7, we define a 64-dimensional codon model in which the instantaneous transition rate from codon r to codon s is given as follows: ⎧ more than one DNA substitution ⎪ ⎨0, (64) Rrs = κπsp , transition occurs at pth site , (8) ⎪ ⎩π , transversion occurs at pth site sp where aj is the amino acid that codon j encodes and πj is the frequency of codon j. The sai aj values are obtained from the amino acid model in equation 4, and the free parameter ρ represents the relative occurrence of synonymous substitutions with respect to sai aj . The where p ∈ {1, 2, 3} and sp is the nucleotide of codon s at the pth site (refer Whelan and Goldman [2004] for a 202 VOL. 58 SYSTEMATIC BIOLOGY similar conversion procedure within 61 dimensions). Let us denote the 64 × 64 rate matrix derived from equation 8 as R(64) . Then, the transition probability from codon r to codon s during time t is given by the element in the rth row and sth column of P(64) (t): = exp{tR(64) }. We have found that the transition probabilities of the R(64) model can be represented by multiplication of the transition probabilities of the R(4) model at each nucleotide site (see Appendix). That is, (4) (4) (4) p(64) rs (t) = pr1 s1 (t) · pr2 s2 (t) · pr3 s3 (t). (9) Further, we have found that the likelihood of the phylogeny obtained using the R(64) model is identical to that obtained using the R(4) model. Let us denote the protein-coding sequence data as S: = s1 s2 · · · s3n , where si is the ith column of nucleotide alignment and n is the number of codon columns. Let θ denote the unknown parameters of the rate matrix and the time duration along the branches of phylogeny. Then, for the ith column of codon alignment, we can show the following relationship (see Appendix): θ|s3i−2 s3i−1 s3i ) = L(4) (θ θ|s3i−2 ) · L(4) (θ θ|s3i−1 ) · L(64) (θ θ|s3i ), L(4) (θ (10) θ|·) and L(4) (θ θ|·) represent the likelihoods where L(64) (θ (64) calculated using the R and R(4) models, respectively. Using equation 10, we obtain the following relationship: θ|S) = L(64) (θ n θ|s3i−2 s3i−1 s3i ) L(64) (θ i=1 n θ|s3i−2 ) · L(4) (θ θ|s3i−1 ) · L(4) (θ θ|s3i )} = {L(4) (θ i=1 = 3n (4) θ|si ) = L(4) (θ θ|S). L (θ In general, the third-codon sites in protein-coding sequences evolve faster than the first- or second-codon sites because most nucleotide substitutions at the thirdcodon sites are synonymous. Thus, typically, an additional free parameter is assigned to the third-codon sites to allow a high evolutionary rate in the analysis using 4dimensional nucleotide models. This type of nucleotide model can also be converted into a 64-dimensional codon model. More generally, we can assign different evolutionary rates for the 3 codon sites. In equation 8, if we multiply α, β, and γ with the instantaneous rates during nucleotide substitutions at the first-, second-, and third-codon sites, respectively, then equation 9 changes into (4) (4) (4) p(64) rs (t) = pr1 s1 (αt) · pr2 s2 (βt) · pr3 s3 (γ t), which can be easily proven by modifying the proof in the Appendix. By setting α = β = 1 and γ as a free parameter, we can define an R(64) model equivalent to an R(4) model in which the third-codon sites are fastevolving sites. Furthermore, by assuming that α, β, and γ follow the gamma distribution, we can convert the nucleotide model with gamma rate heterogeneity (Yang 1993) into a 64-dimensional model (see Results and Discussion). Converting a 61-Dimensional Codon Model into a 64-Dimensional Codon Model This conversion procedure is rather straightforward. For a given 61-dimensional codon model, we consider a 64-dimensional codon model in which the instantaneous transition rate from codon r to codon s is (61) Qrs , if r and s are sense codons , (12) = Q(64) rs 0, otherwise (61) (11) i=1 That is, the likelihood calculated using the nucleotide model R(4) is identical to that calculated using the codon model R(64) . This implies that we can regard a 4-dimensional nucleotide model as a 64-dimensional codon model in which the nucleotides at the first-, second-, and third-codon sites are grouped together as a single unit in the evolutionary process. The instantaneous transition rate given by equation 8 can be nonzero even when r and/or s are stop codons. This allows nucleotide substitutions to occur unrestrictedly and independently at the 3 codon sites; this independence is consistent with the factorization of equations 9 and 10. Because of the independence among the 3 codon sites in the R(64) model, the stationary frequency πr is given by πr1 πr2 πr3 , even when r is a stop codon. The nonzero instantaneous transition rates when a stop codon is (or stop codons are) involved and the nonzero frequencies of stop codons are the 2 main causes of the poor performance of nucleotide models (see Results and Discussion). where Qrs is the instantaneous transition rate from codon r to codon s in the given 61-dimensional codon model, as in equations 1, 5, and 6. Then, the 64 × 64 rate matrix derived from equation 12 is represented as ⎛ (61) ⎞ Q 0 0 0 ⎜ 0T 0 0 0⎟ ⎟, Q(64) = ⎜ (13) ⎝ 0T 0 0 0⎠ 0 0 0 0T where 0 represents a (61 × 1) zero vector and the last three rows and columns correspond to the positions of stop codons. Because Q(64) is a diagonal block matrix, the transition probability matrix during time t is obtained by the exponentiation of the diagonal blocks as follows: ⎛ ⎞ exp{tQ(61) } 0 0 0 ⎜ 1 0 0⎟ 0T ⎟. exp{tQ(64) } = ⎜ (14) T ⎝ 0 1 0⎠ 0 0T 0 0 1 The Q(64) model has 2 important features. First, the transition probabilities among sense codons obtained 2009 SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS using the Q(64) model are identical to those obtained using the Q(61) model. Second, stop codons are explicitly considered, but transitions from a sense codon (or from a stop codon) to a stop codon (or to a sense codon) never occur, and stop codons, if observed in protein-coding sequences, remain constant. In general, stop codons are not observed in protein-coding sequences, and we can adopt the sense codon frequencies of the Q(61) model for the Q(64) model. Then, the likelihoods of the Q(61) and Q(64) models are identical, and the Q(61) model can be regarded as the Q(64) model. In terms of the likelihood scores, there is no change in the Q(64) model by explicitly considering stop codons. Statistical Comparison of 3 Types of Models Using our conversion procedures, we have theoretically shown that 4-dimensional nucleotide models and 61-dimensional codon models can be regarded as 64dimensional codon models, and their likelihoods can be directly compared. In our previous study (Seo and Kishino 2008), we showed that the likelihoods of 20dimensional amino acid models and 61-dimensional codon models are not directly comparable. Because the likelihood corresponding to translations is explicitly considered in the SK-P0 model, amino acid models are comparable to codon models only through SK-P0 models. Therefore, the likelihoods of nucleotide, SK-P0, and codon models are comparable, and the corresponding AIC or BIC scores can be adopted to determine the best model among the 3 types. Our procedure for the conversion of a 4-dimensional nucleotide model into a 64-dimensional codon model clearly implies that the BIC scores should be calculated carefully within the type of nucleotide models. For the sequence data of n aligned codon columns, the number of samples in the simple nucleotide models is 3n; in these models, the 3 codon sites are assumed to be independently and identically distributed. Therefore, instead of equation 3, the following equation should be adopted for calculating the BIC scores: i |X) + K log{3n}. BIC(i): = −2l(θ However, when the 3 codon sites are assumed to evolve differently, the 3 consecutive nucleotide sites should be grouped together and regarded as a single unit of sequence evolution. Therefore, equation 3 should be adopted for calculating the BIC scores even for the comparison of nucleotide models. R ESULTS AND D ISCUSSION Sequence Data For the comparisons of the 3 types of models, we analyzed 3 sequence data sets comprising 1) fast-evolving interspecific, 2) slowly evolving interspecific, and 3) highly polymorphic intraspecific protein-coding sequences. The first data set consisted of 12 mitochondrial genes obtained from 69 mammalian species (Nikaido et al. 203 2003). The data sequences were downloaded from the GenBank sequence database (for accession number, see Nikaido et al. 2003), aligned using ClustalW (Chenna et al. 2003) with the default settings, and manually edited. The log-likelihood scores for various models were calculated using “Tree 6” estimated by Nikaido et al. (2003). For the second data set, we selected the 10 most slowly evolving genes from the 2789 nuclear genes of 10 mammals that were analyzed by Nishihara et al. (2007) to investigate the phylogenetic relationship among the Boreoeutheria, Xenarthra, and Afrotheria groups. Using the phylogenies estimated by Nishihara et al. (2007), we calculated the likelihood scores of 10 genes using various models. The third data set consisted of 12 mitochondrial genes of humans. We excluded individuals from the same location and selected 33 human mitochondrial genomes from the data set obtained by Ingman et al. (2000) for the investigation of human origin. The mitochondrial genomes were separated into 12 protein-coding sequences, which were aligned using ClustalW (Chenna et al. 2003) with the default settings and then manually edited. The log-likelihood scores for various models were calculated using the neighbor-joining (Saitou and Nei 1987) tree topology estimated by Ingman et al. (2000). In the analysis of the 3 data sets, we have not considered the uncertainty in the tree topologies and have assumed that the adopted topologies represent the true phylogenetic relationships. However, we expect that our results are robust for different tree topologies because model selection has been known to be rarely affected by tree topologies if the adopted topologies are not considerably different from the true topologies (Sullivan and Swofford 1997; Posada and Crandall 2001; Abdo et al. 2005). Three Types of Models and Their Conversion to 64-Dimensional Models DNA model and conversion.—Although our conversion procedure is generally applicable for any type of nucleotide model, we demonstrate it here using the HKY model (Hasegawa et al. 1985). To explicitly denote that equation 8 is derived from the HKY model in equation 7, we will refer to the 64-dimensional model in equation 8 as the HKY(64) model in our analysis. We can make the HKY(64) model more realistic by assigning an additional free parameter in the rate matrix given by equation 8 during the substitutions at the third-codon sites and we refer to this new model as the HKY (64) /3rd model. Further, the HKY(64) /3rd model can be made more realistic by assigning zero instantaneous transition rates when stop codons are involved and refer to this new model as the HKY(61) /3rd model. While there exist 4-dimensional nucleotide models equivalent to the HKY(64) and HKY(64) /3rd models, there exist no nucleotide models equivalent to the HKY (61) /3rd model. Therefore, the HKY(61) /3rd model should be 204 −9141.1 −8952.0 −8812.9 −9430.7 −8838.7 −8931.2 −8887.5 ATP6 −26825.4 −24276.1 −24120.3 −23801.3 −22612.8 −22118.3 −22437.4 CYTB NADH4L −12159.7 −11235.6 −11164.1 −10795.0 −10392.6 −10154.6 −10342.8 NADH4 −57511.1 −52951.2 −52437.4 −51378.4 −49401.6 −48520.2 −49213.5 Notes. The number of parameters in the rate matrix is represented by p. NADH5 Codon models and their conversion.—As we noted, the 61(or 60) dimensional codon models can be regarded as 64-dimensional codon models. For a statistical comparison of the 3 types of models, we considered 3 low-dimensional codon models: 1) GY94 (Goldman and Yang 1994), 2) SK-P1 (equation 5), and 3) SK-P2 (equation 6) models. The sai aj values of the mtREV (Adachi and Hasegawa, 1996) and WAG (Whelan and Goldman 2001) models were incorporated into the SK-P1 and SK-P2 models for the analysis of the mitochondrial and nuclear protein–coding sequences, respectively. −14010.6 −12981.8 −12863.0 −12337.8 −11867.5 −11675.2 −11953.6 NADH3 NADH2 Amino acid models and their conversion.—Different amino acid models were adopted for different data sets. For the mammalian and human mitochondrial genes, we adopted the mtREV + F (Adachi and Hasegawa 1996) amino acid model. For the mammalian nuclear genes, we adopted the WAG + F (Whelan and Goldman 2001) amino acid model. These amino acid models were converted into SK-P0 and SK-P1 models, respectively. −46635.7 −44023.7 −43565.5 −41987.8 −40517.9 −40368.1 −41067.9 −35967.5 −32514.8 −32264.3 −31088.0 −29645.3 −29074.5 −29768.1 NADH1 COX3 −27950.8 −24375.7 −24157.2 −23518.7 −22217.1 −21690.8 −21995.5 COX2 −24415.9 −21217.2 −21018.4 −20566.4 −19285.8 −18675.4 −18869.3 COX1 −53522.1 −44378.9 −44077.4 −42549.5 −39670.9 −38678.4 −38934.3 Gene Model HKY(64) (p = 1) HKY(64) /3rd (p = 2) HKY(61) /3rd (p = 2) SK-P0 (p = 0) SK-P1 (p = 1) SK-P2 (p = 2) GY94 (p = 2) TABLE 1. Maximum log-likelihood scores of the 7 different models for 12 mitochondrial genes from 69 mammalian species VOL. 58 categorized as a 61-dimensional codon model. The HKY(61) /3rd model is a 61-dimensional codon model that is closest to the HKY model. −77578.4 −72380.9 −71711.4 −70434.8 −67707.4 −66941.1 −67861.4 −42043.7 −37944.8 −37636.8 −36112.2 −34286.2 −33725.8 −34399.1 ATP8 SYSTEMATIC BIOLOGY Comparing Log-Likelihoods of 7 Different Models Because our selection of 7 different models and 3 data sets is rather arbitrary, some results of model comparison might be highly specific for the compared models and the analyzed data. However, these results are very useful in elucidating the advantages, disadvantages, and performance of the 3 types of models. Other results of the model comparison are so fundamental and general that they will be robust for selections of candidate models and sequence data. These results are discussed in detail in this section. Tables 1–3 show the log-likelihood scores of the 7 different models for the 1) interspecific mammalian mitochondrial, 2) interspecific mammalian nuclear, and 3) intraspecific human mitochondrial protein–coding sequences, respectively. The number of free parameters in the rate matrices varies between 0 and 2. This range is considerably smaller than the range of differences between the log-likelihood scores of the 7 different models. For example, the longest gene in Table 1 is NADH5, whose length is 616 codons including gaps. The difference in K log n between the models with K = 2 and K = 0 is 12.8; this difference is relatively smaller than the difference in the log-likelihood scores and has negligible effect on the AIC and BIC scores. Therefore, we list the log-likelihood scores in Tables 1–3 instead of the AIC or BIC scores. The consistency in the results in Tables 1–3 indicates that codon models generally show better performances than amino acid (SK-P0) and nucleotide models (HKY (64) and HKY(64) /3rd). The SK-P2 and GY94 models are always better than the other models, and the HKY (61) /3rd model is always better than the HKY(64) and HKY(64)/3rd models. 2009 −488.55 −487.76 −481.75 −876.50 −475.94 −453.47 −453.46 −1288.26 −1267.10 −1255.54 −2742.18 −1272.55 −1206.40 −1202.34 −784.17 −741.95 −735.79 −1455.67 −690.13 −672.10 −674.28 −531.10 −491.03 −488.16 −637.61 −439.69 −421.17 −421.32 −373.66 −361.33 −357.49 −667.73 −330.04 −318.00 −317.88 −535.16 −529.42 −525.85 −1202.75 −507.91 −493.96 −497.53 −784.37 −761.00 −755.01 −1651.47 −719.05 −706.83 −707.40 −1117.39 −1098.63 −1087.09 −2741.88 −1047.34 −1031.61 −1029.38 −408.00 −401.84 −397.57 −886.46 −371.79 −362.78 −361.77 −543.33 −529.28 −525.82 −1225.90 −513.75 −491.12 −490.71 FLJ20309 PPP1R12A C10ORF30 LHX9 FOXP2 MEF2C MLR2 HOXD10 Gene Model HKY(64) (p = 1) HKY(64) /3rd (p = 2) HKY(61) /3rd (p = 2) SK-P0 (p = 0) SK-P1 (p = 1) SK-P2 (p = 2) GY94 (p = 2) TABLE 2. Maximum log-likelihood scores of the 7 different models for the 10 most slowly evolving nuclear genes from 10 mammalian species EPC2 GBA SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS 205 In all cases, in Tables 1–3, the log-likelihood scores of the HKY(64) /3rd model are greater than those of the HKY(64) model. This observation is consistent with the fact that the third-codon sites are mostly synonymous sites, and synonymous substitutions generally occur more often than nonsynonymous substitutions. The advantages of the HKY(61) /3rd model over the HKY(64) /3rd model can be explained by the effect of stop codons. As we noted in Theory, nonzero frequencies of the stop codons and nonzero instantaneous transition rates when a stop codon is (or stop codons are) involved are explicitly assumed in the HKY(64) /3rd model. These unrealistic assumptions for protein-coding sequences are intentionally excluded in the HKY(61) /3rd model, which leads to the improvement of the log-likelihood scores. This provides strong evidence of the superiority of codon models, and the superiority of codon models can be generalized for different nucleotide models. For any given 4-dimensional nucleotide model, we can always define a 61-dimensional codon model in which the abovementioned unrealistic assumptions are excluded. That is, there exists at least one codon model that is better than the given nucleotide model. Similarly, consider the GTR(61) /3rd, GTR(64) /3rd, and GTR(64) models, which are derived from 4-dimensional GTR models (Tavaré 1986; Yang 1994). Like the HKY(61) /3rd model, the GTR(61) /3rd model explicitly excludes stop codons and is expected to be better than the GTR (64)/ 3rd and GTR(64) models. Because there are 4 additional free parameters in the rate matrix, the GTR (61) /3rd model requires more computation than the HKY(61) /3rd model and may not be practical for the analysis of a large number of genes and taxa. However, our fundamental and general conclusion will still hold in this case; it is always possible to define a codon model that might be computationally impractical but will be better than the given nucleotide model. In Table 1, the log-likelihood scores of the SK-P0 model are generally greater than those of the HKY (61) / 3rd model, except in the case of ATP8, whose length is 70 codons including gaps. Due to its short sequence length, ATP8 may not provide sufficient resolution for model comparison. Divergent protein-coding sequences contain a reasonable number of amino acid substitutions, and the substitution pattern will be consistent with the sai aj values in equation 4. In Table 1, the additional information provided by the sai aj values indicates the superiority of the SK-P0 model over the HKY (61) /3rd model. However, when the sequences are not divergent (Tables 2 and 3), the observed pattern of amino acid substitutions will be different from the empirical sai aj values. In this case, the HKY(61) /3rd model in which only mechanistical nucleotide substitutions are considered is better than the SK-P0 model. The advantages of the SK-P1 model over the SKP0 model have been theoretically investigated in our previous study and graphically demonstrated for the 206 −289.08 −288.63 −285.05 −1977.12 −276.38 −267.31 −267.99 −1070.17 −1065.17 −1054.65 −7409.50 −999.52 −973.26 −977.36 −1790.69 −1784.98 −1767.42 −11379.30 −1683.94 −1639.58 −1643.94 −2831.03 −2808.72 −2780.79 −19508.40 −2688.78 −2638.96 −2641.14 −443.99 −438.29 −433.69 −2768.65 −385.84 −375.38 −375.78 −2208.82 −2179.66 −2159.00 −14696.50 −2086.34 −2025.27 −2023.87 −525.39 −524.52 −518.55 −3211.39 −473.49 −458.27 −460.04 −1571.05 −1563.62 −1547.12 −11832.90 −1478.55 −1444.37 −1447.55 −1250.11 −1236.58 −1223.87 −8337.56 −1162.47 −1138.88 −1142.48 −2379.08 −2360.42 −2333.39 −16670.50 −2189.37 −2161.77 −2166.12 −1062.11 −1055.10 −1043.27 −7578.14 −999.29 −982.86 −985.60 −1459.92 −1392.18 −1378.71 −9600.59 −1266.47 −1251.83 −1255.37 NADH4 NADH3 NADH2 NADH1 COX3 COX2 COX1 Gene Model HKY(64) (p = 1) HKY(64) /3rd (p = 2) HKY(61) /3rd (p = 2) SK-P0 (p = 0) SK-P1 (p = 1) SK-P2 (p = 2) GY94 (p = 2) TABLE 3. Maximum log-likelihood scores of the 7 different models for 12 mitochondrial genes from 33 humans NADH4L NADH5 CYTB ATP6 ATP8 SYSTEMATIC BIOLOGY VOL. 58 sequences listed in Table 1 (Seo and Kishino 2008). We have observed the same results for protein-coding sequences that are not divergent (Tables 2 and 3). The superiority of the SK-P1 model over the SK-P0 model can be generalized, and it is always possible to define a 61-dimensional SK-P1 model that is better than the given amino acid model (Seo and Kishino 2008). In Tables 2 and 3, the SK-P0 model shows very poor performance, and its log-likelihood scores are remarkably lower than those of the HKY(64) model. When sequences are not divergent, the SK-P0 model (equivalent amino acid model) is not applicable because of the small number of amino acid substitutions. Except for ATP8 in Table 1 and EPC2 in Table 2, the log-likelihood scores of the SK-P1 model are greater than those of the HKY(61) /3rd model, implying that the SK-P1 model is generally better than the HKY (61) /3rd model. Because the length of EPC2 is 191 codons, this exception may not be caused by the small length. The value of κ for EPC2 estimated using the HKY(61) /3rd model is 4.11. Thus, for EPC2, it appears that separating transitions and transversions is more realistic than incorporating the empirical information of the WAG + F (Whelan and Goldman 2001) amino acid model. Multiple nucleotide substitutions are allowed in the SK-P1 model during an infinitesimal time period, whereas only a single nucleotide substitution is allowed in the SK-P2 model. Further, transitions and transversions are distinguished in the SK-P2 model. A single transition and a single transversion during an infinitesimal time period appear to be consistent with the real evolutionary mechanism; this consistency explains the reason behind the log-likelihood scores of the SK-P2 model being greater than those of the SK-P1 models in all cases in Tables 1–3. In Table 1, the SK-P2 model is better than the GY94 model, except for the case of ATP8, which is due to the empirical information provided by the sai aj values. Although the GY94 and SKP2 models show overall superiority in Tables 2 and 3, respectively, the differences in log-likelihood scores are relatively small. Hence, we expect that these two models may be equally applicable for conserved protein-coding sequences. Advantages and Disadvantages of 3 Types of Models and Concluding Remarks Using our conversion procedures, it is possible to statistically compare 4-dimensional nucleotide models with 61-dimensional codon models. We have shown that there exists at least one 61-dimensional codon model that is better than the given nucleotide model. Using our previous procedure to convert a 20-dimensional amino acid model into a 61-dimensional codon model (Seo and Kishino 2008), we can find at least one 61-dimensional codon model that is better than the given amino acid model. Therefore, we note that a codon model is always the best model among evolutionary models. That is, we can always define at least one codon model that is better than the given nucleotide and amino acid models. 2009 SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS Although codon models are in general better than the other 2 types of models, they have a disadvantage— they require intensive computation. When optimizing the free parameters of a rate matrix, we must recalculate the eigenvalues and eigenvectors of a 61-dimensional rate matrix for updated free parameters. Furthermore, codon models require intensive computation for the pruning algorithm (Felsenstein 1981) because of their high dimensions. In contrast, 4-dimensional nucleotide models require considerably less computation. The eigenvalues, eigenvectors, and transition probabilities are analytically available in many cases, and the pruning algorithm is quite fast. The computational advantage of 4-dimensional nucleotide models over the other 2 types of modes is more apparent when we consider the rate heterogeneity among sites (RHAS; Yang 1993). However, nucleotide models are expected to show poor performance, as represented in Tables 1–3. For divergent protein-coding sequences, we note that highly sophisticated nucleotide models that consider RHAS can be worse than the simple SK-P1 model that does not consider RHAS. For the 12 protein-coding sequences listed in Table 1, we have calculated the maximum log-likelihood scores by using PAML (Yang 2007) and the GTR(64) /3rd model with gamma distribution of RHAS, which is referred to as the GTR(64) /3rd+ model. Except NADH5 and ATP8, the other 10 genes show better performance by the application of the SKP1 or SK-P2 model as compared with the application of the GTR(64) /3rd+ model (data not shown). The behavior of ATP8 can be explained by the lack of resolution caused by its small sequence length, and the behavior of NADH5 can be explained by 2 factors: consideration of RHAS and inclusion of additional free parameters in the rate matrix of the GTR(64) /3rd+ model. Because these 2 factors require intensive computation, only one of them may be practically incorporated into codon models. As an example, we have calculated the maximum log-likelihood score of NADH5 using the GY94 model and RHAS (GY94 + ) using PAML (Yang 2007). We have observed that the GY94 + model performs much better than the GTR(64) /3rd+ model in the analysis of NADH5 (data not shown), which is considered to be due to the explicit exclusion of stop codons by using the GY94 + model. Amino acid models are intermediate between nucleotide models and codon models in terms of their computational load. The computational load for pruning algorithms is heavier than that of nucleotide models, but lighter than that of codon models. Further, amino acid models do not have free parameters in the rate matrix, and the recalculation of the eigenvalues and eigenvectors is not required. However, amino acid models are expected to show poor performance when the sequences are not divergent (Tables 2 and 3). When a large number of genes and taxa have to be analyzed and the computational load is an important issue, our new codon model, the SK-P1 model, can be a good alternative to nucleotide models and other 207 codon models. Although the computational load for the pruning algorithm is the same as that of a conventional 61-dimensional codon model, the eigenvalues and eigenvectors need not be recalculated for optimizing the ρ parameter. Further, a 61-dimensional transition probability matrix can be analytically derived from the 20-dimensional transition probability matrix of an amino acid model (Seo and Kishino 2008), drastically reducing the computational time. Furthermore, the SK-P1 model retains the most important merit of codon models—synonymous and nonsynonymous substitutions can be separated and compared to detect the deviation from neutral evolution. When a small number of genes and taxa have to be analyzed or the computational load is not a major concern, our SK-P2 model can be a good alternative to the GY94 model because it has more flexibility for incorporating the empirical information of amino acid models. In this paper, we have presented the theoretical foundation for the statistical comparison of nucleotide and codon models. In conjunction with our previous results (Seo and Kishino 2008), we conclude that codon models should be adopted for the evolutionary analysis of protein-coding sequences so long as the computational load is not a major concern. Further, with continuous improvements in computational performance in the postgenomic era, the development of realistic models of sequence evolution is becoming increasingly more important. Our conclusion on the superiority of codon models will provide a useful guide for future research on improving evolutionary models. F UNDING T.-K.S. was supported by JSPS (EYS B-18770216), and H.K. was supported by JSPS (SR B-16300086). A CKNOWLEDGMENTS We thank Jack Sullivan, Marc Suchard, David Posada, and an anonymous reviewer for providing valuable comments and Hidenori Nishihara for providing the sequence data. R EFERENCES Abascal F., Zardoya R., Posada D. 2005. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 21:2104–2105. Abdo Z., Minin V.N., Joyce P., Sullivan J. 2005. Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. Mol. Biol. Evol. 22:691–703. Adachi J., Hasegawa M. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42:459–468. Akaike H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19:716–723. Anderson F.E., Swofford D.L. 2004. Should we be worried about longbranch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol. Phylogenet. Evol. 33:440–451. Aris-Brosou S., Yang Z. 2002. Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. Syst. Biol. 51:703–714. Bruno W.J., Halpern A.L. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16: 564–566. 208 VOL. 58 SYSTEMATIC BIOLOGY Buckley T.R. 2002. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst. Biol. 51:509–523. Cao Y., Adachi J., Janke A., Paabo S., Hasegawa M. 1994. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519–527. Chenna R., Sugawara H., Koike T., Lopez R., Gibson T.J., Higgins D.G., Thompson J.D. 2003. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 31:3497–3500. Dayhoff M.O., Schwartz R.M., Orcutt B.C. 1978. A model of evolutionary change in proteins. In: Dayhoff M.O., editor. Atlas of protein sequence and structure. Washington (DC): National Biomedical Research Foundation. p. 345–352. Doron-Faigenboim, A., Pupko T. 2007. A combined empirical and mechanistic codon model. Mol. Biol. Evol. 24:388–397. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376. Frati F., Simon C., Sullivan J., Swofford D.L. 1997. Evolution of the mitochondrial cytochrome oxidase II gene in Collembola. J. Mol. Evol. 44:145–158. Gaut B.S., Lewis P.O. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152–162. Goldman N., Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725– 736. Golub G.H., Van Loan C.F. 1996. Matrix computations (Johns Hopkins studies in mathematical sciences). 3rd ed. Baltimore (MD): Johns Hopkins University Press. Hasegawa M., Kishino, H., Yano T. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160–174. Ingman M., Kaessmann H., Paabo S., Gyllensten U. 2000. Mitochondrial genome variation and the origin of modern humans. Nature 408:708–713. Jones D.T., Taylor W.R., Thornton J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275–282. Jukes T.H., Cantor C.R. 1969. Evolution of protein molecules. In: Munro H.N., editor. Mammalian protein metabolism. New York: Academic Press. pp. 21–132. Kimura M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120. Kosiol C., Holmes I., Goldman N. 2007. An empirical codon model for protein sequence evolution. Mol. Biol. Evol. 24:1464–1479. Minin V., Abdo Z., Joyce P., Sullivan J. 2003. Performance-based selection of likelihood models for phylogeny estimation. Syst. Biol. 52:674–683. Muse S.V., Gaut B.S. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. 11:715–724. Nielsen R., Yang Z. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 148:929–936. Nikaido M., Cao Y., Harada M., Okada N., Hasegawa M. 2003. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. 28:276–284. Nishihara H., Okada N., Hasegawa M. 2007. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol. 8:R199.1–R199.10. Posada D., Crandall K.A. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics. 14:817–818. Posada D., Crandall K.A. 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50:580–601. Robinson D.M., Jones D.T., Kishino H., Goldman N., Thorne J.L. 2003. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. 20:1692–1704. Saitou N., Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425. Schwarz G. 1974. Estimating the dimension of a model. Ann. Stat. 6:461–464. Seo T.-K., Kishino H. 2008. Synonymous substitutions substantially improve evolutionary inference from highly diverged proteins. Syst. Biol. 57:367–377. Seo T.-K., Kishino H., Thorne J.L. 2004. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol. Biol. Evol. 21:1201–1213. Shapiro B., Rambaut A., Drummond A.J. 2006. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol. Biol. Evol. 23:7–9. Stuart A., Ord K. 1996. Likelihood ratio tests and the general linear hypothesis. In: Kendall’s advanced theory of statistics. 5th ed. Vol. 2. London: Edward Arnold. Chap. 23. Suchard M.A., Weiss R.E., Sinsheimer J.S. 2001. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. 18:1001–1013. Sullivan J., Markert J.A., Kilpatrick C.W. 1997. Phylogeography and molecular systematics of the Peromyscus aztecus species group (Rodentia: Muridae) inferred using parsimony and likelihood. Syst. Biol. 46:426–440. Sullivan J., Swofford D.L. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mammal. Evol. 4:77–86. Tamura K., Nei M. 1993. Estimating of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512–526. Tavaré S. 1986. Some probabilistic and statistical problems on the analysis of DNA sequences. Lect. Math. Life Sci. 17:57–86. Thorne J.L., Kishino H., Painter I.S. 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15:1647– 1657. Whelan S., Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18:691–699. Whelan S., Goldman N. 2004. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics. 167:2027–2043. Yang Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401. Yang Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586–1591. Yang Z., Nielsen R. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409–418. Yang Z., Nielsen R. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908–917. Yang Z., Nielsen R., Goldman N., Pedersen A.K. 2000. Codonsubstitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449. Yang Z., Nielsen R., Hasegawa M. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15:1600–1611. First submitted 7 June 2008; reviews returned 26 August 2008; final acceptance 5 January 2009 Associate Editor: Marc Suchard A PPENDIX Proof of Equation 9. The matrix R(64) of equation 8 can be separated into three matrices: R(64,1) , R(64,2) , and R(64,3) . For p (= 1, 2, 3), the components of R(64,p) are (64,p) Rrs ⎧ κπsp , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨πs , transition occurs only at pth site transversion occurs only p , = at pth site ⎪ ⎪ ⎪ (64,p) ⎪ − h=r Rrh , r = s ⎪ ⎪ ⎩ 0, otherwise 2009 209 SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS applying Ta,b [·] again is identical to the exponentiation of R(64,p) : Then, R(64) = R(64,1) + R(64,2) + R(64,3) . We found that the three matrices commute with each other. That is, for i and j (i = j and i, j ∈ {1, 2, 3}), R(64,i) R(64,j) = R(64,j) R(64,i) . This leads to (Golub and Van Loan, p. 577) exp{tR (64) } = exp{t{R (64,1) (64,1) = exp{tR (64,2) +R (64,3) +R (64,2) } exp{tR }} } exp{tR (64,3) Tab [exp{Tab [tR(64,p) ]}] = exp{tR(64,p) }. In a similar way, applying Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [·] to 1 = Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) 1 }. 2 ⎛ n 2 n 0 ··· 0 0 .. . tR(4) ··· .. . .. 0 .. . 0 0 ··· ⎜ ⎜ ⎜ =⎜ ⎜ ⎝ tR(4) . n 1 1 1 n 1 ⎡⎛ (64,p) Prs (t) = tR(4) n 0 ··· 0 0 .. . P(4) (t) .. . ··· 0 .. . 0 0 ··· ⎢⎜ ⎢⎜ ⎢⎜ × ⎢⎜ ⎢⎜ ⎣⎝ where (15) n P(4) (t) (64,p) ⎟ ⎟ ⎟ ⎟, ⎟ ⎠ n 1 = Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) = [Prs ⎞ n × exp Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [tR(64,p) ] Ta(p) b(p) ◦ Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [tR(64,p) ] 1 n P(64,p) (t): = exp{tR(64,p) } To calculate exp{tR(64) }, we separately calculate exp{R(64,p) }(p = 1, 2, 3). For R(64,p) , we consider the operation of exchanging the ath and bth rows and columns, which is denoted by the matrix function Tab [·]. Tab [·] simply changes the order of codons and does not affect the validity of the rate matrix. We found that iteratively exchanging the rows and columns transforms R(64,p) into a diagonal block matrix. That is, for each p (= 1, 2, 3), we can find a set of integer pairs (p) (p) (p) (p) (p) (p) {(a1 , b1 ), (a2 b2 ), . . . , (an , bn )} such that 1 n 1 equation 16 is identical to the exponentiation of R(64,p) : .. . ⎞⎤ ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎠⎦ P(4) (t) (t)]64×64 , ⎧ (4) ⎪ Prp sp (t), ⎪ ⎪ ⎪ ⎨ (4) ⎪ Prp rp (t), ⎪ ⎪ ⎪ ⎩ 0, substitution occurs only at pth site r=s . otherwise where R(4) is the rate matrix of equation 7 and 0 is a 4×4 null matrix. The operator ◦ denotes the composition of functions, that is, Ta(p) b(p) ◦Ta(p) b(p) [·]: = Ta(p) b(p) [Ta(p) b(p) [·]]. Let us denote exp{tR(64,1) } exp{tR(64,2) } as P(64,1×2) (t). Then, the elements of P(64,1×2) (t) are The exponentiation of equation 15 can be obtained by the exponentiation of diagonal blocks: exp Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [tR(64,p) ] P(64,1×2) (t) rs 1 1 2 1 n 1 ⎛ 2 1 2 1 0 ··· 0 0 .. . exp{tR(4) } ··· .. . .. 0 .. . 0 0 ··· ⎛ ⎜ ⎜ ⎜ =⎜ ⎜ ⎝ P(4) (t) 0 ··· 0 0 .. . P(4) (t) .. . ··· 0 .. . 0 0 = .. . ··· . ⎞ = ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ exp{tR(4) } P(64,1) (t)P(64,2) (t) rv vs ⎧ (4) (4) ⎪ Pr1 s1 (t)Pr2 s2 (t), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ (4) (4) Pr1 r1 (t)Pr2 r2 (t), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0, ⎪ ⎩ substitution occurs at either the 1st or 2nd site, or at both sites , r=s substitution occurs at the 3rd site The transition probability matrix derived from equation 8—P(64) (t)—can be obtained with exp{tR(64,1×2) } × exp{tR(64,3) }, whose elements are ⎞ ⎟ ⎟ ⎟ ⎟. ⎟ ⎠ 64 v=1 n exp{tR(4) } ⎜ ⎜ ⎜ =⎜ ⎜ ⎝ 2 (16) P(4) (t) Because Ta,b [·] simply changes the order of codons, applying Ta,b [·] to R(64,p) followed by exponentiation and P(64) rs (t) = 64 P(64,1×2) (t)P(64,3) (t) rv vs v=1 (4) (4) = P(4) r1 s1 (t)Pr2 s2 (t)Pr3 s3 (t). Proof of equation 11. We prove equation 11 for a phylogeny of 4 taxa using 1 codon site. The proof can 210 straightforwardly be extended to the case with more than 4 taxa and the case of n codons. Suppose that we observe 4 terminal codons p: = (p1 , p2 , p3 ), q: = (q1 , q2 , q3 ), r: = (r1 , r2 , r3 ), and s: = (s1 , s2 , s3 ), where pk ,qk ,rk , and sk (k = 1, 2, 3) are nucleotides. The unknown ancestral codon of p and q (r and s) is x: = (x1 , x2 , x3 ) (y: = (y1 , y2 , y3 )). The time duration from x to p (q) is t1 (t2 ) and the time duration from y to r (s) is t3 (t4 ). The time duration between x and y is t5 . Then, the likelihood with a DNA model is L (4) × = x1 =1 4 4 4 4 (4) (4) πx1 πx2 πx3 P(4) x1 p1 (t1 )Px2 p2 (t1 )Px3 p3 (t1 ) × 4 y1 =1 4 4 4 y1 =1 y2 =1 y3 =1 (4) (4) (4) P(4) x1 y1 (t5 )Px2 y2 (t5 )Px3 y3 (t5 )Py1 r1 (t3 ) (4) × P(4) y2 r2 (t3 )Py3 r3 (t3 ) (4) (4) (t )P (t )P (t ) × P(4) 4 4 4 y1 s1 y2 s2 y3 s3 (4) (4) P(4) x1 y1 (t5 )Py1 r1 (t3 )Py1 s1 (t4 ) 4 (4) (4) × πx2 P(4) (t )P (t ) P(4) x2 p2 2 x2 q2 2 x2 y2 (t5 )Py2 r2 (t3 )Py2 s2 (t4 ) x2 =1 y2 =1 y3 =1 (4) (4) × P(4) x1 q1 (t2 )Px2 q2 (t2 )Px3 q3 (t2 ) θ|p3 , q3 , r3 , s3 ) × L(4) (θ (4) πx1 P(4) x1 p1 (t1 )Px1 q1 (t2 ) 4 (4) (4) (4) πx3 P(4) P(4) x3 p3 (t1 )Px3 q3 (t2 ) x3 y3 (t5 )Py3 r3 (t3 )Py3 s3 (t4 ) x1 =1 x2 =1 x3 =1 θ|p, q, r, s) (θ 4 4 x3 =1 θ|p2 , q2 , r2 , s2 ) = L(4) (±bθ|p1 , q1 , r1 , s1 ) × L(4) (θ = VOL. 58 SYSTEMATIC BIOLOGY = 64 (64) πx P(64) xp (t1 )Pxq (t2 ) x=1 θ|p, q, r, s). = L(64) (θ 64 y=1 (64) (64) P(64) xy (t5 )Pyr (t3 )Pys (t4 )
© Copyright 2026 Paperzz