Statistical Comparison of Nucleotide, Amino Acid, and Codon

Syst. Biol. 58(2):199–210, 2009
c Society of Systematic Biologists
Copyright DOI:10.1093/sysbio/syp015
Advance Access publication on June 29, 2009
Statistical Comparison of Nucleotide, Amino Acid, and Codon Substitution Models for
Evolutionary Analysis of Protein-Coding Sequences
TAE -K UN S EO1,∗
AND
H IROHISA K ISHINO 2
1 Professional Programme for Agricultural Bioinformatics;
2 Laboratory of Biometrics and Bioinformatics, Graduate School of Agricultural and Life Sciences,
University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo 113-8657, Japan;
∗ Correspondence to be sent to: Professional Programme for Agricultural Bioinformatics, Graduate School of Agricultural and Life Sciences,
University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo 113-8657, Japan; E-mail: [email protected].
Abstract.—Statistical models for the evolution of molecular sequences play an important role in the study of evolutionary processes. For the evolutionary analysis of protein-coding sequences, 3 types of evolutionary models are available:
1) nucleotide, 2) amino acid, and 3) codon substitution models. Selecting appropriate models can greatly improve the estimation of phylogenies and divergence times and the detection of positive selection. Although much attention has been paid
to the comparisons among the same types of models, relatively little attention has been paid to the comparisons among the
different types of models. Additionally, because such models have different data structures, comparison of those models
using conventional model selection criteria such as Akaike information criterion (AIC) or Bayesian information criterion
(BIC) is not straightforward. Here, we suggest new procedures to convert models of the above-mentioned 3 types to 64dimensional models with nucleotide triplet substitution. These conversion procedures render it possible to statistically
compare the models of these 3 types by using AIC or BIC. By analyzing divergent and conserved interspecific mammalian
sequences and intraspecific human population data, we show the superiority of the codon substitution models and discuss
the advantages and disadvantages of the models of the 3 types. [AIC; amino acid model; BIC; codon model; likelihood ratio
test; model comparison; nucleotide model.]
The statistical models for the evolution of molecular
sequences are designed to approximate the simplified
forms of the complex aspects of evolutionary processes.
In these simplified forms, evolutionary processes are
summarized using a certain number of parameters. This
simplification enables us to easily understand evolutionary processes and identify the major driving forces
behind them. For the evolutionary analysis of nucleotide sequences, 3 types of evolutionary models are
used: 1) nucleotide, 2) amino acid, and 3) codon substitution models. Depending on the type of data, all
or some of these models are applicable. This study focuses on the analysis of protein-coding DNA sequences
in which all 3 types of models are applicable.
Nucleotide substitution models define the relative
occurrence of substitutions among the 4 nucleotides;
hence, we refer to them as “4-dimensional” substitution
models. Here, the “dimension” represents the number
of states and the number of rows and columns of the
transition rate matrix of the models. The simple Jukes–
Cantor model (Jukes and Cantor 1969) assumes the
equal occurrence of transitions and transversions and
equal nucleotide frequencies, whereas the improved
models relax either the former (Kimura 1980) or the latter assumptions (Felsenstein 1981). In addition, both assumptions can be relaxed simultaneously (Hasegawa et
al. 1985), and the type of transitions can be distinguished
(Tamura and Nei 1993). All these models are generalized by the general time-reversible model (GTR; Tavar é
1986; Yang 1994) in which the 4 nucleotide frequencies
and 12 possible types of substitutions can be different.
Amino acid substitution models such as JTT (Jones et al.
1992), Dayhoff (Dayhoff et al. 1978), mtREV (Adachi and
Hasegawa 1996), WAG (Whelan and Goldman 2001),
and so on, define the relative occurrence of substitutions
among the 20 amino acids; thus, we refer to them as “20dimensional” substitution models. Database-derived
information for substitution patterns is obtained by using parsimony-based methods (e.g., Dayhoff et al. 1978;
Jones et al. 1992) or likelihood-based methods (e.g.,
Adachi and Hasegawa 1996; Whelan and Goldman
2001). In the simple “proportional amino acid model”
(Cao et al. 1994), all types of amino acid substitutions are
equally probable, and the mechanistic amino acid models that consider the mutational biases of nucleotides
(Yang et al. 1998) are also available. The amino acid frequencies can be derived either from a database or from
the analysis of sequence data. In case the latter technique
is used, the models have been found to be considerably
improved; the names of such models are assigned the
suffix “+F” (Cao et al. 1994). Codon substitution models define the relative occurrence of substitutions among
sense codons (Goldman and Yang, 1994; Muse and Gaut,
1994). For simplicity, we describe codon models using
the universal codon table; hence, we refer to them as
“61-dimensional” substitution models. However, the
specific number of dimensions of a codon model is not
crucial in our study, and our descriptions are equally
valid in codon models that use different codon tables.
Using the codon model developed by Goldman and
Yang (1994; GY94), the instantaneous transition rate
from codon i to codon j is defined as follows:
199
(61)
Qij
⎧
0,
more than one DNA substitution
⎪
⎪
⎪
⎪
,
synonymous transversion
π
⎪
⎨ j
synonymous transition
= κπj ,
, (1)
⎪
⎪
⎪
ωπ
,
nonsynonymous
transversion
⎪ j
⎪
⎩
ωκπj , nonsynonymous transition
200
SYSTEMATIC BIOLOGY
where πj and κ represent the frequency of codon j and
the relative occurrence of transitions with respect to
transversions, respectively, and ω represents the relative occurrence of nonsynonymous substitutions with
respect to synonymous substitutions and is used to detect deviation from neutral evolution. The superscript
“(61)” indicates that the transition is among 61 sense
codons. The codon model in equation 1 can be improved by allowing instantaneous multiple nucleotide
substitutions (Whelan and Goldman 2004) or by incorporating the physical properties of proteins (Robinson
et al. 2003) and the empirical substitution pattern derived from a database (Doron-Faigenboim and Pupko
2007; Kosiol et al. 2007; Seo and Kishino 2008).
Because statistical models are “approximations” of
natural phenomena, all statistical models are fundamentally inaccurate. Thus, it is very crucial to select
an appropriate model that minimizes the discrepancy
between the model and the true evolutionary process. It
has been noted that the selection of inappropriate models may result in the incorrect estimation of phylogenetic
relationships and the incorrect estimation of uncertainty
of tree topologies (e.g., Gaut and Lewis 1995; Sullivan
and Swofford 1997; Bruno and Halpern 1999; Buckley
2002; Anderson and Swofford 2004). In the Bayesian
framework, incorrect branch-length estimation due to
the adoption of inappropriate models may cause incorrect estimation of divergence time (Thorne et al. 1998)
or incorrect estimation of the trend in selective pressure
(Seo et al. 2004).
Typically, likelihood-based models are compared using both frequentist and Bayesian inference methods.
When one model is nested within another, that is, when
one model is a special case of the other model, the
likelihood ratio test (LRT; Stuart and Ord 1996) can
be adopted to test the significance of the advantages
of nesting models. For various nucleotide substitution
models, hierarchical LRT can be adopted to select the
set of appropriate models (e.g., Frati et al. 1997; Sullivan
et al. 1997; Posada and Crandall 2001). For models that
are not nesting or nested, LRT is not applicable, and the
Akaike information criterion (AIC; Akaike 1974) or the
Bayesian information criterion (BIC; Schwarz 1974) can
be adopted for model selection. For a given data set (X)
comprising n independent data items (X: = {x1 , · · · , xn })
and model i, the AIC and BIC are defined as
i |X) + 2K
AIC(i): = −2l(θ
(2)
i |X) + K log n,
BIC(i): = −2l(θ
(3)
i is the vector of the maximum likelihood estiwhere θ
i , and l(θ
i |X) is
mates of model i, K is the dimension of θ
the log-likelihood score for θ i and X. The model whose
AIC (or BIC) score is the minimum score is regarded
as the best model. Models can also be compared using
a decision-theoretic approach (Minin et al. 2003; Abdo
et al. 2005) and the Bayes factor (e.g., Suchard et al. 2001;
Aris-Brosou and Yang 2002). In these comparisons, appropriate loss functions and posterior distributions have
VOL. 58
to be determined by using the likelihood calculated using substitution models.
Although considerable attention has been paid to the
comparisons among the same types of models (e.g.,
Nielsen and Yang 1998; Posada and Crandall 1998; Yang
and Nielsen 1998, 2002; Yang et al. 2000; Abascal et al.,
2005), relatively little attention has been paid to the comparisons among the different types of models (Shapiro
et al. 2006; Seo and Kishino 2008). The comparisons
among different types of models are difficult because
these models have different data structures, and hence,
the AIC or BIC cannot be applied in a straightforward manner in these comparisons. For calculating the
AIC (or BIC) scores for different models, the data X in
equation 2 (or equation 3) should be identical. Although
the entire data set of aligned sequences is identical,
individual data items are different for different types
of models. When applying nucleotide models, we separate each codon column into 3 nucleotide columns.
This “codon separation” yields different data X for
nucleotide and codon models, and the number of individual data items for nucleotide models becomes 3n.
For applying amino acid models, we translate codons
into amino acids. Even a “translation” yields different
data X for amino acid and codon models, although the
number of individual data items is identical in this case.
Recently, we have developed a new procedure to
statistically compare amino acid models with codon
models (Seo and Kishino 2008). By converting the 20dimensional amino acid models into 61-dimensional
codon models and applying amino acid models, it is
possible to measure the log-likelihood scores corresponding to translations. Further, amino acid models
can be regarded as special cases of codon models, and
both types of models with the same 61-dimensional
data structure can be compared (see Theory). Although
amino acid models can be compared with codon models, the technique to compare these models with
4-dimensional nucleotide models is still unclear. A
comparison of the AIC scores calculated using the nucleotide and codon models has been attempted (Shapiro
et al. 2006); however, no theoretical justification has been
provided for the comparison. Here, we solve the problem of comparing nucleotide and codon models despite
their different data structures. We propose new procedures to convert both 4-dimensional nucleotide models
and 61-dimensional codon models into 64-dimensional
codon models. In these 64-dimensional codon models,
stop codons as well as sense codons are taken into consideration. Thus, 20-dimensional amino acid models
can be converted initially into 61-dimensional codon
models via our previous procedure (Seo and Kishino
2008) and then into 64-dimensional codon models using our new procedure. The resulting 64-dimensional
codon models are mathematically equivalent to 4dimensional nucleotide, 20-dimensional amino acid,
and 61-dimensional codon models. Furthermore, due to
this mathematical equivalence, it is possible to obtain
identical data X in equations 2 and 3 and to compare the
3 types of models using the same 64-dimensional data
2009
SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS
structure. We have found that it is always possible to
define a codon model, which is better than the given
nucleotide model, by converting 4-dimensional nucleotide models into 64-dimensional codon models. It
is always possible to define a codon model, which is
better than the given amino acid model, by converting
20-dimensional amino acid models into 61-dimensional
codon models (Seo and Kishino 2008). These theoretical
findings exhibit the potential advantages and applications of codon models. We have applied our procedures
to the analysis of 1) highly divergent interspecific mammalian mitochondrial, 2) highly conserved interspecific
mammalian nuclear, and 3) polymorphic intraspecific
human mitochondrial protein–coding sequences. Using
empirical data analysis, we demonstrate and discuss the
advantages of codon models over the other 2 types of
models.
T HEORY
To begin with, we provide an overview of our previous procedure (Seo and Kishino 2008) used to convert a
20-dimensional amino acid model into 61-dimensional
codon models and then explain the new procedures
used to convert nucleotide and codon models into 64dimensional models. We demonstrate our conversion
procedures using the HKY (Hasegawa et al. 1985) and
GY94 (Goldman and Yang 1994) models, but these conversion procedures can be easily applied to any type of
nucleotide or codon model.
201
transition probabilities during time t in the SK-P1 model
in equation 5 can be analytically obtained using the
transition probabilities and eigenvalues of the amino
acid model in equation 4. When ρ = ∞ in equation
5, synonymous substitutions are completely saturated,
and the SK-P1 model is reduced to an “SK-P0” model,
which is mathematically equivalent to the amino acid
model in equation 4.
The empirical sai aj values obtained from equation 4
can be used to improve the conventional codon model
in equation 1; the improved model is referred to as the
“SK-P2” model (Seo and Kishino 2008). For the SK-P2
model, the instantaneous transition rate from codon i to
codon j is defined as
⎧
0,
more than one DNA substitution
⎪
⎪
⎪
⎪
⎪
synonymous transversion
⎪
⎨ρπj ,
synonymous transition
. (6)
Rij = ρκπj ,
⎪
⎪
⎪
s
π
,
nonsynonymous
transversion
⎪
ai aj j
⎪
⎪
⎩s κπ , nonsynonymous transition
ai aj
j
When sai aj ≡ 1 and ρ = 1/ω, the SK-P2 model becomes
identical to the GY94 model. That is, the GY94 model is
an SK-P2 model in which the simple proportional amino
acid model (sai aj ≡ 1; Cao et al. 1994) is incorporated.
Because any amino acid model can be incorporated into
the SK-P2 model, the SK-P2 model is more flexible than
the GY94 model.
Converting 20-Dimensional Amino Acid Models into
61-Dimensional Codon Models: Overview
Let us consider an amino acid model in which the instantaneous transition rate is defined as
sai aj πaj ,
if ai = aj
Rai aj =
,
(4)
− ak :ak =ai Rai ak , if ai = aj
Converting 4-Dimensional HKY Model into
64-Dimensional Model
In the HKY model (Hasegawa et al. 1985), the instantaneous transition rate from nucleotide a to nucleotide b
is defined as follows:
κπb , transition
(4)
Rab =
,
(7)
πb ,
transversion
where πaj is the frequency of the amino acid aj and sai aj
represents the relative occurrence of substitutions from
ai to aj during an infinitesimal time period. The sai aj
values are typically estimated by parsimony (Dayhoff
et al. 1978; Jones et al. 1992) or likelihood (Adachi and
Hasegawa 1996; Whelan and Goldman 2001) methods.
From the amino acid model in equation 4, we derive a
61-dimensional SK-P1 model (Seo and Kishino 2008) in
which the instantaneous transition rate from codon i to
codon j is defined as
sai aj πj , nonsynonymous change
Rij =
,
(5)
synonymous change
ρπj ,
where πb represents the frequency of nucleotide b. We
use the superscript “(4)” to indicate that equation 7 defines the instantaneous transition rates among the 4 nucleotides. Let us denote the 4 × 4 rate matrix derived
from equation 7 as R(4) . Then, the transition probability from nucleotide a to nucleotide b during time t is
given by the element in the ath row and bth column of
P(4) (t): = exp{tR(4) }. For simplicity, we do not consider
the normalizing factors of rate matrices in our study.
Similar to the definition in equation 7, we define a
64-dimensional codon model in which the instantaneous transition rate from codon r to codon s is given as
follows:
⎧
more than one DNA substitution
⎪
⎨0,
(64)
Rrs = κπsp , transition occurs at pth site
, (8)
⎪
⎩π ,
transversion occurs at pth site
sp
where aj is the amino acid that codon j encodes and
πj is the frequency of codon j. The sai aj values are obtained from the amino acid model in equation 4, and
the free parameter ρ represents the relative occurrence
of synonymous substitutions with respect to sai aj . The
where p ∈ {1, 2, 3} and sp is the nucleotide of codon s
at the pth site (refer Whelan and Goldman [2004] for a
202
VOL. 58
SYSTEMATIC BIOLOGY
similar conversion procedure within 61 dimensions).
Let us denote the 64 × 64 rate matrix derived from
equation 8 as R(64) . Then, the transition probability from
codon r to codon s during time t is given by the element
in the rth row and sth column of P(64) (t): = exp{tR(64) }.
We have found that the transition probabilities of
the R(64) model can be represented by multiplication
of the transition probabilities of the R(4) model at each
nucleotide site (see Appendix). That is,
(4)
(4)
(4)
p(64)
rs (t) = pr1 s1 (t) · pr2 s2 (t) · pr3 s3 (t).
(9)
Further, we have found that the likelihood of the phylogeny obtained using the R(64) model is identical to
that obtained using the R(4) model. Let us denote the
protein-coding sequence data as S: = s1 s2 · · · s3n , where
si is the ith column of nucleotide alignment and n is the
number of codon columns. Let θ denote the unknown
parameters of the rate matrix and the time duration
along the branches of phylogeny. Then, for the ith column of codon alignment, we can show the following
relationship (see Appendix):
θ|s3i−2 s3i−1 s3i ) = L(4) (θ
θ|s3i−2 ) · L(4) (θ
θ|s3i−1 ) ·
L(64) (θ
θ|s3i ),
L(4) (θ
(10)
θ|·) and L(4) (θ
θ|·) represent the likelihoods
where L(64) (θ
(64)
calculated using the R
and R(4) models, respectively.
Using equation 10, we obtain the following relationship:
θ|S) =
L(64) (θ
n
θ|s3i−2 s3i−1 s3i )
L(64) (θ
i=1
n
θ|s3i−2 ) · L(4) (θ
θ|s3i−1 ) · L(4) (θ
θ|s3i )}
=
{L(4) (θ
i=1
=
3n
(4)
θ|si ) = L(4) (θ
θ|S).
L (θ
In general, the third-codon sites in protein-coding
sequences evolve faster than the first- or second-codon
sites because most nucleotide substitutions at the thirdcodon sites are synonymous. Thus, typically, an additional free parameter is assigned to the third-codon sites
to allow a high evolutionary rate in the analysis using 4dimensional nucleotide models. This type of nucleotide
model can also be converted into a 64-dimensional
codon model. More generally, we can assign different
evolutionary rates for the 3 codon sites. In equation 8,
if we multiply α, β, and γ with the instantaneous rates
during nucleotide substitutions at the first-, second-,
and third-codon sites, respectively, then equation 9
changes into
(4)
(4)
(4)
p(64)
rs (t) = pr1 s1 (αt) · pr2 s2 (βt) · pr3 s3 (γ t),
which can be easily proven by modifying the proof in
the Appendix. By setting α = β = 1 and γ as a free
parameter, we can define an R(64) model equivalent to
an R(4) model in which the third-codon sites are fastevolving sites. Furthermore, by assuming that α, β, and
γ follow the gamma distribution, we can convert the
nucleotide model with gamma rate heterogeneity (Yang
1993) into a 64-dimensional model (see Results and
Discussion).
Converting a 61-Dimensional Codon Model into a
64-Dimensional Codon Model
This conversion procedure is rather straightforward.
For a given 61-dimensional codon model, we consider
a 64-dimensional codon model in which the instantaneous transition rate from codon r to codon s is
(61)
Qrs , if r and s are sense codons , (12)
=
Q(64)
rs
0,
otherwise
(61)
(11)
i=1
That is, the likelihood calculated using the nucleotide
model R(4) is identical to that calculated using the
codon model R(64) . This implies that we can regard a
4-dimensional nucleotide model as a 64-dimensional
codon model in which the nucleotides at the first-,
second-, and third-codon sites are grouped together
as a single unit in the evolutionary process.
The instantaneous transition rate given by equation 8
can be nonzero even when r and/or s are stop codons.
This allows nucleotide substitutions to occur unrestrictedly and independently at the 3 codon sites; this
independence is consistent with the factorization of
equations 9 and 10. Because of the independence among
the 3 codon sites in the R(64) model, the stationary
frequency πr is given by πr1 πr2 πr3 , even when r is a
stop codon. The nonzero instantaneous transition rates
when a stop codon is (or stop codons are) involved and
the nonzero frequencies of stop codons are the 2 main
causes of the poor performance of nucleotide models
(see Results and Discussion).
where Qrs is the instantaneous transition rate from
codon r to codon s in the given 61-dimensional codon
model, as in equations 1, 5, and 6. Then, the 64 × 64 rate
matrix derived from equation 12 is represented as
⎛ (61)
⎞
Q
0 0 0
⎜ 0T
0 0 0⎟
⎟,
Q(64) = ⎜
(13)
⎝ 0T
0 0 0⎠
0 0 0
0T
where 0 represents a (61 × 1) zero vector and the last
three rows and columns correspond to the positions of
stop codons. Because Q(64) is a diagonal block matrix,
the transition probability matrix during time t is obtained by the exponentiation of the diagonal blocks as
follows:
⎛
⎞
exp{tQ(61) } 0 0 0
⎜
1 0 0⎟
0T
⎟.
exp{tQ(64) } = ⎜
(14)
T
⎝
0 1 0⎠
0
0T
0 0 1
The Q(64) model has 2 important features. First, the
transition probabilities among sense codons obtained
2009
SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS
using the Q(64) model are identical to those obtained using the Q(61) model. Second, stop codons are explicitly
considered, but transitions from a sense codon (or from
a stop codon) to a stop codon (or to a sense codon) never
occur, and stop codons, if observed in protein-coding
sequences, remain constant. In general, stop codons are
not observed in protein-coding sequences, and we can
adopt the sense codon frequencies of the Q(61) model
for the Q(64) model. Then, the likelihoods of the Q(61)
and Q(64) models are identical, and the Q(61) model can
be regarded as the Q(64) model. In terms of the likelihood scores, there is no change in the Q(64) model by
explicitly considering stop codons.
Statistical Comparison of 3 Types of Models
Using our conversion procedures, we have theoretically shown that 4-dimensional nucleotide models and
61-dimensional codon models can be regarded as 64dimensional codon models, and their likelihoods can
be directly compared. In our previous study (Seo and
Kishino 2008), we showed that the likelihoods of 20dimensional amino acid models and 61-dimensional
codon models are not directly comparable. Because the
likelihood corresponding to translations is explicitly
considered in the SK-P0 model, amino acid models are
comparable to codon models only through SK-P0 models. Therefore, the likelihoods of nucleotide, SK-P0, and
codon models are comparable, and the corresponding
AIC or BIC scores can be adopted to determine the best
model among the 3 types.
Our procedure for the conversion of a 4-dimensional
nucleotide model into a 64-dimensional codon model
clearly implies that the BIC scores should be calculated
carefully within the type of nucleotide models. For the
sequence data of n aligned codon columns, the number of samples in the simple nucleotide models is 3n;
in these models, the 3 codon sites are assumed to be independently and identically distributed. Therefore, instead of equation 3, the following equation should be
adopted for calculating the BIC scores:
i |X) + K log{3n}.
BIC(i): = −2l(θ
However, when the 3 codon sites are assumed to evolve
differently, the 3 consecutive nucleotide sites should
be grouped together and regarded as a single unit of
sequence evolution. Therefore, equation 3 should be
adopted for calculating the BIC scores even for the comparison of nucleotide models.
R ESULTS AND D ISCUSSION
Sequence Data
For the comparisons of the 3 types of models, we analyzed 3 sequence data sets comprising 1) fast-evolving
interspecific, 2) slowly evolving interspecific, and
3) highly polymorphic intraspecific protein-coding
sequences.
The first data set consisted of 12 mitochondrial genes
obtained from 69 mammalian species (Nikaido et al.
203
2003). The data sequences were downloaded from the
GenBank sequence database (for accession number, see
Nikaido et al. 2003), aligned using ClustalW (Chenna
et al. 2003) with the default settings, and manually
edited. The log-likelihood scores for various models
were calculated using “Tree 6” estimated by Nikaido
et al. (2003).
For the second data set, we selected the 10 most
slowly evolving genes from the 2789 nuclear genes of 10
mammals that were analyzed by Nishihara et al. (2007)
to investigate the phylogenetic relationship among the
Boreoeutheria, Xenarthra, and Afrotheria groups. Using
the phylogenies estimated by Nishihara et al. (2007),
we calculated the likelihood scores of 10 genes using
various models.
The third data set consisted of 12 mitochondrial genes
of humans. We excluded individuals from the same location and selected 33 human mitochondrial genomes
from the data set obtained by Ingman et al. (2000) for
the investigation of human origin. The mitochondrial
genomes were separated into 12 protein-coding sequences, which were aligned using ClustalW (Chenna
et al. 2003) with the default settings and then manually edited. The log-likelihood scores for various models were calculated using the neighbor-joining (Saitou
and Nei 1987) tree topology estimated by Ingman et al.
(2000).
In the analysis of the 3 data sets, we have not considered the uncertainty in the tree topologies and have
assumed that the adopted topologies represent the true
phylogenetic relationships. However, we expect that our
results are robust for different tree topologies because
model selection has been known to be rarely affected by
tree topologies if the adopted topologies are not considerably different from the true topologies (Sullivan and
Swofford 1997; Posada and Crandall 2001; Abdo et al.
2005).
Three Types of Models and Their Conversion to
64-Dimensional Models
DNA model and conversion.—Although our conversion
procedure is generally applicable for any type of nucleotide model, we demonstrate it here using the HKY
model (Hasegawa et al. 1985). To explicitly denote that
equation 8 is derived from the HKY model in equation
7, we will refer to the 64-dimensional model in equation
8 as the HKY(64) model in our analysis. We can make
the HKY(64) model more realistic by assigning an additional free parameter in the rate matrix given by
equation 8 during the substitutions at the third-codon
sites and we refer to this new model as the HKY (64) /3rd
model. Further, the HKY(64) /3rd model can be made
more realistic by assigning zero instantaneous transition rates when stop codons are involved and refer
to this new model as the HKY(61) /3rd model. While
there exist 4-dimensional nucleotide models equivalent
to the HKY(64) and HKY(64) /3rd models, there exist
no nucleotide models equivalent to the HKY (61) /3rd
model. Therefore, the HKY(61) /3rd model should be
204
−9141.1
−8952.0
−8812.9
−9430.7
−8838.7
−8931.2
−8887.5
ATP6
−26825.4
−24276.1
−24120.3
−23801.3
−22612.8
−22118.3
−22437.4
CYTB
NADH4L
−12159.7
−11235.6
−11164.1
−10795.0
−10392.6
−10154.6
−10342.8
NADH4
−57511.1
−52951.2
−52437.4
−51378.4
−49401.6
−48520.2
−49213.5
Notes. The number of parameters in the rate matrix is represented by p.
NADH5
Codon models and their conversion.—As we noted, the 61(or 60) dimensional codon models can be regarded as
64-dimensional codon models. For a statistical comparison of the 3 types of models, we considered 3
low-dimensional codon models: 1) GY94 (Goldman
and Yang 1994), 2) SK-P1 (equation 5), and 3) SK-P2
(equation 6) models. The sai aj values of the mtREV
(Adachi and Hasegawa, 1996) and WAG (Whelan and
Goldman 2001) models were incorporated into the
SK-P1 and SK-P2 models for the analysis of the mitochondrial and nuclear protein–coding sequences,
respectively.
−14010.6
−12981.8
−12863.0
−12337.8
−11867.5
−11675.2
−11953.6
NADH3
NADH2
Amino acid models and their conversion.—Different amino
acid models were adopted for different data sets. For
the mammalian and human mitochondrial genes, we
adopted the mtREV + F (Adachi and Hasegawa 1996)
amino acid model. For the mammalian nuclear genes,
we adopted the WAG + F (Whelan and Goldman 2001)
amino acid model. These amino acid models were converted into SK-P0 and SK-P1 models, respectively.
−46635.7
−44023.7
−43565.5
−41987.8
−40517.9
−40368.1
−41067.9
−35967.5
−32514.8
−32264.3
−31088.0
−29645.3
−29074.5
−29768.1
NADH1
COX3
−27950.8
−24375.7
−24157.2
−23518.7
−22217.1
−21690.8
−21995.5
COX2
−24415.9
−21217.2
−21018.4
−20566.4
−19285.8
−18675.4
−18869.3
COX1
−53522.1
−44378.9
−44077.4
−42549.5
−39670.9
−38678.4
−38934.3
Gene
Model
HKY(64) (p = 1)
HKY(64) /3rd (p = 2)
HKY(61) /3rd (p = 2)
SK-P0 (p = 0)
SK-P1 (p = 1)
SK-P2 (p = 2)
GY94 (p = 2)
TABLE 1. Maximum log-likelihood scores of the 7 different models for 12 mitochondrial genes from 69 mammalian species
VOL. 58
categorized as a 61-dimensional codon model. The
HKY(61) /3rd model is a 61-dimensional codon model
that is closest to the HKY model.
−77578.4
−72380.9
−71711.4
−70434.8
−67707.4
−66941.1
−67861.4
−42043.7
−37944.8
−37636.8
−36112.2
−34286.2
−33725.8
−34399.1
ATP8
SYSTEMATIC BIOLOGY
Comparing Log-Likelihoods of 7 Different Models
Because our selection of 7 different models and 3 data
sets is rather arbitrary, some results of model comparison might be highly specific for the compared models
and the analyzed data. However, these results are very
useful in elucidating the advantages, disadvantages,
and performance of the 3 types of models. Other results
of the model comparison are so fundamental and general that they will be robust for selections of candidate
models and sequence data. These results are discussed
in detail in this section.
Tables 1–3 show the log-likelihood scores of the 7
different models for the 1) interspecific mammalian mitochondrial, 2) interspecific mammalian nuclear, and
3) intraspecific human mitochondrial protein–coding
sequences, respectively. The number of free parameters
in the rate matrices varies between 0 and 2. This range
is considerably smaller than the range of differences
between the log-likelihood scores of the 7 different
models. For example, the longest gene in Table 1 is
NADH5, whose length is 616 codons including gaps.
The difference in K log n between the models with K = 2
and K = 0 is 12.8; this difference is relatively smaller
than the difference in the log-likelihood scores and has
negligible effect on the AIC and BIC scores. Therefore,
we list the log-likelihood scores in Tables 1–3 instead of
the AIC or BIC scores.
The consistency in the results in Tables 1–3 indicates
that codon models generally show better performances
than amino acid (SK-P0) and nucleotide models (HKY (64)
and HKY(64) /3rd). The SK-P2 and GY94 models are always better than the other models, and the HKY (61) /3rd
model is always better than the HKY(64) and HKY(64)/3rd
models.
2009
−488.55
−487.76
−481.75
−876.50
−475.94
−453.47
−453.46
−1288.26
−1267.10
−1255.54
−2742.18
−1272.55
−1206.40
−1202.34
−784.17
−741.95
−735.79
−1455.67
−690.13
−672.10
−674.28
−531.10
−491.03
−488.16
−637.61
−439.69
−421.17
−421.32
−373.66
−361.33
−357.49
−667.73
−330.04
−318.00
−317.88
−535.16
−529.42
−525.85
−1202.75
−507.91
−493.96
−497.53
−784.37
−761.00
−755.01
−1651.47
−719.05
−706.83
−707.40
−1117.39
−1098.63
−1087.09
−2741.88
−1047.34
−1031.61
−1029.38
−408.00
−401.84
−397.57
−886.46
−371.79
−362.78
−361.77
−543.33
−529.28
−525.82
−1225.90
−513.75
−491.12
−490.71
FLJ20309
PPP1R12A
C10ORF30
LHX9
FOXP2
MEF2C
MLR2
HOXD10
Gene
Model
HKY(64) (p = 1)
HKY(64) /3rd (p = 2)
HKY(61) /3rd (p = 2)
SK-P0 (p = 0)
SK-P1 (p = 1)
SK-P2 (p = 2)
GY94 (p = 2)
TABLE 2. Maximum log-likelihood scores of the 7 different models for the 10 most slowly evolving nuclear genes from 10 mammalian species
EPC2
GBA
SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS
205
In all cases, in Tables 1–3, the log-likelihood scores
of the HKY(64) /3rd model are greater than those of
the HKY(64) model. This observation is consistent with
the fact that the third-codon sites are mostly synonymous sites, and synonymous substitutions generally
occur more often than nonsynonymous substitutions.
The advantages of the HKY(61) /3rd model over the
HKY(64) /3rd model can be explained by the effect
of stop codons. As we noted in Theory, nonzero frequencies of the stop codons and nonzero instantaneous transition rates when a stop codon is (or stop
codons are) involved are explicitly assumed in the
HKY(64) /3rd model. These unrealistic assumptions for
protein-coding sequences are intentionally excluded
in the HKY(61) /3rd model, which leads to the improvement of the log-likelihood scores. This provides
strong evidence of the superiority of codon models,
and the superiority of codon models can be generalized for different nucleotide models. For any given
4-dimensional nucleotide model, we can always define
a 61-dimensional codon model in which the abovementioned unrealistic assumptions are excluded. That
is, there exists at least one codon model that is better
than the given nucleotide model. Similarly, consider
the GTR(61) /3rd, GTR(64) /3rd, and GTR(64) models,
which are derived from 4-dimensional GTR models
(Tavaré 1986; Yang 1994). Like the HKY(61) /3rd model,
the GTR(61) /3rd model explicitly excludes stop codons
and is expected to be better than the GTR (64)/ 3rd and
GTR(64) models. Because there are 4 additional free
parameters in the rate matrix, the GTR (61) /3rd model
requires more computation than the HKY(61) /3rd model
and may not be practical for the analysis of a large number of genes and taxa. However, our fundamental and
general conclusion will still hold in this case; it is always
possible to define a codon model that might be computationally impractical but will be better than the given
nucleotide model.
In Table 1, the log-likelihood scores of the SK-P0
model are generally greater than those of the HKY (61) /
3rd model, except in the case of ATP8, whose length is 70
codons including gaps. Due to its short sequence length,
ATP8 may not provide sufficient resolution for model
comparison. Divergent protein-coding sequences contain a reasonable number of amino acid substitutions,
and the substitution pattern will be consistent with
the sai aj values in equation 4. In Table 1, the additional
information provided by the sai aj values indicates the
superiority of the SK-P0 model over the HKY (61) /3rd
model. However, when the sequences are not divergent
(Tables 2 and 3), the observed pattern of amino acid
substitutions will be different from the empirical sai aj
values. In this case, the HKY(61) /3rd model in which
only mechanistical nucleotide substitutions are considered is better than the SK-P0 model.
The advantages of the SK-P1 model over the SKP0 model have been theoretically investigated in our
previous study and graphically demonstrated for the
206
−289.08
−288.63
−285.05
−1977.12
−276.38
−267.31
−267.99
−1070.17
−1065.17
−1054.65
−7409.50
−999.52
−973.26
−977.36
−1790.69
−1784.98
−1767.42
−11379.30
−1683.94
−1639.58
−1643.94
−2831.03
−2808.72
−2780.79
−19508.40
−2688.78
−2638.96
−2641.14
−443.99
−438.29
−433.69
−2768.65
−385.84
−375.38
−375.78
−2208.82
−2179.66
−2159.00
−14696.50
−2086.34
−2025.27
−2023.87
−525.39
−524.52
−518.55
−3211.39
−473.49
−458.27
−460.04
−1571.05
−1563.62
−1547.12
−11832.90
−1478.55
−1444.37
−1447.55
−1250.11
−1236.58
−1223.87
−8337.56
−1162.47
−1138.88
−1142.48
−2379.08
−2360.42
−2333.39
−16670.50
−2189.37
−2161.77
−2166.12
−1062.11
−1055.10
−1043.27
−7578.14
−999.29
−982.86
−985.60
−1459.92
−1392.18
−1378.71
−9600.59
−1266.47
−1251.83
−1255.37
NADH4
NADH3
NADH2
NADH1
COX3
COX2
COX1
Gene
Model
HKY(64) (p = 1)
HKY(64) /3rd (p = 2)
HKY(61) /3rd (p = 2)
SK-P0 (p = 0)
SK-P1 (p = 1)
SK-P2 (p = 2)
GY94 (p = 2)
TABLE 3. Maximum log-likelihood scores of the 7 different models for 12 mitochondrial genes from 33 humans
NADH4L
NADH5
CYTB
ATP6
ATP8
SYSTEMATIC BIOLOGY
VOL. 58
sequences listed in Table 1 (Seo and Kishino 2008). We
have observed the same results for protein-coding sequences that are not divergent (Tables 2 and 3). The
superiority of the SK-P1 model over the SK-P0 model
can be generalized, and it is always possible to define
a 61-dimensional SK-P1 model that is better than the
given amino acid model (Seo and Kishino 2008). In
Tables 2 and 3, the SK-P0 model shows very poor performance, and its log-likelihood scores are remarkably
lower than those of the HKY(64) model. When sequences
are not divergent, the SK-P0 model (equivalent amino
acid model) is not applicable because of the small number of amino acid substitutions.
Except for ATP8 in Table 1 and EPC2 in Table 2, the
log-likelihood scores of the SK-P1 model are greater
than those of the HKY(61) /3rd model, implying that the
SK-P1 model is generally better than the HKY (61) /3rd
model. Because the length of EPC2 is 191 codons, this
exception may not be caused by the small length. The
value of κ for EPC2 estimated using the HKY(61) /3rd
model is 4.11. Thus, for EPC2, it appears that separating
transitions and transversions is more realistic than incorporating the empirical information of the WAG + F
(Whelan and Goldman 2001) amino acid model.
Multiple nucleotide substitutions are allowed in
the SK-P1 model during an infinitesimal time period,
whereas only a single nucleotide substitution is allowed
in the SK-P2 model. Further, transitions and transversions are distinguished in the SK-P2 model. A single
transition and a single transversion during an infinitesimal time period appear to be consistent with the real
evolutionary mechanism; this consistency explains the
reason behind the log-likelihood scores of the SK-P2
model being greater than those of the SK-P1 models
in all cases in Tables 1–3. In Table 1, the SK-P2 model
is better than the GY94 model, except for the case of
ATP8, which is due to the empirical information provided by the sai aj values. Although the GY94 and SKP2 models show overall superiority in Tables 2 and 3,
respectively, the differences in log-likelihood scores are
relatively small. Hence, we expect that these two models
may be equally applicable for conserved protein-coding
sequences.
Advantages and Disadvantages of 3 Types of Models and
Concluding Remarks
Using our conversion procedures, it is possible to
statistically compare 4-dimensional nucleotide models with 61-dimensional codon models. We have shown
that there exists at least one 61-dimensional codon model
that is better than the given nucleotide model. Using our
previous procedure to convert a 20-dimensional amino
acid model into a 61-dimensional codon model (Seo and
Kishino 2008), we can find at least one 61-dimensional
codon model that is better than the given amino acid
model. Therefore, we note that a codon model is always
the best model among evolutionary models. That is, we
can always define at least one codon model that is better
than the given nucleotide and amino acid models.
2009
SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS
Although codon models are in general better than the
other 2 types of models, they have a disadvantage—
they require intensive computation. When optimizing
the free parameters of a rate matrix, we must recalculate
the eigenvalues and eigenvectors of a 61-dimensional
rate matrix for updated free parameters. Furthermore,
codon models require intensive computation for the
pruning algorithm (Felsenstein 1981) because of their
high dimensions. In contrast, 4-dimensional nucleotide
models require considerably less computation. The
eigenvalues, eigenvectors, and transition probabilities are analytically available in many cases, and the
pruning algorithm is quite fast. The computational advantage of 4-dimensional nucleotide models over the
other 2 types of modes is more apparent when we
consider the rate heterogeneity among sites (RHAS;
Yang 1993). However, nucleotide models are expected
to show poor performance, as represented in Tables
1–3. For divergent protein-coding sequences, we note
that highly sophisticated nucleotide models that consider RHAS can be worse than the simple SK-P1 model
that does not consider RHAS. For the 12 protein-coding
sequences listed in Table 1, we have calculated the maximum log-likelihood scores by using PAML (Yang 2007)
and the GTR(64) /3rd model with gamma distribution
of RHAS, which is referred to as the GTR(64) /3rd+
model. Except NADH5 and ATP8, the other 10 genes
show better performance by the application of the SKP1 or SK-P2 model as compared with the application of
the GTR(64) /3rd+ model (data not shown). The behavior of ATP8 can be explained by the lack of resolution
caused by its small sequence length, and the behavior of
NADH5 can be explained by 2 factors: consideration of
RHAS and inclusion of additional free parameters in the
rate matrix of the GTR(64) /3rd+ model. Because these
2 factors require intensive computation, only one of
them may be practically incorporated into codon models. As an example, we have calculated the maximum
log-likelihood score of NADH5 using the GY94 model
and RHAS (GY94 + ) using PAML (Yang 2007). We
have observed that the GY94 + model performs much
better than the GTR(64) /3rd+ model in the analysis
of NADH5 (data not shown), which is considered to be
due to the explicit exclusion of stop codons by using the
GY94 + model.
Amino acid models are intermediate between nucleotide models and codon models in terms of their
computational load. The computational load for pruning algorithms is heavier than that of nucleotide models,
but lighter than that of codon models. Further, amino
acid models do not have free parameters in the rate matrix, and the recalculation of the eigenvalues and eigenvectors is not required. However, amino acid models
are expected to show poor performance when the sequences are not divergent (Tables 2 and 3).
When a large number of genes and taxa have to be
analyzed and the computational load is an important
issue, our new codon model, the SK-P1 model, can
be a good alternative to nucleotide models and other
207
codon models. Although the computational load for
the pruning algorithm is the same as that of a conventional 61-dimensional codon model, the eigenvalues
and eigenvectors need not be recalculated for optimizing the ρ parameter. Further, a 61-dimensional transition probability matrix can be analytically derived
from the 20-dimensional transition probability matrix
of an amino acid model (Seo and Kishino 2008), drastically reducing the computational time. Furthermore,
the SK-P1 model retains the most important merit of
codon models—synonymous and nonsynonymous substitutions can be separated and compared to detect the
deviation from neutral evolution. When a small number
of genes and taxa have to be analyzed or the computational load is not a major concern, our SK-P2 model
can be a good alternative to the GY94 model because
it has more flexibility for incorporating the empirical
information of amino acid models.
In this paper, we have presented the theoretical foundation for the statistical comparison of nucleotide and
codon models. In conjunction with our previous results
(Seo and Kishino 2008), we conclude that codon models should be adopted for the evolutionary analysis of
protein-coding sequences so long as the computational
load is not a major concern. Further, with continuous
improvements in computational performance in the
postgenomic era, the development of realistic models
of sequence evolution is becoming increasingly more
important. Our conclusion on the superiority of codon
models will provide a useful guide for future research
on improving evolutionary models.
F UNDING
T.-K.S. was supported by JSPS (EYS B-18770216), and
H.K. was supported by JSPS (SR B-16300086).
A CKNOWLEDGMENTS
We thank Jack Sullivan, Marc Suchard, David Posada,
and an anonymous reviewer for providing valuable
comments and Hidenori Nishihara for providing the
sequence data.
R EFERENCES
Abascal F., Zardoya R., Posada D. 2005. ProtTest: selection of best-fit
models of protein evolution. Bioinformatics. 21:2104–2105.
Abdo Z., Minin V.N., Joyce P., Sullivan J. 2005. Accounting for uncertainty in the tree topology has little effect on the decision-theoretic
approach to model selection in phylogeny estimation. Mol. Biol.
Evol. 22:691–703.
Adachi J., Hasegawa M. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42:459–468.
Akaike H. 1974. A new look at the statistical model identification. IEEE
Trans. Autom. Contr. 19:716–723.
Anderson F.E., Swofford D.L. 2004. Should we be worried about longbranch attraction in real data sets? Investigations using metazoan
18S rDNA. Mol. Phylogenet. Evol. 33:440–451.
Aris-Brosou S., Yang Z. 2002. Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan
18S ribosomal RNA phylogeny. Syst. Biol. 51:703–714.
Bruno W.J., Halpern A.L. 1999. Topological bias and inconsistency
of maximum likelihood using wrong models. Mol. Biol. Evol. 16:
564–566.
208
VOL. 58
SYSTEMATIC BIOLOGY
Buckley T.R. 2002. Model misspecification and probabilistic tests of
topology: evidence from empirical data sets. Syst. Biol. 51:509–523.
Cao Y., Adachi J., Janke A., Paabo S., Hasegawa M. 1994. Phylogenetic
relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a
single gene. J. Mol. Evol. 39:519–527.
Chenna R., Sugawara H., Koike T., Lopez R., Gibson T.J., Higgins D.G.,
Thompson J.D. 2003. Multiple sequence alignment with the Clustal
series of programs. Nucleic Acids Res. 31:3497–3500.
Dayhoff M.O., Schwartz R.M., Orcutt B.C. 1978. A model of evolutionary change in proteins. In: Dayhoff M.O., editor. Atlas of protein
sequence and structure. Washington (DC): National Biomedical
Research Foundation. p. 345–352.
Doron-Faigenboim, A., Pupko T. 2007. A combined empirical and
mechanistic codon model. Mol. Biol. Evol. 24:388–397.
Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.
Frati F., Simon C., Sullivan J., Swofford D.L. 1997. Evolution of the
mitochondrial cytochrome oxidase II gene in Collembola. J. Mol.
Evol. 44:145–158.
Gaut B.S., Lewis P.O. 1995. Success of maximum likelihood phylogeny
inference in the four-taxon case. Mol. Biol. Evol. 12:152–162.
Goldman N., Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–
736.
Golub G.H., Van Loan C.F. 1996. Matrix computations (Johns Hopkins
studies in mathematical sciences). 3rd ed. Baltimore (MD): Johns
Hopkins University Press.
Hasegawa M., Kishino, H., Yano T. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:
160–174.
Ingman M., Kaessmann H., Paabo S., Gyllensten U. 2000. Mitochondrial genome variation and the origin of modern humans. Nature
408:708–713.
Jones D.T., Taylor W.R., Thornton J.M. 1992. The rapid generation
of mutation data matrices from protein sequences. Comput. Appl.
Biosci. 8:275–282.
Jukes T.H., Cantor C.R. 1969. Evolution of protein molecules. In:
Munro H.N., editor. Mammalian protein metabolism. New York:
Academic Press. pp. 21–132.
Kimura M. 1980. A simple method for estimating evolutionary rate
of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.
Kosiol C., Holmes I., Goldman N. 2007. An empirical codon model for
protein sequence evolution. Mol. Biol. Evol. 24:1464–1479.
Minin V., Abdo Z., Joyce P., Sullivan J. 2003. Performance-based selection of likelihood models for phylogeny estimation. Syst. Biol.
52:674–683.
Muse S.V., Gaut B.S. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with
application to the chloroplast genome. Mol. Biol. Evol. 11:715–724.
Nielsen R., Yang Z. 1998. Likelihood models for detecting positively
selected amino acid sites and applications to the HIV-1 envelope
gene. Genetics. 148:929–936.
Nikaido M., Cao Y., Harada M., Okada N., Hasegawa M. 2003. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. 28:276–284.
Nishihara H., Okada N., Hasegawa M. 2007. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol.
8:R199.1–R199.10.
Posada D., Crandall K.A. 1998. Modeltest: testing the model of DNA
substitution. Bioinformatics. 14:817–818.
Posada D., Crandall K.A. 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50:580–601.
Robinson D.M., Jones D.T., Kishino H., Goldman N., Thorne J.L. 2003.
Protein evolution with dependence among codons due to tertiary
structure. Mol. Biol. Evol. 20:1692–1704.
Saitou N., Nei M. 1987. The neighbor-joining method: a new method
for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425.
Schwarz G. 1974. Estimating the dimension of a model. Ann. Stat.
6:461–464.
Seo T.-K., Kishino H. 2008. Synonymous substitutions substantially
improve evolutionary inference from highly diverged proteins.
Syst. Biol. 57:367–377.
Seo T.-K., Kishino H., Thorne J.L. 2004. Estimating absolute rates of
synonymous and nonsynonymous nucleotide substitution in order
to characterize natural selection and date species divergences. Mol.
Biol. Evol. 21:1201–1213.
Shapiro B., Rambaut A., Drummond A.J. 2006. Choosing appropriate
substitution models for the phylogenetic analysis of protein-coding
sequences. Mol. Biol. Evol. 23:7–9.
Stuart A., Ord K. 1996. Likelihood ratio tests and the general linear
hypothesis. In: Kendall’s advanced theory of statistics. 5th ed. Vol.
2. London: Edward Arnold. Chap. 23.
Suchard M.A., Weiss R.E., Sinsheimer J.S. 2001. Bayesian selection
of continuous-time Markov chain evolutionary models. Mol. Biol.
Evol. 18:1001–1013.
Sullivan J., Markert J.A., Kilpatrick C.W. 1997. Phylogeography and
molecular systematics of the Peromyscus aztecus species group
(Rodentia: Muridae) inferred using parsimony and likelihood. Syst.
Biol. 46:426–440.
Sullivan J., Swofford D.L. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mammal.
Evol. 4:77–86.
Tamura K., Nei M. 1993. Estimating of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans
and chimpanzees. Mol. Biol. Evol. 10:512–526.
Tavaré S. 1986. Some probabilistic and statistical problems on the analysis of DNA sequences. Lect. Math. Life Sci. 17:57–86.
Thorne J.L., Kishino H., Painter I.S. 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15:1647–
1657.
Whelan S., Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a
maximum-likelihood approach. Mol. Biol. Evol. 18:691–699.
Whelan S., Goldman N. 2004. Estimating the frequency of events that
cause multiple-nucleotide changes. Genetics. 167:2027–2043.
Yang Z. 1993. Maximum likelihood estimation of phylogeny from
DNA sequences when substitution rates differ over sites. Mol. Biol.
Evol. 10:1396–1401.
Yang Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol.
Evol. 39:105–111.
Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood.
Mol. Biol. Evol. 24:1586–1591.
Yang Z., Nielsen R. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409–418.
Yang Z., Nielsen R. 2002. Codon-substitution models for detecting
molecular adaptation at individual sites along specific lineages.
Mol. Biol. Evol. 19:908–917.
Yang Z., Nielsen R., Goldman N., Pedersen A.K. 2000. Codonsubstitution models for heterogeneous selection pressure at amino
acid sites. Genetics 155:431–449.
Yang Z., Nielsen R., Hasegawa M. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol.
Biol. Evol. 15:1600–1611.
First submitted 7 June 2008; reviews returned 26 August 2008;
final acceptance 5 January 2009
Associate Editor: Marc Suchard
A PPENDIX
Proof of Equation 9. The matrix R(64) of equation 8 can be
separated into three matrices: R(64,1) , R(64,2) , and R(64,3) .
For p (= 1, 2, 3), the components of R(64,p) are
(64,p)
Rrs
⎧
κπsp ,
⎪
⎪
⎪
⎪
⎪
⎪
⎨πs ,
transition occurs only
at pth site
transversion occurs only
p
,
=
at pth site
⎪
⎪
⎪
(64,p)
⎪
− h=r Rrh , r = s
⎪
⎪
⎩
0,
otherwise
2009
209
SEO AND KISHINO—COMPARISON OF 3 TYPES OF EVOLUTIONARY MODELS
applying Ta,b [·] again is identical to the exponentiation
of R(64,p) :
Then,
R(64) = R(64,1) + R(64,2) + R(64,3) .
We found that the three matrices commute with each
other. That is, for i and j (i = j and i, j ∈ {1, 2, 3}),
R(64,i) R(64,j) = R(64,j) R(64,i) . This leads to (Golub and
Van Loan, p. 577)
exp{tR
(64)
} = exp{t{R
(64,1)
(64,1)
= exp{tR
(64,2)
+R
(64,3)
+R
(64,2)
} exp{tR
}}
} exp{tR
(64,3)
Tab [exp{Tab [tR(64,p) ]}] = exp{tR(64,p) }.
In a similar way, applying Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [·] to
1
= Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p)
1
}.
2
⎛
n
2
n
0
···
0
0
..
.
tR(4)
···
..
.
..
0
..
.
0
0
···
⎜
⎜
⎜
=⎜
⎜
⎝
tR(4)
.
n
1
1
1
n
1
⎡⎛
(64,p)
Prs
(t) =
tR(4)
n
0
···
0
0
..
.
P(4) (t)
..
.
···
0
..
.
0
0
···
⎢⎜
⎢⎜
⎢⎜
× ⎢⎜
⎢⎜
⎣⎝
where
(15)
n
P(4) (t)
(64,p)
⎟
⎟
⎟
⎟,
⎟
⎠
n
1
= Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p)
= [Prs
⎞
n
× exp Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [tR(64,p) ]
Ta(p) b(p) ◦ Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [tR(64,p) ]
1
n
P(64,p) (t): = exp{tR(64,p) }
To calculate exp{tR(64) }, we separately calculate
exp{R(64,p) }(p = 1, 2, 3). For R(64,p) , we consider the
operation of exchanging the ath and bth rows and
columns, which is denoted by the matrix function Tab [·].
Tab [·] simply changes the order of codons and does not
affect the validity of the rate matrix. We found that
iteratively exchanging the rows and columns transforms R(64,p) into a diagonal block matrix. That is, for
each p (= 1, 2, 3), we can find a set of integer pairs
(p) (p)
(p) (p)
(p) (p)
{(a1 , b1 ), (a2 b2 ), . . . , (an , bn )} such that
1
n
1
equation 16 is identical to the exponentiation of R(64,p) :
..
.
⎞⎤
⎟⎥
⎟⎥
⎟⎥
⎟⎥
⎟⎥
⎠⎦
P(4) (t)
(t)]64×64 ,
⎧ (4)
⎪
Prp sp (t),
⎪
⎪
⎪
⎨
(4)
⎪
Prp rp (t),
⎪
⎪
⎪
⎩
0,
substitution occurs
only at pth site
r=s
.
otherwise
where R(4) is the rate matrix of equation 7 and 0 is a 4×4
null matrix. The operator ◦ denotes the composition of
functions, that is, Ta(p) b(p) ◦Ta(p) b(p) [·]: = Ta(p) b(p) [Ta(p) b(p) [·]].
Let us denote exp{tR(64,1) } exp{tR(64,2) } as P(64,1×2) (t).
Then, the elements of P(64,1×2) (t) are
The exponentiation of equation 15 can be obtained by
the exponentiation of diagonal blocks:
exp Ta(p) b(p) ◦ · · · ◦ Ta(p) b(p) [tR(64,p) ]
P(64,1×2)
(t)
rs
1
1
2
1
n
1
⎛
2
1
2
1
0
···
0
0
..
.
exp{tR(4) }
···
..
.
..
0
..
.
0
0
···
⎛
⎜
⎜
⎜
=⎜
⎜
⎝
P(4) (t)
0
···
0
0
..
.
P(4) (t)
..
.
···
0
..
.
0
0
=
..
.
···
.
⎞
=
⎟
⎟
⎟
⎟
⎟
⎠
exp{tR(4) }
P(64,1)
(t)P(64,2)
(t)
rv
vs
⎧
(4)
(4)
⎪
Pr1 s1 (t)Pr2 s2 (t),
⎪
⎪
⎪
⎪
⎪
⎪
⎨
(4)
(4)
Pr1 r1 (t)Pr2 r2 (t),
⎪
⎪
⎪
⎪
⎪
⎪
0,
⎪
⎩
substitution occurs
at either the 1st or 2nd
site, or at both sites
,
r=s
substitution occurs
at the 3rd site
The transition probability matrix derived from equation
8—P(64) (t)—can be obtained with exp{tR(64,1×2) } ×
exp{tR(64,3) }, whose elements are
⎞
⎟
⎟
⎟
⎟.
⎟
⎠
64
v=1
n
exp{tR(4) }
⎜
⎜
⎜
=⎜
⎜
⎝
2
(16)
P(4) (t)
Because Ta,b [·] simply changes the order of codons, applying Ta,b [·] to R(64,p) followed by exponentiation and
P(64)
rs (t) =
64
P(64,1×2)
(t)P(64,3)
(t)
rv
vs
v=1
(4)
(4)
= P(4)
r1 s1 (t)Pr2 s2 (t)Pr3 s3 (t).
Proof of equation 11. We prove equation 11 for a phylogeny of 4 taxa using 1 codon site. The proof can
210
straightforwardly be extended to the case with more
than 4 taxa and the case of n codons. Suppose that
we observe 4 terminal codons p: = (p1 , p2 , p3 ), q: =
(q1 , q2 , q3 ), r: = (r1 , r2 , r3 ), and s: = (s1 , s2 , s3 ), where
pk ,qk ,rk , and sk (k = 1, 2, 3) are nucleotides. The unknown
ancestral codon of p and q (r and s) is x: = (x1 , x2 , x3 )
(y: = (y1 , y2 , y3 )). The time duration from x to p (q) is t1
(t2 ) and the time duration from y to r (s) is t3 (t4 ). The
time duration between x and y is t5 . Then, the likelihood
with a DNA model is
L
(4)
×
=
x1 =1
4
4 4 4 (4)
(4)
πx1 πx2 πx3 P(4)
x1 p1 (t1 )Px2 p2 (t1 )Px3 p3 (t1 )
×
4
y1 =1
4 4 4
y1 =1 y2 =1 y3 =1
(4)
(4)
(4)
P(4)
x1 y1 (t5 )Px2 y2 (t5 )Px3 y3 (t5 )Py1 r1 (t3 )
(4)
× P(4)
y2 r2 (t3 )Py3 r3 (t3 )
(4)
(4)
(t
)P
(t
)P
(t
)
× P(4)
4
4
4
y1 s1
y2 s2
y3 s3
(4)
(4)
P(4)
x1 y1 (t5 )Py1 r1 (t3 )Py1 s1 (t4 )
4
(4)
(4)
×
πx2 P(4)
(t
)P
(t
)
P(4)
x2 p2 2 x2 q2 2
x2 y2 (t5 )Py2 r2 (t3 )Py2 s2 (t4 )
x2 =1
y2 =1
y3 =1
(4)
(4)
× P(4)
x1 q1 (t2 )Px2 q2 (t2 )Px3 q3 (t2 )
θ|p3 , q3 , r3 , s3 )
× L(4) (θ
(4)
πx1 P(4)
x1 p1 (t1 )Px1 q1 (t2 )
4
(4)
(4)
(4)
πx3 P(4)
P(4)
x3 p3 (t1 )Px3 q3 (t2 )
x3 y3 (t5 )Py3 r3 (t3 )Py3 s3 (t4 )
x1 =1 x2 =1 x3 =1
θ|p, q, r, s)
(θ
4
4
x3 =1
θ|p2 , q2 , r2 , s2 )
= L(4) (±bθ|p1 , q1 , r1 , s1 ) × L(4) (θ
=
VOL. 58
SYSTEMATIC BIOLOGY
=
64
(64)
πx P(64)
xp (t1 )Pxq (t2 )
x=1
θ|p, q, r, s).
= L(64) (θ
64
y=1
(64)
(64)
P(64)
xy (t5 )Pyr (t3 )Pys (t4 )