Bayesian analysis of elapsed times in continuous

Bayesian Analysis of Elapsed Times in Continuous-Time
Markov Chains
Marco A. R. Ferreira1 , Marc A. Suchard2,3,4
1
Department of Statistics, University of Missouri at Columbia, USA
Department of Biomathematics and 3 Department of Human Genetics, David Geffen School
of Medicine, University of California, Los Angeles, USA
4
Department of Biostatistics, School of Public Health, University of California, Los Angeles,
USA
2
Summary. We explore Bayesian analysis for continuous-time Markov chain (CTMC) models
based on a conditional reference prior. For CTMC models, inference of the elapsed time between chain observations depends heavily on the rate of decay of the prior as the elapsed
time increases. Moreover, improper priors on the elapsed time may lead to improper posterior
distributions. In addition to the elapsed time, an infinitesimal rate matrix also characterizes the
CTMC. Usually, experts have good prior knowledge about the parameters of the infinitesimal
rate matrix, and thus can provide well-informed priors. We show that the use of a proper prior
for the rate matrix parameters together with the conditional reference prior for the elapsed time
yields a proper posterior distribution. Finally, we demonstrate that, when compared to analyses based on priors previously proposed in the literature, Bayesian analysis on the elapsed
time based on the conditional reference prior possesses better frequentist properties. The
conditional reference prior therefore represents a better default prior choice for widely-used
estimation software.
Keywords: Conditional reference prior; phylogenetic reconstruction; prior for branch length;
frequentist coverage; mean square error.
1. Introduction
Continuous-time Markov chains (CTMCs) are ubiquitous modeling tools. The chains define
a stochastic process {Z(t) : t ≥ 0}, where Z(t) realize one of s discrete values from a statespace set S as elapsed time t proceeds. The chains also satisfy Markovian behavior, such
that the probability distribution of Z(t3 ) is independent of Z(t1 ) conditional on Z(t2 ) for
t3 > t2 > t1 . Given this conditionally memoryless property, an infinitesimal rate matrix
Q completely governs the process, where Q = {qij } is an s × s matrix with non-negative,
off-diagonal elements and rows summing to 0. The conditional probability distribution of
Z(t) naturally follows
(1)
Pr (Z(t) = j | Z(0) = i) = etQ ij ,
where {·}ij represents the ij-th matrix element, matrix exponentiation takes the form eA =
P∞
k
I + k=1 Ak! and I is the s × s identity matrix.
To gain intuition into the process, one can consider a graph composed of vertices, one
for each state in S, and edges with weights qij connecting vertices for i 6= j. Random
variable Z(t) then transforms into the location indicator of a particle on this graph at time
t as the particle drifts from vertex to vertex. At any given time, the particle first waits an
Exponential amount of time, with rate equal to the sum of the edge weights leaving the
2
Marco A. R. Ferreira and Marc A. Suchard
particle’s current state. After this waiting-time, the particle jumps to its next location with
probability proportional to the edge weight connecting the current and destination vertices.
Here we consider the case, often found in biology and linguistics, when we observe several
replications of the pair {Z(0), Z(t)}, and the primary interest lies in infering the elapsed
time t.
Biology and linguistics are two seemingly disparate fields that make considerable use of
CTMCs. Both fields exploit CTMCs to infer evolutionary histories. In the case of biology,
researchers reconstruct the histories relating molecular sequences, such as short segments
of DNA, genes or entire genomes; while, glottochronology aims to infer the ancestral relationships between languages as an approach to understand pre-historical human migration.
The sequences in this latter example are strings of presence/absence indicators of critical
words in a language. Often, analysis requires inferring only the times separating sequences
on a pairwise-basis and not the entire underlying history.
The first and most important step in any reconstruction is a description of how a character in one sequence relates to a possibly different character in the corresponding site of
a second sequence (Figure 1). Here, CTMCs come to the rescue. Let S equal the set of
possible characters at a single site. For nucleotide sequences, s = 4, containing adenosine
(A), guanine (G), cytosine (C) and thymine (T). For amino-acid sequences, s = 20, codonbased models have 64 states, and s = 2 naturally describes the glottochronology state-space.
Then, the two related characters at a site form realizations from a single CTMC observed
at two different moments, and one typically assumes the chains at different sites are independent and identically distributed. Statistical inference reduces to estimating the elapsed
time t between the observed sequences and, potentially, the infinitesimal rate matrix Q.
To attack this statistical problem, let the data Y = (n11 , . . . , n1s , n21 , . . . , n2s , . . . ,
ns1 , . . . , nss ) count the observed number of transitions between pairs of states nij for i, j ∈ S.
While up to s(s − 1) free parameters may characterize Q, one often employs structured matrices that are biologically-motivated and contain far fewer free parameters φ ∈ Φ. This
yields the complete parameter vector θ = (t, φ). Usually, experts have good prior knowledge
about the parameters of the infinitesimal rate matrix. Experts either fix φ to empirically
estimated quantities derived from large databases, such as the PAM (Dayhoff et al., 1972),
JTT (Jones et al., 1992), and WAG (Whelan and Goldman, 2001) models for amino acid
chains, or can provide well-informed priors. Such is not the case for the elapsed time t a
priori. Frequently, t is the most important quantity to be estimated as, for example, when
the expert wishes to estimate divergence times between molecular sequences or languages.
Viewing inference of t as paramount differentiates CTMC use in reconstruction from
CTMCs to analyze panel data in the social sciences and econometrics (Kalbfleisch and
Lawless, 1985; Geweke et al., 1986). In these latter models, the elapsed time between
observations is known and inference focuses on the infinitesimal rate matrix parameters φ.
Kalbfleisch and Lawless (1985) introduce maximum likelihood estimators for φ and Geweke
et al. (1986) furnish Bayesian estimators under a uniform prior on rate matrices.
Both maximum likelihood and Bayesian estimation of elapsed time t are not trivial.
Assuming the chain is irreducible, as t → ∞ the probability distribution on Z(t) converges
to the stationary distribution of Q. As a consequence, the data likelihood function f (Y|θ)
of the CTMC converges to a constant greater than zero. Thus, a Bayesian analysis with a
marginal improper prior for t leads to a useless improper posterior distribution. Moreover,
f (Y|θ) may be strictly increasing. Consequentially, with positive probability, the maximum
likelihood estimate of t may not exist and prior choice for t can impart substantial influence
on estimates.
Bayesian analysis of continuous-time Markov chains
3
To briefly demonstrate the poor tail behavior of the likelihood function, we consider a
subset of glottochronology language data analyzed in Gray and Atkinson (2003). These
data are binary characters indicating the presence/absence of cognates in Indo-European
languages. Cognates are words that share a common origin in a predecessor language. One
difficulty in these data is estimating of the divergence time of 16 Romance languages from
Germanic languages, represented by Modern German. As German is the outgroup among
the data, this distance becomes the evolutionary tree height. Figure 2 plots posterior
histograms of the tree height under six different prior choices (note that the scales are
different for each plot). Both Exponential and Uniform priors are standard in MrBayes
(Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003), a popular Bayesian
sampler for evolutionary reconstruction problems. As the likelihood function flattens out,
little information is available in the data and the prior dictates the right-tail behavior of
the tree height. This seriously affects Bayesian estimation and it can lead to disastrous
consequences in model selection problems (Suchard et al., 2001).
Even in moderate-sized phylogenetic problems, the elicitation of a joint prior distribution
on the parameters θ is practically infeasible. Thus, automatic or semi-automatic methods
may find use. An attractive strategy to overcome these difficulties and one that has seen
significant development in recent years is to use an ‘objective Bayesian analysis’, where by
that we mean the use of default priors derived by formal rules that use the structure of the
problem at hand but do not require subjective prior elicitation. Two of the most popular
of these methods are the Jeffreys prior (Jeffreys, 1961) and the reference prior (Bernardo,
1979; Berger and Bernardo, 1989, 1992).
In this paper, we develop a default prior for the elapsed time t of CTMC models for use
when little or no prior information is available on t but there is prior information on the parameters of the infinitesimal transition matrix. Specifically, we derive an explicit expression
for the conditional reference prior (Sun and Berger, 1998) on t and establish the propriety
of the resulting posterior distribution. We also study the small sample frequentist properties of Bayesian procedures based on the conditional reference prior and compare these
properties to results obtained under alternative priors currently exploited by evolutionary
reconstruction practitioners.
2. General Time Model
We begin the development of a default prior by considering the CTMC model in its most
general, irreducible form in which φ contains s(s − 1) parameters. We refer to this as the
general time (GT) model. Later we return to more restricted models, such as the JukesCantor (Jukes and Cantor, 1969, JC69) and Kimura (Kimura, 1980, K80) models commonly
employed in evolutionary reconstruction. To proceed, we first require an understanding of
the tail-behavior of the data likelihood f (Y|θ).
Consider the spectral decomposition Q = B(φ)Λ(φ)B−1 (φ), where Λ(φ) is the diagonal
matrix of eigenvalues and the columns of B(φ) are the corresponding right-eigenvectors of
Q. To simply notation and ease exposition, we drop the implicit dependence of Λ and B
on φ, order the eigenvalues in decreasing order, such that
0 = λ1 > λ2 ≥ . . . ≥ λs ,
(2)
and write B = {rik } and B−1 = {ckj }. Interestingly, the largest eigenvalue λ1 is equal
to 0 because the row-sums of Q are all 0. The largest eigenvalue’s corresponding right-
4
Marco A. R. Ferreira and Marc A. Suchard
eigenvector is (r11 , . . . , rs1 )′ = (1, . . . , 1)′ and we refer to the largest eigenvalue’s lefteigenvector (c11 , . . . , c1s ) as a stationary distribution π of the CTMC. CTMCs are trivially
aperiodic and considering only irreducible chains enforces that π is unique through the
Ergodic theorem. Since π is unique, the remaining eigenvalues are strictly less than λ1 , i.e.
negative.
Using the spectral decomposition, we re-write the conditional probability distribution
on Z(t) given in Equation (1), such that the probability of transitioning from state i to
state j in time t is
Pij (t) =
=
{BeΛt B−1 }ij
s
s
X
X
rik eλk t ckj =
dijk .
k=1
(3)
k=1
Given the structure of the eigenvalues in Equation (2) and taking t → ∞, Pij (t) converges to the stationary distribution c1j = πj that does not dependent on the starting state
i. Note that π implicitly depends on φ.
Returning to the data Y, the log-likelihood function becomes
!
X
X
nij log
(4)
dijk .
log f (Y|θ) =
i,j
k
Proposition 2.1. For the GT model, the log-likelihood function (4) is a continuous
function on [0, ∞) satisfying:
(i)
1, if nij = 0 ∀ i 6= j,
lim f (Y|θ) =
(5)
0, otherwise;
t→0+
(ii)
lim log f (Y|θ) =
t→∞
(iii) For
P
i6=j
X
nij (log πi + log πj ) ; and
(6)
P
(7)
i,j
nij > 0 and small t,
f (Y|θ) ∼ t
Proof. See Appendix.
i6=j
nij
Y
n
qijij .
i6=j
2
3. Conditional reference prior
There are at least two possible forms of noninformative priors for the
p CTMC parameters
θ = (t, φ). The first is the joint Jeffreys-rule prior given by π(θ) ∝ det I(θ), where I(θ)
is the Fisher information matrix. The ij-th entry of the Fisher information matrix is
h
i
∂2
{I(θ)}ij = E −
log f (Y|θ) ,
θ1 = t, θ 2 = φ,
(8)
∂θi ∂θj
where E[·] refers to an expectation with respect to the conditional distribution of Y given
θ. The joint Jeffreys prior for θ could be a reasonable choice if prior information were
available on neither t nor φ, but usually there exists prior biological information on φ.
Bayesian analysis of continuous-time Markov chains
5
Typically, no prior information on t is available a priori, while expert opinion remains
quite strong regarding reasonable values for φ. As discussed in Section 1 and illustrated
in Figure 2, posterior analysis inferring t can be highly influenced by the prior on t. Consequentially, we wish to incorporate prior information about φ and, at the same time, use
a noninformative prior for t. This leads us to a conditional reference prior for t (Sun and
Berger, 1998).
R
Let π(φ), with Φ π(φ)dφ = 1, characterize the prior information on the infinitesimal
rate matrix parameter
p φ. Following Sun and Berger (1998), the conditional reference prior
for t is π r (t|φ) ∝ I(t|φ), where
i
h
∂2
I(t|φ) = E − 2 log f (Y|θ) .
∂t
(9)
π(φ)π r (t|φ).
(10)
Then, the joint prior density for θ becomes
Theorem 3.1. Starting from the GT model with log-likelihood shown in Equation (4),
the conditional reference prior for the elapsed time t given φ is
v
P
P
uX Ps
2
u
( k=2 λk rik eλk t ckj ) − ( sk=1 rik eλk t ckj ) ( sk=2 λ2k rik eλk t ckj )
Ps
π r (t|φ) ∝ g(t|φ) = t
,
λk t c
kj
k=1 rik e
i,j
(11)
where we remind readers that λk , rik and ckj all implicitly depend on φ.
Proof. See Appendix.
2
An interesting feature of the unnormalized conditional reference prior g(t|φ) given in
Equation (11) is that its behavior close to zero is independent of the infinitesimal rate
matrix Q and its behavior for large t depends only on the second largest eigenvalue of Q.
The following corollary describes the tail behavior of g(t|φ).
Corollary 3.1. The unnormalized conditional reference prior of t given φ shown in
Equation (11) is a non-negative, continuous function on [0, ∞), that satisfies:
(i)
1
g(t|φ) = O( √ ) as t → 0, and
(12)
t
(ii)
g(t|φ) = O(eλ2 t ) as t → ∞.
(13)
Proof. See Appendix.
2
As a consequence of Corollary 3.1 and recalling that λ2 < 0, the conditional reference
prior π r (t|φ) is proper. Therefore, there are no normalization concerns (Sun and Berger,
1998), and the conditional reference prior can be written as
π r (t|φ) = K(φ)g(t|φ),
(14)
R ∞
−1
where the normalizing constant K(φ) = 0 g(t|φ)dt
can be easily computed using
standard one-dimensional numerical integration.
6
Marco A. R. Ferreira and Marc A. Suchard
Corollary 3.2. The joint prior distribution for θ induced by the product of the informative prior distribution π(φ) and the conditional reference prior π r (t|φ) given in Equation (11) yields a proper posterior density.
Proof. This follows directly from the fact that π(φ) and π r (t|φ) are proper probability
measures.
2
Given the tail-behavior described in Corollary 3.1, a Gamma(1/2, −λ2 ) can serve as
a reasonable approximation of π r (t|φ) for implementation in software that allows only
standard distributions. A comparison of the performance of Bayesian procedures based
on the conditional reference prior and its Gamma(1/2, −λ2 ) approximation is presented in
Section 5.
4. Commonly Employed CTMC Models
We consider two restricted cases of the GT model that find general use across binary,
nucleotide and amino acid sequences. The first parameterization of Q considers that all
transitions of the CTMC occur with the same infinitesimal rate α, such that qij = α ∀ i 6= j.
Although Jukes and Cantor (1969) first endorse this model for nucleotide substitution processes, the mathematical properties that we derive are shared with all standard models for
amino acid sequences. Amino acid CTMC models follow empirically-estimated infinitesimal
rate matrices, resulting in the same number of free parameters as the JC69 model. We also
consider the K80 (Kimura, 1980) CTMC for nucleotide sequences. This model assumes
that the states of the chain divide into two disjoint sets and that rates of transition within
and between sets differ. Strong expert opinion exists about the relative rates of within and
between events, suggesting use of a conditional reference prior.
4.1. Reference prior for the JC69 model
Under an equal rates model, sufficient statistics of
Pthe data Y are the total number of
sites N and the number of observed changes n = i,j nij ∀ i 6= j. Using these sufficient
statistics, the likelihood function under the JC69 model becomes
f (N, n|t, α) =
1 3 −αt
+ e
4 4
N −n n
1 1 −αt
3
.
− e
4 4
(15)
As the likelihood function provides information only for the product αt, the infinitesimal
rate α is fixed a priori. Different choices of α lead to differing scalings of t. For example,
if α = 1/3 (the usual choice among phylogeneticists) then t scales in terms of the expected
number of changes per site given the chain starts at stationarity. P
To appreciate this, we
count the expected number of changes that occur between t ∈ [0, 1), i πi (−qii ) = 1, where
πi = 1/4 under the JC69 model.
Under the JC69 model, a maximum likelihood estimate of t does not exist for n/N >
3/4 and a Bayesian approach with a poor prior choice may not fair any better. As t →
∞, the distribution of the number of observed changes between sequences converges to a
Binomial distribution with sample size N and probability of success 3/4. Likewise, the
data likelihood converges to a positive constant. This behavior causes major problems for
Bayesian estimation of t as the inference will depend heavily on the tail behavior of the prior
Bayesian analysis of continuous-time Markov chains
7
for t. In particular, improper priors will lead to useless improper posterior distributions and
inference based on truncated uniform priors heavily depend on where the prior is truncated.
From Theorem 3.1, the reference prior for the elapsed time t under the JC69 model is
π r (t|φ) ∝
p
e−αt
.
I(t|φ) ∝ p
(1 + 3e−αt ) (1 − e−αt )
(16)
As the second largest eigenvalue of the infinitesimal transition matrix Q of the JC96 model
is −α, the prior above behaves as e−αt for large t.
4.2. Reference prior for the Kimura Model
The infinitesimal rate matrix Q under the K80 model for nucleotides is


−(κ + 2)
κ
1
1


κ
−(κ + 2)
1
1
,
α


1
1
−(κ + 2)
κ
1
1
κ
−(κ + 2)
(17)
where we arbitrarily have ordered the states in Matrix (17) as {A, G, C, T }. Nucleotides
A and G contain purine side-groups, while C and T contain pyrimidines. Purines and
pyrimidines differ in the size of their aromatic hetero-cycles. Due to steric differences,
CTMC jumps within groups, confusingly called “transitions” by evolutionary biologists,
occur with infinitesimal rate κ × α. This rate may differ from changes across groups, called
“transversions”, at rate α. Following the JC69 model formulation, one fixes α such that t
scales in terms of the expected number of changes per site. This choice implies α = (κ+2)−1 .
Sufficient statistics of the data are, again, the total number of N sites and the numbers
of observed “transitions” ns and “transversions” nv . The data likelihood function under
the K80 Model reduces to
N −ns −nv n
1 1 −ακt v
1 1 −ακt 1 − α(κ+1)t
2
f (N, ns , nv |t, α, κ) =
+ e
+ e
− e
4 4
2
2 2
ns
α(κ+1)t
1 1 −ακt 1 −
2
,
(18)
×
+ e
− e
4 4
2
By Theorem 3.1, the conditional reference prior for t under the K80 model is
s
2κ2 eα(κ+1)t + (κ + 1)2 e2ακt − 4κ2 eακt − 4κeακt + 2κ2 eαt − (κ − 1)2
1
r
ακt
2
π (t|φ) ∝ e
. (19)
3κ+1
κ+1
3κ+1
κ+1
(eακt − 1)(eα 2 t + eα 2 t − 2eακt )(eα 2 t + eα 2 t + 2eακt )
From Corollary 3.1 and as the second largest eigenvalue of Q under the K80 model equals
max{−4α, −2α(κ + 1)}, π r (t|φ) is approximately proportional to exp[max{−4α, −2α(κ +
1)}] for large t.
Usually, there is strong expert opinion about κ. For example, fixing κ = 2 regularly
occurs in phylogenetic software (Felsenstein, 1995); while estimates of κ range as low as as
1.4 in regions of the human immunodeficiency virus (Leitner et al., 1997) to a median of
approximately 4 with variance of 10 across mammalian gene sequences (Rosenberg et al.,
2003). Taking π(κ) as a log-normally density fits these observations well.
For richer infinitesimal rate matrix parameterization, the most popular CTMC for nucleotides is arguably the HKY85 (Hasegawa et al., 1985) model. This model extends the
8
Marco A. R. Ferreira and Marc A. Suchard
K80 chain by allowing for a non-uniform stationary distribution π. While varying π affects
the eigenvectors in B, it does not change the eigenvalues. Commonly one fixes π to their
empirically observed estimates (Li et al., 2000) as their maximum likelihood estimates rarely
differ by an appreciable amount. Then κ remains the only free parameter and all properties
for the K80 model hold under HKY85 assumptions.
5. Frequentist properties
The study of frequentist properties of Bayesian procedures has been proposed as one way
to evaluate default priors (Berger et al., 2001, and references therein). In this section, we
carry out a simulation study to examine the frequentist properties of Bayesian procedures
based on the conditional reference prior and on previous priors proposed in the literature
for the elapsed time t.
The frequentist properties considered here are mean squared error (MSE) of parameter
estimates and frequentist coverage of 95% credible intervals for the elapsed time t. We
compare the analyses based on the conditional reference prior and its Gamma(1/2, −λ2 )
approximation with analyses under two priors previously proposed in the literature: a
Uniform(0,10) as implemented in MrBayes version 2 (Huelsenbeck and Ronquist, 2001) and
an Exponential(10) with mean 10−1 as implemented in MrBayes version 3 (Ronquist and
Huelsenbeck, 2003).
We simulate data under the JC69 model discussed in Section 4.1 with the parameter t
equal to one of 100 values ranging from 0.01 to 1. For each parameter value, we simulate
1000 datasets and from each dataset we compute the posterior mean and the 95% credible
interval using the reference, Gamma, Uniform and Exponential priors. We then compute
the (estimated) MSE of the Bayesian estimators and the (estimated) frequentist coverage
of the equal-tailed 95% credible intervals.
Figure 3 shows the relative MSE of the Bayesian estimates and the frequentist coverage
of 95% credible intervals as a function of the true value of t. In the range of values considered
for t, the estimation based on the Exponential(10) prior is slightly better in terms of relative
MSE than the estimation based on the reference prior. The performance of estimation based
on the Uniform(0,10) prior deteriorates very fast for values of t larger than 0.7. In terms
of frequentist coverage, for all considered values of t the reference prior yields credible
interval with coverage close to nominal. This nominal coverage is consistent with Welch
and Peers (1963), who demonstrate that a univariate Jeffreys prior can serve as a firstorder probability matching prior. The Uniform prior yields credible intervals with coverage
slightly below nominal value, whereas the coverage of the Exponential prior induced credible
intervals quickly drops below the nominal value as the true value of t increases.
A decomposition of the relative MSE in terms of variance and squared bias (Figure 4)
sheds light on the different behaviors of analyses based on the Exponential(10) and reference
priors. For the Exponential(10) prior, the frequentist variance of the posterior mean is
fairly insensitive to the true value of t; whereas, bias increases fairly fast as a function of t.
Conversely, for the reference prior, the frequentist variance of the estimate increases with
the true value of t; whereas, the bias remains close to zero. When variance and squared
bias are combined into the MSE, both analyses have similar performance. Nevertheless, the
poor frequentist coverage of the Exponential-prior-based analyses results from the fact that
the prior strongly favors small values of t. Thus, when no prior information is available
we recommend against using the Exponential(10) prior. Overall, the reference-prior-based
Bayesian analysis of continuous-time Markov chains
9
analyses are more robust to the true value of t, when compared with the Uniform and
Exponential prior analyses.
Finally, Figures 3 and 4 show that the conditional reference prior for t and the corresponding Gamma(1/2, −λ2 ) approximation yield analyses with similar frequentist properties. Therefore, the Gamma(1/2, −λ2 ) prior is a good alternative for implementation in
evolutionary reconstruction packages.
6. Discussion
In this work, we describe Bayesian analyses for CTMCs based on default priors. For the
default prior, we focus on a conditional reference prior for the elapsed time t coupled with
a proper prior for the parameters φ of the infinitesimal rate matrix Q that characterizes
the CTMC. We have derived a general explicit expression for the conditional reference prior
and have shown that the resulting posterior distribution is proper. We also investigate
the frequentist properties of analyses based on the conditional reference prior and on two
priors previously proposed in the literature through simulation under the JC69 model. In
terms of MSE, parameter estimates based on the conditional reference prior are comparable
to estimates based on the Exponential prior and are much better than estimates based
on the Uniform prior. In terms of frequentist coverage, the credible interval based on the
conditional reference prior is comparable to the credible interval based on the Uniform prior
and is much better than the credible interval based on the Exponential prior. The lower than
nominal coverage of the credible interval based on the Exponential prior exposes a severe
underestimation of the posterior uncertainty. Therefore, when there is no prior information
on t, we recommend the use of the conditional reference prior. Many Bayesian estimation
packages allow for only standard distributions; in these situations, the Gamma(1/2, −λ2 )
should suffice.
While considering frequentist properties of Bayesian procedures may smell of heresy,
those properties are indeed relevant in the evaluation of default priors. Default priors
purposefully find themselves implemented in standard estimation software. Amongst evolutionary biologists, these standard programs are often employed without much consideration
of their underlying modeling assumptions. Each use of the program then represents an independent experimental replication. Over many different replications, it is reassuring that
the estimators possess good frequentist properties.
In our default prior construction, we have started with relatively simple CTMC models.
Several extensions of the CTMCs find considerable use and warrant exploration. Amongst
these directions, two are notable. Although pair-wise distances are most popular in molecular biology studies, phylogenetic reconstruction of the evolutionary histories relating 3 or
more sequences dominates in evolutionary biology.
Such a history consists of multiple,
correlated branch lengths. Here, Yang and Rannala (2005) explore independent Exponential
priors with two fixed rates, one for internal and one for external branches; while Suchard
et al. (2001) introduce a hierarchical Exponential prior by simultaneously estimating the
hyperprior rate. Neither are noninformative nor consider the correlation across branches,
necessitating the development of a default prior over their joint distribution. Finally, introducing rate variation relaxes the assumption of identical distributions across sites (Yang,
1996), leading to CTMC mixture models; an open question remains surrounding how to
construct appropriate default priors for these mixtures.
10
Marco A. R. Ferreira and Marc A. Suchard
Acknowledgements
We thank the generous support of the Conselho Nacional de Desenvolvimento Cientı́fico
e Tecnológico, Brazil, (grant 402010/2003-5) in fostering this collaboration. M.A.S. is an
Alfred P. Sloan Research Fellow.
Appendix
Proof of Theorem 3.1
Note that
∂
dijk =
∂t
0,
k=1
λk dijk , k > 1.
(20)
Then, the first derivative of the log-likelihood function becomes
Ps
X
∂
k=2 λk dijk
,
nij P
log f (Y|θ) =
s
∂t
k=1 dijk
i,j
and the second derivative takes the form
Ps
Ps
Ps
2
2
X
( k=1 dijk )
∂2
k=2 λk dijk )
k=2 λk dijk − (
.
n
log
f
(Y|θ)
=
Ps
ij
2
∂t2
( k=1 dijk )
i,j
Using the fact that E(nij ) = N
I(t|φ) = N
Ps
k=1
P
X( s
k=2
i,j
(21)
(22)
dijk , the expected Fisher information is
Ps
Ps
2
2
λk dijk ) − ( k=1 dijk )
k=2 λk dijk
Ps
.
k=1 dijk
Therefore, the conditional reference prior of the elapsed time t for the GT model is
v
Ps
Ps
uX Ps
2
u
( k=2 λk rik eλk t ckj ) − ( k=1 rik eλk t ckj ) ( k=2 λ2k rik eλk t ckj )
r
t
Ps
.
π (t|φ) ∝
λk t c
kj
k=1 rik e
i,j
(23)
(24)
Proof of Proposition 2.1
We may write the data likelihood function as
f (Y|θ) =
Y X
i,j
k
rik e
λk t
ckj
!nij
.
(25)
(i) Recall that B = {rik }, B−1 = {ckj }, and BB−1 is an identity matrix. Then, if nij =
0 ∀ i 6= j,
!nii
!nii
Y X
Y X
λk t
= 1.
(26)
rik cki
=
rik e cki
lim f (Y|θ) = lim
t→0
t→0
i
k
i
k
Bayesian analysis of continuous-time Markov chains
Conversely, if
P
i6=j
11
nij > 0, then
lim f (Y|θ) = lim
t→0
t→0
Y X
i6=j
rik e
λk t
ckj
k
!nij
=
Y X
i6=j
k
rik ckj
!nij
= 0.
(27)
(ii) The tail-behavior as t → ∞ follows directly from the continuity of Pij (t) in Equation (3)
and by considering the joint distribution of the data at stationarity, i.e. that the chain’s
initial state
P draws from π.
(iii) If i6=j nij > 0, then as t → 0
f (Y|θ)
∼
Y X
=
Y
i6=j
i6=j
k
rik e
λk t
ckj
!nij
nij
qij t + O(t2 )
∼
Y X
=
i6=j
t
P
i6=j
2
rik [1 + λk t + O(t )]ckj
k
nij
Y
n
qijij .
!nij
(28)
i6=j
Proof of Corollary 3.1
P
n
From Proposition 2.1(iii), for small t, the likelihood function behaves as t i6=j ij as a
function of t. Therefore, when t → √
0, the conditional reference prior of the elapsed time t
for the GT model will behave as 1/ t.
We now consider the behavior
of the conditional reference prior for t → ∞. In this
Ps
case, the first two terms in k=1 rik eλk t ckj will dominate the sum, so we approximate it
by πj + ri2 eλ2 t c2j . Applying this approximation, the conditional reference prior
v
uX
2
u
(λ2 ri2 eλ2 t c2j ) − (πj + ri2 eλ2 t c2j ) (λ22 ri2 eλ2 t c2j )
π r (t|φ) ∝ g(t|φ) ≈ t
πj + ri2 eλ2 t c2j
i,j
v
uX
u
−πj ri2 eλ2 t c2j
∝ t
πj + ri2 eλ2 t c2j
i,j
s P
Q
λ2 t c
λ2 t c
2m )
2j
(l,m)6=(i,j) (πm + rl2 e
i,j πj ri2 e
Q
−
=
λ
t
2
c2j )
i,j (πj + ri2 e
v
i
Q
Q
u P h
P
2λ2 t
λ2 t
u
(l,m)6=(i,j) rl2 c2m
i,j ri2 c2j e
(l′ ,m′ )6=(l,m) πm′
(l,m) πm + ri2 c2j e
t
Q
−
≈
λ2 t c )
2j
i,j (πj + ri2 e
=
O(eλ2 t ),
λ2 t
where the last step substitutes πj + ri2 eP
c2j ≈ πj for large t in the denominator and
λ2 t
eliminates the leading e
term through i,j ri2 c2j = 0. The latter equality follows from
P
−1
′
′
B = I.
j ri2 c2j = ri2 (c21 , . . . , c2s )(1, . . . , 1) = ri2 (c21 , . . . , c2s )(r11 , . . . , rs1 ) = 0 as B
(29)
12
Marco A. R. Ferreira and Marc A. Suchard
References
Berger, J. O. and Bernardo, J. M. (1989). “Estimating a product of means: Bayesian
analysis with reference priors.” Journal of the American Statistical Association, 84, 200–
207.
— (1992). “On the development of the reference prior method.” In Bayesian Statistics
IV , eds. J. M. Bernardo, J. O. Berger, A. P. David, and A. F. M. Smith, 35–60. Oxford:
Oxford University Press.
Berger, J. O., de Oliveira, V., and Sansó, B. (2001). “Objective Bayesian analysis of spatially
correlated data.” Journal of the American Statistical Association, 96, 456, 1361–1374.
Bernardo, J. M. (1979). “Reference posterior distributions for Bayesian inference (with
discussion).” Journal of the Royal Statistical Society B , 41, 113–147.
Dayhoff, M., Eck, R., and Park, C. (1972). “A model of evolutionary change in proteins.”
In Atlas of protein sequence and structure, vol. 5, 89–99. Washington, DC: National
Biomedical Research Foundation.
Felsenstein, J. (1995). PHYLIP (Phylogenetic Inference Package), Version 3.57 . Seattle,
WA: Distributed by the author. Department of Genetics, University of Washington.
Geweke, J., Marshall, R., and Zarkin, G. (1986). “Exact inference for continuous time
Markov chain models.” Review of Economic Studies, 53, 653–669.
Gray, R. and Atkinson, Q. (2003). “Language-tree divergence times support the Anatolian
theory of Indo-European origin.” Nature, 426, 435–439.
Hasegawa, M., Kishino, H., and Yano, T. (1985). “Dating the human-ape splitting by a
molecular clock of mitochondrial DNA.” Journal of Molecular Evolution, 22, 160–174.
Huelsenbeck, J. and Ronquist, F. (2001). “MRBAYES: Bayesian inference of phylogeny.”
Bioinformatics, 17, 754–755.
Jeffreys, H. (1961). Theory of Probability, 1nd edition. London: Oxford University Press.
Jones, D., Taylor, W., and Thornton, J. (1992). “The rapid generaton of mutation data
matrices from protein sequences.” CABIOS , 8, 275–282.
Jukes, T. and Cantor, C. (1969). “Evolution of protein molecules.” In Mammaliam Protein
Metabolism, ed. H. Munro, 21–132. New York: Academic Press.
Kalbfleisch, J. and Lawless, J. (1985). “The analysis of panel data under a Markov assumption.” Journal of the American Statistical Assocation, 80, 863–871.
Kimura, M. (1980). “A simple model for estimating evolutionary rates of base substitutions
through comparative studies of nucleotide sequences.” Journal of Molecular Evolution,
16, 111–120.
Leitner, T., Kumar, S., and Albert, J. (1997). “Tempo and mode of nucleotide substitutions
in gag and env gene fragments in human immunodeficiency virus type 1 populations with
a known transmission history.” Journal of Virology, 71, 4761–4770.
Bayesian analysis of continuous-time Markov chains
13
Li, S., Pearl, D., and Doss, H. (2000). “Phylogenetic tree construction using Markov chain
Monte Carlo.” Journal of the American Statistical Association, 95, 493–508.
Ronquist, F. and Huelsenbeck, J. (2003). “MRBAYES 3: Bayesian phylogenetic inference
under mixed models.” Bioinformatics, 19, 1572–1574.
Rosenberg, M., Subramanian, S., and Kumar, S. (2003). “Patterns of transitional mutation
biases within and among mammalian genomes.” Molecular Biology and Evolution, 20,
988–993.
Suchard, M., Weiss, R., and Sinsheimer, J. (2001). “Bayesian selection of continuous-time
Markov chain evolutionary models.” Molecular Biology and Evolution, 18, 1001–1013.
Sun, D. and Berger, J. O. (1998). “Reference priors with partial information.” Biometrika,
85, 55–71.
Welch, B. and Peers, H. W. (1963). “On formulae for confidence points based on integrals
of weighted likelihoods.” Journal of the Royal Statistical Society, Series B , 25, 318–329.
Whelan, S. and Goldman, N. (2001). “A general empirical model of protein evolution
derived from multiple protein families using a maximum-likelihood approach.” Molecular
Biology and Evolution, 18, 691–699.
Yang, Z. (1996). “Among-site rate variation and its impact on phylogenetic analyses.”
Trends in Ecology and Evolution, 11, 367–372.
Yang, Z. and Rannala, B. (2005). “Branch-length prior influences Bayesian posterior probability of phylogeny.” Systematic Biology, 54, 455–470.
14
Marco A. R. Ferreira and Marc A. Suchard
Sequence 1
AGCT
ACAG
A
A
Sequence 2
Fig. 1. Pairwise alignment of two nucleotide sequences with continuous-time Markov chain state
space S = {A, G, C, T }. One homologous site is illustrated by a shaded box; sites are independent
and identically distributed along the entire alignment. The observed data Y consist of the counts nij
of the character i ∈ S in a homologous site in sequence #1 ending as character j ∈ S in sequence
#2. For example, nAA = 2 for the shown sites.
Bayesian analysis of continuous-time Markov chains
Uniform(0,1)
0
0.5
1.0
1.5
2.0
0.0
0.4
0.6
Exponential(1)
Uniform(0,2)
0.8
1.0
1.0
0.0
0.6
Density
2.0
Branch length
0.0
2
4
6
8
10
12
0.0
0.5
1.0
1.5
Branch length
Branch length
Exponential(0.5)
Uniform(0,10)
2.0
0.0
0.00
0.3
Density
0.6
0
0.10
Density
0.2
Branch length
1.2
0.0
Density
4
Density
8
4
0
Density
8
12
Exponential(10)
15
5
10
15
20
0
2
4
6
Branch length
Branch length
Exponential(0.1)
Uniform(0,100)
8
10
80
100
0.000
0.008
Density
0.06
0.00
Density
0.12
0
0
20
40
Branch length
60
80
0
20
40
60
Branch length
Fig. 2. Estimates of the elapsed time between German and the most recent common ancestor of
Roman languages under six standard priors. Histograms summary the posterior, while we overlay
the priors (dashed lines).
Marco A. R. Ferreira and Marc A. Suchard
1
0
−1
0.3
0.2
0.90
0.80
0.85
Coverage
0.95
1.00
0.0
0.1
Relative MSE
0.4
0.5
−3
−2
Log(Density)
2
3
16
0.0
0.2
0.4
0.6
0.8
1.0
t
Fig. 3. Analyses under the Jukes-Cantor model based on the conditional reference prior (solid
black line), a Gamma approximation to the conditional reference prior (dotted-dashed green line), an
Exponential(10) prior (dashed red line) and a Uniform(0,10) prior (dotted blue line). The first panel
compares the log prior densities. The second panel plots the relative mean square error (MSE) of
the Bayesian estimates. The final panel describes the frequentist coverage of the Bayesian 95%
credible intervals.
17
0.06
0.04
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
0.06
0.04
0.00
0.02
(Bias t)2
0.08
0.10 0.00
0.02
Var t2
0.08
0.10
Bayesian analysis of continuous-time Markov chains
t
Fig. 4. Decomposition of the relative mean square error (MSE) under the Jukes-Cantor model for the
conditional reference prior (solid black line), the Gamma approximation to the conditional reference
prior (dotted-dashed green line), the Exponential(10) prior (dashed red line) and the Uniform(0,10)
prior (dotted blue line). The first panel traces out the posterior estimate variance scaled by the true
elapsed time t2 and the second panel represents the posterior estimate squared bias scaled by t2 .