Fundamental Differences Between the Methods of Maximum

Syst. Biol. 55(1):116-121, 2006
Copyright © Society of Systematic Biologists
ISSN: 1063-5157 print / 1076-836X online
DO1:10.1080/10635150500481648
Fundamental Differences Between the Methods of Maximum Likelihood
and Maximum Posterior Probability in Phylogenetics
BODIL SVENNBLAD,1 PER ERIXON,2 BENGT OXELMAN,2 AND TOM BRITTON,3
department of Mathematics, Uppsala University, Box 480, SE-751 06 Uppsala, Sweden; E-mail: [email protected]
department of Systematic Botany, Uppsala University, Norbyvagen 18D, SE-75236 Uppsala, Sweden; E-mail: [email protected] (P.E.);
[email protected] (B.O.)
^Department of Mathematics, Stockholm University, SE-106 91 Stockholm, Sweden; E-mail: [email protected]
Abstract.— Using a four-taxon example under a simple model of evolution, we show that the methods of maximum likelihood
and maximum posterior probability (which is a Bayesian method of inference) may not arrive at the same optimal tree
topology. Some patterns that are separately uninformative under the maximum likelihood method are separately informative
under the Bayesian method. We also show that this difference has impact on the bootstrap frequencies and the posterior
probabilities of topologies, which therefore are not necessarily approximately equal. Efron et al. (Proc. Natl. Acad. Sci.
USA 93:13429-13434,1996) stated that bootstrap frequencies can, under certain circumstances, be interpreted as posterior
probabilities. This is true only if one includes a noninformative prior distribution of the possible data patterns, and most
often the prior distributions are instead specified in terms of topology and branch lengths. [Bayesian inference; maximum
likelihood method; Phylogeny; support.]
of k = 4s possible ones. Denote the possible columns
by Xi, X 2 / ..., Xit where, e.g., X[ = {A, A,..., A), X!2 =
{C, C, ...,C} etc. The proportions of Xi, X2,..., Xjt, generated from the true phylogeny r with branch lengths
b(T) are p = (pi, p 2 ,..., p*), where £ p,- = 1. Each {r, b{z)}
induces a different p vector. The p-space is a continuous space that can be divided into different subspaces
corresponding to the different topologies.
The data matrix x is a sample from the true phylogeny. When assuming sites to be iid, the data matrix x
can instead be represented by a vector n = {n\, n^,... n^),
where ]T n, = N and each n, is the number of columns in
x that equals X,. The vector p can then be estimated by
p = (plf p2/... pk), which depends on n. Depending on
the method used for phylogenetic inference and choice
of model of evolution, each n vector gives an estimate of
the topology. The discrete n space is therefore divided in
different regions, Rj, giving f = r,- and there might also
be regions where the estimate is not unique.
In likelihood based methods the information in data is
contained in the likelihood function, L(T,, b^Ti\ 6\n). The
likelihood is the probability of data, n, f(n\ij, Uri\ 6),
given the topology, r,, corresponding branch lengths, b^Ti)
and model of evolution (with parameters 6), but seen as
a function of r,-, b^ and 0.
In the maximum likelihood approach, L(r,, UTi\ 9\n)
is maximized for each r,- with respect to b^ and 0. The
estimated topology TML is the topology r; with the largest
maximized likelihood.
In the Bayesian approach, prior distributions for r,,
METHODOLOGY FOR ESTIMATING PHYLOGENETIC TREES b(Tl), and 0 have to be specified. The posterior probaWe are interested in finding the phylogenetic tree, T, bility can then be calculated by Bayes' theorem. In the
of s species from aligned DNA sequences of length N. method of maximum posterior probability the estimate
That is, we have a n s x N data matrix, x, where each row of the topology, TMPP, is the topology r, with the largest
posterior probability, 7r(r,|n) (not the majority-rule conrepresents the DNA sequence for one of the species.
The columns in x are assumed to be independent and sensus tree as given by, e.g., Huelsenbeck and Ronquist,
identically distributed (iid) where each column is one 2001). To summarize, the estimate of the true phylogeny
Phylogenetic methods based on likelihood aim to find
the best topology by maximizing the likelihood function
with respect to topology and branch lengths (maximum
likelihood method, e.g., Felsenstein, 1981) or by comparing posterior probabilities for the different possible
topologies (Bayesian inference, e.g., Rannala and Yang,
1996). Regardless of the method of inference, a measure
of confidence, or support, is often desired for the estimated topology. Felsenstein (1985) suggested a nonparametric bootstrap procedure to obtain such a measure, a
procedure that is commonly used with the maximum
likelihood method. In Bayesian inference, the posterior
probabilities are the support values.
Efron et al. (1996) noted that bootstrap frequencies can
be interpreted as posterior probabilities under certain circumstances. Those sentences have been frequently cited
(e.g., Larget and Simon, 1999; Cummings et al., 2003;
Simmons et al., 2004), claiming bootstrap support values and Bayesian posterior probabilities to be approximately equal. Other authors have noticed empirically
that Bayesian posterior probabilities for the best supported clades are significantly higher than corresponding nonparametric bootstrap frequencies, (e.g., Suzuki
et al., 2002; Wilcox et al., 2002; Alfaro et al., 2003; Douady
et al., 2003; Erixon et al., 2003). We will, in Appendix 1,
show that the statement of Efron et al. (1996) is true under
the conditions therein given. In this paper, however, we
will show that those conditions are violated in general
phylogenetic inference.
116
2006
117
SVENNBLAD ET AL.—LIKELIHOOD-BASED METHODS IN PHYLOGENETICS
is, depending on the method of inference,
= argmaxi{maxb>9 L{zir b{xi\ 6\n)}
DIFFERENCES BETWEEN MAXIMUM LIKELIHOOD
AND MAXIMUM POSTERIOR PROBABILITY
TABLE 1. With Jukes-Cantor model of evolution and with four taxa
there are 15 different patterns contributing to the likelihood in different
ways.
Pattern no.
Pattern
Description
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
xxry
Groups of two nucleotides equal
within the group but not equal
between groups.
One nucleotide differs from the
rest.
XYXY
XYYX
XXXY
XXYX
XYXX
YXXX
XXYZ
YZXX
XYXZ
XYZX
YXXZ
YXZX
XYZU
XXXX
Efron et al. (1996) stated that, "The bootstrap probability that f * = f is almost the same as the aposteriOne group with two equal
ori probability that r = r starting from an uninformative
nucleotides, the other two differ
from the group and from each
prior density on p", where f * is the estimate of the topolother.
ogy for a bootstrap replicate. In Appendix 1 this statment
is proven. Note, however, that the parameter used is the
4s-dimensional vector p, where s is the number of taxa.
All nucleotides different.
All nucleotides equal.
As explained earlier, the method of maximum posterior
probability does not use the p vector as a parameter. The
parameters used with this method are the topology r,,
Assume data for taxa {a, b, c,d} consists of one single
corresponding branch lengths b^Xi\ and the parameters
column of pattern 8, XXYZ. In Appendix 2 it is shown
of the model of evolution used, 6. The posterior probaanalytically that for this pattern the maximized likelibility for the topology is
hoods for ri = [(a, b), (c, d)\, xi = {(a, c), (b, d)}, and T3 =
f
{(a, d), (b, c)} are indistinguishable.
Now consider the method of maximum posterior
probability for pattern XXYZ. The priors for topology
and branch lengths are considered independent; that is,
(2)
where 7r(r,', b^Xi\ 0) is the prior for those parameters. Even
though flat priors for topology and branch lengths are
used, that is not equivalent to a flat prior for p, which
Assume flat priors for topology, 7r(r,) = 1/3, as well as
simply is not considered in current implementations of
for
branch lengths, 7r(b(T|)) = 1/M, that is the prior for a
Bayesian phylogenetic inference.
Denote the regions in the n-space where the method of branch length is uniform on the interval [0, M] with exmaximum likelihood gives estimates f = r,- with R, (see pected value M/2. The priors will then only be scaling
Appendix 1). Denote the corresponding regions where factors and the main contribution to the posterior distrithe method of maximum posterior probability gives es- bution is the integral of the likelihood. Those integrals
timates f = n with R,-. In Efron et al. (1996), a distance- can be calculated analytically:
based method was used to obtain the regions R,. When
stated that bootstrap support values could be interpreted
pM
pM
as posterior probabilities, those regions remained unJo Jo
changed. Hence, in their example, R, ~ R,'. When comparing the maximum likelihood method and the method
= [M5 + 2M 3 (1 - e~M)2 - 4M 2 (1 - e~Mf
of maximum posterior probability, this means that it
must hold that XML should be equal to XMPP for almost
(3)
- 3M(1 - e~Mf + 4(1 - <T M ) 5 ]/4 4
all n vectors with £ n, = N.
pM
pM
To examine whether XML = *MPP, a situation as simple
h=
•
L(x2,b{X2)\n)db{T2)
Jo
Jo
as possible but where more than one topology is possible has been used; i.e., a four-taxon case under the Jukes= [M5 - 2M 3 (1 - e~M)2 + M(l - <r M ) 4 ]/4 4 . (4)
Cantor model (Jukes and Cantor, 1969). Many of the 256
possible columns contribute to the likelihood for x and
For M > 0 the difference between Equations (3) and
b^ in exactly the same way, so we say they have the same
(4)
is
"pattern." For example, AACC contributes to the likelihood in the same way as CC TT (under the Jukes-Cantor
model), whereas ACAC is another pattern contributing [4M3(1 - e~M)2 - 4M2(1 - e~Mf - 4M(1 - e'Mf
differently. For the Jukes-Cantor model and four taxa,
+ 4(1 - £TM)5]/44 = [(1 - e~M)2
there are 15 different patterns. The possible different patterns of data at a site for four taxa are summarized in
x (M - (1 - e-M))2{M + (1 - <TM))]/43.
Table 1. If a specific pattern on its own gives a unique estimate of the topology we say it is separately informative.
(5)
118
All the factors of (5) are strictly positive, and hence the
difference is strictly positive and l\ > hThe posterior probability of T; , 7T (T, | n), is obtained from
(2). To calculate the denominator n(n) in (2) one has to integrate over the branch lengths b and the parameter(s) 6
for all possible topologies and sum those, a complicated
numerical task. However, if it is possible to calculate the
numerator in (2) for all possible topologies, the posterior
probability for T,- can be obtained by dividing the numerator in (2) for T, with the sum of the numerators for
all topologies. Then n (r, |n) can be calculated analytically
from (3) and (4) as
I,
7r(r,|n) =
where I3 = I2. Because l\ > I2, the posterior probability
for Ti will be larger than for z2 and T3 and hence, using
flat priors, TMPP = z\ for pattern XXYZ. The posterior
probability will tend to 1/3 as M, the length of the interval of the prior for branch lengths, increases. For any
finite M though, TI will be the unique estimate. Using an
exponential distribution with mean j as prior for branch
lengths but still using uniform distribution as prior for
topology, the posterior probabilities can be written as
7T(T;|«) = A/(Ai + A2 + A3) where
1
0
A = 7F5
X
1.
A
3 2
X y
1
2 3
X y
2
Xy
A
4
y5'
1
where y = (X + 1). Pattern XXYZ is separately informative for the maximum posterior probability method with
those priors too, because A\ > A2 for all X > 0.
Analogous to the analysis of pattern XXYZ it can be
shown that the two methods of inference differ for many
of the 15 patterns. For the maximum likelihood method
only patterns 1,2, and 3 are separately informative. These
are separately informative for the maximum posterior
probability method also, but so are patterns 8 to 13 (see
Table 2).
Even though a site is separately uninformative, it contributes to the likelihood function and therefore cannot
be ignored. The fundamental differences between the
two methods of inference, summarized in Table 2, have
impact on the regions Rj and R,', as the next example will
show.
TABLE 2. Separately informative patterns favoring topologies TI =
{(a, b), (c, d)\, T2 = {(a, c), (b, d)) and r3 = {(a, d), (b, c)} for the methods
of maximum likelihood and maximum posterior probability, respectively. The enumeration of the patterns follows Table 1.
Estimate of topology
Method
ML
MPP
VOL. 55
SYSTEMATIC BIOLOGY
1
1,8,9
2
2,10,13
3
3,11,12
a)
MPP, exp(10) prior for t
ML-method
100
FIGURE 1. Regions R,, where f = T, for the maximum likelihood
method (a) and Bayesian inference (b) with nx + n2 + n8 = 100. The
number of columns in data matrix that are equal to pattern 1 is on the
x-axis, number of pattern 2 on the y-axis, and for each dot in the triangle
bounded by the axes and n^ + n2 = 100 we have n8 = 100 - «] - n2.
Example
Suppose N = 100 and data only consists of patterns
1, 2, and 8 (e.g., nx = 45, n2 = 35, n8 = 20, n,- = 0 for
i = 3 , . . . , 7, 9,..., 15). For this data set XML = TMPP = riFrom Table 2 we see that patterns 1 and 2 are separately
informative, favoring topologies r\ and T2, respectively,
for both methods of inference. Pattern 8 is, as we have
shown analytically, separately informative for the maximum posterior probability method, favoring topology
TI, but not for the maximum likelihood method. Columns
of this pattern still contribute to the likelihood and are
therefore taken into account even in the maximum likelihood method. A consequence of this is the region JRO
in Figure 1, where no unique estimate of topology can
be found. The regions in Figure 1 have been determined
by using PAUP*v4.0bl0 (Swofford, 2002) for all possible n-vectors with n, = 0 for all i except n\ > 0, n2 > 0
and ns > 0 with ]T n, = 100 for the maximum likelihood
method. For the Bayesian inference, R,- was determined
by using MrBayes3.04 (Huelsenbeck and Ronquist, 2001)
with an exponential prior with intensity 10 (mean 1/10)
for branch lengths and uniform prior for topology. Figure 1 shows that R, ^ R^ and the bootstrap support value
for TML is smaller (0.8715 when approximated by bootstrap frequencies from PAUP*) than the posterior probability for TMPP (which is approximated by MrBayes to
1.0).
With an original data set far away from the borders between different regions {R,} and {Rj}, respectively (e.g.,
m = 70, ri2 = 10, and n% = 20 in Figure 1), the bootstrap
support value and the posterior probabilities will both
be high, tending to 1, even though R, / R^. This is due
to the fact that the probability of a bootstrap replicate
ending up in a region other than R, is then very small.
CONCLUDING REMARKS
We have shown that the often cited statement of Efron
et al. (1996) about approximate equivalence between
bootstrap support values and posterior probabilities is
true when the only parameter of interest is p and a noninformative prior is used. This parameter is not considered
2006
SVENNBLAD ET AL.—LIKELIHOOD-BASED METHODS IN PHYLOGENETICS
in the current implementations of Bayesian inference of
phytogenies where the parameters topology, r, branch
lengths b (r) , and model parameters 9 are considered, nor
in the maximum likelihood method.
For the four-taxon case with Jukes-Cantor model of
evolution, the 256 different possible data columns can
be reduced to 15 different patterns contributing to the
likelihood function in different ways. We have shown
analytically that some of those patterns are separately
informative for the method of maximum posterior probability but not for the method of maximum likelihood.
The differences between the methods remain when
more taxa are considered. In the five-taxon case,
{a, b, c, d, e}, there are 45 = 1024 different possible data
columns that can, under the Jukes-Cantor model, be reduced to 51 different data patterns. For example, for pattern XXYYZ, seven topologies have the same maximized
likelihood, all of them obtained on a border where at least
one internal branch is of length 0; i.e., the tree collapse
(see Appendix 2 for an analogous case with four taxa).
Some of the topologies collapse to the same, not fully resolved, tree. For this specific pattern, four different collapsed trees are obtained. The groups {a, b} and {c,d}
exist in three of them, respectively. The tree where the
group is absent does not contradict the group. Hence,
with the method of maximum likelihood, this pattern
is not informative for topology but is informative for
some groups. The method of maximum posterior probability, on the other hand, gives a unique estimate of
the topology, (a, b, (d, (e, /))), and hence also of groups.
Numerically, by using PAUP*v40bl0 and MrBayes3.04
we have found that none of the 51 patterns gives a
unique estimate of the topology with maximum likelihood but 15 patterns give fully resolved trees with maximum posterior probability. This is also the case for the
187 different data patterns of the six-taxon case. The
posterior probabilities of groups are, for six taxa and
more, also strongly influenced by unequal priors for
groups that follow a flat prior for topologies (Pickett and
Randle, 2005), which may further explain the observed
disparities between bootstrap values and posterior
probabilities.
The fundamental differences between the methods of
maximum likelihood and maximum posterior probabilities that we have demonstrated influence the regions of
the n space where the estimates equal a specific topology.
Those regions influence the bootstrap support values and
posterior probabilities differently, and they should therefore not be expected to be approximately equal. How
much of the observed differences between maximum
likelihood bootstrap values and posterior probabilities
that can be attributed to this factor is not clear but need
to be further studied.
ACKNO WLEDG EMENTS
We are grateful to Roderic D. M. Page, Jeff Thorne, and an anonymous reviewer for constructive criticism of an earlier version of this
manuscript. This work was supported by the Linnaeus Centre for Bioinformatics, Uppsala University, and the Swedish Research Council.
119
REFERENCES
Alfaro, M. E., S. Zoller, and F. Lutzoni. 2003. Mol. Biol. Evol. 20:255266.
Berger, J. 0.1980. Statistical decision theory, foundations, concepts and
methods. Springer-Veriag, New York.
Bernardo, J. M., and A. F. M. Smith. 1994. Bayesian theory. John Wiley
& Sons, New York.
Box, G. E., and G. C. Tiao. 1973. Bayesian inference in statistical analysis.
Addison-Wesley Publishing Company.
Cummings, M. P., S. A. Handley, D. S. Myers, D. L. Reed, A. Rokas, and
K. Winka. 2003. Syst. Biol. 52:477-487.
Douady, C. J., F. Delsuc, Y. Boucher, W. F. Doolittle, and E. J. P. Douzery.
2003. Mol. Biol. Evol. 20:248-254.
Efron, B. 1982. SIAM CBMS-NSF Monograph 38.
Efron, B., E. Halloran, and S. Holmes. 1996. Proc. Natl. Acad. Sci. USA
93:13429-13434.
Erixon, P., B. Svennblad, T. Britton, and B. Oxelman. 2003. Syst. Biol.
52:665-673.
Felsenstein, J. 1981. J. Mol. Evol. 17:368-376.
Felsenstein, J. 1985. Evolution 39:783-791.
Huelsenbeck, J. P., and F. Ronquist 2001. Bioinformatics 17:754-755.
Jukes, T. H., and C. R. Cantor 1969. Pages 21-132 in Mammalian protein
metabolism. (H. N. Munro., ed.). Academic Press, New York.
Larget, B., and Simon, D. L. 1999. Mol. Biol. Evol. 16:750-759.
Pickett, K. M., and Randle, C. P. 2005. Mol. Phyl. Evol. 34:203-211.
Rannala, B., and Yang, Z. 1996. J. Mol. Evol. 43:304-311.
Simmons, M. P., Pickett, K. M., and Miya, M. 2004. Mol. Biol. Evol.
21:188-199.
Steel, M. 1994. Syst. Biol. 43:560-564.
Suzuki, Y, Glazko, G. V., and Nei, M. 2002. Proc. Natl. Acad. Sci. USA
99:16138-16143.
Swofford, D. L. 2002. PAUP*: Phylogenetic analysis using parsimony
(*and other methods), version 4.0bl0. Sinauer Associates, Sunderland, Massachusetts.
Wilcox, T. P., Zwickl D. J., Heath T. A., and Hillis, D. M., 2002. Mol.
Phylogenet. Evol. 25:361-371.
First submitted 1 February 2005; reviews returned 6 April 2005;
final acceptance 11 August 2005
Associate Editor: Rod Page
APPENDIX 1
In Efron et al. (1996), an example where the data x are a twodimensional normal vector with expectation vector \x = {\x\, 1x2) and
with identity covariance matrix is used. It is noted that if [i a priori
could be anywhere in the plane with equal probability, the a posteriori distribution of [i, given p, = x, is also normally distributed with
expectation vector p, and identity covariance matrix, exactly the same
as the bootstrap distribution of fi , using parametric bootstrap. Referring to this example it is further stated that "The bootstrap probability
that f * = f is almost the same as the aposteriori probability that r = f
starting from an uninformative prior density on p", where f * is the estimate of the topology for a bootstrap replicate and p is the vector of
proportions of the possible data patterns Xi, X 2 ,..., Xj*, where s is the
number of taxa and, e.g., X[ = {A, A,..., A), X£ = {C, C,..., C}, etc.
We will now show that the statement is true if a noninformative prior
is used for p under the assumption that the method of inference gives
the correct tree.
The data, x, is an s x N data matrix of aligned DNA sequences.
This data matrix is equivalent with drawing N columns from the true
phylogeny where each possible column Xi, X 2 ,..., Xk has probability
Pi, P2, • •-, Pk of being drawn, where k = 4s. The data matrix x can then
be represented by a vector n = (n{,... n^) where £ n,; = N and each
«, is the number of columns in x that equals X,-. The probability of
data or, equivalently, the probability of n, the likelihood, then follows
a multinomial distribution.
In the nonparametric bootstrap method used in phylogenetic inference (Felsenstein, 1985), a bootstrap sample of the same size as the
original data is obtained by drawing columns from the matrix x with
replacement. The bootstrap replicate n* = (n*, n*v ..., n*k) follows, regardless of the method of inference used, a multinomial distribution
120
VOL. 55
SYSTEMATIC BIOLOGY
with probability function / defined by
f{n\,n*2,...n*k) =
N
(6)
ntni---nt
where pi = n,/N, with means E(n*) = Np, = nir variances V(n*) =
blpi(l - pi) = (M,-(N - Hj))/N and covariances C(ri[, np = -Npipj =
FIGURE 2. The continous p-space is J:-dimensional where k = 4s
Now consider the Bayesian framework. The prior for p should, and s is the number of taxa. Here this is illustrated to the left in two diaccording to Efron et al. (1996), be noninformative. This term is not mensions with a hypothetical division of the p-space giving subspaces
uniquely defined in the literature. For example, Berger (1980) says
R,', where r = r,-. To the right the corresponding /?-space is illustrated
"a non-informative prior, by which is meant a prior which contains no in-where each dot represents a possible p-vector. The regions R, and R,'
formation about the parameter (or more crudely which favors no possiblemay differ as R,' is the true division of the p space, whereas R, is the
values of the parameter over others)." Bernardo and Smith (1994) de- region where the estimate is equal to r,. In the region RQ there is no
scribe the idea of a non-informative prior as representing ignorance. unique estimate.
Jeffreys' rule for multiparameter problems, described, e.g., in Box and
Tiao (1973), states that the prior distribution for a set of parameters is
approximately noninformative if "the prior distribution for a set of pa- value a could be calculated exactly by (6) as
rameters is taken to be proportional to the square root of the determinant
of the information matrix," which relies on invariance under parameter
(14)
/ ( " I ' n*2' • • • "*)•
transformation.
n'eR,
The three different "definitions" of non-informative priors for
p given above can all be represented by a symmetric Dirichlet
Usually, the n-space is very complex and the division of it is theredistribution
fore practically impossible to determine. The bootstrap support value
is then estimated by the fraction of bootstrap replicates giving the same
estimate of topology as the original data when drawing bootstrap repli(7) cates many times (Fig. 2).
We have seen that theoretically the posterior distribution and the
bootstrap distribution are approximately equal. Let the continuous region where r = r, be denoted R,'. To get the posterior distribution of
where 6 = ( 1 , . . . 1) represents a flat (uniform) prior, 6 = (\,. ..\) rep- r,, the posterior distribution of p is integrated over the region R,'. If
resents Jeffery's prior, and letting 6 ->• (0,..., 0), as in Efron (1982),
R, ~ R,' and N is large, an integral can be approximated by a sum; i.e.,
represents "prior ignorance." Because the Dirichlet distribution is a
conjugate prior to the multinomial distribution, the posterior distribution of p is D{p\6 + n) with
(15)
n(p\n)dp~ V f{n'\n)=a.
I
means £(/?,) =
rii + 9
(8)
N + 9k'
variances V(pt) =
covariances C(p,, pj) = —
^ ^ ( I ' + N ^ V " '
{n, + 9)(nj + 9)
(9)
APPENDIX 2
(10)
From the means, variances, and covariances following (6) it is seen
that for the bootstrap replicate, where p* = n*/N,
(11)
3
V(p?) = V(n*/N) = m{N - n,)/N ,
3
C(p;, p)) = C(n*/N, nyN) = -n ; n ; /N .
Assuming that the method of inference gives the correct tree, (15)
makes the bootstrap support value and the Bayesian posterior probability for a given topology theoretically approximately equal.
In this section we show that there is no unique estimate for topology
for the maximum likelihood method when data consists of one column
of pattern 8, XXYZ, and the Jukes-Cantor model of evolution is used.
The three possible topologies are then zu r2, and r3 (see Fig. 3) with
branch lengths t, b, and v, respectively.
Assume the Jukes-Cantor model of evolution, with a = 0.25. The
likelihood functions for r2 and r3 will be identical (for r3 replace bt with
i>, in the likelihood function for r2). Denote the likelihood function for
Ti with L\ and for r2 (and r3) with L2.
(12)
(13)
For very large N and k small, that is for a lot of data on only a few taxa,
(8) ~ (11), (9) ~ (12), and (10) ~ (13) for each of the priors suggested.
As we assume N to be large, the multivariate central limit theorem
makes the bootstrap distribution and the posterior distribution for p
approximately equal.
Recall that for the data, n, £ n, = N. Each possible M-vector satisfying £ n, = JV, gives an estimate of the topology. The discrete n-space
is therefore divided into different regions R,, with f = r,-. There might
also be regions where the estimate is not unique. Denote these regions
with RQ. The bootstrap support value a is the probability that the bootstrap replicate n* ends up in the same region as n. That is, the probability
that the bootstrap replicate gives the same estimate of the topology as
the original data set. If the division of the n-space in regions R, (giving
estimate f = T,) would be known explicitly, then the bootstrap support
We want, in the semiclosed five-dimensional continuous space
bounded by 0 < tu 0 < t2,..., 0 < t5, to find a point t such that Lj (r)
is maximized. If we would have a closed five-dimensional space,
such that 0 < t, < M, i = 1 , . . . , 5 for some positive M, there would
be four possibilities for a maximum point t (shown in Fig. 4 for a threedimensional case):
1. t is in the interior of the closed space,
2. t is in a boundary region of the space,
2006
SVENNBLAD ET AL.—LIKELIHOOD-BASED METHODS IN PHYLOGENETICS
121
V
5
X
FIGURE 3.
2
The possible topologies ti, T2, and r3 with branches t, b, and v, respectively, when data consist of one site: {XXYZ}.
3. t is on a border between the different boundary regions,
4. t is in a corner of the closed space.
To find the maximum value of L] in a closed space, find, to begin
with, possible extreme points in the interior of the closed space (case 1).
There is no need to further investigate the kind of possible extreme
points but evaluate L\ in those points. Then investigate the boundary
regions (case 2) and evaluate Lj in the possible extreme points of these
regions. Continue in the same way with case 3 and finally evaluate L\
in the corners of the closed space (case 4). The maximum value of L\
in the closed continuous space is the maximum value of the evaluated
points.
If, in the five-dimensional closed space, t is an interior point, the tree
will be fully resolved. If t is a boundary point with some t, = 0, the tree
will collapse.
If there exists an interior t that is extreme, i.e., maximizes or
minimizes L, or is a so-called saddlepoint, the gradient VLi =
a'i,,
L'1(5) = (0,..., 0). From L'h - L'^ = 0 and L'lfj - L'1(i = 0 it
is found that ^ = t2 and t3 = t4. From L'1( — L\ = 0, and by using tx =
fe and t3 =fc»,it also follows that e~* = (-3e"f>(l + e ^ ) ) / ^ " ' 3 ) < 0,
which is impossible because 0 < e"*5 < 1 in the interior of the compact
space. Therefore, t\ has no interior extreme point (case 1).
Continue with cases 2 and 3. In the closed five-dimensional space
the boundary regions to be investigated are obtained by letting combinations of variables be fixed, 0 or M. At least one variable must not be
fixed as otherwise a corner is investigated (case 4). When a variable is
fixed to 0, or M, the likelihood function looks slightly different and an
extreme point is searched for in the usual way. In the five-dimensional
space there are 210 boundary regions and borders to be investigated. By
investigating all of them it is found that only three of the regions have
a possible interior extreme point, no matter how large M is. Evaluating
L\ at those possible extreme points gives L\ < 4~4 for all three of them.
The last possibility for a maximum point is at the corners of the
closed space (case 4); i.e., where U is equal to 0 or M. For 14 of the 32
FIGURE 4. The possible locations of a maximum point {x, y, z} in
a closed three-dimensional space: 1 an interior point, 2 a point on a
boundary region, 3 a point on the border between different boundary
regions, and 4 a corner, where each variable is either 0 or M.
possible corners, L i = 0, for the others, L i < 4~3. As long as we consider
finite M, the maximum point for ^ is in the corner (0, 0, M, M, M) for
which L, = [1 — 3e~m + 2e~3M]/43. As M increases, the value of L} in
that corner tends to 4"3, but there are other corners with the same limit
value. Hence the maximum value of the likelihood will not be a unique
point in the space of branch lengths. The fact that this can happen was
pointed out by Steel (1994).
Now, consider topologies r2 (and T3). The likelihood L2 is
L2(b)
= — [1 —
+ 2e~ ( i > 1 + ! ) 2 + b 4 + f ? 5 )
Analyzing L2 in the same way as L\ gives a unique maximum at
the corner b = (0, M, 0, M, 0), where L2 = (1 - 2e~M + e"2M)/43, giving
max L2{b) = 4"3 as M -*• oo, which is the same as the maximum value
of Li. Hence, pattern XXYZ is not separately informative for the maximum likelihood method.
In Figure 5 the maximized likelihoods for TI and r2 are shown as a
function of M. As can be seen, the maximized likelihoods are indistinguishable for large but finite M.
Maximized likelihood
FIGURE 5. The maximized likelihood for topology r, and r2 (which
has the same likelihood function as r3) for pattern XXYZ. The likelihood for Ti is maximized for branch lengths (0, 0, M, M, M) whereas
the likelihood for r2 is maximized for branch lengths (0, M, 0, M, 0).