Quartet-Mapping, a Generalization of the Likelihood

Quartet-Mapping, a Generalization of the Likelihood-Mapping Procedure
Kay Nieselt-Struwe* and Arndt von Haeseler†
*Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany; and †Max-Planck-Institut für evolutionäre
Anthropologie, Leipzig, Germany
Likelihood-mapping (LM) was suggested as a method of displaying the phylogenetic content of an alignment.
However, statistical properties of the method have not been studied. Here we analyze the special case of a fourspecies tree generated under a range of evolution models and compare the results with those of a natural extension
of the likelihood-mapping approach, geometry-mapping (GM), which is based on the method of statistical geometry
in sequence space. The methods are compared in their abilities to indicate the correct topology. The performance
of both methods in detecting the star topology is especially explored. Our results show that LM tends to reject a
star tree more often than GM. When assumptions about the evolutionary model of the maximum-likelihood reconstruction are not matched by the true process of evolution, then LM shows a tendency to favor one tree, whereas
GM correctly detects the star tree except for very short outer branch lengths with a statistical significance of .0.95
for all models. LM, on the other hand, reconstructs the correct bifurcating tree with a probability of .0.95 for most
branch length combinations even under models with varying substitution rates. The parameter domain for which
GM recovers the true tree is much smaller. When the exterior branch lengths are larger than a (analytically derived)
threshold value depending on the tree shape (rather than the evolutionary model), GM reconstructs a star tree rather
than the true tree. We suggest a combined approach of LM and GM for the evaluation of starlike trees. This
approach offers the possibility of testing for significant positive interior branch lengths without extensive statistical
and computational efforts.
Introduction
Recently, likelihood-mapping (LM), a method of
visualizing the phylogenetic signal of a sequence alignment in a single graph, was proposed (Strimmer and von
Haeseler 1997). With this method, an a priori assessment
of the phylogenetic content, similar to that of the method
of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988), of either the whole set
of sequences or a predefined partition into four subsets
is possible. In addition, LM can be used to test a posteriori the confidence of an inner branch in a phylogenetic tree. The procedure of LM is based on the three
maximum likelihoods for the three possible quartet trees
of aligned sequences, which are transformed into barycentric coordinates. The position of the point with these
coordinates in a simplex visualizes which of the seven
possible treelike and nontreelike quartet graphs represents the underlying topology of divergence of the
sequences.
In general, any quartet-based evaluation method of
phylogenetic signal is implementable in the setup of
LM. We coin the general principle quartet-mapping. In
this paper, we extend the LM procedure by combining
its principle with the method of statistical geometry in
sequence space (Eigen, Winkler-Oswatitsch, and Dress
1988). Both LM and its extension to statistical geometry
return the coordinates of a point which is mapped into
the same graph. Applied to the same sequences, the two
methods can therefore be used to directly compare the
positions of the points and consequently the proposed
topologies of the sequences. One should, however, noKey words: likelihood-mapping, maximum likelihood, statistical
geometry, phylogenetic content, edge test.
Address for correspondence and reprints: Arndt von Haeseler,
Max-Planck-Institut für evolutionäre Anthropologie, Inselstrasse 22,
D-04103 Leipzig, Germany. E-mail: [email protected].
Mol. Biol. Evol. 18(7):1204–1219. 2001
q 2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
1204
tice a subtle difference between the two approaches. LM
is built on the foundation of a stochastic model of sequence evolution, and thus the barycentric coordinates
have the statistical interpretation of posterior probabilities. With the extension of this approach to statistical
geometry in sequence space or maximum parsimony, for
example, a clear statistical interpretation of the coordinates is lacking. However, the statistical properties of
the two quartet-mapping procedures with respect to their
ability to reconstruct the true evolutionary history have
not been investigated so far. Although in numerous investigations it has been shown that maximum likelihood
(ML) is generally robust even when evolution assumptions are violated in the reconstruction model (Gaut and
Lewis 1995; Huelsenbeck 1995; Yang 1996), its performance in the LM framework is not clear. One explicit
question is that of support for the correct topology, especially for the star tree. In principle, a likelihood ratio
test can be constructed for the star tree versus a tree
phylogeny from the ML estimates (Churchill, von Haeseler, and Navidi 1992; Strimmer, Goldman, and von
Haeseler 1997; Yang and Rannala 1997; Whelan and
Goldman 1999; Ota et al. 2000). However, it has been
shown that under the null hypothesis, the likelihood ratio test’s statistics deviate from the usual x2 approximation (Goldman 1993; Gaut and Lewis 1995), and thus
the test is unreliable. The bootstrap test as introduced
into the phylogenetic world by Felsenstein (1985) tends
to support the tree favored by the tree-making method
(Zharkikh and Li 1992). Another test, called the interiorbranch test, computes the standard error of estimates of
edge lengths (Nei, Stephens, and Saitou 1985; Li 1989;
Rzhetsky and Nei 1992). Here, the emphasis of the test
lies on the assessment of the significance of a positive
branch length by means of a normal deviate test. Both
the bootstrap test and the interior branch test have been
shown to become conservative for starlike trees (Sitnikova 1996).
Quartet-Mapping
In this paper, we use the quartet-mapping approach
of ML and statistical geometry to investigate the following two questions: (1) Which tree shapes occur for
which evolution models? (2) Given that the tree shape
is correct, when is an interior branch length reliably positive? For these investigations, we use two approaches:
We simulate sequence data evolving along a wide range
of tree models under a broad range of parameter settings, including differences in the ratio of transition to
transversion substitutions, discrete gamma-distributed
site variation, unequal base composition, and unequal
rate matrix entries. From the sequences, the simplex
points are computed using the ML estimates as well as
the statistical geometry parameters. We compare the positions of the points in the simplex relative to each other,
and we compare the points with the true points as obtained from the tree models.
In addition, we compute the expected coordinates
of the simplex point obtained from the statistical geometry parameters using the Jukes-Cantor (Jukes and
Cantor 1969) and the Kimura (1980) two-parameter substitution models.
We investigate the behavior of LM and its extension to statistical geometry under violation of the evolutionary model. We also compare the efficiencies of
both methods in distinguishing between starlike trees
and fully resolved trees. We investigate under which circumstances a test for a positive interior branch length
can be constructed without extensive statistical and
computational efforts.
Materials and Methods
Quartet-Mapping
In this section, we introduce an extension of the
LM procedure (Strimmer and von Haeseler 1997). The
main feature of LM is the display of the underlying
mode of evolution of quartets of aligned sequences in a
single graph. There are seven graphs connecting four
taxa (Eigen, Winkler-Oswatitsch, and Dress 1988): three
fully resolved unrooted trees, the completely unresolved
star tree, and the three partially resolved ‘‘trees’’ accounting for the cases in which it is difficult to distinguish between two of the three perfect trees. We denote
the three perfect trees by T1, T2, and T3, the star tree by
T., and the three amalgamated trees by T12, T13, and
T23. LM computes for a quartet of aligned sequences the
three MLs L1, L2, and L3 belonging to T1, T2, and T3,
respectively, using any model for the evolution of the
sequences. The three likelihoods are transformed into
posterior probabilities pi 5 Li (L1 1 L2 1 L3), i 5 1, 2,
3, by applying Bayes’ theorem assuming a uniform prior
for all three trees. The three probabilities can be viewed
as the barycentric coordinates of a vector p 5 (p1, p2,
p3), and the corresponding vector is mapped onto a twodimensional simplex
S2 5 {p1 e1 1 p2 e2 1 p3 e3 z p1 1 p2 1 p3 5 1,
0 # p1 , p2 , p3 # 1},
(1)
where the ei are the unit vectors. In the simplex, seven
1205
FIG. 1.—The two-dimensional simplex graph of the quartet-mapping procedure. The graph shows the attractors (marked with dots),
along with corresponding regions (Voronoi cells), of the seven possible
graphs connecting four taxa (A, B, C, D).
points are distinguished, corresponding to the seven
quartet ‘‘trees’’ (see fig. 1). These are the three p vectors
(1, 0, 0), (0, 1, 0), and (0, 0, 1) of the perfect tree topologies T1, T2, and T3, respectively, the center point
with vector p 5 (1/3, 1/3, 1/3) of the star topology T.,
and the three points of the partially resolved tree topologies T12, T13, and T23, whose corresponding p vectors
are (1/2, 1/2, 0), (1/2, 0, 1/2), and (0, 1/2, 1/2). Using
the Euclidean distance, the simplex can be subdivided
into seven areas (so-called Voronoi cells), where each
point in one region has the smallest Euclidean distance
to its corresponding point representing one of the seven
quartet ‘‘trees.’’ For example, the region R(T.) with the
points closest to p 5 (1/3, 1/3, 1/3) of the star topology
T. is characterized as
R(T.) 5 {(p1, p2, p3):min(p1, p2, p3) . 1/6}.
Similarly,
R(T1 ) 5 {(p1 , p2 , p3 ) : p1 2 max(p2 , p3 ) . 1/2 and
5/6 , p1 1 min(p2 , p3 ) # 1}
and
R(T12 ) 5 {(p1 , p2 , p3 ) : 5/6 , p1 1 p2 # 1 and
min(p1 , p2 ) . 1/6}.
Of course, the Euclidean distance can be substituted by
any other reasonable metric.
By this approach, a one-to-one relationship between the topology of the quartet and a single point in
the simplex is established. As will now be shown, this
procedure is not limited to ML estimators. In fact, any
quartet-based approach that evaluates some objective
function is implementable in the setup of the LM procedure. The generalized method is accordingly coined
‘‘quartet-mapping.’’ In general, each quartet method
computes the support of a quartet of sequences for each
of the three possible perfect trees. Let s1 be the support
of tree T1, s2 that of tree T2, and s3 that of tree T3 as
1206
Nieselt-Struwe and von Haeseler
computed by an arbitrary quartet method. We define the
relative support
si 5
si
s1 1 s2 1 s3
(2)
of tree Ti. Then 0 # si # 1 and Si si 5 1. Thus the si can
be seen, like pi of the LM method, as barycentric coordinates of a vector s, whose domain is the simplex S2.
A
B
C
D
C1
x
x
x
x
C2
y
x
x
x
C3
x
y
x
x
C4
x
x
y
x
C5
x
x
x
y
C6
x
x
y
y
C7
x
y
x
y
C8
x
y
y
x
where w, x, y, and z are different nucleotides.
For a given quartet of sequences, one sums the
number of positions dj that belong to Cj. Note that the
sum of the 15 dj values is equal to L, the length of the
alignment. The underlying topology of divergence of the
quartet is obtained from the ‘‘informative’’ categories.
A position category is considered informative if it helps
to distinguish between different tree shapes. For statistical geometry, the configurations C6, . . . , C14 are informative. Configurations C6, C9, and C14 support tree T1,
C7, C10, and C13 support tree T2; and C8, C11, and C12
support tree T3.
GM
GM of the
We define the supports sGM
1 , s2 , and s3
three perfect tree topologies T1, T2, and T3, respectively,
as
1
s GM
5 d6 1 (d9 1 d14 )
1
2
(4)
1
s GM
5 d7 1 (d10 1 d13 )
2
2
(5)
1
5 d8 1 (d11 1 d12 ).
2
(6)
s GM
3
Then the relative supports sGM
of each of the three peri
fect trees are given by equation (2). This implementation
of the quartet-mapping procedure is called geometrymapping (GM).
An extension of the method of statistical geometry
takes a weight matrix M of the alphabet into account
(unpublished data). Each entry in the matrix M gives a
dissimilarity measure between the two respective symbols of the alphabet. Each position in the alignment of
the four sequences A, B, C, and D is then analyzed separately as follows: Let Al, Bl, Cl, and Dl be the entries
at position l of A, B, C, and D, respectively. Then the
four entries define six pairwise distances, M(Al, Bl),
M(Al, Cl), M(Al, Dl), M(Bl, Cl), M(Bl, Dl), and M(Cl,
Let us exemplify the quartet-mapping approach in
more detail for the method of statistical geometry in
sequence space (Eigen, Winkler-Oswatitsch, and Dress
1988). This method computes the sequence space graph
of quartets of aligned sequences by analyzing the positionwise entries of the sequences. Let A, B, C, and D be
a quartet of aligned sequences of length L, whose characters are drawn from an alphabet A of finite length. For
nucleotide sequences, there are 15 possible position categories Cj:
C9
x
x
y
z
C 10
x
y
x
z
C 11
x
y
z
x
C 12
y
x
x
z
C 13
y
x
z
x
C 14
y
z
x
x
C 15
x
y
z
w
(3)
Dl), and three distance sums, M(Al, Bl) 1 M(Cl, Dl),
M(Al, Cl) 1 M(Bl, Dl), and M(Al, Dl) 1 M(Bl, Cl). We
order these three distance sums by magnitude and label
them maxl, medl, and minl, with maxl $ medl $ minl.
The six pairwise distances together with maxl are used
to compute
1
g l (A z BCD) :5 (M(Al , Bl ) 1 M(Al , Cl ) 1 M(Al , Dl )
2
2 maxl )
1
g l (B z ACD) :5 (M(Al , Bl ) 1 M(Bl , Cl ) 1 M(Bl , Dl )
2
2 maxl )
1
g l (C z ABD) :5 (M(Al , Cl ) 1 M(Bl , Cl ) 1 M(Cl , Dl )
2
2 maxl )
1
g l (D z ABC) :5 (M(Al , Dl ) 1 M(Bl , Dl ) 1 M(Cl , Dl )
2
2 maxl )
1
g l (AB z CD) :5 (maxl 2 M(Al , Bl ) 2 M(Cl , Dl ))
2
1
g l (AC z BD) :5 (maxl 2 M(Al , Cl ) 2 M(Bl , Dl ))
2
1
g l (AD z BC) :5 (maxl 2 M(Al , Dl ) 2 M(Bl , Cl )). (7)
2
Note that one of the three values g l (AB z CD),
gl(AC z BD), gl(AD z BC) is always equal to zero. The
nonzero values define a two-dimensional diagram as
shown in figure 2a whose distance segments are given
exactly by these values. For each position l in the align-
Quartet-Mapping
1207
FIG. 2.—a, Given four sequences A, B, C, D, the four entries at each position l define six pairwise distances and three distance sums,
which are used to solve an equation system in the seven unknowns gl (A z BCD), gl (B z ACD), gl (CzABD), gl (DzABC), gl (AB z CD), gl (AC z BD),
and gl (AD z BC), of which one of the latter three is always equal to zero. The nonzero values can be matched to two-dimensional graphs with
six distance segments. b, Summing all parameters gl ( z ) over all positions l yields seven parameters that can be matched to the distance segments
of a three-dimensional box graph.
ment, the parameters gl(· z ·) are then computed, and the
appropriate distance segments are summed up, yielding
g(· z ·) 5 Sl gl(· z ·).
The segments g(A z BCD), g(B z ACD), g(C z ABD),
and g(D z ABC) refer to the lengths of the edges connecting the sequences A, B, C, and D, respectively, with
a three-dimensional box whose edge lengths are determined by the three segments g(AB z CD), g(AC z BD), and
g(AD z BC) (see fig. 2b). These three segments again determine the overall treelikeness of the quartet, as well
as the support of each of the three perfect tree topologies. The box dimensions are used to visualize the treelikeness of the quartet in the two-dimensional simplex
M M
S2. Again, we define the vector sM 5 (sM
1 , s2 , s3 ), where
now,
s1M 5
g(AB z CD)
g(AB z CD) 1 g(AC z BD) 1 g(AD z BC)
s2M 5
g(AC z BD)
g(AB z CD) 1 g(AC z BD) 1 g(AD z BC)
s3M 5
g(AD z BC)
.
g(AB z CD) 1 g(AC z BD) 1 g(AD z BC)
Since LM and GM are based on the same graph, it
is now possible to directly compare various statistical
properties of these two tree evaluation methods.
Comparison of Likelihood-Mapping and GeometryMapping
We conducted both a theoretical analysis and computer simulations of four-taxon trees, in which we sampled a large proportion of the complete parameter space.
The model trees are shown in figure 3. Note that the
branch lengths are expected numbers of nucleotide substitutions per site.
In the theoretical analyses, we computed the expected coordinates of the GM simplex vector given the
tree model and assuming the Jukes-Cantor model (Jukes
and Cantor 1969) and the Kimura (1980) two-parameter
(8)
In appendix A, we prove that sGM 5 sM if the off-diagonal entries of the matrix M are equal to one and the
diagonal entries are equal to zero.
FIG. 3.—The star tree and the two unrooted binary trees used for
the simulations. The arrow in tree I indicates the location of the hypothetical root.
1208
Nieselt-Struwe and von Haeseler
Table 1
Comprehensive List of Model Trees, Evolution Models, and Reconstruction Models
Used in the Simulations
Model Tree
Star tree . . . . . . . . . . . .
Clock tree . . . . . . . . . . .
Nonclock tree . . . . . . . .
Edge Lengths
a 5 0.05, 0.1, . . . , 0.75
b 5 0.05, 0.1, . . . , 0.75
a 5 0.05, 0.1, . . . , 0.75
c 5 0.05, 0.1, . . . , 0.75
a 5 0.05, 0.1, . . . , 0.75
b 5 0.05, 0.1, . . . , 0.75
c5a
Models of
Sequence Evolution
JC
F81
K2P
HKY
JC 1 G(0.8)
JC 1 G(0.2)
K2P 1 G(0.8)
K2P 1 G(0.2)
REV
Model for
Quartet Reconstruction
LM
LM
LM
LM
LM
LM
LM
LM
LM
1
1
1
1
1
1
1
1
1
JC, GM
HKY, GM
K2P, GM 1 M
HKY, GM 1 M
JC, GM
JC, GM
K2P, GM 1 M
K2P, GM 1 M
HKY, GM
NOTE.—JC - Jukes-Cantor model with the transition/transversion parameter k 5 0.5 and equal base frequencies; F81
5 Felsenstein model with k 5 0.5 and unequal base frequences (here, A 5 10%, G 5 30%, C 5 40%, and T 5 20% was
used); K2P 5 Kimura two-parameter model with k 5 4 and equal base frequencies; HKY 5 Hasegawa-Kishino-Yano
model with k 5 4.0 and unequal base frequencies (same as for the F81 model); JC 1 G(a) and K2P 1 G(a) 5 JukesCantor and Kimura two-paramter models in addition to a discrete gamma distribution G(a) with a 5 0.8 (medium rate
variation) and a 5 0.2 (severe rate variation); REV 5 fully reversible model with substitution rate matrix Q as shown in
the text.
model. In the LM setting, a prediction of the simplex
vector, and therefore of the behavior of the ML method,
is too difficult. Thus, here the statistical properties were
only investigated by means of simulations.
For the simulations, various cases of the HKY
model (Hasegawa, Kishino, and Yano 1985) and a fully
reversible REV model (Yang 1994) were assumed for
the sequence evolution. The cases used, together with
the model trees, the branch lengths, and the models assumed in the quartet reconstruction, are listed in table
1. For the fully reversible model, we used the following
the ‘‘instantaneous nucleotide substitution rate’’ matrix
Q:

Q5




A
G
C
T
 A
G

5
216
 5 28
 1
2

1
 10
C
T



1
10
2
1
24
1
1 212 
Since Q is symmetric, the stationary base composition
is uniform.
In all model trees, the branch lengths were increased in steps of 0.05 covering the range from 0.05 to
0.75 (fig. 4, top); thus, for each tree and each evolution
model, a total of 225 parameter combinations were
analyzed.
In all simulations, the sequence length was set to
1,000, and for each branch length combination, 200 repetitions were carried out for each set of parameters.
LM was used as implemented in PUZZLE, version
3.0 (Strimmer and von Haeseler 1996). For the analyses
of the sequences, the HKY model with the default option of no site variation was used as the reconstruction
model for all simulations. Also prespecified was the
transition/transversion parameter k according to the
model used. Thus, in the presence of a gamma distribution of the sites, the reconstruction model leads to a
violation of the assumption. Also, when sequences were
generated under a fully reversible model, REV, the assumption that all transitions and all transversions had
equal substitution rates was violated in the reconstruction process.
GM was used as implemented in STATGEOM, version 2.0. For all simulation models with a transition/
transversion parameter k 5 4.0 (where k 5 2s/v; see
appendix C) the weighted GM method with a weight
matrix M of the alphabet taking this k ratio into account
was used. In the other cases, the unweighted GM method was applied.
In the star topology simulations, the percentage of
simplex points placed (correctly) into the star area, as
well as the percentage of points mapped (incorrectly)
into either of the three tree areas, was recorded.
For the clock and the nonclock trees the percentage
of simplex points mapped into the correct tree area was
computed.
Results
For GM, the relative frequencies of different position categories in the alignment determine the coordinates of the simplex point and therefore the topology
of the four-taxon graph. For four nucleotide sequences,
there are 15 different position categories, as listed in
expression (3). When the evolutionary tree and the model of sequence evolution are known, the probabilities of
the different categories can easily be computed (Saitou
and Nei 1986). Appendix B shows the probabilities of
each category for the nonclock tree (right graph of fig.
3) assuming the Jukes-Cantor model. From the ‘‘informative’’ categories C6, . . . , C14, it is then straightforward to compute the expected coordinates of the simGM GM
plex point sGM 5 (sGM
1 , s2 , s3 ) (see appendix B). Similar calculations can be performed for the Kimura twoparameter model. In appendix C, we compute the
supports sM
i in the GM case when different weights of
transitions and transversions are considered.
Given the outer branch lengths a and b, the coordinates of the simplex point can be used to calculate the
Quartet-Mapping
1209
FIG. 4.—The parameter space representing branch length combinations as used in the simulations. The left graph refers to the star tree, the
middle graph to the clock tree, and the right graph to the nonclock tree. The gray scale (at the bottom) is also used in figure 7.
area into which the simplex point is mapped as a function of the inner branch length c.
In the following sections, we will separately analyze the star tree, i.e., the case in which the expected
inner branch length c is equal to zero, and the trees with
positive inner branch lengths.
Success in Predicting the Star Tree
Figure 5 shows the results of quartet-mapping for
the star tree when HKY substitution models or a general
reversible model was simulated. The gray scale is shown
in figure 4. In the diagram, the shades in column 1 represent the percentages of quartets that are in the star tree
area, while in columns 2–4, the shades show the percentages of quartets that are in tree areas 1, 2, and 3,
respectively. We note that the probability of detecting
the star tree T. is virtually independent of the choice of
a and b. GM has a better chance of correctly identifying
the star topology. LM, on the other hand, is too liberal
in accepting a binary tree. Table 2 shows the average
percentages of quartets over all 225 parameter combinations of branch lengths a and b that are in the seven
simplex areas T., T1, T2, T3, T12, T13, and T23, respectively. While GM shows almost constant values for all
evolution models, the values of LM are quite different
for the different evolution models. If the model of sequence evolution is violated in the reconstruction process, then LM has the tendency to favor the wrong topology (%T2 . %T1 and %T3, and %T2 . %T.). This
tendency is more pronounced if a K b or vice versa.
GM suffers from this tendency only for very small values of a or b.
We conclude that LM has less power to detect the
star tree than does GM. This conclusion is especially
valid if the evolutionary model is violated. Then, LM
has a strong tendency to suggest a wrong tree.
Success in Predicting the Binary Tree
We first analytically computed the areas into which
the simplex point of GM was mapped for the parameter
spaces of the clock and the nonclock trees. We used the
Jukes-Cantor and the Kimura two-parameter evolutionary models. In the latter case, we compared the properties of the unweighted GM method with those of the
weighted GM 1 M method.
The results of figure 6a show that for the clock tree,
depending on the branch length combinations, GM suggests either the correct tree or the star topology. If the
weighted GM method is applied, then the T1 region is
significantly enlarged. For the nonclock tree (fig. 6b),
the set of branch length combinations for the true tree
area is smaller than that for the clock tree. For most
parameter values, GM suggests either the T12 topology
or the star tree. For a very small range of branch length
combinations, the simplex point is even mapped into the
‘‘inconsistent’’ tree area T2 5 T(AC z BD).
While figure 6 displays the analytical results for
GM, such calculations seem impossible for the LM situation. Also, for the more complex evolution models,
analytical expressions relating the model to the expected
coordinates of the simplex point are rather difficult.
Consequently, figure 7 displays the results from simulations for the clock tree. For each branch length combination, the percentage of cases in which LM and GM
map the simplex point into the correct tree area is
shown. Figure 7 follows the color shades of figure 4.
The unfavorable bias of LM in the starlike situation is
now clearly an advantage. In a wide range, the probability of recovering the true tree is .95%, even in the
cases when the evolution model assumptions are violated. The parameter range for which GM maps .95% of
the points into the correct tree area is much smaller. The
separating line of the 95% area is almost equal to the
1210
Nieselt-Struwe and von Haeseler
FIG. 5.—Density plot of the star tree; complete parameter space (for the parameter domain and the gray scale, see fig. 4). Indicated are
percentages of cases in which likelihood-mapping (top graphs) and geometry-mapping (bottom graphs) placed the simplex point into the star
tree area (first column), tree T1 (second column), tree T2 (third column), and tree T3 (fourth column). a, Jukes-Cantor (JC) model. b, F81 model.
c, Kimura two-parameter (K2P) model with k 5 4.0. d, HKY model with k 5 4.0 and f 5 (0.1, 0.3, 0.4, 0.2). e, JC model 1 G with gamma
shape parameter a 5 0.8. f, JC model 1 G with a 5 0.2. g, K2P model 1 G with k 5 4.0 and a 5 0.8. h, K2P model 1 G with k 5 4.0 and
a 5 0.2. i, REV model.
respective analytical curves. The line is shifted to the
left, i.e., to even smaller outer branch lengths, when the
sequences evolve with a severe site variation (see fig. 7f
and i).
The results of the simulations of nonclock tree II
reinforce the clear superiority of the ML method in detecting a tree (data not shown). In almost the entire parameter space, .95% of all quartets are mapped into the
correct tree area. The probability was only slightly reduced when a severe site variation was assumed, and
the difference between the two equal branch lengths and
the three equal branch lengths was large. GM, in com-
parison, is successful in reconstructing the correct tree
in only a very small part of the parameter space.
We conclude that the success and the stability of
LM is virtually independent of the assumed evolution
model. The branch length combinations for which LM
reconstructs the true tree cover a large part of the parameter space. The results of the GM analyses, on the
other hand, show a clear dependency of success in reconstructing the true tree on the tree shape rather than
the evolution model.
The simulations and the analytical considerations
indicate a relationship between the lengths of the outer
Quartet-Mapping
1211
Table 2
Average Percentage of Quartets in the Star Areas T., Tree Areas T1, T2, and T3, and
Net Areas T12, T13, T23 of 225 Parameter Combinations for the Star Tree
Evolution Model
JC . . . . . . . . . . . . . . .
F81 . . . . . . . . . . . . . .
K2P . . . . . . . . . . . . .
HKY. . . . . . . . . . . . .
JC 1 G(0.8). . . . . . .
JC 1 G(0.2). . . . . . .
K2P 1 G(0.8) . . . . .
K2P 1 G(0.2) . . . . .
REV . . . . . . . . . . . . .
Reconstruction
Model
%T.
% T1
%T2
%T3
%T12
%T13
%T23
LM 1 JC
GM
LM 1 HKY
GM
LM 1 K2P
GM 1 M
LM 1 HKY
GM 1 M
LM 1 JC
GM
LM 1 JC
GM
LM 1 K2P
GM 1 M
LM 1 K2P
GM 1 M
LM 1 HKY
GM
84.4
88.5
83.5
88.6
85.8
88.6
87.4
88.4
11.0
90.1
2.0
92.2
7.7
89.9
1.2
90.7
18.8
89.9
4.3
0.0
4.0
0.0
3.6
0.0
3.1
0.0
10.8
0.0
14.6
0.0
14.6
0.0
18.2
0.0
13.3
0.0
3.8
7.2
4.25
7.0
4.2
6.2
4.1
6.5
43.1
6.1
53.7
3.5
38.2
5.1
51.8
3.7
29.6
6.2
5.2
0.0
5.2
0.0
3.2
0.0
3.1
0.0
12.2
0.0
15.7
0.0
17.3
0.0
17.1
0.0
14.6
0.0
0.9
2.2
1.1
2.0
1.1
2.6
0.9
2.4
9.6
1.9
5.6
2.1
8.4
2.5
4.4
2.6
9.5
2.0
0.9
1.8
1.1
2.0
1.1
2.4
0.9
2.5
8.6
1.6
5.9
1.9
8.3
2.2
4.3
2.8
9.2
1.7
0.5
0.0
0.5
0.1
0.8
0.1
0.5
0.1
4.7
0.0
2.9
0.0
5.4
0.0
3.1
0.0
4.9
0.0
NOTE.—JC - Jukes-Cantor model with the transition/transversion parameter k 5 0.5 and equal base frequencies; F81
5 Felsenstein model with k 5 0.5 and unequal base frequences (here, A 5 10%, G 5 30%, C 5 40%, and T 5 20% was
used); K2P 5 Kimura two-parameter model with k 5 4 and equal base frequencies; HKY 5 Hasegawa-Kishino-Yano
model with k 5 4.0 and unequal base frequencies (same as for the F81 model); JC 1 G(a) and K2P 1 G(a) 5 JukesCantor and Kimura two-paramter models in addition to a discrete gamma distribution G(a) with a 5 0.8 (medium rate
variation) and a 5 0.2 (severe rate variation); REV 5 fully reversible model with substitution rate matrix Q as shown in
the text.
branches, a and b, and that of the inner branch, c, that
determines into which simplex area the GM point is
mapped. From the calculations in appendix D, we find
that for the Jukes-Cantor model, the sum of the two
outer branch lengths has to be smaller than a threshold
value as determined in equation (15) such that there exists a positive minimal inner branch length c, given by
equation (14), for which the corresponding simplex
point is mapped into the correct tree area:
a 1 b & 0.9377.
This is the threshold line that separates the correct tree
area from the star tree area in the clocktree as shown in
figure 6. In the Jukes-Cantor case, if the sum of a and
b is larger than this threshold value, then it is impossible
to find the true tree based on GM.
A similar threshold for c is, of course, conceivable
for more complex models of sequence evolution. To estimate that minimal value, cmin, we computed the location of s as a function of c for three examples with fixed
outer branch lengths: a 5 0.1 for clock tree I, and a 5
0.01, b 5 0.1 and a 5 0.005, b 5 0.1, respectively, for
nonclock tree II. c was increased in steps of 0.005 starting from c 5 0.005. Based on 200 simulations for a
fixed c, the vector s̄(c) was computed, and the minimal
value cmin, for which s̄(c) was mapped into the correct
tree area T1 was determined. Moreover, we computed
the minimal inner edge length c0.95
min such that at least
95% of the simulated s vectors were in T1. The results
are shown in table 3.
We first note that the analytically computed values
from equation (14) for the Jukes-Cantor model in the
GM case are identical with the respective values of table
3. Two further results are notable: the minimal branch
length cmin is equal to 0.01 6 0.005 in the LM situation
for all trees and models. This confirms our previous conclusion that LM is robust in detecting the correct tree
even if the model assumptions are violated.
Second, cmin as computed by the GM method is in
all cases larger than cmin of LM and changes considerably with the evolution models. It ranges between 0.02
(Jukes-Cantor) and 0.55 (Kimura two-parameter 1 G
(0.2)). We note too that the differences between cmin and
c0.95
min are larger for the GM method than for the LM
implementation.
We conclude that GM is successful in reconstructing the correct tree when the following three conditions
are fulfilled: (1) the sequences evolve without a severe
site variation; (2) the total sum of the outer branch
lengths is less than 0.9, and thus the mean length of one
outer branch does not exceed 0.225; and (3) the interior
branch length is larger than a minimal value that depends on the lengths of the outer branches.
Applications
Here we illustrate the two methods for two data
sets. The first data set consisted of 24 mitochondrial
sequences (11,133 bp) from three eutherian classes (18
eutheria, 2 marsupialia, and 1 monotremata) and three
other noneutherian sequences (chicken, lungfish, and
frog). The three codon positions were analyzed separately. For all codon positions, LM placed most of the
quartets into the region of tree T1 (fig. 8a), thus suggesting a branching pattern of eutheria and noneutheria
1212
Nieselt-Struwe and von Haeseler
FIG. 6.—For each parameter combination, the expected area is plotted, and the simplex point of the geometry parameters as obtained from
the analytical equations is mapped into it. In the left graph, the Jukes-Cantor model was assumed, and in the middle and right graphs, the
Kimura two-parameter model was assumed. In the middle graph, the statistical geometry parameters with an unweighted alphabet metric were
computed; in the right graph, a weighted metric with the correct transition/transversion parameter was used. a, Density plot of the clock tree
(I). b, Density plot of the nonclock tree (II).
versus marsupialia and monotremata. GM, on the other
hand, mapped all quartets into the star tree area, suggesting an undefined branching pattern (fig. 8b).
The second data set was taken from Kuiken et al.
(1994). It consisted of 72 HIV-1 sequences of the hypervariable region V3 of the external envelope of the
virion. The sample covered different risk groups and
different infection years. Kuiken et al. (1994) showed
that the data set had been randomized to a large extent,
thus allowing almost no phylogenetic interpretation.
However, one clustering, associated with risk group,
could be distinguished. Therefore, we divided the data
set into two groups, the group of HIV-1-infected drug
users (IVD), and the group of HIV-1-infected homosexuals as well as HIV-1-infected hemophiliacs (non-IVD),
and applied quartet-mapping to this clustering. Both LM
and GM confirm the result of Kuiken et al. (1994) (fig.
9). Although GM maps a larger amount of the quartets
into the star tree area, both mapping methods support
the clustering of the IVD group versus the non-IVD
group, while there are only few quartets that cannot be
clustered within these risk groups.
Discussion
In this paper, we introduced an extension of the LM
procedure to other quartet-based tree reconstruction
methods. We exemplified this by implementing the
method of statistical geometry in sequence space in the
LM framework. We compared the statistical properties
of both methods by means of analytical, in addition to
simulation, studies. We assessed the success of both
methods in detecting the star tree, as well as their success in reconstructing the correct topology.
Let us point out that the general idea of quartetmapping is to reduce the probability that a wrong tree is
reconstructed. A tree is recognized only if the simplex
point is mapped into one of the three small trapezia (see
fig. 1). The main conclusion of this paper is that the shape
of the tree, rather than the substitution model, determines
how reliably and successfully ML and statistical geometry map the simplex point into the correct topology area.
Let us discuss the two methods separately. In all
analyzed cases, LM appears to be too liberal in suggesting a tree when the underlying topology is a star
FIG. 7.—Density plot of the clock tree I parameter space. Indicated are percentages of cases in which likelihood-mapping (left graphs) and
geometry-mapping (right graphs) placed the simplex point into the correct tree area. The gray scale is given in figure 4. a, Jukes-Cantor (JC)
model. b, Kimura two-parameter (K2P) model. c, HKY model. d, F81 model. e, JC 1 G with a 5 0.8. f, JC 1 G with a 5 0.2. g, REV model.
h, K2P 1 G with a 5 0.8. i, K2P 1 G with a 5 0.2.
Quartet-Mapping
1213
Table 3
Minimal Inner Branch Lengths (cmin) for Which the Mean Simplex Point Computed
from 200 Runs Is Mapped into the Correct Tree Area, and Minimal Inner Branch
Lengths (c0.95
min ) for Which at Least 95% of All Quartets Are Mapped into the Correct
Tree Area
EVOLUTION MODEL
JC . . . . . . . . . . . . . . .
F81 . . . . . . . . . . . . . .
JC 1 G(0.8). . . . . . .
JC 1 G(0.2). . . . . . .
K2P . . . . . . . . . . . . .
K2P 1 G(0.8) . . . . .
K2P 1 G(0.2) . . . . .
REV . . . . . . . . . . . . .
TREE I
TREE IIA
TREE IIB
RECONSTRUCTION
METHOD
cmin
c0.95
min
cmin
c0.95
min
cmin
c0.95
min
LM 1 JC
GM
LM 1 JC
GM
LM 1 HKY
GM
LM 1 JC
GM
LM 1 K2P
GM 1 M
LM 1 K2P
GM 1 M
LM 1 K2P
GM 1 M
LM 1 HKY
GM
0.01
0.045
0.01
0.045
0.01
0.075
0.015
0.135
0.01
0.04
0.01
0.06
0.015
0.1
0.01
0.06
0.015
0.075
0.02
0.0825
0.02
0.125
0.02
0.23
0.02
0.065
0.02
0.1125
0.035
0.2
0.0175
0.095
0.005
0.02
0.005
0.02
0.005
0.035
0.01
0.06
0.005
0.02
0.005
0.04
0.01
0.045
0.005
0.025
0.0075
0.0325
0.0075
0.0375
0.010
0.065
0.02
0.105
0.01
0.035
0.015
0.0625
0.02
0.09
0.0075
0.05
0.005
0.02
0.005
0.025
0.005
0.04
0.015
0.065
0.005
0.02
0.005
0.035
0.015
0.55
0.005
0.03
0.0075
0.035
0.0075
0.040
0.01
0.0625
0.02
0.1
0.0075
0.035
0.02
0.09
0.02
0.09
0.0075
0.0475
NOTE.—JC - Jukes-Cantor model with the transition/transversion parameter k 5 0.5 and equal base frequencies; F81
5 Felsenstein model with k 5 0.5 and unequal base frequences (here, A 5 10%, G 5 30%, C 5 40%, and T 5 20% was
used); K2P 5 Kimura two-parameter model with k 5 4 and equal base frequencies; HKY 5 Hasegawa-Kishino-Yano
model with k 5 4.0 and unequal base frequencies (same as for the F81 model); JC 1 G(a) and K2P 1 G(a) 5 JukesCantor and Kimura two-paramter models in addition to a discrete gamma distribution G(a) with a 5 0.8 (medium rate
variation) and a 5 0.2 (severe rate variation); REV 5 fully reversible model with substitution rate matrix Q as shown in
the text.
FIG. 8.—Four-cluster quartet-mapping of mitochondrial DNA (11,133 bases). Sequences were divided into four disjoint groups: 18 eutheria,
2 marsupialia, 1 monotremata, and 3 noneutheria. The three codon positions were analyzed separately (left: codon position 1; middle: codon
position 2; right: codon position 3, respectively). a, Results of likelihood-mapping. b, Results of quartet-mapping.
1214
Nieselt-Struwe and von Haeseler
FIG. 9.—Quartet-mapping graphs of 72 HIV-1 sequences of hypervariable region V3 of the external envelope of the virion. The set was
divided into two groups: the group of HIV-1-infected drug users (IVD), and the group of the others (non-IVD). A random sample of 5,000
quartets were analyzed. a, Results of likelihood-mapping. b, Results of quartet-mapping.
tree or very close to it. This bias is especially pronounced if the reconstruction model does not take rate
heterogeneity among sites into account. In the heterogeneous case, as well as the REV case, the binary tree
which brings the long edges together is favored more
often than the other two binary trees. The ultimate reason for this phenomenon needs further investigation, but
it has been observed in other studies that when the data
violate the assumptions of the analysis model, the likelihood ratio test provides strong support for the tree separating the long edges from the short edges (Gaut and
Lewis 1995).
LM appears to be robust and successful in reconstructing the correct tree over much of the whole parameter space even when simple evolution models are used
instead of more complex (and correct) ones. This agrees
with robustness results reported elsewhere (Gaut and
Lewis 1995; Huelsenbeck 1995; Yang 1996).
Our results show that GM allows for consistent topology reconstruction if the inner branch length is longer
than a threshold value. This threshold value depends critically on the tree shape and to some degree on the substitution model. However, there exists an upper limit for
the sum of the outer branches for which the interior branch
length has to be infinitely long such that the correct tree
is reconstructed. Thus, in a major part of the parameter
space, GM suggests a star tree. On the other hand, only a
small range of parameters leads to an inconsistent tree reconstruction, i.e., for which GM suggests the incorrect tree.
Thus, by using the quartet-mapping approach, i.e., by considering the seven quartet graphs rather than only the three
binary trees, the Felsenstein zone is significantly reduced.
The large region of the star tree, whose area is equal to
the sum of the three areas of the three perfect trees, is the
reason that GM suggests a star tree in a large part of the
parameter space. GM evaluates the absolute number of
positions, and, in addition, it does not take the information
of the outer branch lengths into account. Thus, the level
of ‘‘noise’’ is quickly approached, such that a tree can no
longer be extracted from the sequence alignment. It is interesting to note here that GM allows for consistent tree
reconstruction even if the molecular clock is violated, as
long as the threshold relation of the branch lengths is satisfied. This is in a way surprising, since statistical geometry, similarly to maximum parsimony, makes no explicit
assumption about the underlying process of evolution.
One could now ask if the range of branch lengths
for which GM reconstructs the true tree can be enlarged.
One source of noise that destroys the reconstruction of
the true tree in the statistical-geometry approach of the
sequence space approach is the use of the positions in
which only one pair has equal characters. We therefore
used a second, modified, definition of the coordinates of
the simplex point by taking merely the ‘‘parsimonious
informative’’ positions (Swofford and Olsen 1990), i.e.,
those in which two pairs have equal characters. These are
the position categories C6–C8 in expression (3). Then, the
three coordinates of the simplex vector are given by
Quartet-Mapping
1215
FIG. 10.—Size of the areas into which the simplex point of the original geometry-mapping is mapped as compared with the simplex point of
parsimony-mapping under the assumption of the Jukes-Cantor model. In the respective left plots, the original geometry-mapping is shown; in the right
plots, the parsimony-mapping is shown. a, Clock tree. The dashed lines indicate the threshold value for the outer branch length a. b, Nonclock tree.
sMP
5 d6/(d6 1 d7 1 d8), s2MP 5 d7/(d6 1 d7 1 d8), and
1
MP
s3 5 d8/(d6 1 d7 1 d8). Note that this is the quartetmapping implementation of the maximum-parsimony
method (Farris, Kluge, and Eckhardt 1970; Fitch 1971).
Assuming the Jukes-Cantor model, we computed the expected coordinates sMP
for the clock tree and the nonclock
i
tree. Figure 10 shows the comparison of the predicted
areas using the geometry implementation and the parsimony implementation. Indeed, we find a significantly
larger threshold line for the T1 area for the clock tree and
an enlarged T1 area for the nonclock tree. A similar computation for the threshold relation (cf. eqs. 13, 14, and
15) now gives
3
a 1 b $ ln(5) ø 1.2071.
4
We conclude that in the maximum-parsimony case, the
threshold for the sum of the outer branch lengths is
ø1.27 times as large as the one for GM. In addition,
for given branch lengths a and b, the minimal inner
branch length c, for which the simplex point is mapped
into the correct tree area, is smaller.
The effect observed for GM is also observed for
the maximum-parsimony implementation of quartetmapping. Thus, by considering not only the three binary
trees, the Felsenstein zone can also be reduced for the
maximum-parsimony method. However, it can clearly
be seen that the domain in the parameter space for which
GM or maximum parsimony maps the simplex point
into the star tree area rather than into the binary tree
area is still very large. Future studies will therefore address the problem of reducing the attraction of the star
tree area.
When evaluating a reconstructed tree topology, one
of the biggest difficulties is in deciding when the length
of an interior branch is equal to zero. From the star tree
simulations, we find that ML as implemented in LM
suggests a binary tree too often. On the other hand, the
domain of the parameter space in which GM reconstructs a star tree is too large. We therefore propose the
following procedure to test for a positive interior branch
length using a combination of LM, GM, and ML. For a
quartet of sequences, both LM and GM are applied.
Then, we distinguish four cases:
1. Both LM and GM propose the star topology. Then,
indeed, the interior branch length is equal to zero.
2. LM proposes a star topology, whereas GM proposes
a tree topology. This case does not occur and can
therefore be ignored.
3. Both LM and GM propose the same tree topology.
Then, the interior branch length is indeed positive.
4. LM proposes a tree topology, and GM proposes a
star topology. In this case, we use ML to calculate
the expected branch lengths of the four outer branches, denoted by â1, â2, b̂1, and b̂2, and the expected
inner branch length ĉ. We now assume that two of
the outer branches are short and of similar length and
that the other two are comparatively long. We set â
5 (â1 1 â2)/2 and b̂ 5 (b̂1 1 b̂2)/2. If these expected
branch lengths â and b̂ are in the area of the parameter space, where GM should propose a tree, then
with a large probability the interior branch length is
indeed equal to zero. If, on the other hand, the values
â, b̂, and ĉ are not in the area in which GM would
propose a tree, then no definite conclusion can be
reached.
Let us look at the example in figure 8. Here, most
of the simplex points of LM are mapped into the tree
area T1, whereas the simplex points of GM are all placed
into the star tree area. From an additional tree reconstruction now applied to all positions using the ML
method as implemented in PUZZLE (Strimmer and von
Haeseler 1996), we computed the five mean branch
lengths: â1 5 0.31, â2 5 0.21, b̂1 5 0.40, and b̂2 5 0.21
for the outer branches and ĉ 5 0.06 for the inner branch
length. Then, â 5 0.26, b̂ 5 0.31, and a 1 b 5 0.57.
The sum of â and b̂ is below the threshold of equation
(15). Therefore, there exists a positive minimal inner
branch length cmin ø 0.36 given by equation (14) for
which GM reconstructs a tree. The value ĉ of the inner
edge separating the four groups, however, is smaller than
cmin, the minimal value needed for GM to reconstruct a
tree. Thus, a definite conclusion about the phylogeny of
these four groups cannot be reached.
In this paper, we investigated the statistical properties of LM and GM only for model trees with four
taxa. It has been shown that for trees with more than
four species, LM is also too liberal to accept a bifurcating tree even if the true topology is a star tree (Nie-
1216
Nieselt-Struwe and von Haeseler
selt-Struwe 1998). Nieselt-Struwe (1998) also showed
that this tendency cannot be improved by increasing the
sequence length.
All models of nucleotide substitution used in the
simulations in this paper are Markov models and assume
that evolution is independent and identical at each site
and along each lineage. This assumption, of course, may
be quite wrong. The work of Schöniger and von Haeseler (1995) and others on site-dependent models has
shown that, indeed, ML has an enlarged Felsenstein
zone if sequence sites do not evolve independently
(Muse 1995; Rzhetsky 1995; Schöniger and von Haeseler 1995; Tillier and Collins 1995). Investigations of
models violating these assumptions should provide another insight into the stability of the ML method and
statistical geometry in sequence space.
We conclude that quartet-mapping is a powerful tool
with which to test and compare the success of tree evaluation methods based on quartets of taxa. It is especially
a very efficient method for a quick evaluation of zerolength branches in trees. However, we are aware that this
procedure cannot replace bootstrapping or other evaluation methods. The statistical properties that have been
studied in this work can be supplemented by the addition
of a bootstrapping procedure to assess the spreading of
the point cloud. Work on this question is in progress.
Acknowledgments
This work was supported by the Deutsche Forschungsgemeinschaft and the Max-Planck-Society,
which are gratefully acknowledged. We thank Korbinian
Strimmer for supplying the code of PUZZLE, version
3.0. We are also grateful to two anonymous reviewers
for their helpful comments on the manuscript. The extension of likelihood-mapping to geometry-mapping is
incorporated in STATGEOM, version 2.0, which can be
obtained from http://www.gwdg.de/;kniesel.
LITERATURE CITED
CHURCHILL, G., A. VON HAESELER, and W. NAVIDI. 1992. Sample size for a phylogenetic inference. Mol. Biol. Evol. 9:
753–769.
EIGEN, M., R. WINKLER-OSWATITSCH, and A. DRESS. 1988.
Statistical geometry in sequence space: a method of comparative sequence analysis. Proc. Natl. Acad. Sci. USA 85:
5913–5917.
FARRIS, J., A. KLUGE, and M. ECKARDT. 1970. A numerical
approach to phylogenetic systematics. Syst. Zool. 19:172–
191.
FELSENSTEIN, J. 1985. Confidence-limits on phylogenies—an
approach using the bootstrap. Evolution 39:783–791.
FITCH, W. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:
406–416.
GAUT, B., and P. LEWIS. 1995. Success of maximum likelihood
phylogeny inference in the four-taxon case. Mol. Biol. Evol.
12:152–162.
GOLDMAN, N. 1993. Statistical tests for DNA substitution. J.
Mol. Evol. 34:183–198.
HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating the
human-ape splitting by a molecular clock of mitochondrial
DNA. J. Mol. Evol. 22:160–174.
HUELSENBECK, J. 1995. The robustness of two phylogenetic
methods: four-taxon simulations reveal a slight superiority
of maximum likelihood over neighbor joining. Mol. Biol.
Evol. 12:843–849.
JUKES, T., and C. CANTOR. 1969. Evolution of protein molecules. Pp. 21–132 in H. MUNRO, ed. Mammalian protein
metabolism. Academic Press, New York.
KIMURA, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of
nucleotide sequences. J. Mol. Evol. 16:111–120.
KUIKEN, C. L., K. NIESELT-STRUWE, G. F. WEILLER, and J.
GOUDSMIT. 1994. Quasispecies behavior of human immunodificiency virus type 1: sample analysis of sequence data.
Pp. 100–119 in K. W. ADOLPH, ed. Molecular Virology
Techniques Part A. Academic Press, New York.
LI, W.-H. 1989. A statistical test of phylogenies estimated from
sequence data. Mol. Biol. Evol. 6:424–435.
———. 1997. Molecular Evolution. Sinauer Ass., Inc. Publishers, Sunderland, Mass., USA.
MUSE, S. 1995. Evolutionary analysis of DNA sequences subject to constraint on secondary structure. Genetics 139:
1429–1439.
NEI, M., J. STEPHENS, and N. SAITOU. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from
humans and apes. Mol. Biol. Evol. 2:66–85.
NIESELT-STRUWE, K. 1998. From likelihood-mapping to quartet-mapping, a new sequence analysis tool. Pp. 13–22 in M.
K. UYENOYAMA and A. VON HAESELER, eds. Proceedings
of the Trinational Workshop on Molecular Evolution. Duke
University Publication Group, Duke University, Durham,
N.C.
OTA, R., P. WADDELL, M. HASEGAWA, H. SHIMODAIRA, and H.
KISHINO. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798–803.
RZHETSKY, A. 1995. Estimating substitution rates in ribosomal
RNA genes. Genetics 141:771–783.
RZHETSKY, A., and M. NEI. 1992. A simple method for estimating and testing minimum-evolution trees. Mol. Biol.
Evol. 9:945–967.
SAITOU, N., and M. NEI. 1986. The number of nucleotides required to determine the branching order of three species,
with special reference to the human-chimpanzee-gorilla divergence. J. Mol. Evol. 24:189–204.
SCHÖNIGER, M., and A. VON HAESELER. 1995. Performance of
the maximum likelihood, neighbor joining and maximum
parsimony methods when sequence sites are not independent. Syst. Biol. 44:533–547.
SITNIKOVA, T. 1996. Bootstrap method of interior-branch test
for phylogenetic trees. Mol. Biol. Evol. 13:605–611.
STRIMMER, K., N. GOLDMAN, and A. VON HAESELER. 1997.
Bayesian probabilities and quartet puzzling. Mol. Biol.
Evol. 14:210–211.
STRIMMER, K., and A. VON HAESELER. 1996. Quartet puzzling:
a quartet maximum-likelihood method for reconstructing
tree topologies. Mol. Biol. Evol. 13:964–969.
———. 1997. Likelihood-mapping: a simple method to visualize phylogenetic content of sequence alignment. Proc.
Natl. Acad. Sci. USA 94:6815–6819.
SWOFFORD, D., and G. OLSEN. 1990. Phylogeny reconstruction.
Pp. 411–501 in D. HILLIS and G. MORITZ, eds. Molecular
systematics. Sinauer, Sunderland, Mass.
TILLIER, E., and R. COLLINS. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:7–15.
Quartet-Mapping
WHELAN, S., and N. GOLDMAN. 1999. Distribution of statistics
used for the comparison of models of sequence evolution
in phylogenetics. Mol. Biol. Evol. 16:1292–1299.
YANG, Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111.
———. 1996. Phylogenetic analysis using parsimony and likelihood methods. J. Mol. Evol. 42:294–307.
YANG, Z., and B. RANNALA. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo
method. Mol. Biol. Evol. 14:717–724.
ZHARKIKH, A., and W.-H. LI. 1992. Statistical properties of
bootstrap estimation of phylogenetic variability from nucleotide sequences: II. Four taxa without a molecular clock. J.
Mol. Evol. 35:356–366.
1217
if the diagonal elements of the weight matrix M are
equal to zero and the off-diagonal elements are equal to
one. It suffices to show that then,
1
s GM
5 d6 1 (d9 1 d14 ) 5 s M
1
1 5 g(AB z CD).
2
One has
O g (AB z CD)
1
5 1O max 2 [M(A , B ) 1 M(C , D )]2 .
2
g(AB z CD) 5
l
l
l
l
l
l
l
l
APPENDIX A
The following table lists the values of gl(AB z CD) for
each of the 15 possible position categories. Let ml 5
M(Al, Bl) 1 M(Cl, Dl); then,
Here, we prove that
sGM 5 sM
ml
maxl
g l (AB z CD)
C1
0
0
0
C2
1
1
0
C3
1
1
0
C4
1
1
0
C5
1
1
0
C6
0
2
1
C7
2
2
0
Thus, gl(AB z CD) is not equal to zero if and only if l ∈
C6, C9, or C14. Therefore,
O g (AB z CD)
g(AB z CD) 5
l
C8
2
2
0
C9
1
2
1/2
C 10
2
2
0
C 11
2
2
0
C 12
2
2
0
C 13
2
2
0
C 14
1
2
1/2
C 15
2
2
0
cleotide x at time t is identical with that at time 0, and
Pxy is the probability that nucleotide x at time 0 changes
to nucleotide y at time t.
In the Jukes-Cantor model, then,
l
1
1
5 z(k ∈ C 6 )z 1 z(k ∈ C 9 )z 1 z(k ∈ C 14 )z
2
2
Pxx (t) 5
1
1
5 d6 1 d9 1 d14 5 s 1GM .
2
2
Pxy
By symmetry, then, sGM
5 g(AC z BD) and sGM
5
2
3
g(AD z BC).
APPENDIX B
For tree model II in figure 3, the probability that
sequence A, B, C, D, respectively, has nucleotide w, x,
y, z, respectively, at a specific site is given by:
Pr(w, x, y, z)
O g · 5O P (c/2)·P (a)·P (b)6
· 5 O P (c/2)·P (a)·P (b)6 ,
4
5
n
u51
nu
uy
nv
vw
vx
Pxx (t) 5
1
1
1
1 exp(24bt) 1 exp(22(a 1 b)t).
4
4
2
uz
Pxy (t) 5
4
v51
In the Kimura two-parameter model, transitional
changes are assumed to occur at rate a and transversional changes are assumed to occur at rate b. Then, Pxx
is given by
For Pxy, one has to distinguish the two cases of
transitions and transversions. In the first case,
4
n51
1 2 and
1
1
4
(t) 5 2 exp 12 lt2 .
4
4
3
1
3
4
1 exp 2 lt
4
4
3
(9)
where gn is the probability of nucleotide n occurring at
the specified site. In the Jukes-Cantor model, as well as
in the Kimura two-parameter model, the nucleotide frequencies are assumed to be at equilibrium, and therefore
gn 5 0.25 for all n. Pxx denotes the probability that nu-
1
1
1
1 exp(24bt) 2 exp(22(a 1 b)t),
4
4
2
and in the latter case,
Pxy (t) 5
1
1
2 exp(24bt)
4
4
(Li 1997). When the expected rate of nucleotide substi-
1218
Nieselt-Struwe and von Haeseler
tutions per site (lt) and k 5 a/2b are given, a and b
can easily be determined.
The probabilities pj 5 Pr(Cj) of the 15 different
configurations as listed in expression (3) are obtained by
summing the appropriate values of Pr(w, x, y, z) according to equation (9).
Assuming the Jukes-Cantor model and using definitions (4)–(6) of the supports and equation (2), it is then
straightforward to compute the expected coordinates of
the simplex point:
sGM
5
1
4 1 3z c 2 4z a 2 3z 2a 2 4z b 2 3z 2b 2 6z a1b 1 3z 2a12b1c 1 10z a1b1c
12 1 9z c 2 12z a 2 z 2a 2 12z b 2 z 2b 2 2z a1b 1 9z 2a12b1c 2 2z a1b1c
(10)
sGM
5
2
4 1 3z c 2 4z a 1 5z 2a 2 4z b 1 5z 2b 2 6z a1b 1 3z 2a12b1c 2 6z a1b1c
12 1 9z c 2 12z a 2 z 2a 2 12z b 2 z 2b 2 2z a1b 1 9z 2a12b1c 2 2z a1b1c
(11)
sGM
5
3
4 1 3z c 2 4z a 2 3z 2a 2 4z b 2 3z 2b 1 10z a1b 1 3z 2a12b1c 2 6z a1b1c
,
12 1 9z c 2 12z a 2 z 2a 2 12z b 2 z 2b 2 2z a1b 1 9z 2a12b1c 2 2z a1b1c
(12)
with z 5 exp (4/3).
two cases. Eight possibilities are of the type (RRY1Y2)
and/or (YYR1R2). For these positions, we compute
APPENDIX C
Now, assume different weights for transitions (s)
and transversions (v). Consider the matrix

M5




A
G
C
T
A

0
 s
v

v
G
C
T
s
0
v
v
v
v
0
s
v
v
s
0
1
gl (AB z CD) 5 (2v 2 s).
2

,


Sixteen possibilities are of the type (R1R1R2Y) and/or
(Y1Y1Y2R). For these, we get
1
s
g l (AB z CD) 5 (s 1 v 2 v) 5 .
2
2

with s # v. A calculation similar to that in appendix A
shows that gl(AB z CD) ± 0 if l ∈ C6, C9, C14, C15.
a. l ∈ C6: A total of 12 combinatorial possibilities contribute to category C6. We have to distinguish two
cases. Four of these 12 possibilities are of the type
(R1R1R2R2) and/or (Y1Y1Y2Y2), where Ri or Yi denotes
one of the two purines and/or pyrimidines. For these
positions, we compute
1
gl (AB z CD) 5 (2s 2 0) 5 s.
2
Then,
O g (ABCD) 5 31 2v 22 s 1 32 2s 5 2v 61 s .
l ∈C 9
l
c. l ∈ C14: Identical to case b.
d. l ∈ C15: There are a total of 24 possibilities in category C15, of which 8 are of the type (R1R2Y1Y2) and
contribute to gl (AB z CD). The other 16 possibilities
have a value of gl(AB z CD) equal to zero. For the
nonzero values, we get
Eight of the possibilities in C6 are of the type (RRYY).
For these, we get
1
g l (AB z CD) 5 (2v 2 2s) 5 v 2 s.
2
1
gl (AB z CD) 5 (2v 2 0) 5 v.
2
Then,
Then,
O g (AB z CD) 5 3s 1 2v3 5 s 13 2v .
l ∈C 6
l
b. l ∈ C9: A total of 24 combinatorial possibilities contribute to category C9. Again, we have to distinguish
O g (AB z CD) 5 v 23 s .
l ∈C 15
l
Then, we sum up to obtain
Quartet-Mapping
O g (AB z CD)p
5 O g (AB z CD)p
1 O g (AB z CD)p
1 O g (AB z CD)p
1 O g (AB z CD)p
g(AB z CD) 5
l ∈C j
l ∈C 6
l
l
l
6
l ∈C 9
l ∈C 14
l ∈C 15
5
1
l
APPENDIX D
9
l
14
l
15
2
1
2
2v 1 s
2v 1 s
p6 1
p9
3
6
1
1
2
1
2
2v 1 s
v2s
p14 1
p15 .
6
3
Note that if s 5 v 5 1, the equation is identical to
that in appendix A. Suppose k is equal to the ratio
of transition rate and transversion rate parameters. If
we set k 5 2s/v and rescale the weight matrix accordingly,
g(AB z CD) 5
1
2
1
2
4k 1 1
4k 1 1
p6 1
(p9 1 p14 )
6k
12k
1
1
2
2k 2 1
p15 5 s 1M .
6k
In this appendix, we compute a relationship between the outer branch lengths a and b and the inner
branch length c such that the simplex point sGM is
mapped into a tree area in the simplex. We will exemplify it for the tree area T1 5 T(ABzCD). sGM 5 (s1, s2,
s3) is mapped into the area of tree T1 if and only if the
distance between sGM and the attractor (1, 0, 0) of T1 is
shorter than the distances between sGM and all six of the
other attractors. Thus, sGM is mapped into the trapezoid
of T1 if and only if
GM
sGM
. max{sGM
1
2 , s3 }.
For the binary tree with branch lengths a, b, and c, the
inequality is fulfilled if and only if
sGM
. s2GM 1 0.5.
1
(13)
Taking the values of the coordinates from equations (10)
and (11) in appendix B, we find that for the Jukes-Cantor model for given a and b, this inequality is fulfilled
iff
1
2
3
12 2 12z a 1 15z 2a 2 12z b 1 15z 2b 2 2z a1b
c . ln
.
4
34z a1b 2 9z 2a12b 2 9
And similarly,
g(AC z BD) 5
1219
(14)
1
2
1
2
4k 1 1
4k 1 1
p7 1
(p10 1 p13 )
6k
12k
1
1
2
2k 2 1
p15 5 s 2M,
6k
1
2
1
and
2
4k 1 1
4k 1 1
g(AD z BC) 5
p8 1
(p11 1 p12 )
6k
12k
1
1
2
2k 2 1
p15 5 s 3M .
6k
Since a and b are positive, the numerator of the argument of the log is always positive. However, the denominator is less than or equal to zero if
17 1 4Ï13
3
a 1 b $ ln
ø 0.9377.
4
9
1
NARUYA SAITOU, reviewing editor
Accepted March 9, 2001
2
(15)