Quartet-Mapping, a Generalization of the Likelihood-Mapping Procedure Kay Nieselt-Struwe* and Arndt von Haeseler† *Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany; and †Max-Planck-Institut für evolutionäre Anthropologie, Leipzig, Germany Likelihood-mapping (LM) was suggested as a method of displaying the phylogenetic content of an alignment. However, statistical properties of the method have not been studied. Here we analyze the special case of a fourspecies tree generated under a range of evolution models and compare the results with those of a natural extension of the likelihood-mapping approach, geometry-mapping (GM), which is based on the method of statistical geometry in sequence space. The methods are compared in their abilities to indicate the correct topology. The performance of both methods in detecting the star topology is especially explored. Our results show that LM tends to reject a star tree more often than GM. When assumptions about the evolutionary model of the maximum-likelihood reconstruction are not matched by the true process of evolution, then LM shows a tendency to favor one tree, whereas GM correctly detects the star tree except for very short outer branch lengths with a statistical significance of .0.95 for all models. LM, on the other hand, reconstructs the correct bifurcating tree with a probability of .0.95 for most branch length combinations even under models with varying substitution rates. The parameter domain for which GM recovers the true tree is much smaller. When the exterior branch lengths are larger than a (analytically derived) threshold value depending on the tree shape (rather than the evolutionary model), GM reconstructs a star tree rather than the true tree. We suggest a combined approach of LM and GM for the evaluation of starlike trees. This approach offers the possibility of testing for significant positive interior branch lengths without extensive statistical and computational efforts. Introduction Recently, likelihood-mapping (LM), a method of visualizing the phylogenetic signal of a sequence alignment in a single graph, was proposed (Strimmer and von Haeseler 1997). With this method, an a priori assessment of the phylogenetic content, similar to that of the method of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988), of either the whole set of sequences or a predefined partition into four subsets is possible. In addition, LM can be used to test a posteriori the confidence of an inner branch in a phylogenetic tree. The procedure of LM is based on the three maximum likelihoods for the three possible quartet trees of aligned sequences, which are transformed into barycentric coordinates. The position of the point with these coordinates in a simplex visualizes which of the seven possible treelike and nontreelike quartet graphs represents the underlying topology of divergence of the sequences. In general, any quartet-based evaluation method of phylogenetic signal is implementable in the setup of LM. We coin the general principle quartet-mapping. In this paper, we extend the LM procedure by combining its principle with the method of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988). Both LM and its extension to statistical geometry return the coordinates of a point which is mapped into the same graph. Applied to the same sequences, the two methods can therefore be used to directly compare the positions of the points and consequently the proposed topologies of the sequences. One should, however, noKey words: likelihood-mapping, maximum likelihood, statistical geometry, phylogenetic content, edge test. Address for correspondence and reprints: Arndt von Haeseler, Max-Planck-Institut für evolutionäre Anthropologie, Inselstrasse 22, D-04103 Leipzig, Germany. E-mail: [email protected]. Mol. Biol. Evol. 18(7):1204–1219. 2001 q 2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 1204 tice a subtle difference between the two approaches. LM is built on the foundation of a stochastic model of sequence evolution, and thus the barycentric coordinates have the statistical interpretation of posterior probabilities. With the extension of this approach to statistical geometry in sequence space or maximum parsimony, for example, a clear statistical interpretation of the coordinates is lacking. However, the statistical properties of the two quartet-mapping procedures with respect to their ability to reconstruct the true evolutionary history have not been investigated so far. Although in numerous investigations it has been shown that maximum likelihood (ML) is generally robust even when evolution assumptions are violated in the reconstruction model (Gaut and Lewis 1995; Huelsenbeck 1995; Yang 1996), its performance in the LM framework is not clear. One explicit question is that of support for the correct topology, especially for the star tree. In principle, a likelihood ratio test can be constructed for the star tree versus a tree phylogeny from the ML estimates (Churchill, von Haeseler, and Navidi 1992; Strimmer, Goldman, and von Haeseler 1997; Yang and Rannala 1997; Whelan and Goldman 1999; Ota et al. 2000). However, it has been shown that under the null hypothesis, the likelihood ratio test’s statistics deviate from the usual x2 approximation (Goldman 1993; Gaut and Lewis 1995), and thus the test is unreliable. The bootstrap test as introduced into the phylogenetic world by Felsenstein (1985) tends to support the tree favored by the tree-making method (Zharkikh and Li 1992). Another test, called the interiorbranch test, computes the standard error of estimates of edge lengths (Nei, Stephens, and Saitou 1985; Li 1989; Rzhetsky and Nei 1992). Here, the emphasis of the test lies on the assessment of the significance of a positive branch length by means of a normal deviate test. Both the bootstrap test and the interior branch test have been shown to become conservative for starlike trees (Sitnikova 1996). Quartet-Mapping In this paper, we use the quartet-mapping approach of ML and statistical geometry to investigate the following two questions: (1) Which tree shapes occur for which evolution models? (2) Given that the tree shape is correct, when is an interior branch length reliably positive? For these investigations, we use two approaches: We simulate sequence data evolving along a wide range of tree models under a broad range of parameter settings, including differences in the ratio of transition to transversion substitutions, discrete gamma-distributed site variation, unequal base composition, and unequal rate matrix entries. From the sequences, the simplex points are computed using the ML estimates as well as the statistical geometry parameters. We compare the positions of the points in the simplex relative to each other, and we compare the points with the true points as obtained from the tree models. In addition, we compute the expected coordinates of the simplex point obtained from the statistical geometry parameters using the Jukes-Cantor (Jukes and Cantor 1969) and the Kimura (1980) two-parameter substitution models. We investigate the behavior of LM and its extension to statistical geometry under violation of the evolutionary model. We also compare the efficiencies of both methods in distinguishing between starlike trees and fully resolved trees. We investigate under which circumstances a test for a positive interior branch length can be constructed without extensive statistical and computational efforts. Materials and Methods Quartet-Mapping In this section, we introduce an extension of the LM procedure (Strimmer and von Haeseler 1997). The main feature of LM is the display of the underlying mode of evolution of quartets of aligned sequences in a single graph. There are seven graphs connecting four taxa (Eigen, Winkler-Oswatitsch, and Dress 1988): three fully resolved unrooted trees, the completely unresolved star tree, and the three partially resolved ‘‘trees’’ accounting for the cases in which it is difficult to distinguish between two of the three perfect trees. We denote the three perfect trees by T1, T2, and T3, the star tree by T., and the three amalgamated trees by T12, T13, and T23. LM computes for a quartet of aligned sequences the three MLs L1, L2, and L3 belonging to T1, T2, and T3, respectively, using any model for the evolution of the sequences. The three likelihoods are transformed into posterior probabilities pi 5 Li (L1 1 L2 1 L3), i 5 1, 2, 3, by applying Bayes’ theorem assuming a uniform prior for all three trees. The three probabilities can be viewed as the barycentric coordinates of a vector p 5 (p1, p2, p3), and the corresponding vector is mapped onto a twodimensional simplex S2 5 {p1 e1 1 p2 e2 1 p3 e3 z p1 1 p2 1 p3 5 1, 0 # p1 , p2 , p3 # 1}, (1) where the ei are the unit vectors. In the simplex, seven 1205 FIG. 1.—The two-dimensional simplex graph of the quartet-mapping procedure. The graph shows the attractors (marked with dots), along with corresponding regions (Voronoi cells), of the seven possible graphs connecting four taxa (A, B, C, D). points are distinguished, corresponding to the seven quartet ‘‘trees’’ (see fig. 1). These are the three p vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) of the perfect tree topologies T1, T2, and T3, respectively, the center point with vector p 5 (1/3, 1/3, 1/3) of the star topology T., and the three points of the partially resolved tree topologies T12, T13, and T23, whose corresponding p vectors are (1/2, 1/2, 0), (1/2, 0, 1/2), and (0, 1/2, 1/2). Using the Euclidean distance, the simplex can be subdivided into seven areas (so-called Voronoi cells), where each point in one region has the smallest Euclidean distance to its corresponding point representing one of the seven quartet ‘‘trees.’’ For example, the region R(T.) with the points closest to p 5 (1/3, 1/3, 1/3) of the star topology T. is characterized as R(T.) 5 {(p1, p2, p3):min(p1, p2, p3) . 1/6}. Similarly, R(T1 ) 5 {(p1 , p2 , p3 ) : p1 2 max(p2 , p3 ) . 1/2 and 5/6 , p1 1 min(p2 , p3 ) # 1} and R(T12 ) 5 {(p1 , p2 , p3 ) : 5/6 , p1 1 p2 # 1 and min(p1 , p2 ) . 1/6}. Of course, the Euclidean distance can be substituted by any other reasonable metric. By this approach, a one-to-one relationship between the topology of the quartet and a single point in the simplex is established. As will now be shown, this procedure is not limited to ML estimators. In fact, any quartet-based approach that evaluates some objective function is implementable in the setup of the LM procedure. The generalized method is accordingly coined ‘‘quartet-mapping.’’ In general, each quartet method computes the support of a quartet of sequences for each of the three possible perfect trees. Let s1 be the support of tree T1, s2 that of tree T2, and s3 that of tree T3 as 1206 Nieselt-Struwe and von Haeseler computed by an arbitrary quartet method. We define the relative support si 5 si s1 1 s2 1 s3 (2) of tree Ti. Then 0 # si # 1 and Si si 5 1. Thus the si can be seen, like pi of the LM method, as barycentric coordinates of a vector s, whose domain is the simplex S2. A B C D C1 x x x x C2 y x x x C3 x y x x C4 x x y x C5 x x x y C6 x x y y C7 x y x y C8 x y y x where w, x, y, and z are different nucleotides. For a given quartet of sequences, one sums the number of positions dj that belong to Cj. Note that the sum of the 15 dj values is equal to L, the length of the alignment. The underlying topology of divergence of the quartet is obtained from the ‘‘informative’’ categories. A position category is considered informative if it helps to distinguish between different tree shapes. For statistical geometry, the configurations C6, . . . , C14 are informative. Configurations C6, C9, and C14 support tree T1, C7, C10, and C13 support tree T2; and C8, C11, and C12 support tree T3. GM GM of the We define the supports sGM 1 , s2 , and s3 three perfect tree topologies T1, T2, and T3, respectively, as 1 s GM 5 d6 1 (d9 1 d14 ) 1 2 (4) 1 s GM 5 d7 1 (d10 1 d13 ) 2 2 (5) 1 5 d8 1 (d11 1 d12 ). 2 (6) s GM 3 Then the relative supports sGM of each of the three peri fect trees are given by equation (2). This implementation of the quartet-mapping procedure is called geometrymapping (GM). An extension of the method of statistical geometry takes a weight matrix M of the alphabet into account (unpublished data). Each entry in the matrix M gives a dissimilarity measure between the two respective symbols of the alphabet. Each position in the alignment of the four sequences A, B, C, and D is then analyzed separately as follows: Let Al, Bl, Cl, and Dl be the entries at position l of A, B, C, and D, respectively. Then the four entries define six pairwise distances, M(Al, Bl), M(Al, Cl), M(Al, Dl), M(Bl, Cl), M(Bl, Dl), and M(Cl, Let us exemplify the quartet-mapping approach in more detail for the method of statistical geometry in sequence space (Eigen, Winkler-Oswatitsch, and Dress 1988). This method computes the sequence space graph of quartets of aligned sequences by analyzing the positionwise entries of the sequences. Let A, B, C, and D be a quartet of aligned sequences of length L, whose characters are drawn from an alphabet A of finite length. For nucleotide sequences, there are 15 possible position categories Cj: C9 x x y z C 10 x y x z C 11 x y z x C 12 y x x z C 13 y x z x C 14 y z x x C 15 x y z w (3) Dl), and three distance sums, M(Al, Bl) 1 M(Cl, Dl), M(Al, Cl) 1 M(Bl, Dl), and M(Al, Dl) 1 M(Bl, Cl). We order these three distance sums by magnitude and label them maxl, medl, and minl, with maxl $ medl $ minl. The six pairwise distances together with maxl are used to compute 1 g l (A z BCD) :5 (M(Al , Bl ) 1 M(Al , Cl ) 1 M(Al , Dl ) 2 2 maxl ) 1 g l (B z ACD) :5 (M(Al , Bl ) 1 M(Bl , Cl ) 1 M(Bl , Dl ) 2 2 maxl ) 1 g l (C z ABD) :5 (M(Al , Cl ) 1 M(Bl , Cl ) 1 M(Cl , Dl ) 2 2 maxl ) 1 g l (D z ABC) :5 (M(Al , Dl ) 1 M(Bl , Dl ) 1 M(Cl , Dl ) 2 2 maxl ) 1 g l (AB z CD) :5 (maxl 2 M(Al , Bl ) 2 M(Cl , Dl )) 2 1 g l (AC z BD) :5 (maxl 2 M(Al , Cl ) 2 M(Bl , Dl )) 2 1 g l (AD z BC) :5 (maxl 2 M(Al , Dl ) 2 M(Bl , Cl )). (7) 2 Note that one of the three values g l (AB z CD), gl(AC z BD), gl(AD z BC) is always equal to zero. The nonzero values define a two-dimensional diagram as shown in figure 2a whose distance segments are given exactly by these values. For each position l in the align- Quartet-Mapping 1207 FIG. 2.—a, Given four sequences A, B, C, D, the four entries at each position l define six pairwise distances and three distance sums, which are used to solve an equation system in the seven unknowns gl (A z BCD), gl (B z ACD), gl (CzABD), gl (DzABC), gl (AB z CD), gl (AC z BD), and gl (AD z BC), of which one of the latter three is always equal to zero. The nonzero values can be matched to two-dimensional graphs with six distance segments. b, Summing all parameters gl ( z ) over all positions l yields seven parameters that can be matched to the distance segments of a three-dimensional box graph. ment, the parameters gl(· z ·) are then computed, and the appropriate distance segments are summed up, yielding g(· z ·) 5 Sl gl(· z ·). The segments g(A z BCD), g(B z ACD), g(C z ABD), and g(D z ABC) refer to the lengths of the edges connecting the sequences A, B, C, and D, respectively, with a three-dimensional box whose edge lengths are determined by the three segments g(AB z CD), g(AC z BD), and g(AD z BC) (see fig. 2b). These three segments again determine the overall treelikeness of the quartet, as well as the support of each of the three perfect tree topologies. The box dimensions are used to visualize the treelikeness of the quartet in the two-dimensional simplex M M S2. Again, we define the vector sM 5 (sM 1 , s2 , s3 ), where now, s1M 5 g(AB z CD) g(AB z CD) 1 g(AC z BD) 1 g(AD z BC) s2M 5 g(AC z BD) g(AB z CD) 1 g(AC z BD) 1 g(AD z BC) s3M 5 g(AD z BC) . g(AB z CD) 1 g(AC z BD) 1 g(AD z BC) Since LM and GM are based on the same graph, it is now possible to directly compare various statistical properties of these two tree evaluation methods. Comparison of Likelihood-Mapping and GeometryMapping We conducted both a theoretical analysis and computer simulations of four-taxon trees, in which we sampled a large proportion of the complete parameter space. The model trees are shown in figure 3. Note that the branch lengths are expected numbers of nucleotide substitutions per site. In the theoretical analyses, we computed the expected coordinates of the GM simplex vector given the tree model and assuming the Jukes-Cantor model (Jukes and Cantor 1969) and the Kimura (1980) two-parameter (8) In appendix A, we prove that sGM 5 sM if the off-diagonal entries of the matrix M are equal to one and the diagonal entries are equal to zero. FIG. 3.—The star tree and the two unrooted binary trees used for the simulations. The arrow in tree I indicates the location of the hypothetical root. 1208 Nieselt-Struwe and von Haeseler Table 1 Comprehensive List of Model Trees, Evolution Models, and Reconstruction Models Used in the Simulations Model Tree Star tree . . . . . . . . . . . . Clock tree . . . . . . . . . . . Nonclock tree . . . . . . . . Edge Lengths a 5 0.05, 0.1, . . . , 0.75 b 5 0.05, 0.1, . . . , 0.75 a 5 0.05, 0.1, . . . , 0.75 c 5 0.05, 0.1, . . . , 0.75 a 5 0.05, 0.1, . . . , 0.75 b 5 0.05, 0.1, . . . , 0.75 c5a Models of Sequence Evolution JC F81 K2P HKY JC 1 G(0.8) JC 1 G(0.2) K2P 1 G(0.8) K2P 1 G(0.2) REV Model for Quartet Reconstruction LM LM LM LM LM LM LM LM LM 1 1 1 1 1 1 1 1 1 JC, GM HKY, GM K2P, GM 1 M HKY, GM 1 M JC, GM JC, GM K2P, GM 1 M K2P, GM 1 M HKY, GM NOTE.—JC - Jukes-Cantor model with the transition/transversion parameter k 5 0.5 and equal base frequencies; F81 5 Felsenstein model with k 5 0.5 and unequal base frequences (here, A 5 10%, G 5 30%, C 5 40%, and T 5 20% was used); K2P 5 Kimura two-parameter model with k 5 4 and equal base frequencies; HKY 5 Hasegawa-Kishino-Yano model with k 5 4.0 and unequal base frequencies (same as for the F81 model); JC 1 G(a) and K2P 1 G(a) 5 JukesCantor and Kimura two-paramter models in addition to a discrete gamma distribution G(a) with a 5 0.8 (medium rate variation) and a 5 0.2 (severe rate variation); REV 5 fully reversible model with substitution rate matrix Q as shown in the text. model. In the LM setting, a prediction of the simplex vector, and therefore of the behavior of the ML method, is too difficult. Thus, here the statistical properties were only investigated by means of simulations. For the simulations, various cases of the HKY model (Hasegawa, Kishino, and Yano 1985) and a fully reversible REV model (Yang 1994) were assumed for the sequence evolution. The cases used, together with the model trees, the branch lengths, and the models assumed in the quartet reconstruction, are listed in table 1. For the fully reversible model, we used the following the ‘‘instantaneous nucleotide substitution rate’’ matrix Q: Q5 A G C T A G 5 216 5 28 1 2 1 10 C T 1 10 2 1 24 1 1 212 Since Q is symmetric, the stationary base composition is uniform. In all model trees, the branch lengths were increased in steps of 0.05 covering the range from 0.05 to 0.75 (fig. 4, top); thus, for each tree and each evolution model, a total of 225 parameter combinations were analyzed. In all simulations, the sequence length was set to 1,000, and for each branch length combination, 200 repetitions were carried out for each set of parameters. LM was used as implemented in PUZZLE, version 3.0 (Strimmer and von Haeseler 1996). For the analyses of the sequences, the HKY model with the default option of no site variation was used as the reconstruction model for all simulations. Also prespecified was the transition/transversion parameter k according to the model used. Thus, in the presence of a gamma distribution of the sites, the reconstruction model leads to a violation of the assumption. Also, when sequences were generated under a fully reversible model, REV, the assumption that all transitions and all transversions had equal substitution rates was violated in the reconstruction process. GM was used as implemented in STATGEOM, version 2.0. For all simulation models with a transition/ transversion parameter k 5 4.0 (where k 5 2s/v; see appendix C) the weighted GM method with a weight matrix M of the alphabet taking this k ratio into account was used. In the other cases, the unweighted GM method was applied. In the star topology simulations, the percentage of simplex points placed (correctly) into the star area, as well as the percentage of points mapped (incorrectly) into either of the three tree areas, was recorded. For the clock and the nonclock trees the percentage of simplex points mapped into the correct tree area was computed. Results For GM, the relative frequencies of different position categories in the alignment determine the coordinates of the simplex point and therefore the topology of the four-taxon graph. For four nucleotide sequences, there are 15 different position categories, as listed in expression (3). When the evolutionary tree and the model of sequence evolution are known, the probabilities of the different categories can easily be computed (Saitou and Nei 1986). Appendix B shows the probabilities of each category for the nonclock tree (right graph of fig. 3) assuming the Jukes-Cantor model. From the ‘‘informative’’ categories C6, . . . , C14, it is then straightforward to compute the expected coordinates of the simGM GM plex point sGM 5 (sGM 1 , s2 , s3 ) (see appendix B). Similar calculations can be performed for the Kimura twoparameter model. In appendix C, we compute the supports sM i in the GM case when different weights of transitions and transversions are considered. Given the outer branch lengths a and b, the coordinates of the simplex point can be used to calculate the Quartet-Mapping 1209 FIG. 4.—The parameter space representing branch length combinations as used in the simulations. The left graph refers to the star tree, the middle graph to the clock tree, and the right graph to the nonclock tree. The gray scale (at the bottom) is also used in figure 7. area into which the simplex point is mapped as a function of the inner branch length c. In the following sections, we will separately analyze the star tree, i.e., the case in which the expected inner branch length c is equal to zero, and the trees with positive inner branch lengths. Success in Predicting the Star Tree Figure 5 shows the results of quartet-mapping for the star tree when HKY substitution models or a general reversible model was simulated. The gray scale is shown in figure 4. In the diagram, the shades in column 1 represent the percentages of quartets that are in the star tree area, while in columns 2–4, the shades show the percentages of quartets that are in tree areas 1, 2, and 3, respectively. We note that the probability of detecting the star tree T. is virtually independent of the choice of a and b. GM has a better chance of correctly identifying the star topology. LM, on the other hand, is too liberal in accepting a binary tree. Table 2 shows the average percentages of quartets over all 225 parameter combinations of branch lengths a and b that are in the seven simplex areas T., T1, T2, T3, T12, T13, and T23, respectively. While GM shows almost constant values for all evolution models, the values of LM are quite different for the different evolution models. If the model of sequence evolution is violated in the reconstruction process, then LM has the tendency to favor the wrong topology (%T2 . %T1 and %T3, and %T2 . %T.). This tendency is more pronounced if a K b or vice versa. GM suffers from this tendency only for very small values of a or b. We conclude that LM has less power to detect the star tree than does GM. This conclusion is especially valid if the evolutionary model is violated. Then, LM has a strong tendency to suggest a wrong tree. Success in Predicting the Binary Tree We first analytically computed the areas into which the simplex point of GM was mapped for the parameter spaces of the clock and the nonclock trees. We used the Jukes-Cantor and the Kimura two-parameter evolutionary models. In the latter case, we compared the properties of the unweighted GM method with those of the weighted GM 1 M method. The results of figure 6a show that for the clock tree, depending on the branch length combinations, GM suggests either the correct tree or the star topology. If the weighted GM method is applied, then the T1 region is significantly enlarged. For the nonclock tree (fig. 6b), the set of branch length combinations for the true tree area is smaller than that for the clock tree. For most parameter values, GM suggests either the T12 topology or the star tree. For a very small range of branch length combinations, the simplex point is even mapped into the ‘‘inconsistent’’ tree area T2 5 T(AC z BD). While figure 6 displays the analytical results for GM, such calculations seem impossible for the LM situation. Also, for the more complex evolution models, analytical expressions relating the model to the expected coordinates of the simplex point are rather difficult. Consequently, figure 7 displays the results from simulations for the clock tree. For each branch length combination, the percentage of cases in which LM and GM map the simplex point into the correct tree area is shown. Figure 7 follows the color shades of figure 4. The unfavorable bias of LM in the starlike situation is now clearly an advantage. In a wide range, the probability of recovering the true tree is .95%, even in the cases when the evolution model assumptions are violated. The parameter range for which GM maps .95% of the points into the correct tree area is much smaller. The separating line of the 95% area is almost equal to the 1210 Nieselt-Struwe and von Haeseler FIG. 5.—Density plot of the star tree; complete parameter space (for the parameter domain and the gray scale, see fig. 4). Indicated are percentages of cases in which likelihood-mapping (top graphs) and geometry-mapping (bottom graphs) placed the simplex point into the star tree area (first column), tree T1 (second column), tree T2 (third column), and tree T3 (fourth column). a, Jukes-Cantor (JC) model. b, F81 model. c, Kimura two-parameter (K2P) model with k 5 4.0. d, HKY model with k 5 4.0 and f 5 (0.1, 0.3, 0.4, 0.2). e, JC model 1 G with gamma shape parameter a 5 0.8. f, JC model 1 G with a 5 0.2. g, K2P model 1 G with k 5 4.0 and a 5 0.8. h, K2P model 1 G with k 5 4.0 and a 5 0.2. i, REV model. respective analytical curves. The line is shifted to the left, i.e., to even smaller outer branch lengths, when the sequences evolve with a severe site variation (see fig. 7f and i). The results of the simulations of nonclock tree II reinforce the clear superiority of the ML method in detecting a tree (data not shown). In almost the entire parameter space, .95% of all quartets are mapped into the correct tree area. The probability was only slightly reduced when a severe site variation was assumed, and the difference between the two equal branch lengths and the three equal branch lengths was large. GM, in com- parison, is successful in reconstructing the correct tree in only a very small part of the parameter space. We conclude that the success and the stability of LM is virtually independent of the assumed evolution model. The branch length combinations for which LM reconstructs the true tree cover a large part of the parameter space. The results of the GM analyses, on the other hand, show a clear dependency of success in reconstructing the true tree on the tree shape rather than the evolution model. The simulations and the analytical considerations indicate a relationship between the lengths of the outer Quartet-Mapping 1211 Table 2 Average Percentage of Quartets in the Star Areas T., Tree Areas T1, T2, and T3, and Net Areas T12, T13, T23 of 225 Parameter Combinations for the Star Tree Evolution Model JC . . . . . . . . . . . . . . . F81 . . . . . . . . . . . . . . K2P . . . . . . . . . . . . . HKY. . . . . . . . . . . . . JC 1 G(0.8). . . . . . . JC 1 G(0.2). . . . . . . K2P 1 G(0.8) . . . . . K2P 1 G(0.2) . . . . . REV . . . . . . . . . . . . . Reconstruction Model %T. % T1 %T2 %T3 %T12 %T13 %T23 LM 1 JC GM LM 1 HKY GM LM 1 K2P GM 1 M LM 1 HKY GM 1 M LM 1 JC GM LM 1 JC GM LM 1 K2P GM 1 M LM 1 K2P GM 1 M LM 1 HKY GM 84.4 88.5 83.5 88.6 85.8 88.6 87.4 88.4 11.0 90.1 2.0 92.2 7.7 89.9 1.2 90.7 18.8 89.9 4.3 0.0 4.0 0.0 3.6 0.0 3.1 0.0 10.8 0.0 14.6 0.0 14.6 0.0 18.2 0.0 13.3 0.0 3.8 7.2 4.25 7.0 4.2 6.2 4.1 6.5 43.1 6.1 53.7 3.5 38.2 5.1 51.8 3.7 29.6 6.2 5.2 0.0 5.2 0.0 3.2 0.0 3.1 0.0 12.2 0.0 15.7 0.0 17.3 0.0 17.1 0.0 14.6 0.0 0.9 2.2 1.1 2.0 1.1 2.6 0.9 2.4 9.6 1.9 5.6 2.1 8.4 2.5 4.4 2.6 9.5 2.0 0.9 1.8 1.1 2.0 1.1 2.4 0.9 2.5 8.6 1.6 5.9 1.9 8.3 2.2 4.3 2.8 9.2 1.7 0.5 0.0 0.5 0.1 0.8 0.1 0.5 0.1 4.7 0.0 2.9 0.0 5.4 0.0 3.1 0.0 4.9 0.0 NOTE.—JC - Jukes-Cantor model with the transition/transversion parameter k 5 0.5 and equal base frequencies; F81 5 Felsenstein model with k 5 0.5 and unequal base frequences (here, A 5 10%, G 5 30%, C 5 40%, and T 5 20% was used); K2P 5 Kimura two-parameter model with k 5 4 and equal base frequencies; HKY 5 Hasegawa-Kishino-Yano model with k 5 4.0 and unequal base frequencies (same as for the F81 model); JC 1 G(a) and K2P 1 G(a) 5 JukesCantor and Kimura two-paramter models in addition to a discrete gamma distribution G(a) with a 5 0.8 (medium rate variation) and a 5 0.2 (severe rate variation); REV 5 fully reversible model with substitution rate matrix Q as shown in the text. branches, a and b, and that of the inner branch, c, that determines into which simplex area the GM point is mapped. From the calculations in appendix D, we find that for the Jukes-Cantor model, the sum of the two outer branch lengths has to be smaller than a threshold value as determined in equation (15) such that there exists a positive minimal inner branch length c, given by equation (14), for which the corresponding simplex point is mapped into the correct tree area: a 1 b & 0.9377. This is the threshold line that separates the correct tree area from the star tree area in the clocktree as shown in figure 6. In the Jukes-Cantor case, if the sum of a and b is larger than this threshold value, then it is impossible to find the true tree based on GM. A similar threshold for c is, of course, conceivable for more complex models of sequence evolution. To estimate that minimal value, cmin, we computed the location of s as a function of c for three examples with fixed outer branch lengths: a 5 0.1 for clock tree I, and a 5 0.01, b 5 0.1 and a 5 0.005, b 5 0.1, respectively, for nonclock tree II. c was increased in steps of 0.005 starting from c 5 0.005. Based on 200 simulations for a fixed c, the vector s̄(c) was computed, and the minimal value cmin, for which s̄(c) was mapped into the correct tree area T1 was determined. Moreover, we computed the minimal inner edge length c0.95 min such that at least 95% of the simulated s vectors were in T1. The results are shown in table 3. We first note that the analytically computed values from equation (14) for the Jukes-Cantor model in the GM case are identical with the respective values of table 3. Two further results are notable: the minimal branch length cmin is equal to 0.01 6 0.005 in the LM situation for all trees and models. This confirms our previous conclusion that LM is robust in detecting the correct tree even if the model assumptions are violated. Second, cmin as computed by the GM method is in all cases larger than cmin of LM and changes considerably with the evolution models. It ranges between 0.02 (Jukes-Cantor) and 0.55 (Kimura two-parameter 1 G (0.2)). We note too that the differences between cmin and c0.95 min are larger for the GM method than for the LM implementation. We conclude that GM is successful in reconstructing the correct tree when the following three conditions are fulfilled: (1) the sequences evolve without a severe site variation; (2) the total sum of the outer branch lengths is less than 0.9, and thus the mean length of one outer branch does not exceed 0.225; and (3) the interior branch length is larger than a minimal value that depends on the lengths of the outer branches. Applications Here we illustrate the two methods for two data sets. The first data set consisted of 24 mitochondrial sequences (11,133 bp) from three eutherian classes (18 eutheria, 2 marsupialia, and 1 monotremata) and three other noneutherian sequences (chicken, lungfish, and frog). The three codon positions were analyzed separately. For all codon positions, LM placed most of the quartets into the region of tree T1 (fig. 8a), thus suggesting a branching pattern of eutheria and noneutheria 1212 Nieselt-Struwe and von Haeseler FIG. 6.—For each parameter combination, the expected area is plotted, and the simplex point of the geometry parameters as obtained from the analytical equations is mapped into it. In the left graph, the Jukes-Cantor model was assumed, and in the middle and right graphs, the Kimura two-parameter model was assumed. In the middle graph, the statistical geometry parameters with an unweighted alphabet metric were computed; in the right graph, a weighted metric with the correct transition/transversion parameter was used. a, Density plot of the clock tree (I). b, Density plot of the nonclock tree (II). versus marsupialia and monotremata. GM, on the other hand, mapped all quartets into the star tree area, suggesting an undefined branching pattern (fig. 8b). The second data set was taken from Kuiken et al. (1994). It consisted of 72 HIV-1 sequences of the hypervariable region V3 of the external envelope of the virion. The sample covered different risk groups and different infection years. Kuiken et al. (1994) showed that the data set had been randomized to a large extent, thus allowing almost no phylogenetic interpretation. However, one clustering, associated with risk group, could be distinguished. Therefore, we divided the data set into two groups, the group of HIV-1-infected drug users (IVD), and the group of HIV-1-infected homosexuals as well as HIV-1-infected hemophiliacs (non-IVD), and applied quartet-mapping to this clustering. Both LM and GM confirm the result of Kuiken et al. (1994) (fig. 9). Although GM maps a larger amount of the quartets into the star tree area, both mapping methods support the clustering of the IVD group versus the non-IVD group, while there are only few quartets that cannot be clustered within these risk groups. Discussion In this paper, we introduced an extension of the LM procedure to other quartet-based tree reconstruction methods. We exemplified this by implementing the method of statistical geometry in sequence space in the LM framework. We compared the statistical properties of both methods by means of analytical, in addition to simulation, studies. We assessed the success of both methods in detecting the star tree, as well as their success in reconstructing the correct topology. Let us point out that the general idea of quartetmapping is to reduce the probability that a wrong tree is reconstructed. A tree is recognized only if the simplex point is mapped into one of the three small trapezia (see fig. 1). The main conclusion of this paper is that the shape of the tree, rather than the substitution model, determines how reliably and successfully ML and statistical geometry map the simplex point into the correct topology area. Let us discuss the two methods separately. In all analyzed cases, LM appears to be too liberal in suggesting a tree when the underlying topology is a star FIG. 7.—Density plot of the clock tree I parameter space. Indicated are percentages of cases in which likelihood-mapping (left graphs) and geometry-mapping (right graphs) placed the simplex point into the correct tree area. The gray scale is given in figure 4. a, Jukes-Cantor (JC) model. b, Kimura two-parameter (K2P) model. c, HKY model. d, F81 model. e, JC 1 G with a 5 0.8. f, JC 1 G with a 5 0.2. g, REV model. h, K2P 1 G with a 5 0.8. i, K2P 1 G with a 5 0.2. Quartet-Mapping 1213 Table 3 Minimal Inner Branch Lengths (cmin) for Which the Mean Simplex Point Computed from 200 Runs Is Mapped into the Correct Tree Area, and Minimal Inner Branch Lengths (c0.95 min ) for Which at Least 95% of All Quartets Are Mapped into the Correct Tree Area EVOLUTION MODEL JC . . . . . . . . . . . . . . . F81 . . . . . . . . . . . . . . JC 1 G(0.8). . . . . . . JC 1 G(0.2). . . . . . . K2P . . . . . . . . . . . . . K2P 1 G(0.8) . . . . . K2P 1 G(0.2) . . . . . REV . . . . . . . . . . . . . TREE I TREE IIA TREE IIB RECONSTRUCTION METHOD cmin c0.95 min cmin c0.95 min cmin c0.95 min LM 1 JC GM LM 1 JC GM LM 1 HKY GM LM 1 JC GM LM 1 K2P GM 1 M LM 1 K2P GM 1 M LM 1 K2P GM 1 M LM 1 HKY GM 0.01 0.045 0.01 0.045 0.01 0.075 0.015 0.135 0.01 0.04 0.01 0.06 0.015 0.1 0.01 0.06 0.015 0.075 0.02 0.0825 0.02 0.125 0.02 0.23 0.02 0.065 0.02 0.1125 0.035 0.2 0.0175 0.095 0.005 0.02 0.005 0.02 0.005 0.035 0.01 0.06 0.005 0.02 0.005 0.04 0.01 0.045 0.005 0.025 0.0075 0.0325 0.0075 0.0375 0.010 0.065 0.02 0.105 0.01 0.035 0.015 0.0625 0.02 0.09 0.0075 0.05 0.005 0.02 0.005 0.025 0.005 0.04 0.015 0.065 0.005 0.02 0.005 0.035 0.015 0.55 0.005 0.03 0.0075 0.035 0.0075 0.040 0.01 0.0625 0.02 0.1 0.0075 0.035 0.02 0.09 0.02 0.09 0.0075 0.0475 NOTE.—JC - Jukes-Cantor model with the transition/transversion parameter k 5 0.5 and equal base frequencies; F81 5 Felsenstein model with k 5 0.5 and unequal base frequences (here, A 5 10%, G 5 30%, C 5 40%, and T 5 20% was used); K2P 5 Kimura two-parameter model with k 5 4 and equal base frequencies; HKY 5 Hasegawa-Kishino-Yano model with k 5 4.0 and unequal base frequencies (same as for the F81 model); JC 1 G(a) and K2P 1 G(a) 5 JukesCantor and Kimura two-paramter models in addition to a discrete gamma distribution G(a) with a 5 0.8 (medium rate variation) and a 5 0.2 (severe rate variation); REV 5 fully reversible model with substitution rate matrix Q as shown in the text. FIG. 8.—Four-cluster quartet-mapping of mitochondrial DNA (11,133 bases). Sequences were divided into four disjoint groups: 18 eutheria, 2 marsupialia, 1 monotremata, and 3 noneutheria. The three codon positions were analyzed separately (left: codon position 1; middle: codon position 2; right: codon position 3, respectively). a, Results of likelihood-mapping. b, Results of quartet-mapping. 1214 Nieselt-Struwe and von Haeseler FIG. 9.—Quartet-mapping graphs of 72 HIV-1 sequences of hypervariable region V3 of the external envelope of the virion. The set was divided into two groups: the group of HIV-1-infected drug users (IVD), and the group of the others (non-IVD). A random sample of 5,000 quartets were analyzed. a, Results of likelihood-mapping. b, Results of quartet-mapping. tree or very close to it. This bias is especially pronounced if the reconstruction model does not take rate heterogeneity among sites into account. In the heterogeneous case, as well as the REV case, the binary tree which brings the long edges together is favored more often than the other two binary trees. The ultimate reason for this phenomenon needs further investigation, but it has been observed in other studies that when the data violate the assumptions of the analysis model, the likelihood ratio test provides strong support for the tree separating the long edges from the short edges (Gaut and Lewis 1995). LM appears to be robust and successful in reconstructing the correct tree over much of the whole parameter space even when simple evolution models are used instead of more complex (and correct) ones. This agrees with robustness results reported elsewhere (Gaut and Lewis 1995; Huelsenbeck 1995; Yang 1996). Our results show that GM allows for consistent topology reconstruction if the inner branch length is longer than a threshold value. This threshold value depends critically on the tree shape and to some degree on the substitution model. However, there exists an upper limit for the sum of the outer branches for which the interior branch length has to be infinitely long such that the correct tree is reconstructed. Thus, in a major part of the parameter space, GM suggests a star tree. On the other hand, only a small range of parameters leads to an inconsistent tree reconstruction, i.e., for which GM suggests the incorrect tree. Thus, by using the quartet-mapping approach, i.e., by considering the seven quartet graphs rather than only the three binary trees, the Felsenstein zone is significantly reduced. The large region of the star tree, whose area is equal to the sum of the three areas of the three perfect trees, is the reason that GM suggests a star tree in a large part of the parameter space. GM evaluates the absolute number of positions, and, in addition, it does not take the information of the outer branch lengths into account. Thus, the level of ‘‘noise’’ is quickly approached, such that a tree can no longer be extracted from the sequence alignment. It is interesting to note here that GM allows for consistent tree reconstruction even if the molecular clock is violated, as long as the threshold relation of the branch lengths is satisfied. This is in a way surprising, since statistical geometry, similarly to maximum parsimony, makes no explicit assumption about the underlying process of evolution. One could now ask if the range of branch lengths for which GM reconstructs the true tree can be enlarged. One source of noise that destroys the reconstruction of the true tree in the statistical-geometry approach of the sequence space approach is the use of the positions in which only one pair has equal characters. We therefore used a second, modified, definition of the coordinates of the simplex point by taking merely the ‘‘parsimonious informative’’ positions (Swofford and Olsen 1990), i.e., those in which two pairs have equal characters. These are the position categories C6–C8 in expression (3). Then, the three coordinates of the simplex vector are given by Quartet-Mapping 1215 FIG. 10.—Size of the areas into which the simplex point of the original geometry-mapping is mapped as compared with the simplex point of parsimony-mapping under the assumption of the Jukes-Cantor model. In the respective left plots, the original geometry-mapping is shown; in the right plots, the parsimony-mapping is shown. a, Clock tree. The dashed lines indicate the threshold value for the outer branch length a. b, Nonclock tree. sMP 5 d6/(d6 1 d7 1 d8), s2MP 5 d7/(d6 1 d7 1 d8), and 1 MP s3 5 d8/(d6 1 d7 1 d8). Note that this is the quartetmapping implementation of the maximum-parsimony method (Farris, Kluge, and Eckhardt 1970; Fitch 1971). Assuming the Jukes-Cantor model, we computed the expected coordinates sMP for the clock tree and the nonclock i tree. Figure 10 shows the comparison of the predicted areas using the geometry implementation and the parsimony implementation. Indeed, we find a significantly larger threshold line for the T1 area for the clock tree and an enlarged T1 area for the nonclock tree. A similar computation for the threshold relation (cf. eqs. 13, 14, and 15) now gives 3 a 1 b $ ln(5) ø 1.2071. 4 We conclude that in the maximum-parsimony case, the threshold for the sum of the outer branch lengths is ø1.27 times as large as the one for GM. In addition, for given branch lengths a and b, the minimal inner branch length c, for which the simplex point is mapped into the correct tree area, is smaller. The effect observed for GM is also observed for the maximum-parsimony implementation of quartetmapping. Thus, by considering not only the three binary trees, the Felsenstein zone can also be reduced for the maximum-parsimony method. However, it can clearly be seen that the domain in the parameter space for which GM or maximum parsimony maps the simplex point into the star tree area rather than into the binary tree area is still very large. Future studies will therefore address the problem of reducing the attraction of the star tree area. When evaluating a reconstructed tree topology, one of the biggest difficulties is in deciding when the length of an interior branch is equal to zero. From the star tree simulations, we find that ML as implemented in LM suggests a binary tree too often. On the other hand, the domain of the parameter space in which GM reconstructs a star tree is too large. We therefore propose the following procedure to test for a positive interior branch length using a combination of LM, GM, and ML. For a quartet of sequences, both LM and GM are applied. Then, we distinguish four cases: 1. Both LM and GM propose the star topology. Then, indeed, the interior branch length is equal to zero. 2. LM proposes a star topology, whereas GM proposes a tree topology. This case does not occur and can therefore be ignored. 3. Both LM and GM propose the same tree topology. Then, the interior branch length is indeed positive. 4. LM proposes a tree topology, and GM proposes a star topology. In this case, we use ML to calculate the expected branch lengths of the four outer branches, denoted by â1, â2, b̂1, and b̂2, and the expected inner branch length ĉ. We now assume that two of the outer branches are short and of similar length and that the other two are comparatively long. We set â 5 (â1 1 â2)/2 and b̂ 5 (b̂1 1 b̂2)/2. If these expected branch lengths â and b̂ are in the area of the parameter space, where GM should propose a tree, then with a large probability the interior branch length is indeed equal to zero. If, on the other hand, the values â, b̂, and ĉ are not in the area in which GM would propose a tree, then no definite conclusion can be reached. Let us look at the example in figure 8. Here, most of the simplex points of LM are mapped into the tree area T1, whereas the simplex points of GM are all placed into the star tree area. From an additional tree reconstruction now applied to all positions using the ML method as implemented in PUZZLE (Strimmer and von Haeseler 1996), we computed the five mean branch lengths: â1 5 0.31, â2 5 0.21, b̂1 5 0.40, and b̂2 5 0.21 for the outer branches and ĉ 5 0.06 for the inner branch length. Then, â 5 0.26, b̂ 5 0.31, and a 1 b 5 0.57. The sum of â and b̂ is below the threshold of equation (15). Therefore, there exists a positive minimal inner branch length cmin ø 0.36 given by equation (14) for which GM reconstructs a tree. The value ĉ of the inner edge separating the four groups, however, is smaller than cmin, the minimal value needed for GM to reconstruct a tree. Thus, a definite conclusion about the phylogeny of these four groups cannot be reached. In this paper, we investigated the statistical properties of LM and GM only for model trees with four taxa. It has been shown that for trees with more than four species, LM is also too liberal to accept a bifurcating tree even if the true topology is a star tree (Nie- 1216 Nieselt-Struwe and von Haeseler selt-Struwe 1998). Nieselt-Struwe (1998) also showed that this tendency cannot be improved by increasing the sequence length. All models of nucleotide substitution used in the simulations in this paper are Markov models and assume that evolution is independent and identical at each site and along each lineage. This assumption, of course, may be quite wrong. The work of Schöniger and von Haeseler (1995) and others on site-dependent models has shown that, indeed, ML has an enlarged Felsenstein zone if sequence sites do not evolve independently (Muse 1995; Rzhetsky 1995; Schöniger and von Haeseler 1995; Tillier and Collins 1995). Investigations of models violating these assumptions should provide another insight into the stability of the ML method and statistical geometry in sequence space. We conclude that quartet-mapping is a powerful tool with which to test and compare the success of tree evaluation methods based on quartets of taxa. It is especially a very efficient method for a quick evaluation of zerolength branches in trees. However, we are aware that this procedure cannot replace bootstrapping or other evaluation methods. The statistical properties that have been studied in this work can be supplemented by the addition of a bootstrapping procedure to assess the spreading of the point cloud. Work on this question is in progress. Acknowledgments This work was supported by the Deutsche Forschungsgemeinschaft and the Max-Planck-Society, which are gratefully acknowledged. We thank Korbinian Strimmer for supplying the code of PUZZLE, version 3.0. We are also grateful to two anonymous reviewers for their helpful comments on the manuscript. The extension of likelihood-mapping to geometry-mapping is incorporated in STATGEOM, version 2.0, which can be obtained from http://www.gwdg.de/;kniesel. LITERATURE CITED CHURCHILL, G., A. VON HAESELER, and W. NAVIDI. 1992. Sample size for a phylogenetic inference. Mol. Biol. Evol. 9: 753–769. EIGEN, M., R. WINKLER-OSWATITSCH, and A. DRESS. 1988. Statistical geometry in sequence space: a method of comparative sequence analysis. Proc. Natl. Acad. Sci. USA 85: 5913–5917. FARRIS, J., A. KLUGE, and M. ECKARDT. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19:172– 191. FELSENSTEIN, J. 1985. Confidence-limits on phylogenies—an approach using the bootstrap. Evolution 39:783–791. FITCH, W. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20: 406–416. GAUT, B., and P. LEWIS. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152–162. GOLDMAN, N. 1993. Statistical tests for DNA substitution. J. Mol. Evol. 34:183–198. HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174. HUELSENBECK, J. 1995. The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol. Biol. Evol. 12:843–849. JUKES, T., and C. CANTOR. 1969. Evolution of protein molecules. Pp. 21–132 in H. MUNRO, ed. Mammalian protein metabolism. Academic Press, New York. KIMURA, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120. KUIKEN, C. L., K. NIESELT-STRUWE, G. F. WEILLER, and J. GOUDSMIT. 1994. Quasispecies behavior of human immunodificiency virus type 1: sample analysis of sequence data. Pp. 100–119 in K. W. ADOLPH, ed. Molecular Virology Techniques Part A. Academic Press, New York. LI, W.-H. 1989. A statistical test of phylogenies estimated from sequence data. Mol. Biol. Evol. 6:424–435. ———. 1997. Molecular Evolution. Sinauer Ass., Inc. Publishers, Sunderland, Mass., USA. MUSE, S. 1995. Evolutionary analysis of DNA sequences subject to constraint on secondary structure. Genetics 139: 1429–1439. NEI, M., J. STEPHENS, and N. SAITOU. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol. 2:66–85. NIESELT-STRUWE, K. 1998. From likelihood-mapping to quartet-mapping, a new sequence analysis tool. Pp. 13–22 in M. K. UYENOYAMA and A. VON HAESELER, eds. Proceedings of the Trinational Workshop on Molecular Evolution. Duke University Publication Group, Duke University, Durham, N.C. OTA, R., P. WADDELL, M. HASEGAWA, H. SHIMODAIRA, and H. KISHINO. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798–803. RZHETSKY, A. 1995. Estimating substitution rates in ribosomal RNA genes. Genetics 141:771–783. RZHETSKY, A., and M. NEI. 1992. A simple method for estimating and testing minimum-evolution trees. Mol. Biol. Evol. 9:945–967. SAITOU, N., and M. NEI. 1986. The number of nucleotides required to determine the branching order of three species, with special reference to the human-chimpanzee-gorilla divergence. J. Mol. Evol. 24:189–204. SCHÖNIGER, M., and A. VON HAESELER. 1995. Performance of the maximum likelihood, neighbor joining and maximum parsimony methods when sequence sites are not independent. Syst. Biol. 44:533–547. SITNIKOVA, T. 1996. Bootstrap method of interior-branch test for phylogenetic trees. Mol. Biol. Evol. 13:605–611. STRIMMER, K., N. GOLDMAN, and A. VON HAESELER. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14:210–211. STRIMMER, K., and A. VON HAESELER. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964–969. ———. 1997. Likelihood-mapping: a simple method to visualize phylogenetic content of sequence alignment. Proc. Natl. Acad. Sci. USA 94:6815–6819. SWOFFORD, D., and G. OLSEN. 1990. Phylogeny reconstruction. Pp. 411–501 in D. HILLIS and G. MORITZ, eds. Molecular systematics. Sinauer, Sunderland, Mass. TILLIER, E., and R. COLLINS. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:7–15. Quartet-Mapping WHELAN, S., and N. GOLDMAN. 1999. Distribution of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol. Biol. Evol. 16:1292–1299. YANG, Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111. ———. 1996. Phylogenetic analysis using parsimony and likelihood methods. J. Mol. Evol. 42:294–307. YANG, Z., and B. RANNALA. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol. Biol. Evol. 14:717–724. ZHARKIKH, A., and W.-H. LI. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences: II. Four taxa without a molecular clock. J. Mol. Evol. 35:356–366. 1217 if the diagonal elements of the weight matrix M are equal to zero and the off-diagonal elements are equal to one. It suffices to show that then, 1 s GM 5 d6 1 (d9 1 d14 ) 5 s M 1 1 5 g(AB z CD). 2 One has O g (AB z CD) 1 5 1O max 2 [M(A , B ) 1 M(C , D )]2 . 2 g(AB z CD) 5 l l l l l l l l APPENDIX A The following table lists the values of gl(AB z CD) for each of the 15 possible position categories. Let ml 5 M(Al, Bl) 1 M(Cl, Dl); then, Here, we prove that sGM 5 sM ml maxl g l (AB z CD) C1 0 0 0 C2 1 1 0 C3 1 1 0 C4 1 1 0 C5 1 1 0 C6 0 2 1 C7 2 2 0 Thus, gl(AB z CD) is not equal to zero if and only if l ∈ C6, C9, or C14. Therefore, O g (AB z CD) g(AB z CD) 5 l C8 2 2 0 C9 1 2 1/2 C 10 2 2 0 C 11 2 2 0 C 12 2 2 0 C 13 2 2 0 C 14 1 2 1/2 C 15 2 2 0 cleotide x at time t is identical with that at time 0, and Pxy is the probability that nucleotide x at time 0 changes to nucleotide y at time t. In the Jukes-Cantor model, then, l 1 1 5 z(k ∈ C 6 )z 1 z(k ∈ C 9 )z 1 z(k ∈ C 14 )z 2 2 Pxx (t) 5 1 1 5 d6 1 d9 1 d14 5 s 1GM . 2 2 Pxy By symmetry, then, sGM 5 g(AC z BD) and sGM 5 2 3 g(AD z BC). APPENDIX B For tree model II in figure 3, the probability that sequence A, B, C, D, respectively, has nucleotide w, x, y, z, respectively, at a specific site is given by: Pr(w, x, y, z) O g · 5O P (c/2)·P (a)·P (b)6 · 5 O P (c/2)·P (a)·P (b)6 , 4 5 n u51 nu uy nv vw vx Pxx (t) 5 1 1 1 1 exp(24bt) 1 exp(22(a 1 b)t). 4 4 2 uz Pxy (t) 5 4 v51 In the Kimura two-parameter model, transitional changes are assumed to occur at rate a and transversional changes are assumed to occur at rate b. Then, Pxx is given by For Pxy, one has to distinguish the two cases of transitions and transversions. In the first case, 4 n51 1 2 and 1 1 4 (t) 5 2 exp 12 lt2 . 4 4 3 1 3 4 1 exp 2 lt 4 4 3 (9) where gn is the probability of nucleotide n occurring at the specified site. In the Jukes-Cantor model, as well as in the Kimura two-parameter model, the nucleotide frequencies are assumed to be at equilibrium, and therefore gn 5 0.25 for all n. Pxx denotes the probability that nu- 1 1 1 1 exp(24bt) 2 exp(22(a 1 b)t), 4 4 2 and in the latter case, Pxy (t) 5 1 1 2 exp(24bt) 4 4 (Li 1997). When the expected rate of nucleotide substi- 1218 Nieselt-Struwe and von Haeseler tutions per site (lt) and k 5 a/2b are given, a and b can easily be determined. The probabilities pj 5 Pr(Cj) of the 15 different configurations as listed in expression (3) are obtained by summing the appropriate values of Pr(w, x, y, z) according to equation (9). Assuming the Jukes-Cantor model and using definitions (4)–(6) of the supports and equation (2), it is then straightforward to compute the expected coordinates of the simplex point: sGM 5 1 4 1 3z c 2 4z a 2 3z 2a 2 4z b 2 3z 2b 2 6z a1b 1 3z 2a12b1c 1 10z a1b1c 12 1 9z c 2 12z a 2 z 2a 2 12z b 2 z 2b 2 2z a1b 1 9z 2a12b1c 2 2z a1b1c (10) sGM 5 2 4 1 3z c 2 4z a 1 5z 2a 2 4z b 1 5z 2b 2 6z a1b 1 3z 2a12b1c 2 6z a1b1c 12 1 9z c 2 12z a 2 z 2a 2 12z b 2 z 2b 2 2z a1b 1 9z 2a12b1c 2 2z a1b1c (11) sGM 5 3 4 1 3z c 2 4z a 2 3z 2a 2 4z b 2 3z 2b 1 10z a1b 1 3z 2a12b1c 2 6z a1b1c , 12 1 9z c 2 12z a 2 z 2a 2 12z b 2 z 2b 2 2z a1b 1 9z 2a12b1c 2 2z a1b1c (12) with z 5 exp (4/3). two cases. Eight possibilities are of the type (RRY1Y2) and/or (YYR1R2). For these positions, we compute APPENDIX C Now, assume different weights for transitions (s) and transversions (v). Consider the matrix M5 A G C T A 0 s v v G C T s 0 v v v v 0 s v v s 0 1 gl (AB z CD) 5 (2v 2 s). 2 , Sixteen possibilities are of the type (R1R1R2Y) and/or (Y1Y1Y2R). For these, we get 1 s g l (AB z CD) 5 (s 1 v 2 v) 5 . 2 2 with s # v. A calculation similar to that in appendix A shows that gl(AB z CD) ± 0 if l ∈ C6, C9, C14, C15. a. l ∈ C6: A total of 12 combinatorial possibilities contribute to category C6. We have to distinguish two cases. Four of these 12 possibilities are of the type (R1R1R2R2) and/or (Y1Y1Y2Y2), where Ri or Yi denotes one of the two purines and/or pyrimidines. For these positions, we compute 1 gl (AB z CD) 5 (2s 2 0) 5 s. 2 Then, O g (ABCD) 5 31 2v 22 s 1 32 2s 5 2v 61 s . l ∈C 9 l c. l ∈ C14: Identical to case b. d. l ∈ C15: There are a total of 24 possibilities in category C15, of which 8 are of the type (R1R2Y1Y2) and contribute to gl (AB z CD). The other 16 possibilities have a value of gl(AB z CD) equal to zero. For the nonzero values, we get Eight of the possibilities in C6 are of the type (RRYY). For these, we get 1 g l (AB z CD) 5 (2v 2 2s) 5 v 2 s. 2 1 gl (AB z CD) 5 (2v 2 0) 5 v. 2 Then, Then, O g (AB z CD) 5 3s 1 2v3 5 s 13 2v . l ∈C 6 l b. l ∈ C9: A total of 24 combinatorial possibilities contribute to category C9. Again, we have to distinguish O g (AB z CD) 5 v 23 s . l ∈C 15 l Then, we sum up to obtain Quartet-Mapping O g (AB z CD)p 5 O g (AB z CD)p 1 O g (AB z CD)p 1 O g (AB z CD)p 1 O g (AB z CD)p g(AB z CD) 5 l ∈C j l ∈C 6 l l l 6 l ∈C 9 l ∈C 14 l ∈C 15 5 1 l APPENDIX D 9 l 14 l 15 2 1 2 2v 1 s 2v 1 s p6 1 p9 3 6 1 1 2 1 2 2v 1 s v2s p14 1 p15 . 6 3 Note that if s 5 v 5 1, the equation is identical to that in appendix A. Suppose k is equal to the ratio of transition rate and transversion rate parameters. If we set k 5 2s/v and rescale the weight matrix accordingly, g(AB z CD) 5 1 2 1 2 4k 1 1 4k 1 1 p6 1 (p9 1 p14 ) 6k 12k 1 1 2 2k 2 1 p15 5 s 1M . 6k In this appendix, we compute a relationship between the outer branch lengths a and b and the inner branch length c such that the simplex point sGM is mapped into a tree area in the simplex. We will exemplify it for the tree area T1 5 T(ABzCD). sGM 5 (s1, s2, s3) is mapped into the area of tree T1 if and only if the distance between sGM and the attractor (1, 0, 0) of T1 is shorter than the distances between sGM and all six of the other attractors. Thus, sGM is mapped into the trapezoid of T1 if and only if GM sGM . max{sGM 1 2 , s3 }. For the binary tree with branch lengths a, b, and c, the inequality is fulfilled if and only if sGM . s2GM 1 0.5. 1 (13) Taking the values of the coordinates from equations (10) and (11) in appendix B, we find that for the Jukes-Cantor model for given a and b, this inequality is fulfilled iff 1 2 3 12 2 12z a 1 15z 2a 2 12z b 1 15z 2b 2 2z a1b c . ln . 4 34z a1b 2 9z 2a12b 2 9 And similarly, g(AC z BD) 5 1219 (14) 1 2 1 2 4k 1 1 4k 1 1 p7 1 (p10 1 p13 ) 6k 12k 1 1 2 2k 2 1 p15 5 s 2M, 6k 1 2 1 and 2 4k 1 1 4k 1 1 g(AD z BC) 5 p8 1 (p11 1 p12 ) 6k 12k 1 1 2 2k 2 1 p15 5 s 3M . 6k Since a and b are positive, the numerator of the argument of the log is always positive. However, the denominator is less than or equal to zero if 17 1 4Ï13 3 a 1 b $ ln ø 0.9377. 4 9 1 NARUYA SAITOU, reviewing editor Accepted March 9, 2001 2 (15)
© Copyright 2026 Paperzz