NJML: A Hybrid Algorithm for the Neighbor-Joining and MaximumLikelihood Methods Satoshi Ota and Wen-Hsiung Li Department of Ecology and Evolution, University of Chicago In the reconstruction of a large phylogenetic tree, the most difficult part is usually the problem of how to explore the topology space to find the optimal topology. We have developed a ‘‘divide-and-conquer’’ heuristic algorithm in which an initial neighbor-joining (NJ) tree is divided into subtrees at internal branches having bootstrap values higher than a threshold. The topology search is then conducted by using the maximum-likelihood method to reevaluate all branches with a bootstrap value lower than the threshold while keeping the other branches intact. Extensive simulation showed that our simple method, the neighbor-joining maximum-likelihood (NJML) method, is highly efficient in improving NJ trees. Furthermore, the performance of the NJML method is nearly equal to or better than existing time-consuming heuristic maximum-likelihood methods. Our method is suitable for reconstructing relatively large molecular phylogenetic trees (number of taxa $ 16). Introduction The neighbor-joining (NJ) method (Saitou and Nei 1987) is simple and widely used, especially for large molecular phylogenetic trees. On the other hand, the maximum-likelihood (ML) method (Cavalli-Sforza and Edwards 1967; Felsenstein 1981) tends to outperform the NJ method if an appropriate model of nucleotide substitution is used (Fukami-Kobayashi and Tateno 1991; Hasegawa, Kishino, and Saitou 1991; Hasegawa and Fujiwara 1993; Kuhner and Felsenstein 1994; Tateno, Takezaki, and Nei 1994; Huelsenbeck 1995). Unfortunately, the ML method requires a large amount of computational time when many taxa are involved. Therefore, it is desirable to drastically reduce the topology search space by introducing heuristics. The basic idea of this paper is to use a ‘‘divideand-conquer’’ strategy, briefly described as follows: An initial tree is constructed by the NJ method. Bootstrap values (Felsenstein 1985) are computed on all internal branches (nodes). The initial tree is then divided into subtrees at internal branches that have a bootstrap value higher than a threshold. Each subtree is referred to as a composite operating taxonomic unit (OTU) and is kept intact to reduce the search space. In other words, the topology search by ML reconsiders only internal branches with bootstrap values lower than the threshold. Therefore, the depth of the search depends on the bootstrap values on the internal branches (nodes) of the NJ tree. Figure 1 shows the basic principle of the new algorithm. Since internal branches A and E in figure 1a have low bootstrap values, they are removed and the remaining internal nodes are merged. Figure 1b shows a multifurcating tree thus constructed. This is an intermediate tree with which to reconstruct the final bifurcating tree. Reconstruction of a bifurcating tree is perKey words: phylogenetic reconstruction, topology search, subtrees, greedy algorithm. Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, Illinois 60637. E-mail: [email protected]. Mol. Biol. Evol. 17(9):1401–1409. 2000 q 2000 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 formed by inserting new internal branches at the multifurcating nodes using the ML principle. In figure 1b, the tree is divided into four subtrees by three internal branches: B, C, and D. Since subtrees (2, 3) and (7, 6) are already bifurcating trees, a topology search will not be performed here. On the other hand, subtrees (1, 8, (2, 3)) and (4, 5, (6, 7)) are trifurcating, and we need to resolve them to find the optimal tree. Since each of these two subtrees consists of three OTUs and thus can have three possible alternative topologies, we need to consider a total of 3 3 3 5 9 topologies. This idea, however, is still not practical for very large trees. If a tree has m ($2) adjacent internal branches with low bootstrap values, we need to perform the exhaustive search for m 1 2 OTUs. When m is large, the computation will be intractable. To reduce the computational load, we present below a greedy (hill-climbing) search algorithm (e.g., Winston 1993). Materials and Methods Algorithm of the Neighbor-Joining MaximumLikelihood Method In the neighbor-joining maximum-likelihood (NJML) method, we do not simultaneously remove all internal branches having bootstrap values lower than the threshold. Instead, only n such internal branches are removed in each step. The NJML algorithm is very simple: • Step 1 Build an NJ tree and perform a simple bootstrap analysis (for example, with 100 replications). • Step 2 1. If all bootstrap values are greater than or equal to the critical value C (say, 90% or 95%), take the current tree as the final tree. 2. Otherwise, make a multifurcating tree from the NJ tree by removing n internal branches having the smallest n bootstrap values (say, n 5 3). • Step 3 1. Compute the ML value for each of the at most Pni51(2i 1 1) possible rearranged trees around 1401 1402 Ota and Li FIG. 1.—The basic principle of the NJML algorithm. a, An initial neighbor-joining tree. Circles represent nodes. Solid and dashed lines represent internal branches having high and low bootstrap values, respectively (A: low; B: high; C: high; D: high; E: low). b, An intermediate multifurcating tree derived from a. the multifurcating node derived in the preceding step. For details, see Discussion. 2. Choose a tree that has the largest likelihood value. 3. Set an imaginary bootstrap value C (the threshold) to each of the rearranged internal branches. This is an operation to terminate the program in step 2. 4. Go to step 2. Figure 2 shows how the tree reconstruction is performed in this algorithm for n 5 3. The bootstrap values on branches A, B, C, F, J, and L are lower than the threshold C (see fig. 2a). However, only the three smallest bootstrap values are chosen and the corresponding internal branches (A, C, and J) are removed in this step (fig. 2b). The multifurcating nodes will be resolved, assuming that the remaining parts reflect the true tree topology. In further steps, internal branches B, F, and L will be removed to perform the topology search. Since this is a greedy search algorithm, we may be led to a wrong result. Unlike with other stepwise algorithms, however, our working trees are always bifurcating trees and keep the same number of leaves (OTUs) FIG. 2.—Schematic representation of the greedy algorithm. a, An initial neighbor-joining tree. The bootstrap values of internal branches are as follows: A , C , J , B , F , L , 90% # E , G , H , I , M. Suppose that the threshold used is 90 %. b, An intermediate multifurcating tree derived from a. In this step, three internal branches A, C, and J were removed because they had the smallest n (53) bootstrap values. Solid and broken lines represent internal branches having higher and lower bootstrap values than 90%, respectively. See the legend to figure 1 for more details. during reconstruction (see fig. 2a). In other words, the number of parameters at each step is always the same in the ML estimation. This means we can compare a working tree with one in the previous stage at step 3 in terms of the ML values (all rearranged trees contain the previous working tree). Therefore, we never choose a tree worse than a previous intermediate tree in terms of ML values. Implementation In PHYLIP, version 3.5c (Felsenstein 1993), the programs dnadist, seqboot, neighbor, and consensus were used with slight modifications to construct an initial NJ tree (fig. 3). Neighbor Joining and Maximum Likelihood 1403 FIG. 3.—Implementation of the NJML method. NJML contains four newly developed modules: conbstree constructs bootstrap trees from an initial bootstrap neighbor-joining tree and removes n internal branches whose bootstrap values are less than a threshold; allptopon generates all possible bifurcating trees from a given multifurcating tree and computes the maximum-likelihood value for each tree; selectpmltree selects the maximum-likelihood tree from the candidates; setbs sets the bootstrap value on each branch of a working tree. We also used part of the source code from MOLPHY, version 2.2 (Adachi and Hasegawa 1996), to develop NJML. In the source code, the eigenvalues and eigenvectors of a transition probability matrix are computed following Kishino and Hasegawa (1990), so occasionally a data set may not yield all the proper eigenvalues when empirical base frequencies are used. In this case, NJML will return the initial NJ tree as the final result (see fig. 3). We cannot exclude the possibility that a pairwise distance from bootstrap resampling is infinite in dnadist of PHYLIP (fig. 3). In this case, NJML will return no result. As shown in figure 3, NJML contains several modules written in C. GNU CC version 2.8.1 was used as a compiler. Simulation Design Two types of simulation were carried out to check the efficiency of NJML: (1) fixed model trees were used, and randomly generated ancestral sequences were 1404 Ota and Li FIG. 5.—Model trees T1 and T2 with expected substitution rates a and b. In T1, a molecular clock is assumed, whereas T2 describes a situation of extreme rate heterogeneity in different branches of the tree. Modified from Strimmer and von Haeseler (1996b). FIG. 4.—Five model trees used for the computer simulation. A constant rate of nucleotide substitution was assumed for trees A, B, and E. In trees A and B, U ([8a) 5 0.05 and 0.50 were used in the computer simulation. In tree E, the value of a was determined from the pairwise distance between the two most distantly related sequences (dmax). The rate of nucleotide substitution varies among branches for trees C and D with a 5 0.01 or 0.05. evolved along the model trees under certain evolutionary models; and (2) randomly generated trees were used, and randomly generated ancestral sequences were evolved in the same way as in (1). No site-specific heterogeneity (Yang 1993) was assumed in either type of simulation. The maximum number (n) of removed internal branches at each step was set at 3 for all runs. Fixed Model Tree Simulation Seq-Gen (Rambaut and Grassly 1997) was used to generate ancestral sequences. In Seq-Gen, Jukes-Cantor (JC) (Jukes and Cantor 1969) and Kimura’s two-parameter (K2P) (Kimura 1980) models were implemented as special cases of Hasegawa, Kishino, and Yano’s (1985); (HKY) model. We set the nucleotide frequencies to be equal in both models, and we set the transition-to-transversion ratio (TS/TV) to 0.5 and 4.0 for the JC and K2P models, respectively. (We set the ratio to 4.0 for the K2P model rather than the commonly used value of 2.0 because we wanted to consider a more extreme case) Base frequencies were all set to 0.25. For each case, 500 replications were generated. The model trees used are shown in figures 4 and 5. In figure 4, model trees A, B, C, and D are modified from Saitou and Imanishi (1989). Trees A and B assume a constant rate of nucleotide substitution. The expected number of nucleotide substitutions per site is denoted by U, with U 5 0.05 or 0.5. The length of each branch is expressed as multiples of a (5U/8) (see fig. 4). Model trees C and D have a large variation among branches in rate of nucleotide substitution. For both trees, a 5 0.01 and a 5 0.05 were used. The threshold of bootstrap values was set to 90% or 95%. Model tree E is modified from Nei, Kumar, and Takahashi (1998). The expected number of nucleotide substitutions per site between two most divergent sequences is denoted by dmax, and the a value in the model tree is then determined in proportion to this dmax value. For this tree, dmax values of 0.25, 1.0, and 1.5 are used. In figure 5, model trees T1 and T2 were modified from Strimmer and von Haeseler (1996b). For each of them, a variety of substitution rates a and b were assumed. As shown in figure 5, T1 and T2 assume a constant rate and two varying rates (among lineages), respectively. Model trees T1 and T2 were used to study the efficiencies of BIONJ (Gascuel 1997), Weighbor 1.0.1 (Bruno, Socci, and Halpern 2000), and fastDNAml 1.0 (Olsen et al. 1994) under the same conditions as the NJML method. BIONJ and Weighbor are recent enhancements to the NJ method. The programs were run with their default options. For fastDNAml 1.0, however, TS/TV and all base frequency parameters were set to 0.5 or 4.0, and 0.25, respectively. To compare inferred trees with model trees, a program modified from PAML’s evolver (Yang 1999) was used. Randomly Generated Model Tree Simulation For each case, 100 trees were generated randomly, and each possible tree was equally probable. To generate them, evolver was used with slight modification. Ran- Neighbor Joining and Maximum Likelihood Table 1 Proportions of Replicates in Which the Correct Topology Was Reconstructed Under a Constant-Rate Model Tree JC MODEL TREE NJ Table 2 Proportions of Replicates in Which the Correct Topology Was Reconstructed Under a Varying-Rate Model Tree K2P NJML(90) NJML(95) NJ JC NJML(90) NJML(95) A 1405 MODEL TREE K2P NJML(90) NJML(95) NJ NJ NJML(90) NJML(95) C U 5 0.05 300 bp . . . . 600 bp . . . . 1,000 bp . . . U 5 0.50 300 bp . . . . 600 bp . . . . 1,000 bp . . . 0.524 0.822 0.938 0.598 0.858 0.960 0.558 0.854 0.966 0.516 0.734 0.918 0.574 0.826 0.938 0.606 0.824 0.942 0.554 0.814 0.926 0.556 0.792 0.920 0.562 0.814 0.924 0.494 0.748 0.892 0.620 0.846 0.956 0.606 0.870 0.956 B U 5 0.05 300 bp . . . . 600 bp . . . . 1,000 bp . . . U 5 0.50 300 bp . . . . 600 bp . . . . 1,000 bp . . . 0.754 0.928 0.998 0.824 0.972 1.000 0.804 0.964 0.998 0.608 0.868 0.976 0.742 0.948 0.992 0.706 0.968 0.994 0.738 0.904 0.986 0.906 0.984 1.000 0.902 0.992 0.998 0.652 0.854 0.962 0.856 0.976 1.000 0.812 0.966 1.000 0.736 0.946 0.988 0.806 0.968 1.000 0.800 0.978 1.000 0.594 0.842 0.976 0.762 0.948 0.998 0.750 0.962 0.996 0.736 0.946 0.988 0.958 0.996 1.000 0.954 1.000 1.000 0.630 0.882 0.974 0.906 0.996 1.000 0.930 0.996 1.000 D U 5 0.05 300 bp . . . . 600 bp . . . . 1,000 bp . . . U 5 0.50 300 bp . . . . 600 bp . . . . 1,000 bp . . . 0.650 0.814 0.946 0.680 0.888 0.976 0.688 0.854 0.970 0.548 0.778 0.888 0.622 0.822 0.936 0.598 0.844 0.934 0.520 0.752 0.940 0.564 0.768 0.958 0.606 0.832 0.950 0.570 0.756 0.870 0.666 0.850 0.936 0.650 0.850 0.966 NOTE.—JC 5 Jukes and Cantor’s model; K2P 5 Kimura’s two-parameter model; NJ 5 the NJ method; NJML(90) 5 the NJML method with a 90% threshold; NJML(95) 5 the NJML method with a 95% threshold. In each case, 500 replications were conducted. domly generated ancestral sequences evolved along the randomly generated trees under the JC or the K2P model. The number of OTUs was 20, and the sequence length was 1,000 bp. The expected branch lengths of randomly generated trees were varied for three cases: 0.01, 0.05, and 0.1. The other part of this simulation was the same as the fixed model tree simulation. Computer Run Times Computer run times were measured in NJML, PUZZLE 4.0.2 (an implementation of the QP method), fastDNAml 1.0, and DNAML 3.5 by using the clock() function of GNU CC version 2.8.1 on a Pentium III machine (450 MHz). Sequences were randomly generated by Seq-Gen under the K2P model. A given tree was randomly generated by setting the mean branch length to 0.05. These operations were iterated 100 times for each case, and each computer run time was measured. The averages of the computer run times were used to compare performances. In the case of DNAML, however, only one run was carried out, because the computational run times were too large. Results Table 1 shows the results of simulation using ‘‘constant-rate’’ trees A and B (fig. 4). All NJ trees given in the tables were consensus trees by bootstrap resampling, and they were used as the initial trees for the NJML method. Table 1 indicates that all initial NJ trees were improved except in two cases under the JC model. In the two exceptions, the sequences were very divergent (U 5 0.50) and the simple Jukes-Cantor model was used. U 5 0.05 300 bp . . . . 600 bp . . . . 1,000 bp . . . U 5 0.50 300 bp . . . . 600 bp . . . . 1,000 bp . . . NOTE.—Abbreviations are as in table 1. Table 2 shows the results of simulation using ‘‘varying-rate’’ trees C and D. Every initial NJ tree was improved by the NJML method except in one tie case (model tree C; U 5 0.05; l 5 1,000 bp). Under the K2P model, the improvement in performance was remarkable. Table 3 shows the efficiency of the NJML and NJ methods when ‘‘comblike’’ model trees (model tree E) were used. In every case, the initial NJ trees were improved by the NJML method. The average ratio of matched internal branches between inferred trees and model trees was also improved by the NJML method; the ratio is defined as (I 2 dT/2)/I, where I and dT are the number of internal branches and the topological distance (see Foulds, Penny, and Hendy 1979), respectively. The performance of NJML was also compared with that of BIONJ (BI), Weighbor 1.0.1 (WE), fastDNAml (fM), the quartet puzzling (QP) method (Strimmer and von Haeseler 1996), and PHYLIP DNAML, version 3.5c (DNAML) (Felsenstein 1993). Model trees T1 and T2 (fig. 5) were used. We used the data of Strimmer, Table 3 Proportions of Replicates in Which the Correct Topology Was Reconstructed Under Constant-Rate Model Tree E JC dmax NJ K2P NJML(90) 0.25. . . 0.336 (0.879) 0.466 (0.916) 1.00. . . 0.250 (0.864) 0.342 (0.895) 1.50. . . — — NJ NJML(95) 0.226 (0.849) 0.404 (0.900) 0.208 (0.840) 0.394 (0.895) 0.164 (0.824) 0.292 (0.873) NOTE.—The values in parentheses are average ratios of matched internal branches between inferred trees and model trees. Dashes indicate that infinite distances were estimated in more than 100 replications. Other abbreviations are as in table 1. The sequence length was 300 bp. 1406 Ota and Li Table 4 Ratios of Correctly Reconstructed Trees for Clocklike Evolution According to Tree T1 SEQUENCE EVOLUTION l (bp) a/b 500 . . . . . 0.01/0.07 0.02/0.19 0.03/0.42 1,000 . . . . 0.01/0.07 0.02/0.19 0.03/0.42 K2P (TS/TV 5 4) JC NJ BN WE NM(90) 0.69 0.52 0.12 0.96 0.86 0.35 0.73 0.52 0.14 0.96 0.87 0.35 0.72 0.47 0.13 0.92 0.83 0.29 0.80 0.62 0.14 0.97 0.90 0.37 NM(95) QPa 0.80 0.64 0.16 0.97 0.92 0.38 0.80 0.70 0.29 0.94 0.92 0.53 fM MLb NJ BN WE NM(90) NM(95) QPa fM MLb 0.83 0.61 0.12 0.98 0.92 0.33 0.87 0.63 0.09 0.96 0.85 0.34 0.59 0.34 0.13 0.89 0.74 0.28 0.49 0.39 0.11 0.90 0.79 0.35 0.56 0.37 0.10 0.89 0.69 0.25 0.68 0.58 0.19 0.93 0.85 0.46 0.87 0.63 0.09 0.96 0.85 0.34 0.70 0.54 0.17 0.94 0.85 0.40 0.71 0.53 0.18 0.94 0.88 0.41 0.70 0.63 0.33 0.89 0.85 0.57 NOTE.—BN 5 BIONJ; WE 5 Weighbor; NM(90) 5 NJML with 90% threshold; NM(95) 5 NJML with 95% threshold; QP 5 the quartet puzzling algorithm; fM 5 the maximum-likelihood method implemented in fastDNAml 1.0; ML 5 the maximum-likelihood method implemented in PHYLIP DNAML, version 3.5c. TS/TV 5 transition to transversion ratio. The other abbreviations are as in table 1. a Data from Strimmer and von Haeseler (1997). b Data from Strimmer and von Haeseler (1996b). Goldman, and von Haeseler (1997) for QP and that of Strimmer and von Haeseler (1996) for DNAML. As shown in tables 4 and 5, although BIONJ and Weighbor often gave better results than did the NJ method, the NJML method outperformed both of them. The QP method gave better results than any other method in six cases, while the NJML method gave better or tie results in the other cases (table 4). In particular, note that when model tree T2 (a ‘‘rate-varying’’ tree) was used, the efficiency of the NJML method was considerably better than that of both the NJ method and the QP methods (table 5). NJML performed almost as well as fastDNAml and DNAML in most cases and gave better results than fastDNAml and DNAML in some cases. Table 6 shows the results of simulation using randomly generated trees. The NJML method improved the initial NJ trees without exception, not only with regard to the proportion of correct reconstruction, but also with regard to the average ratio of matched internal branches between inferred trees and model trees. As shown in table 7, NJML is obviously faster than the PUZZLE, DNAML, and fastDNAml programs except for the cases with relatively small numbers of OTUs (#14). Note that PUZZLE and fastDNAml have been highly brushed up with regard to coding, while the current version of NJML is still a set of experimental programs. The computer run time for NJML for each case shown in table 7 is actually the sum of the computer run times of the NJML subprograms (see fig. 3). Interestingly, the most time-consuming part was not the ML estimation, but the bootstrap resampling and the computation of distance matrices by seqboot and dnadist of PHYLIP (data not shown). This explains why NJML is slower than PUZZLE and fastDNAml when the numbers of taxa involved are less than or equal to 14 and 8, respectively. Some simulation data caused dnadist to return infinite distances. When the number of such replicates was greater than 100 in a case, we did not evaluate the results (indicated by dashes in tables 3 and 5). A Phylogenetic Tree of the Small-Subunit rRNA for Eukaryotes We reconstructed a phylogenetic tree of the smallsubunit ribosomal RNA for eukaryotes by using the NJ and NJML methods. The multiply aligned sequences are from the Ribosomal Database Project (Maidak et al. 1999). Figure 6a and b shows the initial NJ tree and the NJML tree, respectively. The K2P model was used with TS/TV 5 4.0, and the threshold was set to 90%. OTUs of the trees were automatically annotated by programs in DeepForest (OOta 1998). Excluding gaps, 1,465 sites were used. The NJ tree (fig. 6a) has the following major problems: 1. The nematode (Caenorhabditis elegans) is clustered with the acellular slime mold, the cellular slime mold, the malaria parasite, and the dysentery amoeba. Table 5 Ratios of Correctly Reconstructed Trees for Nonclocklike Evolution According to Tree T2 SEQUENCE EVOLUTION l (bp) a/b 500 . . . . . 0.01/0.07 0.02/0.19 0.03/0.42 1,000 . . . 0.01/0.07 0.02/0.19 0.03/0.42 K2P (TS/TV 5 4) JC NJ BN WE NM(90) 0.82 0.65 — 0.95 0.91 — 0.86 0.81 0.46 0.98 0.99 0.75 0.88 0.89 0.70 0.98 0.99 0.92 0.94 0.94 — 0.99 0.99 — NM(95) QPa 0.93 0.95 — 1.00 1.00 0.97c 0.86 0.85 0.47 0.97 0.96 0.70 fM MLb NJ BN WE NM(90) NM(95) QPa fM MLb 0.91 0.94 0.81 1.00 1.00 0.95 0.91 0.93 0.72 0.99 0.99 0.92 0.75 0.56 0.24 0.92 0.84 0.46 0.80 0.72 0.40 0.96 0.93 0.67 0.83 0.77 0.53 0.97 0.95 0.83 0.90 0.88 0.74 0.99 1.00 1.00 0.94 0.92 0.73 0.98 0.99 0.96 0.91 0.89 0.70 0.98 0.98 0.92 0.89 0.90 0.72 0.99 0.99 0.90 NOTE.—Dashes indicate that infinite distances were estimated in more than 100 replications. Other abbreviations are as in table 4. a Data from Strimmer and von Haeseler (1997). b Data from Strimmer and von Haeseler (1996). c Infinite distances were estimated in three replications. 0.81 0.77 0.52 0.95 0.92 0.74 Neighbor Joining and Maximum Likelihood 1407 Table 6 Ratios of Correctly Reconstructed Trees for Randomly Generated Trees JC K2P MEAN BRANCH LENGTH NJ NJML NJ NJML 0.01. . . . . . . . . . . . . . . . . 0.05. . . . . . . . . . . . . . . . . 0.1. . . . . . . . . . . . . . . . . . 0.530 (0.963) 0.600 (0.972) 0.000a (0.020) 0.540 (0.965) 0.660 (0.978) 0.000a (0.020) 0.360 (0.946) 0.510 (0.964) 0.480 (0.958) 0.370 (0.947) 0.630 (0.972) 0.600 (0.971) NOTE.—For each case, 100 trees were randomly generated with the number of operating taxonomic units 5 20. Abbreviations are as in table 1. a Infinite distances were estimated in five replications (i.e., the estimates of efficiencies were based on 95 replications). 2. The sea urchin and the amphioxus form a monophyletic group. 3. The amoeba (Acanthamoebae) and green plants form a monophyletic group. In the NJML tree (fig. 6b), these problems are solved: the nematode, the sea urchin, the amphioxus, and the amoeba are located in reasonable places (e.g., see Maddison and Maddison 1998). Discussion Our results showed that NJML improved almost all initial NJ trees; the improvement was especially remarkable in the varying-rate model trees (tables 2 and 5). As shown in Saitou and Imanishi (1989), the NJ method outperformed the ML method when model trees A and B were used under the JC model. Our results are consistent with Saitou and Imanishi’s conclusions (see table 1). Under the K2P model, the NJML method was more efficient than the NJ method without exception. The results of the QP method in tables 4 and 5 were obtained by the improved QP method using discrete weights (Strimmer, Goldman, and von Haeseler 1997). In fact, in comparison with any other method, its results were surprisingly good when model tree T1 was used with relatively long branch lengths (a/b 5 0.02/0.19 and 0.03/0.42). However, we should note that model trees T1 and T2 were originally designed for testing the QP method (Strimmer and von Haeseler 1996) and that the NJML method always outperformed the QP method when model tree T2 was used. As noted above, NJML outperformed fastDNAml and DNAML in some cases (see tables 4 and 5). The computational load for DNAML is prohibitive when the number of taxa is large (Strimmer and von Haeseler 1996). Although fastDNAml is considerably faster than DNAML (table 7), the computational load is still not negligible for large trees. In comparison, the search space of NJML is mainly determined by n (the maximum number of removed internal branches at each step), not the total number of taxa, so it is more practical than fastDNAml for large trees. Interestingly, NJML did not always give better results with the threshold 95% than with the threshold 90%. This is because in the NJML algorithm, the search space varies considerably depending on the distribution of bootstrap values. For n 5 3, there are three possibilities with regard to the size of search space if three or more internal branches have lower bootstrap values than the threshold: 1. The three internal branches are not adjacent to each other: the size of the search space is 3 3 3 3 3 5 27. 2. Two of the three internal branches are adjacent to each other: the size of the search space is 15 3 3 5 45. 3. All three internal branches are adjacent to each other: the size of the search space is 105. This heterogeneity of search space will become higher with greater n. In the worst case, we need to explore a search space whose size is approximately n Ss/3 i51 Pj51 (2j 1 1) when there are s internal branches having lower bootstrap values than the threshold. For n n 5 3, Pnj51 (2j 1 1) 5 105 and Ss/3 i51 Pj51 (2j 1 1) # (s/3) 3 105 5 35s, which increases only linearly with s. On the other hand, in the best case, we need to explore n a search space whose size is approximately Ss/3 i513 5 s3n21. Although this heterogeneity of search space should be explored in the future, we think that n 5 3 is suitable for personal computers. Since bootstrapping tends to underestimate the confidence level of a subtree (Zharkikh and Li 1992, 1995; Hillis and Bull 1993), a threshold of 90% or even less Table 7 Average Computer Run Times (in seconds) Corresponding to the Number of Operating Taxonomic Units (OTUs) NO. PROGRAM NJML(90) . ..... NJML(95) . . . . . . PUZZLE . . . . . . fastDNAml . . . . DNAML . . . . . . OF OTUS 8 10 12 14 16 18 20 22 3.37 3.38 0.94 1.63 24.94 5.35 5.38 2.49 5.75 75.94 9.44 9.66 4.96 12.20 148.61 10.65 10.90 10.07 27.21 268.17 14.96 15.39 18.35 43.39 474.24 21.17 23.02 29.48 78.55 825.79 28.47 31.98 46.83 115.57 1,109.46 33.93 38.84 70.81 180.39 1,596.86 NOTE.—Sequences were generated by using random trees. The mean branch length was 0.05. The K2P model was used to evolve the sequences. The number of replications was 100 except for DNAML, in which only one run was carried out. 1408 Ota and Li FIG. 6.—Phylogenetic trees of 43 small-subunit ribosomal RNA for eukaryotes. Trees a and b were reconstructed by using the NJ and NJML methods, respectively. Numbers by internal branches represent bootstrap values (%) based on 100 pseudoreplications. Thick internal branches without bootstrap values were evaluated by the NJML method in tree b. Note that some of the internal branches gave length 0. The scale bars represent the number of substitutions per 100 sites. would be sufficiently high for a good performance of NJML. Acknowledgments We thank Masami Hasegawa for permission to use the source code of MOLPHY, version 2.2. Andrew Rodin kindly advised us on simulation studies. Richard Blocker helped us to maintain the computer system. This work was supported by NIH grants GM30998 and GM55759. LITERATURE CITED ADACHI, J., and M. HASEGAWA. 1996. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28:1–150. BRUNO, W. J., N. D. SOCCI, and A. L. HALPERN. 2000. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17: 189–197. CAVALLI-SFORZA, L. L., and A. W. F. EDWARDS. 1967. Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Genet. 19:233. FELSENSTEIN, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368– 376. ———. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791. ———. 1993. PHYLIP: phylogeny inference package. Version 3.5c. University of Washington, Seattle. FOULDS, L. R., D. PENNY, and M. D. HENDY. 1979. A graph theoretic approach to the development of minimal phylogenetic trees. J. Mol. Evol. 13:151–166. FUKAMI-KOBAYASHI, K., and Y. TATENO. 1991. Robustness of maximum likelihood tree estimation against different patterns of base substitutions. J. Mol. Evol. 32:79–91. GASCUEL, O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685–695. HASEGAWA, M., and M. FUJIWARA. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2:1–5. HASEGAWA, M., H. KISHINO, and N. SAITOU. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol. 32:443–445. HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174. HILLIS, D. M., and J. J. BULL. 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42:182–192. HUELSENBECK, J. P. 1995. The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol. Biol. Evol. 12:843–849. JUKES, T. H., and C. R. CANTOR 1969. Evolution of protein molecules. Pp. 21–132 in H. N. MUNRO, ed. Mammalian protein metabolism. Academic Press, New York. KIMURA, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120. KISHINO, H., and M. HASEGAWA. 1990. Converting distance to time: application to human evolution. Methods Enzymol. 183:550–570. KUHNER, M., and J. FELSENSTEIN. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459–468. Neighbor Joining and Maximum Likelihood MADDISON, D., and W. P. MADDISON. 1998. The tree of life: a multi-authored, distributed Internet project containing information about phylogeny and biodiversity. College of Agriculture, University of Arizona, Tucson. Internet address: http://phylogeny.arizona.edu/tree/phylogeny.html. MAIDAK, B. L., J. R. COLE, C. T. P. JR. et al. (14 co-authors). 1999. A new version of the RDP (Ribosomal Database Project). Nucleic Acids Res. 27:171–173. NEI, M., S. KUMAR, and K. TAKAHASHI. 1998. The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Natl. Acad. Sci. USA 95:12390– 12397. OLSEN, G., H. MATSUDA, R. HAGSTROM, and R. OVERBEEK. 1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:41–48. OOTA, S. 1998. Development of an integrated system for molecular evolutionary study and its application. Ph.D. thesis, Department of Genetics, School of Life Science, Graduate University for Advanced Studies, Mishima, Japan. RAMBAUT, A., and N. C. GRASSLY. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–238. SAITOU, N., and T. IMANISHI. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol. 6:514–525. SAITOU, N., and M. NEI. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425. 1409 STRIMMER, K., and A. VON HAESELER. 1996a. PUZZLE. Version 2.5. Zoologisches Institut, Universitaet Muenchen, Munich, Germany. ———. 1996b. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964–969. STRIMMER, K., N. GOLDMAN, and A. VON HAESELER. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14:210–211. TATENO, Y., N. TAKEZAKI, and M. NEI. 1994. Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. 11:261–277. WINSTON, P. H. 1993. Artificial intelligence. Addison-Wesley, Mass. YANG, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396–1401. ———. 1999. PAML manual. Department of Biology, Galton Laboratory, University College London, London. ZHARKIKH, A., and W.-H. LI. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Mol. Biol. Evol. 9:1119–1147. ———. 1995. Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4:44–63. YUN-XIN FU, reviewing editor Accepted June 6, 2000
© Copyright 2026 Paperzz