NJML: A Hybrid Algorithm for the Neighbor

NJML: A Hybrid Algorithm for the Neighbor-Joining and MaximumLikelihood Methods
Satoshi Ota and Wen-Hsiung Li
Department of Ecology and Evolution, University of Chicago
In the reconstruction of a large phylogenetic tree, the most difficult part is usually the problem of how to explore
the topology space to find the optimal topology. We have developed a ‘‘divide-and-conquer’’ heuristic algorithm in
which an initial neighbor-joining (NJ) tree is divided into subtrees at internal branches having bootstrap values
higher than a threshold. The topology search is then conducted by using the maximum-likelihood method to reevaluate all branches with a bootstrap value lower than the threshold while keeping the other branches intact.
Extensive simulation showed that our simple method, the neighbor-joining maximum-likelihood (NJML) method,
is highly efficient in improving NJ trees. Furthermore, the performance of the NJML method is nearly equal to or
better than existing time-consuming heuristic maximum-likelihood methods. Our method is suitable for reconstructing relatively large molecular phylogenetic trees (number of taxa $ 16).
Introduction
The neighbor-joining (NJ) method (Saitou and Nei
1987) is simple and widely used, especially for large
molecular phylogenetic trees. On the other hand, the
maximum-likelihood (ML) method (Cavalli-Sforza and
Edwards 1967; Felsenstein 1981) tends to outperform
the NJ method if an appropriate model of nucleotide
substitution is used (Fukami-Kobayashi and Tateno
1991; Hasegawa, Kishino, and Saitou 1991; Hasegawa
and Fujiwara 1993; Kuhner and Felsenstein 1994; Tateno, Takezaki, and Nei 1994; Huelsenbeck 1995). Unfortunately, the ML method requires a large amount of
computational time when many taxa are involved.
Therefore, it is desirable to drastically reduce the topology search space by introducing heuristics.
The basic idea of this paper is to use a ‘‘divideand-conquer’’ strategy, briefly described as follows: An
initial tree is constructed by the NJ method. Bootstrap
values (Felsenstein 1985) are computed on all internal
branches (nodes). The initial tree is then divided into
subtrees at internal branches that have a bootstrap value
higher than a threshold. Each subtree is referred to as a
composite operating taxonomic unit (OTU) and is kept
intact to reduce the search space. In other words, the
topology search by ML reconsiders only internal
branches with bootstrap values lower than the threshold.
Therefore, the depth of the search depends on the bootstrap values on the internal branches (nodes) of the NJ
tree.
Figure 1 shows the basic principle of the new algorithm. Since internal branches A and E in figure 1a
have low bootstrap values, they are removed and the
remaining internal nodes are merged. Figure 1b shows
a multifurcating tree thus constructed. This is an intermediate tree with which to reconstruct the final bifurcating tree. Reconstruction of a bifurcating tree is perKey words: phylogenetic reconstruction, topology search, subtrees, greedy algorithm.
Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology and Evolution, University of Chicago, 1101 East
57th Street, Chicago, Illinois 60637. E-mail: [email protected].
Mol. Biol. Evol. 17(9):1401–1409. 2000
q 2000 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
formed by inserting new internal branches at the multifurcating nodes using the ML principle.
In figure 1b, the tree is divided into four subtrees
by three internal branches: B, C, and D. Since subtrees
(2, 3) and (7, 6) are already bifurcating trees, a topology
search will not be performed here. On the other hand,
subtrees (1, 8, (2, 3)) and (4, 5, (6, 7)) are trifurcating,
and we need to resolve them to find the optimal tree.
Since each of these two subtrees consists of three OTUs
and thus can have three possible alternative topologies,
we need to consider a total of 3 3 3 5 9 topologies.
This idea, however, is still not practical for very
large trees. If a tree has m ($2) adjacent internal branches with low bootstrap values, we need to perform the
exhaustive search for m 1 2 OTUs. When m is large,
the computation will be intractable. To reduce the computational load, we present below a greedy (hill-climbing) search algorithm (e.g., Winston 1993).
Materials and Methods
Algorithm of the Neighbor-Joining MaximumLikelihood Method
In the neighbor-joining maximum-likelihood
(NJML) method, we do not simultaneously remove all
internal branches having bootstrap values lower than the
threshold. Instead, only n such internal branches are removed in each step. The NJML algorithm is very simple:
• Step 1
Build an NJ tree and perform a simple bootstrap
analysis (for example, with 100 replications).
• Step 2
1. If all bootstrap values are greater than or equal
to the critical value C (say, 90% or 95%), take
the current tree as the final tree.
2. Otherwise, make a multifurcating tree from the
NJ tree by removing n internal branches having
the smallest n bootstrap values (say, n 5 3).
• Step 3
1. Compute the ML value for each of the at most
Pni51(2i 1 1) possible rearranged trees around
1401
1402
Ota and Li
FIG. 1.—The basic principle of the NJML algorithm. a, An initial
neighbor-joining tree. Circles represent nodes. Solid and dashed lines
represent internal branches having high and low bootstrap values, respectively (A: low; B: high; C: high; D: high; E: low). b, An intermediate multifurcating tree derived from a.
the multifurcating node derived in the preceding
step. For details, see Discussion.
2. Choose a tree that has the largest likelihood
value.
3. Set an imaginary bootstrap value C (the threshold) to each of the rearranged internal branches.
This is an operation to terminate the program in
step 2.
4. Go to step 2.
Figure 2 shows how the tree reconstruction is performed in this algorithm for n 5 3. The bootstrap values
on branches A, B, C, F, J, and L are lower than the
threshold C (see fig. 2a). However, only the three smallest bootstrap values are chosen and the corresponding
internal branches (A, C, and J) are removed in this step
(fig. 2b). The multifurcating nodes will be resolved, assuming that the remaining parts reflect the true tree topology. In further steps, internal branches B, F, and L
will be removed to perform the topology search.
Since this is a greedy search algorithm, we may be
led to a wrong result. Unlike with other stepwise algorithms, however, our working trees are always bifurcating trees and keep the same number of leaves (OTUs)
FIG. 2.—Schematic representation of the greedy algorithm. a, An
initial neighbor-joining tree. The bootstrap values of internal branches
are as follows: A , C , J , B , F , L , 90% # E , G , H ,
I , M. Suppose that the threshold used is 90 %. b, An intermediate
multifurcating tree derived from a. In this step, three internal branches
A, C, and J were removed because they had the smallest n (53) bootstrap values. Solid and broken lines represent internal branches having
higher and lower bootstrap values than 90%, respectively. See the legend to figure 1 for more details.
during reconstruction (see fig. 2a). In other words, the
number of parameters at each step is always the same
in the ML estimation. This means we can compare a
working tree with one in the previous stage at step 3 in
terms of the ML values (all rearranged trees contain the
previous working tree). Therefore, we never choose a
tree worse than a previous intermediate tree in terms of
ML values.
Implementation
In PHYLIP, version 3.5c (Felsenstein 1993), the
programs dnadist, seqboot, neighbor, and consensus
were used with slight modifications to construct an initial NJ tree (fig. 3).
Neighbor Joining and Maximum Likelihood
1403
FIG. 3.—Implementation of the NJML method. NJML contains four newly developed modules: conbstree constructs bootstrap trees from
an initial bootstrap neighbor-joining tree and removes n internal branches whose bootstrap values are less than a threshold; allptopon generates
all possible bifurcating trees from a given multifurcating tree and computes the maximum-likelihood value for each tree; selectpmltree selects
the maximum-likelihood tree from the candidates; setbs sets the bootstrap value on each branch of a working tree.
We also used part of the source code from MOLPHY, version 2.2 (Adachi and Hasegawa 1996), to develop NJML. In the source code, the eigenvalues and
eigenvectors of a transition probability matrix are computed following Kishino and Hasegawa (1990), so occasionally a data set may not yield all the proper eigenvalues when empirical base frequencies are used. In this
case, NJML will return the initial NJ tree as the final
result (see fig. 3).
We cannot exclude the possibility that a pairwise
distance from bootstrap resampling is infinite in dnadist
of PHYLIP (fig. 3). In this case, NJML will return no
result.
As shown in figure 3, NJML contains several modules written in C. GNU CC version 2.8.1 was used as a
compiler.
Simulation Design
Two types of simulation were carried out to check
the efficiency of NJML: (1) fixed model trees were used,
and randomly generated ancestral sequences were
1404
Ota and Li
FIG. 5.—Model trees T1 and T2 with expected substitution rates a
and b. In T1, a molecular clock is assumed, whereas T2 describes a
situation of extreme rate heterogeneity in different branches of the tree.
Modified from Strimmer and von Haeseler (1996b).
FIG. 4.—Five model trees used for the computer simulation. A
constant rate of nucleotide substitution was assumed for trees A, B,
and E. In trees A and B, U ([8a) 5 0.05 and 0.50 were used in the
computer simulation. In tree E, the value of a was determined from
the pairwise distance between the two most distantly related sequences
(dmax). The rate of nucleotide substitution varies among branches for
trees C and D with a 5 0.01 or 0.05.
evolved along the model trees under certain evolutionary models; and (2) randomly generated trees were used,
and randomly generated ancestral sequences were
evolved in the same way as in (1). No site-specific heterogeneity (Yang 1993) was assumed in either type of
simulation. The maximum number (n) of removed internal branches at each step was set at 3 for all runs.
Fixed Model Tree Simulation
Seq-Gen (Rambaut and Grassly 1997) was used to
generate ancestral sequences. In Seq-Gen, Jukes-Cantor
(JC) (Jukes and Cantor 1969) and Kimura’s two-parameter (K2P) (Kimura 1980) models were implemented as
special cases of Hasegawa, Kishino, and Yano’s (1985);
(HKY) model. We set the nucleotide frequencies to be
equal in both models, and we set the transition-to-transversion ratio (TS/TV) to 0.5 and 4.0 for the JC and K2P
models, respectively. (We set the ratio to 4.0 for the K2P
model rather than the commonly used value of 2.0 because we wanted to consider a more extreme case) Base
frequencies were all set to 0.25. For each case, 500 replications were generated.
The model trees used are shown in figures 4 and
5. In figure 4, model trees A, B, C, and D are modified
from Saitou and Imanishi (1989). Trees A and B assume
a constant rate of nucleotide substitution. The expected
number of nucleotide substitutions per site is denoted by
U, with U 5 0.05 or 0.5. The length of each branch is
expressed as multiples of a (5U/8) (see fig. 4). Model
trees C and D have a large variation among branches in
rate of nucleotide substitution. For both trees, a 5 0.01
and a 5 0.05 were used. The threshold of bootstrap
values was set to 90% or 95%.
Model tree E is modified from Nei, Kumar, and
Takahashi (1998). The expected number of nucleotide
substitutions per site between two most divergent sequences is denoted by dmax, and the a value in the model
tree is then determined in proportion to this dmax value.
For this tree, dmax values of 0.25, 1.0, and 1.5 are used.
In figure 5, model trees T1 and T2 were modified
from Strimmer and von Haeseler (1996b). For each of
them, a variety of substitution rates a and b were assumed. As shown in figure 5, T1 and T2 assume a constant rate and two varying rates (among lineages),
respectively.
Model trees T1 and T2 were used to study the efficiencies of BIONJ (Gascuel 1997), Weighbor 1.0.1
(Bruno, Socci, and Halpern 2000), and fastDNAml 1.0
(Olsen et al. 1994) under the same conditions as the
NJML method. BIONJ and Weighbor are recent enhancements to the NJ method. The programs were run
with their default options. For fastDNAml 1.0, however,
TS/TV and all base frequency parameters were set to
0.5 or 4.0, and 0.25, respectively. To compare inferred
trees with model trees, a program modified from
PAML’s evolver (Yang 1999) was used.
Randomly Generated Model Tree Simulation
For each case, 100 trees were generated randomly,
and each possible tree was equally probable. To generate
them, evolver was used with slight modification. Ran-
Neighbor Joining and Maximum Likelihood
Table 1
Proportions of Replicates in Which the Correct Topology
Was Reconstructed Under a Constant-Rate Model Tree
JC
MODEL TREE
NJ
Table 2
Proportions of Replicates in Which the Correct Topology
Was Reconstructed Under a Varying-Rate Model Tree
K2P
NJML(90) NJML(95)
NJ
JC
NJML(90) NJML(95)
A
1405
MODEL TREE
K2P
NJML(90) NJML(95)
NJ
NJ
NJML(90) NJML(95)
C
U 5 0.05
300 bp . . . .
600 bp . . . .
1,000 bp . . .
U 5 0.50
300 bp . . . .
600 bp . . . .
1,000 bp . . .
0.524
0.822
0.938
0.598
0.858
0.960
0.558
0.854
0.966
0.516
0.734
0.918
0.574
0.826
0.938
0.606
0.824
0.942
0.554
0.814
0.926
0.556
0.792
0.920
0.562
0.814
0.924
0.494
0.748
0.892
0.620
0.846
0.956
0.606
0.870
0.956
B
U 5 0.05
300 bp . . . .
600 bp . . . .
1,000 bp . . .
U 5 0.50
300 bp . . . .
600 bp . . . .
1,000 bp . . .
0.754
0.928
0.998
0.824
0.972
1.000
0.804
0.964
0.998
0.608
0.868
0.976
0.742
0.948
0.992
0.706
0.968
0.994
0.738
0.904
0.986
0.906
0.984
1.000
0.902
0.992
0.998
0.652
0.854
0.962
0.856
0.976
1.000
0.812
0.966
1.000
0.736
0.946
0.988
0.806
0.968
1.000
0.800
0.978
1.000
0.594
0.842
0.976
0.762
0.948
0.998
0.750
0.962
0.996
0.736
0.946
0.988
0.958
0.996
1.000
0.954
1.000
1.000
0.630
0.882
0.974
0.906
0.996
1.000
0.930
0.996
1.000
D
U 5 0.05
300 bp . . . .
600 bp . . . .
1,000 bp . . .
U 5 0.50
300 bp . . . .
600 bp . . . .
1,000 bp . . .
0.650
0.814
0.946
0.680
0.888
0.976
0.688
0.854
0.970
0.548
0.778
0.888
0.622
0.822
0.936
0.598
0.844
0.934
0.520
0.752
0.940
0.564
0.768
0.958
0.606
0.832
0.950
0.570
0.756
0.870
0.666
0.850
0.936
0.650
0.850
0.966
NOTE.—JC 5 Jukes and Cantor’s model; K2P 5 Kimura’s two-parameter
model; NJ 5 the NJ method; NJML(90) 5 the NJML method with a 90% threshold; NJML(95) 5 the NJML method with a 95% threshold. In each case, 500
replications were conducted.
domly generated ancestral sequences evolved along the
randomly generated trees under the JC or the K2P model. The number of OTUs was 20, and the sequence
length was 1,000 bp. The expected branch lengths of
randomly generated trees were varied for three cases:
0.01, 0.05, and 0.1. The other part of this simulation
was the same as the fixed model tree simulation.
Computer Run Times
Computer run times were measured in NJML,
PUZZLE 4.0.2 (an implementation of the QP method),
fastDNAml 1.0, and DNAML 3.5 by using the clock()
function of GNU CC version 2.8.1 on a Pentium III
machine (450 MHz). Sequences were randomly generated by Seq-Gen under the K2P model. A given tree
was randomly generated by setting the mean branch
length to 0.05. These operations were iterated 100 times
for each case, and each computer run time was measured. The averages of the computer run times were
used to compare performances. In the case of DNAML,
however, only one run was carried out, because the computational run times were too large.
Results
Table 1 shows the results of simulation using ‘‘constant-rate’’ trees A and B (fig. 4). All NJ trees given in
the tables were consensus trees by bootstrap resampling,
and they were used as the initial trees for the NJML
method. Table 1 indicates that all initial NJ trees were
improved except in two cases under the JC model. In
the two exceptions, the sequences were very divergent
(U 5 0.50) and the simple Jukes-Cantor model was
used.
U 5 0.05
300 bp . . . .
600 bp . . . .
1,000 bp . . .
U 5 0.50
300 bp . . . .
600 bp . . . .
1,000 bp . . .
NOTE.—Abbreviations are as in table 1.
Table 2 shows the results of simulation using
‘‘varying-rate’’ trees C and D. Every initial NJ tree was
improved by the NJML method except in one tie case
(model tree C; U 5 0.05; l 5 1,000 bp). Under the
K2P model, the improvement in performance was
remarkable.
Table 3 shows the efficiency of the NJML and NJ
methods when ‘‘comblike’’ model trees (model tree E)
were used. In every case, the initial NJ trees were improved by the NJML method. The average ratio of
matched internal branches between inferred trees and
model trees was also improved by the NJML method;
the ratio is defined as (I 2 dT/2)/I, where I and dT are
the number of internal branches and the topological
distance (see Foulds, Penny, and Hendy 1979),
respectively.
The performance of NJML was also compared with
that of BIONJ (BI), Weighbor 1.0.1 (WE), fastDNAml
(fM), the quartet puzzling (QP) method (Strimmer and
von Haeseler 1996), and PHYLIP DNAML, version
3.5c (DNAML) (Felsenstein 1993). Model trees T1 and
T2 (fig. 5) were used. We used the data of Strimmer,
Table 3
Proportions of Replicates in Which the Correct Topology
Was Reconstructed Under Constant-Rate Model Tree E
JC
dmax
NJ
K2P
NJML(90)
0.25. . . 0.336 (0.879) 0.466 (0.916)
1.00. . . 0.250 (0.864) 0.342 (0.895)
1.50. . .
—
—
NJ
NJML(95)
0.226 (0.849) 0.404 (0.900)
0.208 (0.840) 0.394 (0.895)
0.164 (0.824) 0.292 (0.873)
NOTE.—The values in parentheses are average ratios of matched internal
branches between inferred trees and model trees. Dashes indicate that infinite
distances were estimated in more than 100 replications. Other abbreviations are
as in table 1. The sequence length was 300 bp.
1406
Ota and Li
Table 4
Ratios of Correctly Reconstructed Trees for Clocklike Evolution According to Tree T1
SEQUENCE EVOLUTION
l (bp)
a/b
500 . . . . . 0.01/0.07
0.02/0.19
0.03/0.42
1,000 . . . . 0.01/0.07
0.02/0.19
0.03/0.42
K2P (TS/TV 5 4)
JC
NJ
BN
WE
NM(90)
0.69
0.52
0.12
0.96
0.86
0.35
0.73
0.52
0.14
0.96
0.87
0.35
0.72
0.47
0.13
0.92
0.83
0.29
0.80
0.62
0.14
0.97
0.90
0.37
NM(95)
QPa
0.80
0.64
0.16
0.97
0.92
0.38
0.80
0.70
0.29
0.94
0.92
0.53
fM
MLb
NJ
BN
WE NM(90) NM(95) QPa
fM
MLb
0.83
0.61
0.12
0.98
0.92
0.33
0.87
0.63
0.09
0.96
0.85
0.34
0.59
0.34
0.13
0.89
0.74
0.28
0.49
0.39
0.11
0.90
0.79
0.35
0.56
0.37
0.10
0.89
0.69
0.25
0.68
0.58
0.19
0.93
0.85
0.46
0.87
0.63
0.09
0.96
0.85
0.34
0.70
0.54
0.17
0.94
0.85
0.40
0.71
0.53
0.18
0.94
0.88
0.41
0.70
0.63
0.33
0.89
0.85
0.57
NOTE.—BN 5 BIONJ; WE 5 Weighbor; NM(90) 5 NJML with 90% threshold; NM(95) 5 NJML with 95% threshold; QP 5 the quartet puzzling algorithm;
fM 5 the maximum-likelihood method implemented in fastDNAml 1.0; ML 5 the maximum-likelihood method implemented in PHYLIP DNAML, version 3.5c.
TS/TV 5 transition to transversion ratio. The other abbreviations are as in table 1.
a Data from Strimmer and von Haeseler (1997).
b Data from Strimmer and von Haeseler (1996b).
Goldman, and von Haeseler (1997) for QP and that of
Strimmer and von Haeseler (1996) for DNAML.
As shown in tables 4 and 5, although BIONJ and
Weighbor often gave better results than did the NJ method, the NJML method outperformed both of them. The
QP method gave better results than any other method in
six cases, while the NJML method gave better or tie
results in the other cases (table 4). In particular, note
that when model tree T2 (a ‘‘rate-varying’’ tree) was
used, the efficiency of the NJML method was considerably better than that of both the NJ method and the
QP methods (table 5). NJML performed almost as well
as fastDNAml and DNAML in most cases and gave better results than fastDNAml and DNAML in some cases.
Table 6 shows the results of simulation using randomly generated trees. The NJML method improved the
initial NJ trees without exception, not only with regard
to the proportion of correct reconstruction, but also with
regard to the average ratio of matched internal branches
between inferred trees and model trees.
As shown in table 7, NJML is obviously faster than
the PUZZLE, DNAML, and fastDNAml programs except for the cases with relatively small numbers of
OTUs (#14). Note that PUZZLE and fastDNAml have
been highly brushed up with regard to coding, while the
current version of NJML is still a set of experimental
programs. The computer run time for NJML for each
case shown in table 7 is actually the sum of the computer run times of the NJML subprograms (see fig. 3).
Interestingly, the most time-consuming part was not the
ML estimation, but the bootstrap resampling and the
computation of distance matrices by seqboot and dnadist
of PHYLIP (data not shown). This explains why NJML
is slower than PUZZLE and fastDNAml when the numbers of taxa involved are less than or equal to 14 and
8, respectively.
Some simulation data caused dnadist to return infinite distances. When the number of such replicates was
greater than 100 in a case, we did not evaluate the results (indicated by dashes in tables 3 and 5).
A Phylogenetic Tree of the Small-Subunit rRNA for
Eukaryotes
We reconstructed a phylogenetic tree of the smallsubunit ribosomal RNA for eukaryotes by using the NJ
and NJML methods. The multiply aligned sequences are
from the Ribosomal Database Project (Maidak et al.
1999). Figure 6a and b shows the initial NJ tree and the
NJML tree, respectively. The K2P model was used with
TS/TV 5 4.0, and the threshold was set to 90%. OTUs
of the trees were automatically annotated by programs
in DeepForest (OOta 1998). Excluding gaps, 1,465 sites
were used. The NJ tree (fig. 6a) has the following major
problems:
1. The nematode (Caenorhabditis elegans) is clustered
with the acellular slime mold, the cellular slime
mold, the malaria parasite, and the dysentery
amoeba.
Table 5
Ratios of Correctly Reconstructed Trees for Nonclocklike Evolution According to Tree T2
SEQUENCE EVOLUTION
l (bp)
a/b
500 . . . . . 0.01/0.07
0.02/0.19
0.03/0.42
1,000 . . . 0.01/0.07
0.02/0.19
0.03/0.42
K2P (TS/TV 5 4)
JC
NJ
BN
WE
NM(90)
0.82
0.65
—
0.95
0.91
—
0.86
0.81
0.46
0.98
0.99
0.75
0.88
0.89
0.70
0.98
0.99
0.92
0.94
0.94
—
0.99
0.99
—
NM(95)
QPa
0.93
0.95
—
1.00
1.00
0.97c
0.86
0.85
0.47
0.97
0.96
0.70
fM
MLb
NJ
BN
WE NM(90) NM(95) QPa
fM
MLb
0.91
0.94
0.81
1.00
1.00
0.95
0.91
0.93
0.72
0.99
0.99
0.92
0.75
0.56
0.24
0.92
0.84
0.46
0.80
0.72
0.40
0.96
0.93
0.67
0.83
0.77
0.53
0.97
0.95
0.83
0.90
0.88
0.74
0.99
1.00
1.00
0.94
0.92
0.73
0.98
0.99
0.96
0.91
0.89
0.70
0.98
0.98
0.92
0.89
0.90
0.72
0.99
0.99
0.90
NOTE.—Dashes indicate that infinite distances were estimated in more than 100 replications. Other abbreviations are as in table 4.
a Data from Strimmer and von Haeseler (1997).
b Data from Strimmer and von Haeseler (1996).
c Infinite distances were estimated in three replications.
0.81
0.77
0.52
0.95
0.92
0.74
Neighbor Joining and Maximum Likelihood
1407
Table 6
Ratios of Correctly Reconstructed Trees for Randomly Generated Trees
JC
K2P
MEAN BRANCH LENGTH
NJ
NJML
NJ
NJML
0.01. . . . . . . . . . . . . . . . .
0.05. . . . . . . . . . . . . . . . .
0.1. . . . . . . . . . . . . . . . . .
0.530 (0.963)
0.600 (0.972)
0.000a (0.020)
0.540 (0.965)
0.660 (0.978)
0.000a (0.020)
0.360 (0.946)
0.510 (0.964)
0.480 (0.958)
0.370 (0.947)
0.630 (0.972)
0.600 (0.971)
NOTE.—For each case, 100 trees were randomly generated with the number of operating taxonomic units 5 20. Abbreviations are as in table 1.
a Infinite distances were estimated in five replications (i.e., the estimates of efficiencies were based on 95 replications).
2. The sea urchin and the amphioxus form a monophyletic group.
3. The amoeba (Acanthamoebae) and green plants form
a monophyletic group.
In the NJML tree (fig. 6b), these problems are
solved: the nematode, the sea urchin, the amphioxus,
and the amoeba are located in reasonable places (e.g.,
see Maddison and Maddison 1998).
Discussion
Our results showed that NJML improved almost all
initial NJ trees; the improvement was especially remarkable in the varying-rate model trees (tables 2 and
5). As shown in Saitou and Imanishi (1989), the NJ
method outperformed the ML method when model trees
A and B were used under the JC model. Our results are
consistent with Saitou and Imanishi’s conclusions (see
table 1). Under the K2P model, the NJML method was
more efficient than the NJ method without exception.
The results of the QP method in tables 4 and 5 were
obtained by the improved QP method using discrete
weights (Strimmer, Goldman, and von Haeseler 1997).
In fact, in comparison with any other method, its results
were surprisingly good when model tree T1 was used
with relatively long branch lengths (a/b 5 0.02/0.19 and
0.03/0.42). However, we should note that model trees
T1 and T2 were originally designed for testing the QP
method (Strimmer and von Haeseler 1996) and that the
NJML method always outperformed the QP method
when model tree T2 was used.
As noted above, NJML outperformed fastDNAml
and DNAML in some cases (see tables 4 and 5). The
computational load for DNAML is prohibitive when the
number of taxa is large (Strimmer and von Haeseler
1996). Although fastDNAml is considerably faster than
DNAML (table 7), the computational load is still not
negligible for large trees. In comparison, the search
space of NJML is mainly determined by n (the maximum number of removed internal branches at each
step), not the total number of taxa, so it is more practical
than fastDNAml for large trees.
Interestingly, NJML did not always give better results with the threshold 95% than with the threshold
90%. This is because in the NJML algorithm, the search
space varies considerably depending on the distribution
of bootstrap values. For n 5 3, there are three possibilities with regard to the size of search space if three or
more internal branches have lower bootstrap values than
the threshold:
1. The three internal branches are not adjacent to each
other: the size of the search space is 3 3 3 3 3 5
27.
2. Two of the three internal branches are adjacent to
each other: the size of the search space is 15 3 3 5
45.
3. All three internal branches are adjacent to each other:
the size of the search space is 105.
This heterogeneity of search space will become
higher with greater n. In the worst case, we need to
explore a search space whose size is approximately
n
Ss/3
i51 Pj51 (2j 1 1) when there are s internal branches
having lower bootstrap values than the threshold. For n
n
5 3, Pnj51 (2j 1 1) 5 105 and Ss/3
i51 Pj51 (2j 1 1) #
(s/3) 3 105 5 35s, which increases only linearly with
s. On the other hand, in the best case, we need to explore
n
a search space whose size is approximately Ss/3
i513 5
s3n21. Although this heterogeneity of search space
should be explored in the future, we think that n 5 3 is
suitable for personal computers.
Since bootstrapping tends to underestimate the confidence level of a subtree (Zharkikh and Li 1992, 1995;
Hillis and Bull 1993), a threshold of 90% or even less
Table 7
Average Computer Run Times (in seconds) Corresponding to the Number of Operating Taxonomic Units (OTUs)
NO.
PROGRAM
NJML(90) .
.....
NJML(95) . . . . . .
PUZZLE . . . . . .
fastDNAml . . . .
DNAML . . . . . .
OF
OTUS
8
10
12
14
16
18
20
22
3.37
3.38
0.94
1.63
24.94
5.35
5.38
2.49
5.75
75.94
9.44
9.66
4.96
12.20
148.61
10.65
10.90
10.07
27.21
268.17
14.96
15.39
18.35
43.39
474.24
21.17
23.02
29.48
78.55
825.79
28.47
31.98
46.83
115.57
1,109.46
33.93
38.84
70.81
180.39
1,596.86
NOTE.—Sequences were generated by using random trees. The mean branch length was 0.05. The K2P model was used to evolve the sequences. The number
of replications was 100 except for DNAML, in which only one run was carried out.
1408
Ota and Li
FIG. 6.—Phylogenetic trees of 43 small-subunit ribosomal RNA for eukaryotes. Trees a and b were reconstructed by using the NJ and
NJML methods, respectively. Numbers by internal branches represent bootstrap values (%) based on 100 pseudoreplications. Thick internal
branches without bootstrap values were evaluated by the NJML method in tree b. Note that some of the internal branches gave length 0. The
scale bars represent the number of substitutions per 100 sites.
would be sufficiently high for a good performance of
NJML.
Acknowledgments
We thank Masami Hasegawa for permission to use
the source code of MOLPHY, version 2.2. Andrew Rodin kindly advised us on simulation studies. Richard
Blocker helped us to maintain the computer system.
This work was supported by NIH grants GM30998 and
GM55759.
LITERATURE CITED
ADACHI, J., and M. HASEGAWA. 1996. MOLPHY version 2.3:
programs for molecular phylogenetics based on maximum
likelihood. Comput. Sci. Monogr. 28:1–150.
BRUNO, W. J., N. D. SOCCI, and A. L. HALPERN. 2000. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:
189–197.
CAVALLI-SFORZA, L. L., and A. W. F. EDWARDS. 1967. Phylogenetic analysis: models and estimation procedures. Am.
J. Hum. Genet. 19:233.
FELSENSTEIN, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–
376.
———. 1985. Confidence limits on phylogenies: an approach
using the bootstrap. Evolution 39:783–791.
———. 1993. PHYLIP: phylogeny inference package. Version
3.5c. University of Washington, Seattle.
FOULDS, L. R., D. PENNY, and M. D. HENDY. 1979. A graph
theoretic approach to the development of minimal phylogenetic trees. J. Mol. Evol. 13:151–166.
FUKAMI-KOBAYASHI, K., and Y. TATENO. 1991. Robustness of
maximum likelihood tree estimation against different patterns of base substitutions. J. Mol. Evol. 32:79–91.
GASCUEL, O. 1997. BIONJ: an improved version of the NJ
algorithm based on a simple model of sequence data. Mol.
Biol. Evol. 14:685–695.
HASEGAWA, M., and M. FUJIWARA. 1993. Relative efficiencies
of the maximum likelihood, maximum parsimony, and
neighbor-joining methods for estimating protein phylogeny.
Mol. Phylogenet. Evol. 2:1–5.
HASEGAWA, M., H. KISHINO, and N. SAITOU. 1991. On the
maximum likelihood method in molecular phylogenetics. J.
Mol. Evol. 32:443–445.
HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating of
the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.
HILLIS, D. M., and J. J. BULL. 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42:182–192.
HUELSENBECK, J. P. 1995. The robustness of two phylogenetic
methods: four-taxon simulations reveal a slight superiority
of maximum likelihood over neighbor joining. Mol. Biol.
Evol. 12:843–849.
JUKES, T. H., and C. R. CANTOR 1969. Evolution of protein
molecules. Pp. 21–132 in H. N. MUNRO, ed. Mammalian
protein metabolism. Academic Press, New York.
KIMURA, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies
of nucleotide sequences. J. Mol. Evol. 16:111–120.
KISHINO, H., and M. HASEGAWA. 1990. Converting distance to
time: application to human evolution. Methods Enzymol.
183:550–570.
KUHNER, M., and J. FELSENSTEIN. 1994. A simulation comparison of phylogeny algorithms under equal and unequal
evolutionary rates. Mol. Biol. Evol. 11:459–468.
Neighbor Joining and Maximum Likelihood
MADDISON, D., and W. P. MADDISON. 1998. The tree of life: a
multi-authored, distributed Internet project containing information about phylogeny and biodiversity. College of Agriculture, University of Arizona, Tucson. Internet address:
http://phylogeny.arizona.edu/tree/phylogeny.html.
MAIDAK, B. L., J. R. COLE, C. T. P. JR. et al. (14 co-authors).
1999. A new version of the RDP (Ribosomal Database Project). Nucleic Acids Res. 27:171–173.
NEI, M., S. KUMAR, and K. TAKAHASHI. 1998. The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino
acids used is small. Proc. Natl. Acad. Sci. USA 95:12390–
12397.
OLSEN, G., H. MATSUDA, R. HAGSTROM, and R. OVERBEEK.
1994. fastDNAml: a tool for construction of phylogenetic
trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:41–48.
OOTA, S. 1998. Development of an integrated system for molecular evolutionary study and its application. Ph.D. thesis,
Department of Genetics, School of Life Science, Graduate
University for Advanced Studies, Mishima, Japan.
RAMBAUT, A., and N. C. GRASSLY. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence
evolution along phylogenetic trees. Comput. Appl. Biosci.
13:235–238.
SAITOU, N., and T. IMANISHI. 1989. Relative efficiencies of the
Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of
phylogenetic tree construction in obtaining the correct tree.
Mol. Biol. Evol. 6:514–525.
SAITOU, N., and M. NEI. 1987. The neighbor-joining method:
a new method for reconstructing phylogenetic trees. Mol.
Biol. Evol. 4:406–425.
1409
STRIMMER, K., and A. VON HAESELER. 1996a. PUZZLE. Version 2.5. Zoologisches Institut, Universitaet Muenchen, Munich, Germany.
———. 1996b. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol.
Evol. 13:964–969.
STRIMMER, K., N. GOLDMAN, and A. VON HAESELER. 1997.
Bayesian probabilities and quartet puzzling. Mol. Biol.
Evol. 14:210–211.
TATENO, Y., N. TAKEZAKI, and M. NEI. 1994. Relative efficiencies of the maximum-likelihood, neighbor-joining, and
maximum-parsimony methods when substitution rate varies
with site. Mol. Biol. Evol. 11:261–277.
WINSTON, P. H. 1993. Artificial intelligence. Addison-Wesley,
Mass.
YANG, Z. 1993. Maximum-likelihood estimation of phylogeny
from DNA sequences when substitution rates differ over
sites. Mol. Biol. Evol. 10:1396–1401.
———. 1999. PAML manual. Department of Biology, Galton
Laboratory, University College London, London.
ZHARKIKH, A., and W.-H. LI. 1992. Statistical properties of
bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Mol.
Biol. Evol. 9:1119–1147.
———. 1995. Estimation of confidence in phylogeny: the
complete-and-partial bootstrap technique. Mol. Phylogenet.
Evol. 4:44–63.
YUN-XIN FU, reviewing editor
Accepted June 6, 2000