Detecting the Node-Density Artifact in Phylogeny

Syst. Biol. 55(4):637-643,2006
Copyright © Society of Systematic Biologists
ISSN: 1063-5157 print / 1076-836X online
DO1:10.1080/10635150600865567
Detecting the Node-Density Artifact in Phylogeny Reconstruction
CHRIS VENDITTI, ANDREW MEADE, AND MARK PAG EL
School of Biological Sciences, University of Reading, Whiteknights, Reading RC6 6AJ, England; E-mail: [email protected] (M.R)
Abstract.— The node-density effect is an artifact of phylogeny reconstruction that can cause branch lengths to be underestimated in areas of the tree with fewer taxa. Webster, Payne, and Pagel (2003, Science 301:478) introduced a statistical procedure
(the "delta" test) to detect this artifact, and here we report the results of computer simulations that examine the test's performance. In a sample of 50,000 random data sets, we find that the delta test detects the artifact in 94.4% of cases in which it is
present. When the artifact is not present (n = 10,000 simulated data sets) the test showed a type I error rate of approximately
1.69%, incorrectly reporting the artifact in 169 data sets. Three measures of tree shape or "balance" failed to predict the size
of the node-density effect. This may reflect the relative homogeneity of our randomly generated topologies, but emphasizes
that nearly any topology can suffer from the artifact, the effect not being confined only to highly unevenly sampled or otherwise imbalanced trees. The ability to screen phylogenies for the node-density artifact is important for phylogenetic inference
and for researchers using phylogenetic trees to infer evolutionary processes, including their use in molecular clock dating.
[Delta test; molecular clock; molecular evolution; node-density effect; phylogenetic reconstruction; speciation; simulation.]
Fitch and Bruschi (1987) and Fitch and Beintema (1990)
identified an artifact of phylogeny reconstruction that
has come to be known as the node-density effect. These
authors noted that branch lengths will tend to be better
estimated in parts of a tree where more taxa have been
sampled. Conversely, where taxon sampling is sparse or
the amount of change between successive nodes of the
tree is large, phylogenetic reconstruction methods will
tend to underestimate the true amount of change. This is
because in longer branches of a tree, multiple "hits," or
two or more changes at a given site, are common. These
multiple hits are mostly invisible and get reconstructed
as one change, causing branch lengths to be underestimated. The effect never disappears but will be smaller in
shorter branches of the tree, where fewer multiple hits
are expected.
Summed over all of the branch lengths of a phylogeny,
this artifact can cause an apparent relationship between
the number of nodes and the total inferred amount of
change. Where there has been more net speciation (more
internal nodes of the tree), the true amount of change
along each branch is better estimated, giving the appearance that there has been more total evolution along the
summed path from the root of the phylogeny to its tips.
The effect increases with the number of nodes included
along a path until the total path length approaches the
true length. This leads to the expectation of a curvilinear relationship between the reconstructed length of a
path and the number of nodes along that path. Figure la
shows a phylogeny in which the artifact is present, and
Figure lb plots the total root-to-tip (species) path lengths
against the number of nodes along the path, showing the
expected curvilinear trend (see also Fitch and Beintema's
[1990] figure 2, reprinted in Page and Holmes [1998, page
169]).
Webster et al. (2003) introduced a statistical test to detect phylogenies that suffer from the node-density artifact. Those authors fit a curve of the form n = fixs, where n
is the number of nodes, x is the phylogenetic path length,
P describes that rate of change between path length and
the number of nodes, and 8 captures any curvature. This
is algebraically equivalent to finding a curve of the form
x = f3*nl/s, where ft* = fi~1/s, and we expect 8 >1 when
the artifact is present. When the data do not suffer from
the artifact, there can still be a relationship between path
lengths and nodes such that p* > 0, but <5 < 1. To test for
the artifact, Webster et al. (2003; Supplementary Information) describe a generalized least-squares (GLS) procedure based upon Pagel's (1997,1999) continuous method.
The GLS method assesses the relationship between path
lengths and nodes using all of the information in the
phylogenetic tree and accounting for phylogenetic relatedness in both measures.
Here we report on the performance of the delta test for
detecting the node-density artifact by analyzing simulated gene-sequence data on random phylogenetic trees.
Our particular interest is to determine how well the 8 > 1
criterion identifies trees suffering from the artifact.
METHODS
Simulation Data
We used PhyloGen (Rambaut, 2002) to simulate 1000
random ultrametric trees of 50 species each. The speciation rate was set to twice that of the extinction parameter
(birth = 0.2, death = 0.1, respectively). We then added an
artificial outgroup taxon to each tree. This was done to
ensure that all the branches leading to the true root were
estimated properly (as described below).
For each of the 1000 random topologies we used SeqGen (Rambaut and Grassly, 1997) to generate 50 random
gene-sequence data sets of 1000 base pairs. We generated
data from the general time-reversible (GTR +F4) model
of sequence evolution, choosing the values of the rate
parameters in the GTR matrix at random for each data
set from the uniform interval between 0 and 20, with the
exception of the G -> T rate, which was always 1. All base
frequencies were assumed to be 0.25. We chose the value
of the gamma shape parameter on the uniform interval
0 to 4 and varied the tree length by randomly choosing
the root-to-tip distance (substitutions per site) between
0.2 and 2.2 each time an alignment was simulated. This
637
638
VOL. 55
SYSTEMATIC BIOLOGY
gave us 50,000 data sets in which, because the trees are
ultrametric, there is no relationship between the number
of nodes along a path and the path length.
We estimated the phylogenetic branch lengths for each
of the 50,000 data sets using PAUP* 4.0bl0 (Swofford,
2001) and giving it the correct topology. Although the
simulated trees were rooted, all branch lengths were estimated on unrooted trees. The artificial outgroup taxon
was included and used to estimate where the true root
of the tree should be placed along the basal branch. If
branch lengths are estimated on rooted trees, maximum
likelihood will correctly estimate the total length of the
branch leading from the outgroup to the ingroup taxa,
but it does not know where to place the root along this
branch. If the root is placed such that the arbitrary length
of the segment leading from the root to the outgroup is
short, this can falsely give the impression of the nodedensity artifact.
Although PAUP was given the true topology, we used
a GTR model of evolution without gamma to infer the
branch lengths in each of the 50,000 data sets. The simple
GTR model will fail to capture the exact nature of the
evolutionary process that gave rise to the data and is
therefore expected to produce the node-density effect to
varying degrees (see also Zharkikh, 1994).
We generated a further 10,000 data sets of 1000 base
pairs using 10 replicates each of the same 1000 simulated trees, and the same range of parameters. For each
of these data sets we estimated the branch lengths in
PAUP but using a GTR + r 4 model. Inferring the branch
lengths with the same model as the data were simulated
by means that the evolutionary process that gave rise to
the data will be well approximated and we do not expect
the node-density artifact to be present.
Node-Density Analyses
We removed the artifical outgroup taxon from each
data set and rooted the trees at the point the outgroup
taxon had identified. Then, for each tree we first tested for
a relationship between the reconstructed path length, calculated as the sum of the branches from the root to the tip
for each species, and the number of internal nodes along
that path, starting at zero for the root and not counting the tip at the end of the path as an additional node
(Webster et al, 2003, count species as additional nodes
meaning that the values reported in their figure 1 would
differ from ours by one. We prefer the present method of
counting nodes as it corresponds to speciation events on
the tree; see also Discussion.) The relationship between
nodes and path lengths is tested by means of a likelihoodratio (LR) statistic comparing the likelihood of a randomwalk model to a directional random-walk model (Pagel,
1997,1999; Webster et al., 2003; Supplementary Information). The models differ by the parameterft*as described
above in the equation for x, where ft* measures the regression of path length on nodes. We expectft*= 0 when
no artifact is present. If the artifact is present in the data,
we expect/T > 0 and that the directional model will provide a better fit. Twice the difference in likelihoods (the
LR) is assessed by a xl distribution.
Because the true trees are ultrametric, a significant association between path length and nodes is evidence,
apart from chance effects, for the node-density artifact. In
real data the nature of the true tree is not known, and a relationship between the number of nodes and path length
could arise for reasons other than the artifact (see, for example, Webster et al., 2003). However, the artifact can be
distinguished from other causes by the nature of the relationship it produces between path lengths and nodes.
In particular, the delta test asserts that when a significant
association has been caused by the artifact, we expect the
parameter 8 to be greater than 1. For each significant directional model (ft* significantly > 0), we therefore also
separately estimated <5 and recorded its value (the test
makes no predictions about 8 when the artifact is not
present). In practice we find that fi* and 8 are more accurately estimated from n = ftxs than from the equivalent
regression of path length on nodes (see Appendix), and
all of our analyses used this form of the equation. We took
any numerical value of 8 > 1 in conjunction with a significant directional model to be evidence of the node-density
effect. The performance of the delta test is measured by
the proportion of the simulated data sets with significant
associations between nodes and path length in which
the parameter 8 is greater than 1. Software to implement
the test is available from www.evolution.reading.ac.uk
orwww.ams.rdg.ac.uk/zoology/pagel.
Distributional Statistics
Using the methods described above we derived for
each data set a likelihood-ratio statistic comparing the
directional to the random walk model—this is the test of
ft*. Under the null hypothesis of no artifact, we expect
the cumulative density of LR values to conform to a xl
density. We compared distributions of the LR statistics to
these expected xl densities using the the KolmogorovSmirnov (K-S) D statistic.
Tree Shape
To examine whether the shape of the simulated trees
influenced the probability of obtaining an artifact, we
calculated three measures of tree shape for each tree using the computer program MeSA (Agapow and Purvis,
2002): Colless' (1982) index Ic, a measure of tree imbalance; Shao and Sokal's (1990) Bl index, a measure of tree
balance; and Rohlf et al.'s (1990) noncumulative steminess index.
RESULTS
The tree in Figure 1 shows the artifact. The LR test of
the directional model returns a significant LR of 7.38, the
slope ft* is estimated to be 0.13, and 8 = 7.33 (all values
estimated by maximum likelihood).
Manipulating the Presence/Absence of the Artifact
We expect to see the artifact at much higher than
chance levels in the 50,000-tree data set (hereafter, artifact data), but not in the 10,000-tree data set (hereafter
2006
639
VENDITTI ET AL.—DETECTING THE NODE-DENSITY ARTIFACT
(a)
(b) 0.2
2
3
4
5
6
7
10
Number of Nodes
FIGURE 1. (a) A tree that displays the node-density artifact, (b) Plot of the total path length from root to tip against the number of nodes for
each taxon in (a), showing the curvilinear trend associated with the node-density artifact. The directional random-walk model fits these data
significantly better than the random-walk model (LR = 7.38; /S* = 0.13). The parameter <5 is estimated to be 7.33 (see text). Therefore, the solid
line in (b) is of the form .Y = 0.13 /i1'7-33, where .r is the total path length and n is the number of nodes (see text).
nonartifact data). Figure 2a plots the cumulative distribution of the 10,000 observed LR values for the nonartifact data along with the cumulative distribution of
a true xl- The two lines fall on top of each other, and the
K-S test confirms that the observed cumulative density
does not depart from the expected xl distribution (D =
0.09559, P =0.3189): analyzing the simulated data with
the model that generated it returns accurately estimated
branch lengths. On the other hand, Figure 2b shows that
the distribution of the LR statistics resulting from the artifact data set of 50,000 trees is considerably skewed to
the right of the expected xl distribution. This indicates
more large LR scores than expected, and the distribution
returns a significant K-S test (D = 0.4735, P < 0.0001). Inferring the branch lengths on ultrametric trees using the
"wrong" model of sequence evolution gives rise to the
node-density artifact.
Detecting the Node-Density Artifact
In the artifact data, 48.67% (n = 24,336) of the simulated data sets showed a significant and positive association between total path length and the number of nodes.
The artifact, as measured by the size of the LR statistic, was more likely to arise in trees with greater rate
heterogeneity, as indicated by the a-shape parameter of
the gamma distribution (r = - 0.6024 P < 0.0001), and
somewhat more likely to arise in longer trees, (r = 0.272,
P < 0.0001). These results are expected: in shorter trees
and in trees with minimal rate heterogeneity, the inferred branch lengths capture all or nearly all of the true
(a)
15
0
80
100
Likelihood Ratio
FIGURE 2. (a) A plot to compare the cumulative distribution frequency for the X\ distribution (grey line) with that of the LR statistics derived
from the 10,000 trees in which the branch lengths were estimated using the GTR+ F4 model (black line): the two lines fall directly on top of
each other and the K-S test is not significant (D = 0.09559, P = 0.3189). (b) Compares the same x2 distribution (grey line) with the cumulative
probability distribution of the LR statistics derived from the 50,000 trees in which only the GTR model was used to estimated the branch lengths
(black line). The distribution of LRs is significantly skewed to the right of the x2 distribution, indicating more large LR scores than expected, and
the K-S test is significant (D = 0.4735, P < 0.0001).
640
VOL. 55
SYSTEMATIC BIOLOGY
TABLE 1. The number of significant positive associations in the
artifact data set, and the number of these that had an estimate of 8
greater than 1.
Sample
size
Number of trees that
showed a significant positive
association between nodes
and total path length
ML estimate of S > 1
in cases where there was a
significant positive association
between nodes and total path length
50,000
24,336 (48.7%)
22,983 (94.4%)
changes in the data, and the artifact is negligible or not
present.
The delta test expects that data sets displaying the artifact will return values of 8 > 1. In 94.4% of the 24,336 data
sets with significant LR statistics, the maximum likelihood estimate of 8 exceeded 1 (Table 1). Thus the delta test
correctly identifies cases of the artifact at a high rate. By
comparison, only the expected 5% (5.12%) of the 10,000
nonartifact data sets showed a significant association between nodes and path length. Fewer than half of these
(1.95% of the total) showed the positive association expected of the artifact. Of this 1.95% about 87% return an
estimate 8 greater than 1. This means that the delta test
has a type I error rate of about 1.7% in these data.
Figure 3 shows the LR statistic plotted against the estimate of 8 for each of the 24,336 artifact data sets with
significant positive associations between path length and
nodes. As the estimate of 8 moves past 1, the LR statistic increases sharply. Because <5 measures the curvature
of the relationship, this plot emphasizes that when the
node-density artifact is present (LR > 3.84), the expected
curvilinear relationship between path length and nodes
arises, such as in Figure 1. The opposite point also holds:
values of 8 < 1 are not expected when the artifact is
present and Figure 3 confirms this with only 5.6% of the
estimated 8 values less than 1.0.
The decline in LR values for larger values of 8 probably arises from trees with a small variance in total path
lengths across the tips. Consider in Figure lb if there
were very little difference among species in total path
lengths. In the limit if all species have the same path
length, the plot will produce a horizontal line. As this
limit is approached the directional model offers less and
less improvement on the nondirectional model, eventually declining to zero. At the same time, as the limit is
approached, the x = f}*nl/s curve is required to turn an
increasingly sharp corner, requiring higher values of 8.
In support of this conjecture, we find that for the 24,336
results plotted in Figure 3, the correlation between the
variance in path lengths and LR is 0.48 (P < 0.0001).
Tree Shape
The shape of the tree, at least as revealed by the three
measures we employed, did not influence the probability
of finding a significant association between path lengths
and nodes. The r2 values relating the likelihood-ratio to
the Ic, Bl, and steminess scores were 0.008, 0.001, and
0.015, respectively. This may reflect that randomly generated trees of size n = 50 tend to be relatively homogeneous. Colless' Ic statistic, for example, varies between 0
(perfectly balanced tree) and 1 (pectinate or ladder tree).
In our sample, the mean Ic was 0.12 ± 0.03—most trees
were relatively balanced.
In the limits, a perfectly balanced tree cannot suffer
from the node-density artifact because all paths from the
root to the tips traverse the same number of nodes. At
the other extreme, a pectinate tree has the potential to
show a large effect. However, the same simulated topology often gave qualitatively different results in our study,
depending upon the parameters used to generate the
data. Figure 4 shows a single simulated tree with an Ic
score of 0.15. The tree has seven independent clades in
which node density varies in a pectinate-like manner. It
returned the highest LR statistic we observed for data
simulated with an a-shape parameter of 0.05 and a rootto-tip tree length of 2.16. With an a-shape parameter of
3.36 and a length of 0.71, the same tree returned one of
the lowest observed LRs. The node-density artifact is not
confined to highly imbalanced or poorly sampled trees
but can arise whenever the true amounts of change are
underestimated.
DISCUSSION
Using the 8 > 1 criterion in conjunction with a significant regression of path lengths on nodes, the delta test
correctly identified the node-density artifact in 94.4% of
the simulated data sets in which it was present. When
the artifact was absent, the test had a type I error rate of
about 1.7%. This makes it a useful statistic for identifying
cases in which inferred branch lengths may suffer from
the systematic bias to which Fitch and Bruschi (1987) and
0
i
2
3
«t
5
Fitch and Beintema (1990) first called attention. It can be
ML Estimate of 5
used as a general phylogenetic diagnostic tool, and for
FIGURE 3. The ML estimate of 8 and the LR statistic plotted for other cases in which it is important first to rule out the
the 24,336 trees that showed a significant positive association between artifact, such as reconstructing ancestral states or calcubranch lengths and node (at the P < 0.05 level). The sharp rise in the
LR statistic as 8 moves past 1 shows that the signal of the artifact is the lating molecular clocks. Out of historical interest, we applied the delta test to the Fitch and Bruschi and Fitch
curvilinear relationship between nodes and path length.
2006
(a)
VENDITTI ET AL.—DETECTING THE NODE-DENSITY ARTIFACT
641
(b)
FIGURE 4. (a) Simulated tree topology that returned one of the weakest relationships observed between nodes and path lengths in the 50,000
data sets under one set of simulation parameters (topology with branch lengths shown in (b), a-shape parameter = 3.36, root-to-tip length =
0.71), LR =0.01, and the strongest association observed under another set (topology with branch lengths shown in (c), a-shape parameter =
0.05, root-to-tip length = 2.16), LR = 102.53.
Some authors have suggested that maximum likeand Beintema trees. Both return significant relationships
between nodes and path lengths, and both have <5 esti- lihood inference is robust to the node-density effect
mated to be greater than 1 (Fitch and Bruschi's tree LR = because it uses a substitutional model of evolution
25.99 and 8 = 1.54, Fitch and Beintema's tree LR = 8.33 (Bromham, 2003; Bromham and Penny, 2003; Bromham
and S = 1.66).
Webster et al. (2003) introduced the delta test in their
study of speciation rates affecting rates of molecular evolution. These authors analysed whether higher speciation rates—as evidenced by a larger number of internal
nodes along a path—were associated with greater
amounts of overall genetic change. The delta test was
used to identify trees in which an apparent relationship
between rates of speciation and path lengths could have
arisen as a result of the node-density effect. After removing trees with significant regressions and 8 > 1, these
authors found evidence for higher rates of molecular
evolution linked to speciation in 34.8% of the trees that
remained (this figure is 28.2% when nodes are counted
as in this paper, see Methods).
Commenting on the Webster et al. study, Witt and
Brumfield (2004) suggested that 8 < 1 is compatible with
the artifact and cited the Fitch and Bruschi (1987) tree
as an example. Mathematically 8 < 1 is not compatible
with the artifact (see Webster et al., 2004, in reply), and
our simulations support this: when the node-density artifact is present, values of 8 < 1 arise only around 5% of
the time, and then as a result of chance variation. Had
Witt and Brumfield analyzed the Fitch and Bruschi tree,
they would have discovered (see above) that it reveals
the predicted 8 > 1, despite appearing to produce a linear relationship between path lengths and nodes. This
emphasizes the importance of applying phylogenetically
based statistics to this problem.
et al., 2002). Maximum likelihood methods are expected
to perform far better than parsimony methods in reconstructing change along branches by allowing multiple
changes, whereas parsimony can only "see" at most one.
But as our results here and others (e.g., Zharkikh, 1994)
have shown, even likelihood methods will underestimate the true amount of change, especially when the
wrong model of sequence evolution is used to analyze the
data. Yang (1994,1996) and Pagel and Meade (2004,2005)
note that tree lengths often increase when more realistic
models of sequence evolution are applied. Better fitting
models of sequence evolution should reduce the strength
of any observed relationship between nodes and path
lengths, and this could be easily assessed by comparing
the 8 values for trees inferred from different models.
Molecular sequence data are likely to harbor complex
signals of their evolutionary history. Detecting, characterizing, and interpreting these signals using statistical
methods is a powerful way to reconstruct the past (Pagel,
1997, 1999). The results we report here show that it is
possible to detect phylogenies that display an artifact of
phylogeny reconstruction that can bias inferences about
such historical evolutionary events.
ACKNOWLEDG EMENTS
This work was supported by BBSRC G19848 and a BBSRC
Studentship to C.V. Tom Kirkman kindly modified his computer
program to implement the Kolmogorov-Smirnov test and calculate
the cumulative distribution frequency.
642
VOL. 55
SYSTEMATIC BIOLOGY
REFERENCES
Agapow, P. M., and A. Purvis. 2002. Power of eight tree shape statistics
to detect nonrandom diversification: A comparison by simulation of
two models of cladogenesis. Syst. Biol. 51:866-872.
Bromham, L. 2003. Molecular clocks and explosive radiations. J. Mol.
Evol. 57 (Suppl l):S13-S20.
Bromham, Lv and D. Penny. 2003. The modern molecular clock. Nat.
Rev. Genet. 4:216-224.
Bromham, L., M. Woolfit, M. S. Lee, and A. Rambaut. 2002. Testing the
relationship between morphological and molecular rates of change
along phylogenies. Evol. Int. J. Org. Evol. 56:1921-1930.
Fitch, W. M., and J. J. Beintema. 1990. Correcting parsimonious trees
for unseen nucleotide substitutions: The effect of dense branching as
exemplified by ribonuclease. Mol. Biol. Evol. 7:438-443.
Fitch, W. M., and M. Bruschi. 1987. The evolution of prokaryotic
ferredoxins—with a general method correcting for unobserved substitutions in less branched lineages. Mol. Biol. Evol. 4:381-394.
Pagel, M. 1997. Inferring evolutionary processes from phylogenies.
Zool. Scripta 26:331-348.
Pagel, M. 1999. Inferring the historical patterns of biological evolution.
Nature 401:877-884.
Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for
detecting pattern-heterogeneity in gene sequence or character-state
data. Syst. Biol. 53:571-581.
Pagel, M., and A. Meade. 2005. Mixture models in phylogenetic inference. Pages 121-139 in Mathmatics of evolution and phylogeny (O.
Gascuel, ed.). Oxford Univiversty Press, New York.
Rambaut, A. 2002. PhyloGen: Phylogenetic tree simulator package, version 1.1. Department of Zoology, University of Oxford.
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: An application for the
Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238.
Rohlf, F. J., W. S. Chang, R. R. Sokal, and J. Y. Kim. 1990. Accuracy
of estimated phylogenies: Effects of tree topology and evolutionary
model. Evolution 44:1671-1684.
Swofford, D. L. 2001. PAUP*: Phylogenetic analysis using parsimony (*and other methods), version 4.0bl0. Sinauer Associates,
Sunderland, Massachusetts.
Webster, A. J., R. J. Payne, and M. Pagel. 2003. Molecular phylogenies
link rates of evolution and speciation. Science 301:478.
Webster, A. J., R. J. Payne, and M. Pagel. 2004. Response to comments
on "Molecular phylogenies link rates of evolution and speciation."
Science 303:173d-174d.
Witt, C. C, and R. T. Brumfield. 2004. Comment on "Molecular phylogenies link rates of evolution and speciation" (I). Science 303:173;
author reply 173.
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from
DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306-314.
Yang, Z. 1996. Among-site rate variation and its impact on phylogenetic
analyses. Trends Ecol. Evol. 11:367-372.
Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315-329.
ters are estimated controlling for the phylogenetic relationships among
taxa (see Webster et al., 2003-Supplementary Information). In addition,
we find that the parameters are normally more accurately estimated
from n = fix6 than from the equivalent regression of path length on
nodes. This is especially true when 8 as estimated from n = fix8 is less
than 1.0. But there are exceptions and it is easiest to see from examples
of nodes versus reconstructed path lengths how some of the estimation
problems arise. Fortunately all of them can be resolved from viewing
a plot of the data.
True 8 > 1
Figure Ala and b plot the same phylogenetic data first as nodes
versus path lengths and second as path lengths versus nodes. Estimating 8 from n = fix5 (Fig. Ala) yields I = 3.25, and estimating it from
x = f)*n^/s (Fig. Alb) yields 8 = 3.27. The estimated regression lines are
drawn through the data. In general, when 8 > 1, it makes little difference which equation is used to estimate it (but see below for 8 » 1).
Nevertheless, we prefer the n = fix6 equation on the assumption that
in real data, path lengths will tend to be better estimated than the number of nodes (representing speciation events), and it is well known that
regression models underestimate parameters to the extent that there is
error in the independent variable.
True 8 » 1
An exception to the rule of using n — fix6 can arise for trees that produce a large 8. This can occur in short trees or trees with little rate heterogeneity. Figure A2a plots nodes versus reconstructed path lengths for
a tree of 50 tips. The relationship is curvilinear with 8 » 1. Estimating
S from n = fix6 yields the starkly incorrect line shown, with S = 0.74.
Ironically, the parameter is poorly estimated because all of the path
lengths are reasonably well reconstructed, producing a nearly vertical
array of points. As a result, outliers can have large vertical deviations
from the correct curve, which here is estimated to be 8 = 10.76. When
this occurs it is often the case that the maximum likelihood estimator
is a downwards curving line, such as the one obtained for these data,
because it has smaller vertical deviations on average than the "correct"
line. In this case the problem is apparent by inspection and can be resolved either by fitting by eye, or by estimating 8 from x = fi*rf16 as in
Figure A2b.
True 8 <1
We do not expect a true 8 < 1 in data with the node-density artifact,
but 8 < 1 can arise when the artifact is absent. When the true 8 is less
than 1, it may be estimated poorly from x = fi*n^16 even giving the impression that the artifact is present (i.e., 8 > 1). Figure A3a plots data
from a real phylogeny for which by inspection it can be seen that 8 < 1.
Estimating 8 from n = fix6 yields 8 = 0.73 (r2 = 0.20), and the regression line plausibly captures the curvature. Estimating 8 from Figure
A3b according to x = fi*nvs yields 8 = 1.29 (r2 = 0.09), and the regresFirst submitted 2 September 2005; reviews returned 11 November 2005;
sion line fails to capture the curvature in the data. The r2 values differ
final acceptance 11 January 2006
because the two equations presume different variance-covariance maAssociate Editor: Thomas Buckley
trices in the generalized least-squares regression (see Webster et al.,
2003-Supplementary Information.).
APPENDIX 1.
The second fitting procedure returns a worse log-likelihood and
fails in this case because of an unusual feature of nodes data. Node
ESTIMATING 8 FROM PATH LENGTH AND NODES DATA
numbers vary in discrete jumps, and most trees will have a range of
Webster et al. (2003) fit a curve of the form n — fix6 to detect the node- path lengths for the same number of nodes. These two features cause
density artifact, where n is the number of nodes, x is the phylogenetic the discretely spaced vertical stacks of data in Figure A3b. As with the
path length, fi describes that rate of change between path length and previous example, an upwards curving line drawn through such data
the number of nodes, and 8 captures any curvature. This is algebraically can have long vertical deviations from the points, and this tendency
equivalent to x = fi*rf16, where fi* = fi~^/6, and we expect 8 > 1 whenbecomes more prominent the steeper the line. When this occurs, it is
the artifact is present. When the data do not suffer from the artifact, often the case that the maximum likelihood estimator is a downwards
there can still be a relationship between path lengths and nodes such curving line, such as the one obtained for these data, and for the same
that/9* > 0,but<5 < 1.
reasons as given above. Estimating 8 from n = fix6 avoids this problem.
In practice fi* and 8 can in some cases be tricky to estimate owing to It also uses path lengths on the x-axis and these are likely to be better
vagaries of path length and nodes data. It is essential that the parame- estimated that numbers of nodes.
2006
643
VENDITTI ET AL.—DETECTING THE NODE-DENSITY ARTIFACT
(a)
10
T3
O
t> 6\
.Q
§4
0
.2
.1
.3
.4
.5
.6
Total Path Length
.7
.8
.9
4
6
Number of Nodes
10
FIGURE Al. Phylogenetic information taken from a single tree of 50 tips with branch lengths inferred from simulated artifact data (see
Methods), (a) Data plotted as nodes versus path lengths, with 8 estimated from n — f)xs (S = 3.25). (b) Data plotted as path lengths versus nodes,
with 8 is estimated from x = p*nus (S = 3.27). The corresponding regression line is drawn through the data.
.4
.6
.8
Total Path Length
4
6
8
Number of Nodes
10
12
FIGURE A2. Number of nodes and inferred total path lengths for a single tree with 50 tips derived from simulated artifact data (see Methods),
(a) 8 Was estimated from n = fix* (8 = 0.73), the regression line shows that the parameter was poorly estimated in this case, (b) 8 Was estimated
from x = p*n1/s (8 = 10.76), the regression line shows that this is the better estimate.
(a)
(b)
25
O
O
20
OO
i/i
0
O
O
yS
>^
0.02
Is"
^
0.015
O O O /
C
aSo
0
0
0 oqsr
OO (DO) (DQSOOCDO
O
C /<^
10
)
^DO3D
0.01
O
O
Jr
x
OOO
O
0 GO
<nr<& (TOO 0
5
OCOO
OOO
00 arc jgoax> 0 0
(CO oojr
0
c 0 0
z
>0*E
3
©
ODD uo>«r
o 15
JQ
0
O
O
0
OO
0.005
0
0
/ °
0 ' °O
.005
.01
.015
Total Path Length
.02
10
15
20
Number of Nodes
FIGURE A3. True 8 < 1. Plots the phylogenetic information for a tree of 147 tips (inferred from real data, (a) 5 Was estimated from n = fixs
(8 = 0.73) and the regression line plausibly captures the curvature (r2 = 0.20). (b) 8 Was estimated from x — fi*nys (8 = 1.29), the regression line
plotted in that panel fails to capture the curvature of the data (r2 = 0.09).