Syst. Biol. 55(4):637-643,2006 Copyright © Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DO1:10.1080/10635150600865567 Detecting the Node-Density Artifact in Phylogeny Reconstruction CHRIS VENDITTI, ANDREW MEADE, AND MARK PAG EL School of Biological Sciences, University of Reading, Whiteknights, Reading RC6 6AJ, England; E-mail: [email protected] (M.R) Abstract.— The node-density effect is an artifact of phylogeny reconstruction that can cause branch lengths to be underestimated in areas of the tree with fewer taxa. Webster, Payne, and Pagel (2003, Science 301:478) introduced a statistical procedure (the "delta" test) to detect this artifact, and here we report the results of computer simulations that examine the test's performance. In a sample of 50,000 random data sets, we find that the delta test detects the artifact in 94.4% of cases in which it is present. When the artifact is not present (n = 10,000 simulated data sets) the test showed a type I error rate of approximately 1.69%, incorrectly reporting the artifact in 169 data sets. Three measures of tree shape or "balance" failed to predict the size of the node-density effect. This may reflect the relative homogeneity of our randomly generated topologies, but emphasizes that nearly any topology can suffer from the artifact, the effect not being confined only to highly unevenly sampled or otherwise imbalanced trees. The ability to screen phylogenies for the node-density artifact is important for phylogenetic inference and for researchers using phylogenetic trees to infer evolutionary processes, including their use in molecular clock dating. [Delta test; molecular clock; molecular evolution; node-density effect; phylogenetic reconstruction; speciation; simulation.] Fitch and Bruschi (1987) and Fitch and Beintema (1990) identified an artifact of phylogeny reconstruction that has come to be known as the node-density effect. These authors noted that branch lengths will tend to be better estimated in parts of a tree where more taxa have been sampled. Conversely, where taxon sampling is sparse or the amount of change between successive nodes of the tree is large, phylogenetic reconstruction methods will tend to underestimate the true amount of change. This is because in longer branches of a tree, multiple "hits," or two or more changes at a given site, are common. These multiple hits are mostly invisible and get reconstructed as one change, causing branch lengths to be underestimated. The effect never disappears but will be smaller in shorter branches of the tree, where fewer multiple hits are expected. Summed over all of the branch lengths of a phylogeny, this artifact can cause an apparent relationship between the number of nodes and the total inferred amount of change. Where there has been more net speciation (more internal nodes of the tree), the true amount of change along each branch is better estimated, giving the appearance that there has been more total evolution along the summed path from the root of the phylogeny to its tips. The effect increases with the number of nodes included along a path until the total path length approaches the true length. This leads to the expectation of a curvilinear relationship between the reconstructed length of a path and the number of nodes along that path. Figure la shows a phylogeny in which the artifact is present, and Figure lb plots the total root-to-tip (species) path lengths against the number of nodes along the path, showing the expected curvilinear trend (see also Fitch and Beintema's [1990] figure 2, reprinted in Page and Holmes [1998, page 169]). Webster et al. (2003) introduced a statistical test to detect phylogenies that suffer from the node-density artifact. Those authors fit a curve of the form n = fixs, where n is the number of nodes, x is the phylogenetic path length, P describes that rate of change between path length and the number of nodes, and 8 captures any curvature. This is algebraically equivalent to finding a curve of the form x = f3*nl/s, where ft* = fi~1/s, and we expect 8 >1 when the artifact is present. When the data do not suffer from the artifact, there can still be a relationship between path lengths and nodes such that p* > 0, but <5 < 1. To test for the artifact, Webster et al. (2003; Supplementary Information) describe a generalized least-squares (GLS) procedure based upon Pagel's (1997,1999) continuous method. The GLS method assesses the relationship between path lengths and nodes using all of the information in the phylogenetic tree and accounting for phylogenetic relatedness in both measures. Here we report on the performance of the delta test for detecting the node-density artifact by analyzing simulated gene-sequence data on random phylogenetic trees. Our particular interest is to determine how well the 8 > 1 criterion identifies trees suffering from the artifact. METHODS Simulation Data We used PhyloGen (Rambaut, 2002) to simulate 1000 random ultrametric trees of 50 species each. The speciation rate was set to twice that of the extinction parameter (birth = 0.2, death = 0.1, respectively). We then added an artificial outgroup taxon to each tree. This was done to ensure that all the branches leading to the true root were estimated properly (as described below). For each of the 1000 random topologies we used SeqGen (Rambaut and Grassly, 1997) to generate 50 random gene-sequence data sets of 1000 base pairs. We generated data from the general time-reversible (GTR +F4) model of sequence evolution, choosing the values of the rate parameters in the GTR matrix at random for each data set from the uniform interval between 0 and 20, with the exception of the G -> T rate, which was always 1. All base frequencies were assumed to be 0.25. We chose the value of the gamma shape parameter on the uniform interval 0 to 4 and varied the tree length by randomly choosing the root-to-tip distance (substitutions per site) between 0.2 and 2.2 each time an alignment was simulated. This 637 638 VOL. 55 SYSTEMATIC BIOLOGY gave us 50,000 data sets in which, because the trees are ultrametric, there is no relationship between the number of nodes along a path and the path length. We estimated the phylogenetic branch lengths for each of the 50,000 data sets using PAUP* 4.0bl0 (Swofford, 2001) and giving it the correct topology. Although the simulated trees were rooted, all branch lengths were estimated on unrooted trees. The artificial outgroup taxon was included and used to estimate where the true root of the tree should be placed along the basal branch. If branch lengths are estimated on rooted trees, maximum likelihood will correctly estimate the total length of the branch leading from the outgroup to the ingroup taxa, but it does not know where to place the root along this branch. If the root is placed such that the arbitrary length of the segment leading from the root to the outgroup is short, this can falsely give the impression of the nodedensity artifact. Although PAUP was given the true topology, we used a GTR model of evolution without gamma to infer the branch lengths in each of the 50,000 data sets. The simple GTR model will fail to capture the exact nature of the evolutionary process that gave rise to the data and is therefore expected to produce the node-density effect to varying degrees (see also Zharkikh, 1994). We generated a further 10,000 data sets of 1000 base pairs using 10 replicates each of the same 1000 simulated trees, and the same range of parameters. For each of these data sets we estimated the branch lengths in PAUP but using a GTR + r 4 model. Inferring the branch lengths with the same model as the data were simulated by means that the evolutionary process that gave rise to the data will be well approximated and we do not expect the node-density artifact to be present. Node-Density Analyses We removed the artifical outgroup taxon from each data set and rooted the trees at the point the outgroup taxon had identified. Then, for each tree we first tested for a relationship between the reconstructed path length, calculated as the sum of the branches from the root to the tip for each species, and the number of internal nodes along that path, starting at zero for the root and not counting the tip at the end of the path as an additional node (Webster et al, 2003, count species as additional nodes meaning that the values reported in their figure 1 would differ from ours by one. We prefer the present method of counting nodes as it corresponds to speciation events on the tree; see also Discussion.) The relationship between nodes and path lengths is tested by means of a likelihoodratio (LR) statistic comparing the likelihood of a randomwalk model to a directional random-walk model (Pagel, 1997,1999; Webster et al., 2003; Supplementary Information). The models differ by the parameterft*as described above in the equation for x, where ft* measures the regression of path length on nodes. We expectft*= 0 when no artifact is present. If the artifact is present in the data, we expect/T > 0 and that the directional model will provide a better fit. Twice the difference in likelihoods (the LR) is assessed by a xl distribution. Because the true trees are ultrametric, a significant association between path length and nodes is evidence, apart from chance effects, for the node-density artifact. In real data the nature of the true tree is not known, and a relationship between the number of nodes and path length could arise for reasons other than the artifact (see, for example, Webster et al., 2003). However, the artifact can be distinguished from other causes by the nature of the relationship it produces between path lengths and nodes. In particular, the delta test asserts that when a significant association has been caused by the artifact, we expect the parameter 8 to be greater than 1. For each significant directional model (ft* significantly > 0), we therefore also separately estimated <5 and recorded its value (the test makes no predictions about 8 when the artifact is not present). In practice we find that fi* and 8 are more accurately estimated from n = ftxs than from the equivalent regression of path length on nodes (see Appendix), and all of our analyses used this form of the equation. We took any numerical value of 8 > 1 in conjunction with a significant directional model to be evidence of the node-density effect. The performance of the delta test is measured by the proportion of the simulated data sets with significant associations between nodes and path length in which the parameter 8 is greater than 1. Software to implement the test is available from www.evolution.reading.ac.uk orwww.ams.rdg.ac.uk/zoology/pagel. Distributional Statistics Using the methods described above we derived for each data set a likelihood-ratio statistic comparing the directional to the random walk model—this is the test of ft*. Under the null hypothesis of no artifact, we expect the cumulative density of LR values to conform to a xl density. We compared distributions of the LR statistics to these expected xl densities using the the KolmogorovSmirnov (K-S) D statistic. Tree Shape To examine whether the shape of the simulated trees influenced the probability of obtaining an artifact, we calculated three measures of tree shape for each tree using the computer program MeSA (Agapow and Purvis, 2002): Colless' (1982) index Ic, a measure of tree imbalance; Shao and Sokal's (1990) Bl index, a measure of tree balance; and Rohlf et al.'s (1990) noncumulative steminess index. RESULTS The tree in Figure 1 shows the artifact. The LR test of the directional model returns a significant LR of 7.38, the slope ft* is estimated to be 0.13, and 8 = 7.33 (all values estimated by maximum likelihood). Manipulating the Presence/Absence of the Artifact We expect to see the artifact at much higher than chance levels in the 50,000-tree data set (hereafter, artifact data), but not in the 10,000-tree data set (hereafter 2006 639 VENDITTI ET AL.—DETECTING THE NODE-DENSITY ARTIFACT (a) (b) 0.2 2 3 4 5 6 7 10 Number of Nodes FIGURE 1. (a) A tree that displays the node-density artifact, (b) Plot of the total path length from root to tip against the number of nodes for each taxon in (a), showing the curvilinear trend associated with the node-density artifact. The directional random-walk model fits these data significantly better than the random-walk model (LR = 7.38; /S* = 0.13). The parameter <5 is estimated to be 7.33 (see text). Therefore, the solid line in (b) is of the form .Y = 0.13 /i1'7-33, where .r is the total path length and n is the number of nodes (see text). nonartifact data). Figure 2a plots the cumulative distribution of the 10,000 observed LR values for the nonartifact data along with the cumulative distribution of a true xl- The two lines fall on top of each other, and the K-S test confirms that the observed cumulative density does not depart from the expected xl distribution (D = 0.09559, P =0.3189): analyzing the simulated data with the model that generated it returns accurately estimated branch lengths. On the other hand, Figure 2b shows that the distribution of the LR statistics resulting from the artifact data set of 50,000 trees is considerably skewed to the right of the expected xl distribution. This indicates more large LR scores than expected, and the distribution returns a significant K-S test (D = 0.4735, P < 0.0001). Inferring the branch lengths on ultrametric trees using the "wrong" model of sequence evolution gives rise to the node-density artifact. Detecting the Node-Density Artifact In the artifact data, 48.67% (n = 24,336) of the simulated data sets showed a significant and positive association between total path length and the number of nodes. The artifact, as measured by the size of the LR statistic, was more likely to arise in trees with greater rate heterogeneity, as indicated by the a-shape parameter of the gamma distribution (r = - 0.6024 P < 0.0001), and somewhat more likely to arise in longer trees, (r = 0.272, P < 0.0001). These results are expected: in shorter trees and in trees with minimal rate heterogeneity, the inferred branch lengths capture all or nearly all of the true (a) 15 0 80 100 Likelihood Ratio FIGURE 2. (a) A plot to compare the cumulative distribution frequency for the X\ distribution (grey line) with that of the LR statistics derived from the 10,000 trees in which the branch lengths were estimated using the GTR+ F4 model (black line): the two lines fall directly on top of each other and the K-S test is not significant (D = 0.09559, P = 0.3189). (b) Compares the same x2 distribution (grey line) with the cumulative probability distribution of the LR statistics derived from the 50,000 trees in which only the GTR model was used to estimated the branch lengths (black line). The distribution of LRs is significantly skewed to the right of the x2 distribution, indicating more large LR scores than expected, and the K-S test is significant (D = 0.4735, P < 0.0001). 640 VOL. 55 SYSTEMATIC BIOLOGY TABLE 1. The number of significant positive associations in the artifact data set, and the number of these that had an estimate of 8 greater than 1. Sample size Number of trees that showed a significant positive association between nodes and total path length ML estimate of S > 1 in cases where there was a significant positive association between nodes and total path length 50,000 24,336 (48.7%) 22,983 (94.4%) changes in the data, and the artifact is negligible or not present. The delta test expects that data sets displaying the artifact will return values of 8 > 1. In 94.4% of the 24,336 data sets with significant LR statistics, the maximum likelihood estimate of 8 exceeded 1 (Table 1). Thus the delta test correctly identifies cases of the artifact at a high rate. By comparison, only the expected 5% (5.12%) of the 10,000 nonartifact data sets showed a significant association between nodes and path length. Fewer than half of these (1.95% of the total) showed the positive association expected of the artifact. Of this 1.95% about 87% return an estimate 8 greater than 1. This means that the delta test has a type I error rate of about 1.7% in these data. Figure 3 shows the LR statistic plotted against the estimate of 8 for each of the 24,336 artifact data sets with significant positive associations between path length and nodes. As the estimate of 8 moves past 1, the LR statistic increases sharply. Because <5 measures the curvature of the relationship, this plot emphasizes that when the node-density artifact is present (LR > 3.84), the expected curvilinear relationship between path length and nodes arises, such as in Figure 1. The opposite point also holds: values of 8 < 1 are not expected when the artifact is present and Figure 3 confirms this with only 5.6% of the estimated 8 values less than 1.0. The decline in LR values for larger values of 8 probably arises from trees with a small variance in total path lengths across the tips. Consider in Figure lb if there were very little difference among species in total path lengths. In the limit if all species have the same path length, the plot will produce a horizontal line. As this limit is approached the directional model offers less and less improvement on the nondirectional model, eventually declining to zero. At the same time, as the limit is approached, the x = f}*nl/s curve is required to turn an increasingly sharp corner, requiring higher values of 8. In support of this conjecture, we find that for the 24,336 results plotted in Figure 3, the correlation between the variance in path lengths and LR is 0.48 (P < 0.0001). Tree Shape The shape of the tree, at least as revealed by the three measures we employed, did not influence the probability of finding a significant association between path lengths and nodes. The r2 values relating the likelihood-ratio to the Ic, Bl, and steminess scores were 0.008, 0.001, and 0.015, respectively. This may reflect that randomly generated trees of size n = 50 tend to be relatively homogeneous. Colless' Ic statistic, for example, varies between 0 (perfectly balanced tree) and 1 (pectinate or ladder tree). In our sample, the mean Ic was 0.12 ± 0.03—most trees were relatively balanced. In the limits, a perfectly balanced tree cannot suffer from the node-density artifact because all paths from the root to the tips traverse the same number of nodes. At the other extreme, a pectinate tree has the potential to show a large effect. However, the same simulated topology often gave qualitatively different results in our study, depending upon the parameters used to generate the data. Figure 4 shows a single simulated tree with an Ic score of 0.15. The tree has seven independent clades in which node density varies in a pectinate-like manner. It returned the highest LR statistic we observed for data simulated with an a-shape parameter of 0.05 and a rootto-tip tree length of 2.16. With an a-shape parameter of 3.36 and a length of 0.71, the same tree returned one of the lowest observed LRs. The node-density artifact is not confined to highly imbalanced or poorly sampled trees but can arise whenever the true amounts of change are underestimated. DISCUSSION Using the 8 > 1 criterion in conjunction with a significant regression of path lengths on nodes, the delta test correctly identified the node-density artifact in 94.4% of the simulated data sets in which it was present. When the artifact was absent, the test had a type I error rate of about 1.7%. This makes it a useful statistic for identifying cases in which inferred branch lengths may suffer from the systematic bias to which Fitch and Bruschi (1987) and 0 i 2 3 «t 5 Fitch and Beintema (1990) first called attention. It can be ML Estimate of 5 used as a general phylogenetic diagnostic tool, and for FIGURE 3. The ML estimate of 8 and the LR statistic plotted for other cases in which it is important first to rule out the the 24,336 trees that showed a significant positive association between artifact, such as reconstructing ancestral states or calcubranch lengths and node (at the P < 0.05 level). The sharp rise in the LR statistic as 8 moves past 1 shows that the signal of the artifact is the lating molecular clocks. Out of historical interest, we applied the delta test to the Fitch and Bruschi and Fitch curvilinear relationship between nodes and path length. 2006 (a) VENDITTI ET AL.—DETECTING THE NODE-DENSITY ARTIFACT 641 (b) FIGURE 4. (a) Simulated tree topology that returned one of the weakest relationships observed between nodes and path lengths in the 50,000 data sets under one set of simulation parameters (topology with branch lengths shown in (b), a-shape parameter = 3.36, root-to-tip length = 0.71), LR =0.01, and the strongest association observed under another set (topology with branch lengths shown in (c), a-shape parameter = 0.05, root-to-tip length = 2.16), LR = 102.53. Some authors have suggested that maximum likeand Beintema trees. Both return significant relationships between nodes and path lengths, and both have <5 esti- lihood inference is robust to the node-density effect mated to be greater than 1 (Fitch and Bruschi's tree LR = because it uses a substitutional model of evolution 25.99 and 8 = 1.54, Fitch and Beintema's tree LR = 8.33 (Bromham, 2003; Bromham and Penny, 2003; Bromham and S = 1.66). Webster et al. (2003) introduced the delta test in their study of speciation rates affecting rates of molecular evolution. These authors analysed whether higher speciation rates—as evidenced by a larger number of internal nodes along a path—were associated with greater amounts of overall genetic change. The delta test was used to identify trees in which an apparent relationship between rates of speciation and path lengths could have arisen as a result of the node-density effect. After removing trees with significant regressions and 8 > 1, these authors found evidence for higher rates of molecular evolution linked to speciation in 34.8% of the trees that remained (this figure is 28.2% when nodes are counted as in this paper, see Methods). Commenting on the Webster et al. study, Witt and Brumfield (2004) suggested that 8 < 1 is compatible with the artifact and cited the Fitch and Bruschi (1987) tree as an example. Mathematically 8 < 1 is not compatible with the artifact (see Webster et al., 2004, in reply), and our simulations support this: when the node-density artifact is present, values of 8 < 1 arise only around 5% of the time, and then as a result of chance variation. Had Witt and Brumfield analyzed the Fitch and Bruschi tree, they would have discovered (see above) that it reveals the predicted 8 > 1, despite appearing to produce a linear relationship between path lengths and nodes. This emphasizes the importance of applying phylogenetically based statistics to this problem. et al., 2002). Maximum likelihood methods are expected to perform far better than parsimony methods in reconstructing change along branches by allowing multiple changes, whereas parsimony can only "see" at most one. But as our results here and others (e.g., Zharkikh, 1994) have shown, even likelihood methods will underestimate the true amount of change, especially when the wrong model of sequence evolution is used to analyze the data. Yang (1994,1996) and Pagel and Meade (2004,2005) note that tree lengths often increase when more realistic models of sequence evolution are applied. Better fitting models of sequence evolution should reduce the strength of any observed relationship between nodes and path lengths, and this could be easily assessed by comparing the 8 values for trees inferred from different models. Molecular sequence data are likely to harbor complex signals of their evolutionary history. Detecting, characterizing, and interpreting these signals using statistical methods is a powerful way to reconstruct the past (Pagel, 1997, 1999). The results we report here show that it is possible to detect phylogenies that display an artifact of phylogeny reconstruction that can bias inferences about such historical evolutionary events. ACKNOWLEDG EMENTS This work was supported by BBSRC G19848 and a BBSRC Studentship to C.V. Tom Kirkman kindly modified his computer program to implement the Kolmogorov-Smirnov test and calculate the cumulative distribution frequency. 642 VOL. 55 SYSTEMATIC BIOLOGY REFERENCES Agapow, P. M., and A. Purvis. 2002. Power of eight tree shape statistics to detect nonrandom diversification: A comparison by simulation of two models of cladogenesis. Syst. Biol. 51:866-872. Bromham, L. 2003. Molecular clocks and explosive radiations. J. Mol. Evol. 57 (Suppl l):S13-S20. Bromham, Lv and D. Penny. 2003. The modern molecular clock. Nat. Rev. Genet. 4:216-224. Bromham, L., M. Woolfit, M. S. Lee, and A. Rambaut. 2002. Testing the relationship between morphological and molecular rates of change along phylogenies. Evol. Int. J. Org. Evol. 56:1921-1930. Fitch, W. M., and J. J. Beintema. 1990. Correcting parsimonious trees for unseen nucleotide substitutions: The effect of dense branching as exemplified by ribonuclease. Mol. Biol. Evol. 7:438-443. Fitch, W. M., and M. Bruschi. 1987. The evolution of prokaryotic ferredoxins—with a general method correcting for unobserved substitutions in less branched lineages. Mol. Biol. Evol. 4:381-394. Pagel, M. 1997. Inferring evolutionary processes from phylogenies. Zool. Scripta 26:331-348. Pagel, M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877-884. Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53:571-581. Pagel, M., and A. Meade. 2005. Mixture models in phylogenetic inference. Pages 121-139 in Mathmatics of evolution and phylogeny (O. Gascuel, ed.). Oxford Univiversty Press, New York. Rambaut, A. 2002. PhyloGen: Phylogenetic tree simulator package, version 1.1. Department of Zoology, University of Oxford. Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238. Rohlf, F. J., W. S. Chang, R. R. Sokal, and J. Y. Kim. 1990. Accuracy of estimated phylogenies: Effects of tree topology and evolutionary model. Evolution 44:1671-1684. Swofford, D. L. 2001. PAUP*: Phylogenetic analysis using parsimony (*and other methods), version 4.0bl0. Sinauer Associates, Sunderland, Massachusetts. Webster, A. J., R. J. Payne, and M. Pagel. 2003. Molecular phylogenies link rates of evolution and speciation. Science 301:478. Webster, A. J., R. J. Payne, and M. Pagel. 2004. Response to comments on "Molecular phylogenies link rates of evolution and speciation." Science 303:173d-174d. Witt, C. C, and R. T. Brumfield. 2004. Comment on "Molecular phylogenies link rates of evolution and speciation" (I). Science 303:173; author reply 173. Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306-314. Yang, Z. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367-372. Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315-329. ters are estimated controlling for the phylogenetic relationships among taxa (see Webster et al., 2003-Supplementary Information). In addition, we find that the parameters are normally more accurately estimated from n = fix6 than from the equivalent regression of path length on nodes. This is especially true when 8 as estimated from n = fix8 is less than 1.0. But there are exceptions and it is easiest to see from examples of nodes versus reconstructed path lengths how some of the estimation problems arise. Fortunately all of them can be resolved from viewing a plot of the data. True 8 > 1 Figure Ala and b plot the same phylogenetic data first as nodes versus path lengths and second as path lengths versus nodes. Estimating 8 from n = fix5 (Fig. Ala) yields I = 3.25, and estimating it from x = f)*n^/s (Fig. Alb) yields 8 = 3.27. The estimated regression lines are drawn through the data. In general, when 8 > 1, it makes little difference which equation is used to estimate it (but see below for 8 » 1). Nevertheless, we prefer the n = fix6 equation on the assumption that in real data, path lengths will tend to be better estimated than the number of nodes (representing speciation events), and it is well known that regression models underestimate parameters to the extent that there is error in the independent variable. True 8 » 1 An exception to the rule of using n — fix6 can arise for trees that produce a large 8. This can occur in short trees or trees with little rate heterogeneity. Figure A2a plots nodes versus reconstructed path lengths for a tree of 50 tips. The relationship is curvilinear with 8 » 1. Estimating S from n = fix6 yields the starkly incorrect line shown, with S = 0.74. Ironically, the parameter is poorly estimated because all of the path lengths are reasonably well reconstructed, producing a nearly vertical array of points. As a result, outliers can have large vertical deviations from the correct curve, which here is estimated to be 8 = 10.76. When this occurs it is often the case that the maximum likelihood estimator is a downwards curving line, such as the one obtained for these data, because it has smaller vertical deviations on average than the "correct" line. In this case the problem is apparent by inspection and can be resolved either by fitting by eye, or by estimating 8 from x = fi*rf16 as in Figure A2b. True 8 <1 We do not expect a true 8 < 1 in data with the node-density artifact, but 8 < 1 can arise when the artifact is absent. When the true 8 is less than 1, it may be estimated poorly from x = fi*n^16 even giving the impression that the artifact is present (i.e., 8 > 1). Figure A3a plots data from a real phylogeny for which by inspection it can be seen that 8 < 1. Estimating 8 from n = fix6 yields 8 = 0.73 (r2 = 0.20), and the regression line plausibly captures the curvature. Estimating 8 from Figure A3b according to x = fi*nvs yields 8 = 1.29 (r2 = 0.09), and the regresFirst submitted 2 September 2005; reviews returned 11 November 2005; sion line fails to capture the curvature in the data. The r2 values differ final acceptance 11 January 2006 because the two equations presume different variance-covariance maAssociate Editor: Thomas Buckley trices in the generalized least-squares regression (see Webster et al., 2003-Supplementary Information.). APPENDIX 1. The second fitting procedure returns a worse log-likelihood and fails in this case because of an unusual feature of nodes data. Node ESTIMATING 8 FROM PATH LENGTH AND NODES DATA numbers vary in discrete jumps, and most trees will have a range of Webster et al. (2003) fit a curve of the form n — fix6 to detect the node- path lengths for the same number of nodes. These two features cause density artifact, where n is the number of nodes, x is the phylogenetic the discretely spaced vertical stacks of data in Figure A3b. As with the path length, fi describes that rate of change between path length and previous example, an upwards curving line drawn through such data the number of nodes, and 8 captures any curvature. This is algebraically can have long vertical deviations from the points, and this tendency equivalent to x = fi*rf16, where fi* = fi~^/6, and we expect 8 > 1 whenbecomes more prominent the steeper the line. When this occurs, it is the artifact is present. When the data do not suffer from the artifact, often the case that the maximum likelihood estimator is a downwards there can still be a relationship between path lengths and nodes such curving line, such as the one obtained for these data, and for the same that/9* > 0,but<5 < 1. reasons as given above. Estimating 8 from n = fix6 avoids this problem. In practice fi* and 8 can in some cases be tricky to estimate owing to It also uses path lengths on the x-axis and these are likely to be better vagaries of path length and nodes data. It is essential that the parame- estimated that numbers of nodes. 2006 643 VENDITTI ET AL.—DETECTING THE NODE-DENSITY ARTIFACT (a) 10 T3 O t> 6\ .Q §4 0 .2 .1 .3 .4 .5 .6 Total Path Length .7 .8 .9 4 6 Number of Nodes 10 FIGURE Al. Phylogenetic information taken from a single tree of 50 tips with branch lengths inferred from simulated artifact data (see Methods), (a) Data plotted as nodes versus path lengths, with 8 estimated from n — f)xs (S = 3.25). (b) Data plotted as path lengths versus nodes, with 8 is estimated from x = p*nus (S = 3.27). The corresponding regression line is drawn through the data. .4 .6 .8 Total Path Length 4 6 8 Number of Nodes 10 12 FIGURE A2. Number of nodes and inferred total path lengths for a single tree with 50 tips derived from simulated artifact data (see Methods), (a) 8 Was estimated from n = fix* (8 = 0.73), the regression line shows that the parameter was poorly estimated in this case, (b) 8 Was estimated from x = p*n1/s (8 = 10.76), the regression line shows that this is the better estimate. (a) (b) 25 O O 20 OO i/i 0 O O yS >^ 0.02 Is" ^ 0.015 O O O / C aSo 0 0 0 oqsr OO (DO) (DQSOOCDO O C /<^ 10 ) ^DO3D 0.01 O O Jr x OOO O 0 GO <nr<& (TOO 0 5 OCOO OOO 00 arc jgoax> 0 0 (CO oojr 0 c 0 0 z >0*E 3 © ODD uo>«r o 15 JQ 0 O O 0 OO 0.005 0 0 / ° 0 ' °O .005 .01 .015 Total Path Length .02 10 15 20 Number of Nodes FIGURE A3. True 8 < 1. Plots the phylogenetic information for a tree of 147 tips (inferred from real data, (a) 5 Was estimated from n = fixs (8 = 0.73) and the regression line plausibly captures the curvature (r2 = 0.20). (b) 8 Was estimated from x — fi*nys (8 = 1.29), the regression line plotted in that panel fails to capture the curvature of the data (r2 = 0.09).
© Copyright 2026 Paperzz