Syst. Biol. 54(6):895–899, 2005 c Society of Systematic Biologists Copyright ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150500354696 Nodes in Phylogenetic Trees: The Relation Between Imbalance and Number of Descendent Species ERIC W. HOLMAN Department of Psychology, University of California, Los Angeles, California 90095, USA; E-mail: [email protected] Abstract.— The imbalance of a node in a phylogenetic tree can be defined in terms of the relative numbers of species (or higher taxa) on the branches that originate at the node. Empirically, imbalance also turns out to depend on the absolute total number of species on the branches: in a sample of large trees, nodes with more descendent species tend to be more unbalanced. Subsidiary analyses suggest that this pattern is not a result of errors in tree estimation. Instead, the increase in imbalance with species is consistent with a cumulative effect of differences in diversification rates between branches. [Equal-rates Markov model; imbalance; phylogeny shape; proportional-to-distinguishable-arrangements model.] Since the pioneering work of Savage (1983), a large body of research has been devoted to the question of what inferences about evolution can be drawn from the shape of phylogenetic trees. The property of trees most thoroughly studied is their degree of imbalance, that is, the extent to which some branches lead to many species (or higher taxa) while others lead only to a few. The observed degree of imbalance is typically compared to the imbalance predicted by a null model called the simple birth and death process or the equal-rates Markov model, which assumes that species originate and become extinct at stochastically constant rates on all branches of the tree. In their review of this literature, Mooers and Heard (1997) concluded that most empirical trees are more unbalanced than predicted by the model. The usual explanation for imbalance is differences among lineages in rates of speciation relative to rates of extinction; these differences in net diversification rates are presumably caused by biological differences among organisms. Slowinski and Guyer (1989) established the Markov model as the appropriate null hypothesis for testing differences of this sort. As Heard and Mooers (2002) pointed out, the surprising empirical result here is that the imbalance of most trees exceeds not only the prediction from the Markov model, but also predictions from biologically plausible differences in net diversification rates. Such differences do not have enough time to produce much imbalance in the small trees typically studied in research on tree shape. Heard and Mooers showed by simulation that occasional mass extinctions can enhance the effect of differences in diversification rates and increase the imbalance of the resulting trees, although whether the predicted imbalance matches that of empirical trees is not clear. The main alternative explanation for imbalance is errors that cause the estimated trees to deviate from the true phylogenies. Slowinski (1990) pointed out that random errors can be represented by the everytree-is-equiprobable or proportional-to-distinguishablearrangements model, which assumes that trees are chosen at random from the set of all possible trees for a given number of species. This model predicts higher levels of imbalance than does the Markov model, and the simulation studies reviewed by Mooers and Heard (1997) show that adding random error to the data does indeed increase the imbalance of trees estimated by cladistic methods. The evidence is mixed on whether such errors account for the imbalance of empirical trees. Guyer and Slowinski (1991) found that the trees in their sample that were supported by the largest number of characters were consistent with the Markov model, whereas trees with less support were more unbalanced. Mooers et al. (1995) also observed a negative correlation between imbalance and data quality in another sample of trees. Stam (2002), however, used a different measure of imbalance, which more completely compensated for tree size, and found no correlation between imbalance and data quality in a new sample of trees. As a further complication, Scotland and Sanderson (2004) showed that the rules used by taxonomists to set the boundaries between higher taxa have a large effect on imbalance as measured by the distribution of number of species per taxon. In hopes of narrowing down the possible reasons for imbalance, the present paper addresses a specific empirical question. At any given bifurcating node in a phylogenetic tree, imbalance can be defined in terms of the relative numbers of species on the two branches that originate at the given node. The empirical question is whether imbalance also depends on the absolute total number of species on the two branches. The question can be answered with the aid of a measure of imbalance developed by Fusco and Cronk (1995) and Purvis et al. (2002), which is predicted by the Markov model to be independent of the total number of species. In contrast to the Markov model, the results of Heard and Mooers (2002) suggest that differences between branches in diversification rates should have a cumulative effect to produce more imbalance with more species. The random errors embodied in the proportional-to-distinguishablearrangements model also predict an increase in imbalance with number of species, according to a specific distribution that can be tested empirically. An alternative analysis of errors raises the possibility that the number of branches per node may be related to the number of species per branch; this question can also be answered empirically. 895 896 VOL. 54 SYSTEMATIC BIOLOGY D ATA AND M ETHODS The data are drawn mainly from the sample of phylogenetic trees previously collected by Purvis and Agapow (2002). Three reasons recommend this sample for the present study. First, most of the trees are already published and thus available for further analysis. Second, the terminals of the trees are superspecific taxa, such as genera or families, with approximately known numbers of species. Because these trees contain more species than most trees with a single species at each terminal, the effect of number of species per node can be studied over a relatively wide range. Third, the sample is probably unbiased with respect to the present hypothesis, because it was collected for a different purpose. Purvis and Agapow used the sample to show that imbalance tends to be greater when the units of analysis are higher taxa rather than species. For this hypothesis, number of species is if anything a nuisance variable: in the one analysis that included number of species per node as a factor, Purvis and Agapow deliberately restricted its range to 20 or fewer species and found no significant effect. The question remains whether an effect can be demonstrated over a much wider range. The sample of Purvis and Agapow includes 61 trees: 25 of arthropods, 21 of angiosperms, and 15 of vertebrates. The trees in the present sample (see Appendix, available at www.systbio.org) were obtained from the same sources cited by Purvis and Agapow, with the following exceptions. For arthropods, the tree of noncyclostome braconids is unpublished and therefore was not used here. The tree of Syrphidae, published by Katzkourakis et al. (2001), was used instead; this tree was discussed by Purvis and Agapow but not included in their sample. For angiosperms, the tree of all angiosperms was unpublished at the time but has since been published by Davies et al. (2004); the published version was used here. For vertebrates, many of the nodes in the tree of Odontoceti also occur in the tree of Eutheria; to avoid counting any node more than once, all the species of Odontoceti in the tree of Eutheria were here lumped together into a single terminal. Finally, in order to maximize the range of number of species, one more tree was added to the present sample: the tree of all living organisms published by Lecointre and Le Guyader (2001). Because many of the nodes in the tree of Eutheria also occur in the tree of all organisms, all the eutherian species in the latter tree were lumped together into a single terminal. None of the remaining nodes in any tree occurs in any other tree; thus, each node was analyzed only once. Most measures of imbalance are defined for an entire tree, which contains various nodes with different numbers of species. An effect of number of species would be easier to observe if imbalance were measured for individual nodes or sets of nodes within a tree. Just such a measure of imbalance was introduced by Fusco and Cronk (1995) and extended by Purvis et al. (2002). For a given bifurcating node, let S be the total number of species on the two branches, let B be the total number of species on the branch with more species, and let m be the smallest integer not smaller than S/2. It can be assumed without loss of generality that S is at least 4, the smallest number of species for which nodes can have different levels of imbalance. Fusco and Cronk (1995) defined an imbalance score I as follows: I = (B − m)/(S − m − 1). I has a maximum value of 1 if the node is as unbalanced as possible, with one species on one branch and all remaining species on the other; I has a minimum value of 0 if the node is as balanced as possible, with the numbers of species on the two branches either equal or differing by only one. Purvis et al. (2002) showed, however, that the expected value of I depends on S even if the Markov model is true. To correct this problem, they defined a weight w as follows: w = 1 if S is odd; w = (S − 1)/S if S is even and I > 0; w = 2(S − 1)/S if S is even and I = 0. For any set of nodes, such as those with a particular value of S, Purvis et al. (2002) also defined the weighted mean imbalance Iw as the weighted mean of I with weights w. They then showed that under the Markov model, Iw (unlike I ) has an expected value of 0.5 for any value of S. Therefore, any empirical effect of S on Iw implies that the total number of species at a node influences the extent to which imbalance exceeds the prediction from the Markov model. Although the Markov model assumes that all nodes in a tree are statistically independent, Purvis and Agapow (2002) already showed that the model does not apply to the present collection of trees. Consequently, the assumption of independence is not appropriate for testing the statistical significance of differences among nodes. Instead, a bootstrap test was conducted that assumes only the independence of the 62 trees. In each of 10,000 bootstrap samples of 62 trees with replacement from the original collection, the data were analyzed in the same way as the original data; the proportion of these samples that show an effect opposite from a given prediction is an estimate of the one-tailed descriptive significance level of the predicted effect. R ESULTS The trees in the original sample contain 1251 bifurcating nodes, along with 131 nodes with more than two branches (polytomous nodes). The bifurcating nodes were sorted into sets according to the number of species per node, in intervals with lower bounds of 4, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 20,000, and 100,000 species per node. In each set, the weighted mean imbalance Iw was calculated as described above, and the weighted geometric mean number of species per node was calculated with the same weights w. The solid line in Figure 1 plots 2005 HOLMAN—IMBALANCE OF NODES IN PHYLOGENETIC TREES 897 FIGURE 1. Weighted mean imbalance (Iw ) as a function of total number of species per node (S). Solid line: data. Dotted line: prediction from proportional-to-distinguishable-arrangements model. Markov model predicts that Iw is 0.5 for all S. FIGURE 2. Proportion of completely unbalanced trees as a function of total number of species (S). Solid line: data. Upper dotted line: prediction from proportional-to-distinguishable-arrangements model. Lower dotted line: prediction from Markov model. imbalance as a function of species per node, with the latter on a logarithmic scale. The function increases across its entire range except for minor fluctuations. The increase is negligible, however, across the much narrower range of 4 to 20 species per node, confirming the results of Purvis and Agapow (2002). Also in agreement with Purvis and Agapow is the fact that imbalance is consistently above the 0.5 predicted by the Markov model. As a summary measure of association, the weighted product-moment correlation between imbalance and the logarithm of number of species per node was calculated across all 1251 bifurcating nodes, again with the weights w. The correlation is 0.20. The correlation also proved to be positive in all the bootstrap samples (P < .0001). The dotted line in Figure 1 plots the imbalance predicted by the proportional-to-distinguishablearrangements model according to equations 1, 12, and 13 of Slowinski (1990). This model, unlike the Markov model but like the data, implies a positive relation between imbalance and number of species. In fact, the data fall about halfway between the predictions of the two models, except that the last data point (for more than 100,000 species) is noticeably above the halfway point. To compare the models with the data for a different aspect of imbalance, Figure 2 shows the proportion of bifurcating nodes that are completely unbalanced; such nodes have one species on one branch, all the other species on the other branch, and an imbalance score of 1. For the data (solid line), the nodes were sorted according to number of species (S) in the same intervals as in Figure 1, but all the nodes were weighted equally. According to Slowinski (1990), the proportion of completely unbalanced nodes is predicted to be 2/(S − 1) by the Markov model (lower dotted line), and S/(2S − 3) by the proportionalto-distinguishable-arrangements model (upper dotted line). As before, the data fall between the predictions of the two models, but this time the data move much closer to the Markov model as the number of species increases. Because the data fall between the models, a probability mixture of the models can be explored as a possible compromise. Let the Markov model hold with probability P(S), which may depend upon S, and let the proportional-to-distinguishable-arrangements model hold with probability 1 − P(S). According to Figure 1, P(S) is about 0.5 and decreases if anything for large S. According to Figure 2, however, P(S) increases to near 1 as S increases. This discrepancy contradicts any probability mixture of the models. In other words, the trees that fail to obey the Markov model are not chosen at random from the set of all possible trees. As further evidence on the empirical pattern of imbalance, Figure 3 presents relative frequency histograms of imbalance scores for nodes with different numbers of species. The range of possible imbalance scores is divided into ten intervals of length 0.1; the graph shows the relative frequency of scores in each interval, with nodes FIGURE 3. Weighted relative frequency histograms of imbalance. White bars: 20 to 199 species per node. Gray bars: 200 to 1999 species per node. Black bars: 2000+ species per node. Markov model predicts that each relative frequency is 0.1. 898 VOL. 54 SYSTEMATIC BIOLOGY weighted by the weights w. The three histograms refer to nodes with 20 to 199 species (white bars), 200 to 1999 species (grey bars), and 2000 or more species (black bars). Nodes with fewer than 20 species are not included because the underlying discrete distribution of imbalance scores is not well approximated by an interval histogram for small numbers of species. The Markov model predicts a discrete uniform distribution with a probability close to 0.1 in each interval. The empirical distributions are in fact approximately uniform for imbalance scores below about 0.7, although the relative frequencies are lower than predicted. For imbalance scores above 0.7, the relative frequencies increase with imbalance; the highest frequency in each distribution is observed for imbalance scores in the interval from 0.9 to 1.0. As number of species increases, relative frequencies decrease for imbalance below 0.7 and increase for imbalance above 0.9, resulting in a general increase in imbalance. One factor that may contribute to the relation between imbalance and number of species is the proximity of a node to the root of the tree. On any branch of a tree, nodes closer to the root also have more species. Thus, if for any reason the methods used to construct trees tend to produce more imbalance at nodes closer to the root, then there could also be more imbalance at nodes with more species. To test this possibility, the distance from any node to the root was defined as the number of other nodes on the path from the given node to the root. The weighted correlation between imbalance and distance from the root is 0.03, although the correlation would be negative if the greater imbalance at nodes with more species were a secondary effect of proximity to the root. The correlation was not negative in 80% of the bootstrap samples, indicating no significant correlation. Another possibly relevant factor is the strength of the data supporting nodes with different numbers of species. If nodes with more species tend to be less strongly supported, then their greater imbalance could be explained by the inverse relation between support and imbalance found in simulated trees (Mooers and Heard, 1997). Investigation of this possibility is hampered by the heterogeneity of the published information on degree of support for individual nodes, and also by the heterogeneity of the very methods used to construct the trees. Information on degree of support ranges from none in some trees to a variety of different measures in others, depending on how the trees were constructed. One general albeit indirect measure of support can nevertheless be derived from the fact that trees are most informative if they have the highest degree of resolution justified by the data. Consequently, the presence of nodes with more than two branches suggests that the data are not strong enough to support nodes with higher resolution. In particular, nodes with more than two branches are the inevitable result when poorly supported nodes collapse in a consensus tree; this process accounts for 91 of the 131 nodes with more than two branches in the present data. Because each branch contributes its species to the total at a node, the number of species per node must be replaced by the mean number of species per branch in comparisons of nodes with different numbers of branches. The empirical question is whether nodes with more branches also tend to have more species per branch. The unweighted correlation across nodes was therefore calculated between the logarithm of the number of branches per node and the logarithm of the mean number of species per branch. The correlation is −0.02; the correlation was nonpositive in 64% of the bootstrap samples. In case the correlation is diluted by heterogeneity among the trees in the criteria used for collapsing nodes, correlations were also calculated separately within each of the 34 individual trees that include at least one node with more than two branches; 18 of the correlations are positive and 16 are negative. In case the correlations are vitiated because only 10% of the nodes have more than two branches, the mean numbers of species per branch were also compared between nodes with two branches and nodes with more than two branches; the geometric mean was at least as great for nodes with two branches in 26% of the bootstrap samples and in 16 of the 34 individual trees. This series of null results suggests that nodes with more species per branch do not have more branches, and therefore that the greater imbalance of nodes with more species is not an effect of weaker support. D ISCUSSION Most measures of imbalance are defined for entire trees and are thus suitable for comparisons between trees. The imbalance scores of Fusco and Cronk (1995), however, along with the weights of Purvis et al. (2002), can be defined for sets of nodes within trees and are thus appropriate for comparisons within as well as between trees. In the first such comparison, Purvis and Agapow (2002) showed an effect of taxonomic rank: imbalance tends to be greater when calculated in terms of higher taxa rather than species. The present comparison shows an effect of total number of species: imbalance tends to be greater at nodes with more species. The flexibility of the weighted imbalance measure recommends its use in further comparisons within and between phylogenetic trees. The proportional-to-distinguishable-arrangements model has served its purpose in explaining why the addition of random error to simulated data increases the imbalance of estimated trees (Mooers and Heard, 1997). As a mechanism for generating trees, however, the process of completely random choice embodied in the model becomes less plausible as the number of species increases and the number of possible trees increases even faster. It is therefore no surprise that the model progressively fails as an alternative to the Markov model in describing the pattern of imbalance as the number of species increases in the present data. A better description of how trees are constructed might start with the fact that for more than a few species, there are far more possible trees than can ever be explored exhaustively. Even computerized heuristic searches, the fastest of which now rely on Bayesian statistics, rarely 2005 HOLMAN—IMBALANCE OF NODES IN PHYLOGENETIC TREES attempt to find trees with more than a few hundred terminals (Huelsenbeck et al., 2001). Larger numbers of species can only be accommodated with the aid of additional approximations. The most common approximation, used in nearly all the trees in the present sample, is to substitute higher taxa as terminals in place of species. This technique assumes that the higher taxa are strictly monophyletic, and also that they can be adequately represented by a subset of their species or character states. Another approximation, used in some of the largest trees in the present sample, is to combine a number of smaller trees into a single large one, commonly called a supertree (Bininda-Emonds, 2004). Because these approximations become more common and controversial with more species, nodes with more species may be less well supported and for that reason more unbalanced. An admittedly indirect test of this possibility in the present sample found no relation between the number of branches per node and the number of species per branch. To the extent that more branches at a node indicate less support, these results suggest that the approximations necessary to construct nodes with many species do not substantially undermine their support or exaggerate their imbalance. In the context of biological explanations for imbalance, the present results partly account for the surprisingly high levels of imbalance pointed out by Heard and Mooers (2002). If differences in diversification rates evolve incrementally, then the effects of such differences on imbalance should be inconspicuous at nodes with few species before accumulating at nodes with more species. Figure 1 does indeed show the predicted increase in imbalance with number of species, but the increase starts from a level of imbalance that already indicates substantial differences in diversification rates. The high initial level of imbalance remains to be explained. In addition to the general increase in imbalance with number of species, large trees like those in the present sample contain a wealth of more detailed information about imbalance that the present research has just begun to explore. For instance, Figure 2 shows that as the number of species increases, the proportion of completely unbalanced trees approaches an asymptote that is close to 0 but nevertheless above 0. Also, Figure 3 shows a strikingly simple pattern in the distributions of imbalance, which are nearly uniform except for a peak near the upper end of their range; the only apparent effect of increasing the number species is to shift relative frequency from the uniform portion to the peak. Any successful 899 models for phylogenetic trees will have to account for these patterns. ACKNOWLEDGMENTS I thank Paul-Michael Agapow, Marshal Hedin, Roderic Page, and an anonymous referee for their helpful suggestions. R EFERENCES Bininda-Emonds, O. R. P. 2004. The evolution of supertrees. Trends Ecol. Evol. 19:315–322. Davies, T. J., T. G. Barraclough, M. W. Chase, P. S. Soltis, D. E. Soltis, and V. Savolainen. 2004. Darwin’s abominable mystery: Insights from a supertree of the angiosperms. Proc. Nat. Acad. Sci. USA 101:1904– 1909. Fusco, G., and Q. C. B. Cronk. 1995. A new method for evaluating the shape of large phylogenies. J. Theor. Biol. 175:235–243. Guyer, C., and J. B. Slowinski. 1991. Comparisons of observed phylogenetic topologies with null expectations among three monophyletic lineages. Evolution 45:340–350. Heard, S. B., and A. Ø. Mooers. 2002. Signatures of random and selective mass extinctions in phylogenetic tree balance. Syst. Biol. 51:889– 897. Huelsenbeck, J. P., F. Ronquist, R. Nielsen, and J. P. Bollback. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310–2314. Katzourakis, A., A. Purvis, S. Azmeh, G. Rotheray, and F. Gilbert. 2001. Macroevolution of hoverflies (Diptera: Syrphidae): The effect of using higher-level taxa in studies of biodiversity, and correlates of species richness. J. Evol. Biol. 14:219–227. Lecointre, G., and H. Le Guyader. 2001. Classification phylogénétique du vivant, 2e édition. Belon, Paris. Mooers, A. Ø., and S. B. Heard. 1997. Inferring evolutionary process from phylogenetic tree shape. Q. Rev. Biol. 72:31–54. Mooers, A. Ø., R. D. M. Page, A. Purvis, and P. H. Harvey. 1995. Phylogenetic noise leads to unbalanced cladistic tree reconstructions. Syst. Biol. 44:332–342. Purvis, A., and P.-M. Agapow. 2002. Phylogeny imbalance: Taxonomic level matters. Syst. Biol. 51:844–854. Purvis, A., A. Katzourakis, and P.-M. Agapow. 2002. Evaluating phylogenetic tree shape: Two modifications to Fusco and Cronk’s method. J. Theor. Biol. 214:99–103. Savage, H. M. 1983. The shape of evolution: Systematic tree topology. Biol. J. Linn. Soc. 20:225–244. Scotland, R. W., and M. J. Sanderson. 2004. The significance of few versus many in the tree of life. Science 303:643. Slowinski, J. B. 1990. Probabilities of n-trees under two models: A demonstration that asymmetrical interior nodes are not improbable. Syst. Zool. 39:89–94. Slowinski, J. B., and C. Guyer. 1989. Testing the stochasticity of patterns of organismal diversity: An improved null model. Am. Nat. 134:907– 921. Stam, E. 2002. Does imbalance in phylogenies reflect only bias? Evolution 56:1292–1295. First submitted 4 January 2005; reviews returned 31 March 2005; final acceptance 7 June 2005 Associate Editor: Marshal Hedin
© Copyright 2026 Paperzz