Nodes in Phylogenetic Trees: The Relation Between Imbalance and

Syst. Biol. 54(6):895–899, 2005
c Society of Systematic Biologists
Copyright ISSN: 1063-5157 print / 1076-836X online
DOI: 10.1080/10635150500354696
Nodes in Phylogenetic Trees: The Relation Between Imbalance
and Number of Descendent Species
ERIC W. HOLMAN
Department of Psychology, University of California, Los Angeles, California 90095, USA; E-mail: [email protected]
Abstract.— The imbalance of a node in a phylogenetic tree can be defined in terms of the relative numbers of species (or
higher taxa) on the branches that originate at the node. Empirically, imbalance also turns out to depend on the absolute
total number of species on the branches: in a sample of large trees, nodes with more descendent species tend to be more
unbalanced. Subsidiary analyses suggest that this pattern is not a result of errors in tree estimation. Instead, the increase
in imbalance with species is consistent with a cumulative effect of differences in diversification rates between branches.
[Equal-rates Markov model; imbalance; phylogeny shape; proportional-to-distinguishable-arrangements model.]
Since the pioneering work of Savage (1983), a large
body of research has been devoted to the question of
what inferences about evolution can be drawn from the
shape of phylogenetic trees. The property of trees most
thoroughly studied is their degree of imbalance, that is,
the extent to which some branches lead to many species
(or higher taxa) while others lead only to a few. The observed degree of imbalance is typically compared to the
imbalance predicted by a null model called the simple
birth and death process or the equal-rates Markov model,
which assumes that species originate and become extinct
at stochastically constant rates on all branches of the tree.
In their review of this literature, Mooers and Heard (1997)
concluded that most empirical trees are more unbalanced
than predicted by the model.
The usual explanation for imbalance is differences
among lineages in rates of speciation relative to rates
of extinction; these differences in net diversification
rates are presumably caused by biological differences
among organisms. Slowinski and Guyer (1989) established the Markov model as the appropriate null hypothesis for testing differences of this sort. As Heard
and Mooers (2002) pointed out, the surprising empirical result here is that the imbalance of most trees exceeds not only the prediction from the Markov model,
but also predictions from biologically plausible differences in net diversification rates. Such differences do
not have enough time to produce much imbalance in
the small trees typically studied in research on tree
shape. Heard and Mooers showed by simulation that
occasional mass extinctions can enhance the effect of
differences in diversification rates and increase the imbalance of the resulting trees, although whether the predicted imbalance matches that of empirical trees is not
clear.
The main alternative explanation for imbalance is
errors that cause the estimated trees to deviate from
the true phylogenies. Slowinski (1990) pointed out
that random errors can be represented by the everytree-is-equiprobable or proportional-to-distinguishablearrangements model, which assumes that trees are
chosen at random from the set of all possible trees for
a given number of species. This model predicts higher
levels of imbalance than does the Markov model, and
the simulation studies reviewed by Mooers and Heard
(1997) show that adding random error to the data does indeed increase the imbalance of trees estimated by cladistic methods. The evidence is mixed on whether such
errors account for the imbalance of empirical trees. Guyer
and Slowinski (1991) found that the trees in their sample
that were supported by the largest number of characters were consistent with the Markov model, whereas
trees with less support were more unbalanced. Mooers
et al. (1995) also observed a negative correlation between
imbalance and data quality in another sample of trees.
Stam (2002), however, used a different measure of imbalance, which more completely compensated for tree
size, and found no correlation between imbalance and
data quality in a new sample of trees. As a further complication, Scotland and Sanderson (2004) showed that
the rules used by taxonomists to set the boundaries between higher taxa have a large effect on imbalance as
measured by the distribution of number of species per
taxon.
In hopes of narrowing down the possible reasons for
imbalance, the present paper addresses a specific empirical question. At any given bifurcating node in a phylogenetic tree, imbalance can be defined in terms of the
relative numbers of species on the two branches that
originate at the given node. The empirical question is
whether imbalance also depends on the absolute total
number of species on the two branches. The question
can be answered with the aid of a measure of imbalance
developed by Fusco and Cronk (1995) and Purvis et al.
(2002), which is predicted by the Markov model to be
independent of the total number of species. In contrast
to the Markov model, the results of Heard and Mooers
(2002) suggest that differences between branches in diversification rates should have a cumulative effect to produce more imbalance with more species. The random
errors embodied in the proportional-to-distinguishablearrangements model also predict an increase in imbalance with number of species, according to a specific
distribution that can be tested empirically. An alternative
analysis of errors raises the possibility that the number
of branches per node may be related to the number of
species per branch; this question can also be answered
empirically.
895
896
VOL. 54
SYSTEMATIC BIOLOGY
D ATA AND M ETHODS
The data are drawn mainly from the sample of phylogenetic trees previously collected by Purvis and Agapow
(2002). Three reasons recommend this sample for the
present study. First, most of the trees are already published and thus available for further analysis. Second,
the terminals of the trees are superspecific taxa, such
as genera or families, with approximately known numbers of species. Because these trees contain more species
than most trees with a single species at each terminal,
the effect of number of species per node can be studied
over a relatively wide range. Third, the sample is probably unbiased with respect to the present hypothesis,
because it was collected for a different purpose. Purvis
and Agapow used the sample to show that imbalance
tends to be greater when the units of analysis are higher
taxa rather than species. For this hypothesis, number of
species is if anything a nuisance variable: in the one analysis that included number of species per node as a factor,
Purvis and Agapow deliberately restricted its range to
20 or fewer species and found no significant effect. The
question remains whether an effect can be demonstrated
over a much wider range.
The sample of Purvis and Agapow includes 61 trees: 25
of arthropods, 21 of angiosperms, and 15 of vertebrates.
The trees in the present sample (see Appendix, available at www.systbio.org) were obtained from the same
sources cited by Purvis and Agapow, with the following
exceptions. For arthropods, the tree of noncyclostome
braconids is unpublished and therefore was not used
here. The tree of Syrphidae, published by Katzkourakis
et al. (2001), was used instead; this tree was discussed
by Purvis and Agapow but not included in their sample.
For angiosperms, the tree of all angiosperms was unpublished at the time but has since been published by Davies
et al. (2004); the published version was used here. For vertebrates, many of the nodes in the tree of Odontoceti also
occur in the tree of Eutheria; to avoid counting any node
more than once, all the species of Odontoceti in the tree of
Eutheria were here lumped together into a single terminal. Finally, in order to maximize the range of number of
species, one more tree was added to the present sample:
the tree of all living organisms published by Lecointre
and Le Guyader (2001). Because many of the nodes in
the tree of Eutheria also occur in the tree of all organisms,
all the eutherian species in the latter tree were lumped
together into a single terminal. None of the remaining
nodes in any tree occurs in any other tree; thus, each
node was analyzed only once.
Most measures of imbalance are defined for an entire
tree, which contains various nodes with different numbers of species. An effect of number of species would
be easier to observe if imbalance were measured for individual nodes or sets of nodes within a tree. Just such
a measure of imbalance was introduced by Fusco and
Cronk (1995) and extended by Purvis et al. (2002). For
a given bifurcating node, let S be the total number of
species on the two branches, let B be the total number of
species on the branch with more species, and let m be the
smallest integer not smaller than S/2. It can be assumed
without loss of generality that S is at least 4, the smallest
number of species for which nodes can have different
levels of imbalance. Fusco and Cronk (1995) defined an
imbalance score I as follows:
I = (B − m)/(S − m − 1).
I has a maximum value of 1 if the node is as unbalanced
as possible, with one species on one branch and all remaining species on the other; I has a minimum value of
0 if the node is as balanced as possible, with the numbers
of species on the two branches either equal or differing
by only one. Purvis et al. (2002) showed, however, that
the expected value of I depends on S even if the Markov
model is true. To correct this problem, they defined a
weight w as follows:
w = 1 if S is odd;
w = (S − 1)/S if S is even and I > 0;
w = 2(S − 1)/S if S is even and I = 0.
For any set of nodes, such as those with a particular value
of S, Purvis et al. (2002) also defined the weighted mean
imbalance Iw as the weighted mean of I with weights
w. They then showed that under the Markov model, Iw
(unlike I ) has an expected value of 0.5 for any value
of S. Therefore, any empirical effect of S on Iw implies
that the total number of species at a node influences the
extent to which imbalance exceeds the prediction from
the Markov model.
Although the Markov model assumes that all nodes in
a tree are statistically independent, Purvis and Agapow
(2002) already showed that the model does not apply
to the present collection of trees. Consequently, the assumption of independence is not appropriate for testing
the statistical significance of differences among nodes.
Instead, a bootstrap test was conducted that assumes
only the independence of the 62 trees. In each of 10,000
bootstrap samples of 62 trees with replacement from the
original collection, the data were analyzed in the same
way as the original data; the proportion of these samples
that show an effect opposite from a given prediction is an
estimate of the one-tailed descriptive significance level
of the predicted effect.
R ESULTS
The trees in the original sample contain 1251 bifurcating nodes, along with 131 nodes with more than
two branches (polytomous nodes). The bifurcating nodes
were sorted into sets according to the number of species
per node, in intervals with lower bounds of 4, 10, 20, 50,
100, 200, 500, 1000, 2000, 5000, 20,000, and 100,000 species
per node. In each set, the weighted mean imbalance Iw
was calculated as described above, and the weighted geometric mean number of species per node was calculated
with the same weights w. The solid line in Figure 1 plots
2005
HOLMAN—IMBALANCE OF NODES IN PHYLOGENETIC TREES
897
FIGURE 1. Weighted mean imbalance (Iw ) as a function of total number of species per node (S). Solid line: data. Dotted line:
prediction from proportional-to-distinguishable-arrangements model.
Markov model predicts that Iw is 0.5 for all S.
FIGURE 2. Proportion of completely unbalanced trees as a function of total number of species (S). Solid line: data. Upper dotted line:
prediction from proportional-to-distinguishable-arrangements model.
Lower dotted line: prediction from Markov model.
imbalance as a function of species per node, with the latter on a logarithmic scale. The function increases across
its entire range except for minor fluctuations. The increase is negligible, however, across the much narrower
range of 4 to 20 species per node, confirming the results
of Purvis and Agapow (2002). Also in agreement with
Purvis and Agapow is the fact that imbalance is consistently above the 0.5 predicted by the Markov model.
As a summary measure of association, the weighted
product-moment correlation between imbalance and the
logarithm of number of species per node was calculated
across all 1251 bifurcating nodes, again with the weights
w. The correlation is 0.20. The correlation also proved to
be positive in all the bootstrap samples (P < .0001).
The dotted line in Figure 1 plots the imbalance predicted by the proportional-to-distinguishablearrangements model according to equations 1, 12, and
13 of Slowinski (1990). This model, unlike the Markov
model but like the data, implies a positive relation between imbalance and number of species. In fact, the data
fall about halfway between the predictions of the two
models, except that the last data point (for more than
100,000 species) is noticeably above the halfway point.
To compare the models with the data for a different aspect of imbalance, Figure 2 shows the proportion of bifurcating nodes that are completely unbalanced; such nodes
have one species on one branch, all the other species on
the other branch, and an imbalance score of 1. For the data
(solid line), the nodes were sorted according to number
of species (S) in the same intervals as in Figure 1, but all
the nodes were weighted equally. According to Slowinski
(1990), the proportion of completely unbalanced nodes
is predicted to be 2/(S − 1) by the Markov model
(lower dotted line), and S/(2S − 3) by the proportionalto-distinguishable-arrangements model (upper dotted
line). As before, the data fall between the predictions
of the two models, but this time the data move much
closer to the Markov model as the number of species
increases.
Because the data fall between the models, a probability mixture of the models can be explored as a possible compromise. Let the Markov model hold with
probability P(S), which may depend upon S, and
let the proportional-to-distinguishable-arrangements
model hold with probability 1 − P(S). According to
Figure 1, P(S) is about 0.5 and decreases if anything for
large S. According to Figure 2, however, P(S) increases
to near 1 as S increases. This discrepancy contradicts any
probability mixture of the models. In other words, the
trees that fail to obey the Markov model are not chosen
at random from the set of all possible trees.
As further evidence on the empirical pattern of imbalance, Figure 3 presents relative frequency histograms
of imbalance scores for nodes with different numbers
of species. The range of possible imbalance scores is divided into ten intervals of length 0.1; the graph shows the
relative frequency of scores in each interval, with nodes
FIGURE 3. Weighted relative frequency histograms of imbalance.
White bars: 20 to 199 species per node. Gray bars: 200 to 1999 species
per node. Black bars: 2000+ species per node. Markov model predicts
that each relative frequency is 0.1.
898
VOL. 54
SYSTEMATIC BIOLOGY
weighted by the weights w. The three histograms refer
to nodes with 20 to 199 species (white bars), 200 to 1999
species (grey bars), and 2000 or more species (black bars).
Nodes with fewer than 20 species are not included because the underlying discrete distribution of imbalance
scores is not well approximated by an interval histogram
for small numbers of species.
The Markov model predicts a discrete uniform distribution with a probability close to 0.1 in each interval.
The empirical distributions are in fact approximately uniform for imbalance scores below about 0.7, although the
relative frequencies are lower than predicted. For imbalance scores above 0.7, the relative frequencies increase
with imbalance; the highest frequency in each distribution is observed for imbalance scores in the interval from
0.9 to 1.0. As number of species increases, relative frequencies decrease for imbalance below 0.7 and increase
for imbalance above 0.9, resulting in a general increase
in imbalance.
One factor that may contribute to the relation between
imbalance and number of species is the proximity of a
node to the root of the tree. On any branch of a tree,
nodes closer to the root also have more species. Thus,
if for any reason the methods used to construct trees
tend to produce more imbalance at nodes closer to the
root, then there could also be more imbalance at nodes
with more species. To test this possibility, the distance
from any node to the root was defined as the number
of other nodes on the path from the given node to the
root. The weighted correlation between imbalance and
distance from the root is 0.03, although the correlation
would be negative if the greater imbalance at nodes with
more species were a secondary effect of proximity to the
root. The correlation was not negative in 80% of the bootstrap samples, indicating no significant correlation.
Another possibly relevant factor is the strength of the
data supporting nodes with different numbers of species.
If nodes with more species tend to be less strongly supported, then their greater imbalance could be explained
by the inverse relation between support and imbalance
found in simulated trees (Mooers and Heard, 1997). Investigation of this possibility is hampered by the heterogeneity of the published information on degree of
support for individual nodes, and also by the heterogeneity of the very methods used to construct the trees.
Information on degree of support ranges from none in
some trees to a variety of different measures in others,
depending on how the trees were constructed. One general albeit indirect measure of support can nevertheless
be derived from the fact that trees are most informative
if they have the highest degree of resolution justified
by the data. Consequently, the presence of nodes with
more than two branches suggests that the data are not
strong enough to support nodes with higher resolution.
In particular, nodes with more than two branches are
the inevitable result when poorly supported nodes collapse in a consensus tree; this process accounts for 91
of the 131 nodes with more than two branches in the
present data. Because each branch contributes its species
to the total at a node, the number of species per node
must be replaced by the mean number of species per
branch in comparisons of nodes with different numbers
of branches. The empirical question is whether nodes
with more branches also tend to have more species per
branch.
The unweighted correlation across nodes was therefore calculated between the logarithm of the number of
branches per node and the logarithm of the mean number of species per branch. The correlation is −0.02; the
correlation was nonpositive in 64% of the bootstrap samples. In case the correlation is diluted by heterogeneity
among the trees in the criteria used for collapsing nodes,
correlations were also calculated separately within each
of the 34 individual trees that include at least one node
with more than two branches; 18 of the correlations are
positive and 16 are negative. In case the correlations are
vitiated because only 10% of the nodes have more than
two branches, the mean numbers of species per branch
were also compared between nodes with two branches
and nodes with more than two branches; the geometric
mean was at least as great for nodes with two branches in
26% of the bootstrap samples and in 16 of the 34 individual trees. This series of null results suggests that nodes
with more species per branch do not have more branches,
and therefore that the greater imbalance of nodes with
more species is not an effect of weaker support.
D ISCUSSION
Most measures of imbalance are defined for entire trees
and are thus suitable for comparisons between trees. The
imbalance scores of Fusco and Cronk (1995), however,
along with the weights of Purvis et al. (2002), can be defined for sets of nodes within trees and are thus appropriate for comparisons within as well as between trees.
In the first such comparison, Purvis and Agapow (2002)
showed an effect of taxonomic rank: imbalance tends to
be greater when calculated in terms of higher taxa rather
than species. The present comparison shows an effect of
total number of species: imbalance tends to be greater at
nodes with more species. The flexibility of the weighted
imbalance measure recommends its use in further comparisons within and between phylogenetic trees.
The
proportional-to-distinguishable-arrangements
model has served its purpose in explaining why the
addition of random error to simulated data increases
the imbalance of estimated trees (Mooers and Heard,
1997). As a mechanism for generating trees, however,
the process of completely random choice embodied
in the model becomes less plausible as the number
of species increases and the number of possible trees
increases even faster. It is therefore no surprise that
the model progressively fails as an alternative to the
Markov model in describing the pattern of imbalance as
the number of species increases in the present data.
A better description of how trees are constructed might
start with the fact that for more than a few species, there
are far more possible trees than can ever be explored
exhaustively. Even computerized heuristic searches, the
fastest of which now rely on Bayesian statistics, rarely
2005
HOLMAN—IMBALANCE OF NODES IN PHYLOGENETIC TREES
attempt to find trees with more than a few hundred terminals (Huelsenbeck et al., 2001). Larger numbers of
species can only be accommodated with the aid of additional approximations. The most common approximation, used in nearly all the trees in the present sample, is
to substitute higher taxa as terminals in place of species.
This technique assumes that the higher taxa are strictly
monophyletic, and also that they can be adequately represented by a subset of their species or character states.
Another approximation, used in some of the largest
trees in the present sample, is to combine a number of
smaller trees into a single large one, commonly called
a supertree (Bininda-Emonds, 2004). Because these approximations become more common and controversial
with more species, nodes with more species may be
less well supported and for that reason more unbalanced. An admittedly indirect test of this possibility in
the present sample found no relation between the number of branches per node and the number of species per
branch. To the extent that more branches at a node indicate less support, these results suggest that the approximations necessary to construct nodes with many species
do not substantially undermine their support or exaggerate their imbalance.
In the context of biological explanations for imbalance, the present results partly account for the surprisingly high levels of imbalance pointed out by Heard
and Mooers (2002). If differences in diversification rates
evolve incrementally, then the effects of such differences
on imbalance should be inconspicuous at nodes with few
species before accumulating at nodes with more species.
Figure 1 does indeed show the predicted increase in imbalance with number of species, but the increase starts
from a level of imbalance that already indicates substantial differences in diversification rates. The high initial
level of imbalance remains to be explained.
In addition to the general increase in imbalance with
number of species, large trees like those in the present
sample contain a wealth of more detailed information
about imbalance that the present research has just begun
to explore. For instance, Figure 2 shows that as the number of species increases, the proportion of completely unbalanced trees approaches an asymptote that is close to
0 but nevertheless above 0. Also, Figure 3 shows a strikingly simple pattern in the distributions of imbalance,
which are nearly uniform except for a peak near the upper end of their range; the only apparent effect of increasing the number species is to shift relative frequency
from the uniform portion to the peak. Any successful
899
models for phylogenetic trees will have to account for
these patterns.
ACKNOWLEDGMENTS
I thank Paul-Michael Agapow, Marshal Hedin, Roderic Page, and an
anonymous referee for their helpful suggestions.
R EFERENCES
Bininda-Emonds, O. R. P. 2004. The evolution of supertrees. Trends
Ecol. Evol. 19:315–322.
Davies, T. J., T. G. Barraclough, M. W. Chase, P. S. Soltis, D. E. Soltis, and
V. Savolainen. 2004. Darwin’s abominable mystery: Insights from a
supertree of the angiosperms. Proc. Nat. Acad. Sci. USA 101:1904–
1909.
Fusco, G., and Q. C. B. Cronk. 1995. A new method for evaluating the
shape of large phylogenies. J. Theor. Biol. 175:235–243.
Guyer, C., and J. B. Slowinski. 1991. Comparisons of observed phylogenetic topologies with null expectations among three monophyletic
lineages. Evolution 45:340–350.
Heard, S. B., and A. Ø. Mooers. 2002. Signatures of random and selective mass extinctions in phylogenetic tree balance. Syst. Biol. 51:889–
897.
Huelsenbeck, J. P., F. Ronquist, R. Nielsen, and J. P. Bollback. 2001.
Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310–2314.
Katzourakis, A., A. Purvis, S. Azmeh, G. Rotheray, and F. Gilbert. 2001.
Macroevolution of hoverflies (Diptera: Syrphidae): The effect of using higher-level taxa in studies of biodiversity, and correlates of
species richness. J. Evol. Biol. 14:219–227.
Lecointre, G., and H. Le Guyader. 2001. Classification phylogénétique
du vivant, 2e édition. Belon, Paris.
Mooers, A. Ø., and S. B. Heard. 1997. Inferring evolutionary process
from phylogenetic tree shape. Q. Rev. Biol. 72:31–54.
Mooers, A. Ø., R. D. M. Page, A. Purvis, and P. H. Harvey. 1995. Phylogenetic noise leads to unbalanced cladistic tree reconstructions. Syst.
Biol. 44:332–342.
Purvis, A., and P.-M. Agapow. 2002. Phylogeny imbalance: Taxonomic
level matters. Syst. Biol. 51:844–854.
Purvis, A., A. Katzourakis, and P.-M. Agapow. 2002. Evaluating phylogenetic tree shape: Two modifications to Fusco and Cronk’s method.
J. Theor. Biol. 214:99–103.
Savage, H. M. 1983. The shape of evolution: Systematic tree topology.
Biol. J. Linn. Soc. 20:225–244.
Scotland, R. W., and M. J. Sanderson. 2004. The significance of few
versus many in the tree of life. Science 303:643.
Slowinski, J. B. 1990. Probabilities of n-trees under two models: A
demonstration that asymmetrical interior nodes are not improbable.
Syst. Zool. 39:89–94.
Slowinski, J. B., and C. Guyer. 1989. Testing the stochasticity of patterns
of organismal diversity: An improved null model. Am. Nat. 134:907–
921.
Stam, E. 2002. Does imbalance in phylogenies reflect only bias? Evolution 56:1292–1295.
First submitted 4 January 2005; reviews returned 31 March 2005;
final acceptance 7 June 2005
Associate Editor: Marshal Hedin