A simulation test of Smith`s "degrees of freedom

AMERICAN JOURNAL OF PHYSICAL ANTHROPOLOGY 98:355-367 (1995)
A Simulation Test of Smith’s “Degrees of Freedom” Correction
for Comparative Studies
CHARLES L. NUNN
Department of Biological Anthropology and Anatomy, Duke University,
DUMC Box 90383, Durham, North Carolina 27708-0383
KEY WORDS
Comparative methods, Phylogenetic constraint,
Nonindependence, Computer simulation
ABSTRACT
Computer simulation was used to test Smith‘s (1994) correction for phylogenetic nonindependence in comparative studies. Smith’s
method finds effective N, which is computed using nested analysis of variance,
and uses this value in place of observed N a s the baseline degrees of freedom
(do for calculating statistical significance levels. If Smith’s formula finds the
correct df, distributions of computer-generated statistics from simulations
with observed N nonindependent species should match theoretical distributions (from statistical tables) with the df based on effective N.
The computer program developed to test Smith’s method simulates character evolution down user-specified phylogenies. Parameters were systematically varied to discover their effects on Smith’s method. In simulations in
which the phylogeny and taxonomy were identical (tests of narrow-sense
validity), Smith’s method always gave conservative statistical results when
the taxonomy had fewer than five levels. This conservative departure gave
way to a liberal deviation in type I error rates in simulations using more than
five taxonomic levels, except when species values were nearly independent.
Reducing the number of taxonomic levels used in the analysis, and thereby
eliminating available information regarding evolutionary relationships, also
increased type I error rates (broad-sense validity), indicating that this may
be inappropriate under conditions shown to have high type I error rates.
However, the use of taxonomic categories over more accurate phylogenies did
not create a liberal bias in all cases in the analysis performed here. The effect
of correlated trait evolution was ambiguous but, relative to other parameters,
negligible. o 1995Wiley-Liss, Inc.
Species values in comparative studies are
not necessarily independent of one another.
However, the statistical techniques commonly used in these studies require independent data points. When species values a r e
falsely considered independent, the degrees
of freedom ( d o appropriate for statistical
testing are inflated, resulting in a n overstatement of statistical significance
(Felsenstein, 1985; Harvey and Pagel, 1991;
Martins and Garland, 1991).
All the new comparative methods that address problems of phylogenetic nonindepen0 1995 WILEY-LISS, INC
dence are limited in their application and
have strict assumptions. Some methods,
such as Felsenstein’s (1985) independent
contrasts method, require a phylogeny with
known branch lengths. Other methods, including higher nodes approaches that use
nested analysis of variance (ANOVA)to identify a n independent level of analysis (Clutton-Brock and Harvey, 19771, assume that
the taxonomy used estimates the actual phy-
Received July 1, 1994; accepted May 21, 1995.
356
C.L. NUNN
logenetic relationships, while autocorrelation techniques (Cheverud e t al., 1985; Gittleman and Kot, 1990) require a hierarchical
ordering of relatedness and a n association
between phylogenetic relatedness and trait
variation (“phylogenetic correlation”: Gittleman and Luh, 1992, 1994). By averaging
traits of lower taxonomic levels, some methods, such a s higher nodes approaches (Clutton-Brock and Harvey, 1977) and parsimony
techniques (Ridley, 1983; Maddison, 1990),
ignore species-level variation (Anthony and
Kay, 1993; Smith, 1994). Finally, although
computer software that calculates many of
the corrections is widely available, the mathematical complexity of most methods seems
to limit their application by many in the scientific community.
Smith’s (1994) method, a correction for the
degrees of freedom that uses nested ANOVA
to partition variance to taxonomic levels,
avoids some of the restrictions of other comparative methods. However, Smith’s method
has its own limitations. In Smith’s method,
species values are considered independent
for the initial calculation of a statistical measure. Only when computing statistical significance levels are phylogenetic effects
incorporated. Using nested ANOVA procedures on taxonomic levels, Smith partitions the overall sample variance into percentage variance components (PVCs). Smith
computes effective N, which is used in place
of observed N when calculating the degrees
of freedom, by multiplying the PVCs by the
number of taxa at each hierarchical level
and summing these values through the taxonomic levels under investigation:
Effective N = (# species)(PVC,,,,i,,)
+ (# genera)(PVC,,,,,)+. . .
+ (# taxa)(PVC,,,).
Because of the structure of evolutionary
trees, effective N must be less than or equal
to the number of species in the data set, or
observed N. The difference between effective
N and observed N depends on how variation
is partitioned across taxonomic levels. When
phylogenetic constraints are weak and species are relatively free to vary, variation will
occur mostly at lower taxonomic levels (species). In this case, the PVC for species ap-
proaches 1, and effective N approaches observed N. At the other extreme, when
phylogenetic effects are strong and lower
taxonomic levels are constrained, nested
ANOVA is expected to partition most of the
variance to higher taxonomic levels. In this
case, the PVC for species (and other low taxonomic levels) approaches 0, and effective N
is consequently much less than observed N.
By using nested ANOVA to partition variance, and then using this partitioning to find
the baseline degrees of freedom, Smith’s correction is supposed to account for phylogenetic nonindependence when calculating
statistical significance levels. In many ways,
Smith’s method is a compromise between the
older, nonphylogenetic comparative methods and their new counterparts; it allows one
to use the “traditional‘equilibrium’ analysis”
(Martins and Garland, 1991) by providing
a correction for phylogenetic effects that is
implemented at a later stage in the analysis.
Smith’s method has three advantages over
the other new comparative methods. First,
Smith’s method uses the observed variance
at each taxonomic level to estimate phylogenetic constraint and is thus independent of
a specific model of evolutionary change (for
other methods, see Gittleman and Luh,
1994). Second, Smith’s method uses all information at the species level, thus avoiding
the criticism that available information is
lost with new comparative methods (Anthony and Kay, 1993; Smith, 1994). Finally,
the ease of interpreting and calculating effective N-which
can be done with commonly used statistical packages-makes
Smith’s method a n attractive alternative to
more mathematically complicated procedures. No transformations or contrasts are
required, further easing statistical interpretation of the output.
Although Smith’s method overcomes some
weaknesses of previous comparative methods, the method still has limitations. Smith’s
method may be unsatisfactory when the
taxonomy chosen by the user does not
accurately represent all evolutionary relationships, such a s when phylogenetic relationships are misstated or nodes are incompletely resolved (polytomies). Inability to
meet the assumptions necessary for proper
nested ANOVA calculations, such a s the re-
357
A SIMULATION TEST OF SMITHS DF CORRECTION
h
-.8
-.6
-.4
-.2
0
.2
4
.6
.8
Correlation Coefficient, r
Fig. 1. Example of an observed distribution. The evolution of two traits is simulated down an input
phylogeny, and the values at the branch tips (the species values) are used to calculate the correlation
between the traits (I). The number of r in the observed distribution equals the number of simulations.
quirements of homoscedasticity and the description of evolution a s hierarchical, may
also limit application of Smith’s method. I n
addition, equal sample sizes are preferred
when calculating nested ANOVAs. When
sample sizes are unequal, calculation of
nested ANOVAs becomes more complex, and
only approximate tests of significance are
available (Sokal and Rohlf, 1995).
Smith’s method lowers the df, which
makes it more conservative than simply using observed N. In addition, Smith (1994)
suggests that his method gives conservative
statistical results even after taking phylogeny into account. However, the theoretical
basis of Smith’s method is unclear, and the
ability of Smith’s method to estimate the
true effective N, the value that a correction
for the degrees of freedom should calculate,
has not been established analytically.
This paper describes a simulation test of
Smith‘s method. The computer program created for this test simulates the evolution of
two traits down fully known phylogenies,
which in some cases is identical to the taxonomy used to calculate effective N. By repeatedly simulating evolution down a known
phylogeny, the program generates distributions of statistics (“observed distributions”)
calculated from the resulting branch tip trait
values (Fig. 1). In the analysis of Smith’s
method, goodness-of-fit tests were used to
compare observed distributions to their expected distributions (calculated using
Smith‘s method), and type I error rates and
true effective Ns were estimated. Thus, the
possible bias and statistical error associated
with Smith’s correction were established empirically with different phylogenies and under different evolutionary scenarios.
METHODS
The computer program was written in
THINK Pascal 4.0 (Symantec Corp., 1991, Cupertino, CA). I t allows the user to specify
the input phylogeny, the correlation between
traits (p), and the number of simulations to
perform (=2,000 for all results discussed
here). Trait changes are calculated for each
internode branch on the input phylogeny,
and each trait change is normally distributed with mean equal to zero and variance
proportional to the branch length. This follows a model of evolutionary change known
a s Brownian motion (Felsenstein, 1985),
where the variance accumulates a t a rate
linearly proportional to time. The assump-
358
C.L. NUNN
tion of Brownian motion simplifies partitioning of variance when branch lengths for
each taxonomic level are equal, as the PVC
of a taxonomic level is simply its branch
length’s proportion of the total phylogeny’s
length.
For each simulation of trait evolution the
trait values a t the branch tips are used to
calculate a product-moment correlation coefficient (r). Each r is a n estimate of the userspecified correlation p. The observed distribution is composed of the r calculated from
each simulation. Thus, each r is computed
using all species in the input phylogeny (observed N), and the number of rs in the observed distribution equals the user-specified
number of simulations.
If Smith’s effective N is a n unbiased estimate of the true effective N, experimentally
generated distributions of correlation coefiicients calculated from observed N should
match their expected distributions based on
effective N, despite the fact that effective N
must be less than observed N. Goodness-offit tests, calculation of type I error rates, and
comparison of effective N to true effective
N show whether Smith’s method correctly
estimates true effective N, and, if not,
whether the method is statistically conservative or liberal.
Statistical tests of goodness-of-fit (G-statistics, using William’s correction) were used
to compare observed distributions to their
expected distributions. When comparing two
observed distributions generated with different input phylogenies, the KolmogorovSmirnov two-sample test was used. Statistical measures of goodness-of-fit, such as G,
give the probability that two distributions
have the same underlying distribution. With
this probability, the null hypothesis that no
difference exists between the distributions
(and, hence, that Smith’s method estimates
the correct baseline df) was accepted or rejected a t a significance level of 5%.
Type I error rates were calculated by finding the percentage of the observed distribution that exceeds the CL = 0.05 critical value,
where the CL = 0.05 critical values were
taken from statistical tables (r distributions)
using effective N-2 a s the df. If Smith’s correction estimates the true effective N, type
I error rates should be approximately 5%.
If Smith’s correction is conservative, type I
error rates should be less than 5%, and if
Smith‘s correction is liberal, type I error
rates should exceed 5%.
The true effective N was determined by
finding the expected distribution (Zar, 1984,
Table B.16) that (1) best fits the observed
data (using G-statistics), and (2) maintains
a conservative bias in type I error rates (type
I error rate ~ C =L 0.05). Harmonic interpolation was used to find noninteger df when
integer values for the true effective N did
not fit the data. The resolution used in finding the true effective N was limited to 0.25.
Narrow- versus broad-sense validity
Two sets of simulations were performed
that test different aspects of Smith’s correction. First, simulations were conducted
when the assumptions of Smith’s method
hold. This set of simulations tests the method’s “narrow-sense” validity (Pagel and Harvey, 1992), a type of validity that applies
when the assumptions of the comparative
method are met by the data set. One key
assumption of Smith’s method is that the
taxonomy used accurately reflects all phylogenetic relationships. This assumption is
preserved in tests of narrow-sense validity
by simulating evolution on a phylogeny that
is really a n ideal taxonomy: the phylogeny
is a dichotomously branching tree with equal
branch lengths at each taxonomic level. (The
terms “phylogeny” and “taxonomy” are thus
interchangeable in tests of narrow-sense validity.) Branch lengths can differ between
taxonomic levels (e.g., terminal branches
may be longer than internal branches), but
within each taxonomic level branches have
the same length. An example of such a n input phylogeny is given in Figure 2. Three
parameters were experimentally manipulated in this series of simulations: the number of taxonomic levels; how variance accumulates a t successive taxonomic levels (or,
the amount of phylogenetic nonindependence); and the correlation between the simulated traits (p). These tests are summarized
in Table 1.
The assumptions of a comparative study
are almost never perfectly satisfied in actual
data sets. Therefore, the second set of simulations tests the method’s “broad-sense’’ va-
A SIMULATION TEST OF SMITHS DF CORRECTION
7
G
n
-
E
F
-
K
P-
Fig. 2. In tests of narrow-sense validity t h e input
phylogeny was really a n ideal taxonomy: within each
taxonomic level branch lengths were equal. However,
branch lengths could differ between taxonomic levels.
This is shown in t h e hypothetical phylogeny here, with
t h e terminal level having longer branch lengths t h a n
internal levels.
lidity (Pagel and Harvey, 1992), or how the
method fares when some (or all) of its assumptions are broken. Smith‘s method uses
taxonomic categories. Therefore, one potential problem with the method is that a n accurate phylogeny, because it includes branch
lengths, will provide more information about
evolutionary relationships than a n accurate
359
taxonomy. In addition, as fewer taxonomic
levels are used in the comparative study,
more phylogenetic information is lost: essentially, the phylogeny has more unresolved
nodes, or polytomies. Decreasing the resolution of phylogenetic relationships has been
shown to increase the type I error rates of
other methods (Purvis et al., 1994), and the
same may be true of Smith’s method.
To test this possibility, evolution was simulated down a n input phylogeny used in previous simulation studies [Sessions and Larson’s (1987) plethodontid salamander
phylogeny used in Martins and Garland‘s
(1991) study, chosen because branch lengths
are provided]. From the input phylogeny,
three possible taxonomies were created,
each having a different number of taxonomic
levels. The taxonomy using the most levels
best approximates the input phylogeny,
while the taxonomy using the fewest levels
has the poorest resemblance to the input
phylogeny. By calculating Smith’s effective
N for each of these taxonomies and then comparing these values to the true effective N,
the effect on type I error of using unresolved
evolutionary relationships was established
for this phylogeny.
RESULTS
Narrow-sense validity
For testing the narrow-sense validity of
Smith’s method, simulations were conducted
with the parameter of interest varying,
while all other parameters were held constant. The input phylogeny bifurcated at
each node. Three parameters were systematically varied: the number of taxonomic levels, the distribution of variance across taxo-
TABLE 1. Summary of the simulations conducted
Type of validity
Narrow-sense validity
Factor investigated
Number of taxonomic levels
Variance partitioning (PVCs differ)
Trait correlation (p)
Broad-sense validity
The phylogenetic resolution of the
taxonomy used in finding effective N
Parameters simulated
3-9 taxonomic levels, all branch lengths equal
Effective Nhbserved N ranges from 0.125 to
0,875 in increments of 0.125; taxonomic
levels range from 4 to 7
p = 0, 0.25, 0.50, 0.75, 0.90, 0.95, and 0.98;
various effective N and number of
taxonomic levels
3-5 taxonomic levels (5-11 internal nodes),
taxonomies based on Sessions and Larson’s
(1987) ulethodontid salamander uhvlogenv
360
C.L. NUNN
nomic levels, and the correlation between
the simulated traits (p). The number of taxonomic levels and p are not included a s variables in Smith’s formula for effective N.
However, effective N does incorporate the
distribution of variance across taxonomic
levels by using nested ANOVA to calculate
PVCs.
Number of taxonomic levels To deter-
mine whether the number of taxonomic levels biases estimates of true effective N, simulations were conducted with the number of
taxonomic levels ranging from 3 to 9 (seven
different input phylogenies). All branch
lengths on the input phylogenies were equal.
Using G-tests, the observed distributions
were compared to a n expected distribution
of correlation coefficients with df = effective
N-2 (Zar, 1984, Table B.16; 9 ranges of the
expected distribution are provided by Zar
and used in the analysis; a 10th range was
added to reflect the limits of correlation coefficients: -1 to + l ) .
Because the computer program simulates
evolution by Brownian motion and because
all branch lengths were equal, PVCs were
the same for each taxonomic level (holding
the number of taxonomic levels constant).
The input phylogeny bifurcates a t each node,
so a t each taxonomic level, 2L taxa occur,
where L is the hierarchical level counted
from the root of the tree. These valuesPVCs and the number of taxa a t each hierarchical level-were used to calculate effective
N. For example, with five taxonomic levels,
PVC = 1.0/5 = 0.2 a t each level. Thus,
Effective N
=
0.2(2’ + 2’
= 12.4.
+ 23 + 24 + 25)
Harmonic interpolation between values
from statistical tables was used to find noninteger df. Although noninteger df have only
theoretical value, interpolation eliminates
the conservative departure from type I error
t h a t is more pronounced a t small values of
effective N.
The df appropriate for finding values from
statistical tables of r distributions is n-2.
Returning to the example, using Smith’s
method with harmonic interpolation, the observed distribution of correlation coefficients
TABLE 2. Observed N, effectiue N, and effectiue
Nlobserued N for simulations with taxonomic
levels uarvinp and branch lengths equal
of
taxonomic
levels
Nn.
.
..
3
4
5
6
7
a
9
~
Observed N
Effective N
Effective N1
observed N
8
16
32
64
128
256
512
4.67
7.50
12.40
21.00
36.29
63.75
113.56
0.584
0.469
0.388
0.328
0.284
0.249
0.222
should follow the expected distribution with
12.4 - 2 = 10.4 df. By contrast, a n analysis
that ignores phylogenetic effects would use
Z5 - 2 = 30 df (5 taxonomic levels in a balanced, bifurcating phylogeny gives z5= 32
branch tips, or observed N). Table 2 lists
observed Ns and effective Ns for simulations
in which the number of taxonomic levels
was varied.
G-statistics were compared to the x2distribution with df = number of categories - 1.
To avoid observed frequencies of 0 in a category, which complicates calculation of G-statistics, simulations of input phylogenies with
3 and 4 taxonomic levels were pooled into
seven ranges of the theoretical distribution.
For the remaining simulations (5 to 6 taxonomic levels), i t was possible to use all 10
ranges of the theoretical distribution.
Table 3 lists effective N, GadJ,the type I
error rate, and true effective N for simulations run with different numbers of taxonomic levels. For all tests the observed
distribution differed significantly from
expected (all P-values are <0.005). Using
G,, a s a n estimate of goodness of fit, Smith’s
correction worked best in the middle ranges
of taxonomic levels simulated (5 and 6 levels); simulations on input phylogenies with
fewer (3 and 4) or more (7 to 9) taxonomic
levels generated observed distributions that
departed more from expected.
When 3 to 5 taxonomic levels were simulated, Smith’s correction gave conservative
type I error rates (type I error < a = 0.05).
This conservative departure disappeared
when 6 or more taxonomic levels were simulated (type I error > a = 0.05). For simulations of 8 and 9 taxonomic levels, the type I
error rate was substantial (8 levels, type I
36 1
A SIMULATION TEST OF SMITHS DF CORRECTION
TABLE 3. Results of simulations, with the number of
taxonomic levels varvine and branch lengths eaual'
Taxonomic
levels
3
4
5
6
7
8
9
Effective
N
Gad,
Type1
error
True
effective N
4.67
7.50
12.40
21.00
36.29
63.75
113.56
220.735
76.438
24.459
29.036
266.545
689.022
1600.100
0.010
0.019
0.038
0.069
0.129
0.175
0.243
6.25
9.00
13.50
18.00
24.00
31.00
39.00
TABLE 4. Results of Kolmogorou-Smirnou two-sample
test, testing for the effect on observed distributions of
changing the input phylogeny while holding effective
N a n d number of taxonomic levels constant'
Comparisod
uarameters
TvDe of test
Magnitude of branch
lengths differs (but
relative branch
lengths the same)
'All observed distributions are significantly different from expected
(Gad,:3 and 4 taxonomic levels, df = 6; 2 5 taxonomic levels, df = 9;
all P-values 40.005).
error = 0.175; 9 levels, type I error = 0.243).
Another way of looking at this is by comparing Smith's effective N to true effective N:
as more taxonomic levels were simulated,
Smith's effective N underestimated true effective N, indicating a liberal statistical bias
to the method as more taxonomic levels are
included in the analysis (Fig. 3). Note, however, that few comparative studies use more
than 5 taxonomic levels (see examples in
Harvey and Pagel, 19911, and Smith's
method is therefore expected to be conservative in the narrow sense.
All internode branch lengths were equal
in this set of simulations. This essentially
assumes a punctuational model of evolutionary change, where the amount of change is
proportional to the number of speciation
events (Martins and Garland, 1991; Martins, 1993). More importantly, the equality
of branch lengths means that effective N/
observed N declines as more taxonomic levels are simulated (Table 2). Consequently,
the increase in type I error rates a s more
taxonomic levels were simulated may instead only reflect the decline in effective N/
observed N, which is really a measure of the
nonindependence in the data set. The next
section separates the effects of these two parameters.
Variance partitioning When species values are less independent, variance is partitioned mostly to high taxonomic levels (e.g.,
orders), reducing effective N/observed N.
When species values are nearly independent, the variance is partitioned mostly to
low taxonomic levels (e.g., species), and effective N/observed N approaches 1.Effective
N/observed N, really a measure of phyloge-
Relative branch
lengths differ
1
Critical value of
D,,,,,,,,=
D">.v
Variance = 1 vs.
var = 10, 4
taxonomic levels
0.0240, n.s.
var = 1 vs.
var = 100, 4
taxonomic levels
var = 1 vs.
var = 10, 6
taxonomic levels
var = 1 vs.
var = 100, 6
taxonomic levels
0.0185, n.s.
Effective N = 20,
5 taxonomic
levels
Effective N = 25,
6 taxonomic
levels
Effective N = 30,
6 taxonomic
levels
0.0190, n.s.
0.0345, n.s.
0.0100, n.s.
0.0260, n.s.
0.0185, n.s.
0.0430.
netic nonindependence, may explain the deviations in type I error rates noted above.
Internode and terminal branch lengths of
the input phylogeny were altered such that
effective N/observed N varied from 0.125 to
0.875 in 0.125 increments. For low values of
effective N (strong phylogenetic nonindependence), higher taxonomic levels had the longest relative branch lengths, while for high
values of effective N (species values nearly
independent), lower taxonomic levels had
the longest relative branch lengths. To preserve the assumption that the taxonomy accurately reflects all phylogenetic relationships, branch lengths were identical within
each taxonomic level. Simulations were run
on input phylogenies with 4 , 5 , 6 ,and 7 taxonomic levels (effective N/observed N = 0.125
was not possible for 4 taxonomic levels without some branch lengths of 0 and, consequently, it was not tested here). By varying
both the number of taxonomic levels and effective N/observed N, the effects of these two
parameters could be separated.
Before simulating different effective Ns
by changing relative branch lengths, the relative branch lengths themselves had to be
excluded as factors influencing observed dis-
362
C.L. NUNN
3
4
5
6
8
7
9
Taxonomic Levels
Fig. 3. Graphic representation of effective N and true effective N, the value that Smith’s method
should calculate, with the number of taxonomic levels varying and branch lengths equal. Smiths method
underestimated true effective N when 3 to 5 taxonomic levels were simulated; when more than 5
taxonomic levels were simulated, Smith’s method overestimated true effective N.
tributions and type I error rates. The Kolmogorov-Smirnov two-sample test was used to
compare observed distributions from input
phylogenies that differed in their branch
lengths but had the same number of taxonomic levels and effective N. Two types of
changes in input phylogenies were tested:
(1)differences in the magnitude of internode
branch lengths (with relative branch lengths
the same), and (2) differences in relative
branch lengths. Table 4 presents the results
of these tests. In no cases were significant
differences found between distributions simulated with the same number of taxonomic
levels and effective N but different absolute
or relative branch lengths, suggesting that
effective N could be varied by changing
branch lengths.
Returning to simulations with different effective Ns and different numbers of taxonomic levels, Table 5 lists the parameters
simulated and their type I error rates. Figure 4 provides a graphic representation of
the results. As effective N/observed N was
increased (species values were more independent) type I error rates approached the
expected type I error rate (a = 0.05). This
suggests that Smith’s correction works best
when branch tip data points are nearly independent. As effective Nlobserved N was de-
TABLE 5. Results of simulations with effectiveN a n d
number of taxonomic levels varying
Taxonomic
levels
Effective NI
observed N
Type I
4
6
8
10
12
14
0.25
0.375
0.5
0.625
0.75
0.875
0.002
0.019
0.0185
0.0275
0.0255
0.0385
4
8
12
16
20
24
28
0.125
0.25
0.375
0.5
0.625
0.75
0.875
0.003
0.036
0.035
0.037
0.034
0.0345
0.04
6
8
16
24
32
40
48
56
0.125
0.25
0.375
0.5
0.625
0.75
0.875
0.143
0.092
0.0735
0.076
0.064
0.0465
0.0515
7
16
32
48
64
0.125
0.25
0.375
0.5
0.625
0.75
0.875
0.246
0.151
0.116
0.109
0.097
0.0515
0.0545
Effective N
80
96
112
error
363
A SIMULATION TEST OF SMITHS DF CORRECTION
0
0.125
0.25
0.375
0.5
0 625
0.75
0.875
1
effectiveN I observedN
Fig. 4. Type I error rates (expected = 0.05) with the number of taxonomic levels and effective N1
observed N varying. When 4 o r 5 taxonomic levels were simulated, Smiths method always gave conservative statistical significance levels. For 6 and 7 taxonomic levels decreasing effective Nlobserved N increased type I error rates
creased the magnitude and the direction of
departure from a depended on the number
of taxonomic levels simulated. Type I error
rates for simulations of 4 and 5 taxonomic
levels never exceeded a = 0.05, suggesting
that Smith’s method is statistically conservative, regardless of effective Nlobserved N,
when 5 or fewer taxonomic levels are used
in the study. By contrast, results from simulations of 6 and 7 taxonomic levels indicate
that Smith’s method is statistically liberal
when effective N/observed N is less than
about 0.75. Furthermore, this liberal deviation increased as effective N/observed N decreased. In every case, for a given effecthe
N/observed N, simulations on input phylogenies with more taxonomic levels had higher
type I error rates.
Correlation between traits To test for a n
effect of correlated character evolution on
Smith’s method, simulations were run with
p = 0.00, 0.25, 0.50, 0.75, 0.90, 0.95, and
0.98. Different numbers of taxonomic levels
(4,5, and 6) and effective Ns were simulated.
Linear regression techniques were used to
see whether a relationship exists between
type I error rates and p.
Early results suggested that as p increases
type I error rates also increase, although this
positive relationship was small (linear regression: b = 0.0217, P = 0.0054). However,
this result was not consistently found when
tested over a wider range of parameters. In
fact, b ranged from -0.0119 to 0.0293, with
most values insignificantly different from 0.
I n all cases b was small, indicating that the
effect of correlated trait evolution on type I
error rates is minor compared to the effect
of the parameters discussed above.
One explanation for the inconsistency of
this result may be the method of calculating
critical values for nonzero input correlations. As p approaches its limits ( 5l.O), the
statistical distributions become more asymmetrical. Authors differ in their methods of
finding critical values for p # 0 (e.g., compare Zar, 1984, wth Sokal and Rohlf, 1995).
Furthermore, the approximations used may
be biased in a n unknown way for small sample sizes and high values of p. This could
create a n observed increase in type I error
rates for some sets of simulation parameters
when in fact no relationship exists.
One way of dealing with these potential
biases is to find critical values by computer
simulation. These values are, however, only
estimates of the true critical values. This
means that finding a relationship between
364
C.L. NUNN
p and type I error rates is still difficult, a s a n
actual trend would be diluted by stochastic
error in the estimated critical values. This
is especially true when the effect is small,
a s initial analyses indicated here. In one series of simulations (6 taxonomic levels, all
internode branches equal) with critical values found by simulation (n = 101000),a significant positive trend was discovered
(b = 0.0293; P = 0.0273). However, under
the same simulation conditions and four taxonomic levels, the trend was negative, but
not significantly so (b = -0.0119; P =
0.057). These were also the most extreme
values of b found.
The inconsistency in these findings may
reflect (1)unknown variables or a n interaction between variables, (2) a lack of resolution in finding critical values, because of either biased statistical methods or stochastic
error associated with computer simulations,
or (3) statistical artifacts coupled with no
real trend. In all cases, the effect is small
relative to the effects of the other parameters tested.
Broad-sense validity
Some researchers, citing the narrow-sense
results from above, may be inclined to group
their species into fewer taxonomic levels to
reduce excessively high type I error rates.
However, the above results are from tests of
narrow-sense validity, where the taxonomy
and the phylogeny were identical. Grouping
the species into fewer taxonomic levels
breaks this assumption by obscuring the
species’ true phylogenetic relationships, and
this makes it incorrect to apply tests of narrow-sense validity to this situation. This section tests the “broad-sense1’ validity of
Smith’s method, or how the method works
when some of its assumptions are broken.
As Page1 and Harvey (1992) point out,
there are a n infinite number of ways in
which the assumptions of a comparative
method can fail. Therefore, only a subset of
possible violations can ever be tested experimentally. Smith’s method makes use of the
observed variance at each taxonomic level
and therefore makes no assumptions about
the model of evolutionary change. Consequently, the model of evolutionary change is
not a n interesting parameter to test in this
case. Instead, the analysis of broad-sense va-
I
Emalrna(A)
Aneidesferreus (Bj
A Frvrpunctatus (C)
-
A. lugubris (D)
A hardir (E)
Plethodon Iarsellr @j
P elongatus (G)
P vehrculum fl)
P dunnr (7)
P. jordanr (M)
P. yonahlossee (?f)
P. glutinoms (0)
Fig. 5. Sessions and Larson’s (1987) plethodontid
salamander phylogeny, used as the input phylogeny in
tests of broad-sense validity.
lidity here focuses on how rearrangements
of the taxonomy, where the taxonomy used
in calculating effective N does not equal the
input phylogeny, affects Smith’s method.
Evolution was simulated using Sessions
and Larson’s (1987) plethodontid salamander phylogeny (Fig. 5) a s the input phylogeny
(also used in Martins and Garland, 19911,
and the resulting observed distribution was
used to find the true effective N. Three taxonomies, each with fewer taxonomic levels (5,
4,and 31, were created from the input phylogeny. Taxonomies with more levels best approximated the input phylogeny, and thus
had the greatest phylogenetic resolution.
However, no taxonomy provided the resolution of the input phylogeny. The taxonomies
created and used in the analysis are shown
in Figure 6.
Then, with a set of branch tip data points,
simulated using Sessions and Larson’s phylogeny as the input phylogeny, PVCs were
calculated using the procedure “proc n e s t e d
in SAS (SAS Institute Inc., 1992, Cary, NC).
This procedure was repeated 25 times, each
time with different simulated branch tips,
and the PVCs calculated from the 25 nested
ANOVAs were averaged for each taxonomic
level. The entire process was then repeated
365
A SIMULATION TEST OF SMITHS DF CORRECTION
=
7.
I
A
I
I
L
-
-
LtoO
O
It00
Fig. 6. Taxonomies used in finding effective N when testing broad-sense validity. Letters correspond
to the species in Fig. 5. Sessions and Larson's (1987) phylogeny broken down into 5 taxonomic levels
(a),4 taxonomic levels (b),and 3 taxonomic levels (c).
TABLE 6. Tests of broad-sense validity: number of taxa
and PVCs for each taxonomic feuel
Taxonomic
levels
3
Level
number
1
2
3
Number
taxa
PVC
2
3
5
9
15
0.3405
0.1307
0.1262
0.1914
0.2111
2
3
5
15
0.3580
0.1358
0.1649
0.3413
2
3
15
0.3836
0.0980
0.5184
for the next taxonomic arrangement. Table
6 gives the PVCs for each of the taxonomies.
To summarize, traits were simulated down
the actual plethodontid salamander phylogeny (to find true effective N), but PVCs were
calculated from the taxonomies (to find
Smith's effective N). The departure of effective N from true effective N thus estimates
how Smith's method fares when evolutionary relationships are obscured by eliminating taxonomic levels. Because none of the
taxonomies uses all the available phylogenetic information, the effect of using taxonomic relationships over phylogenetic ones
is also tested.
The true effective N for Sessions and Larson's input phylogeny is 7.5 (resolution to
TABLE 7. Effective N, effective Nlobserved N, and type I
error rates from tests of broad-sense validitv
Taxonomic
levels
5
4
3
Effective N
Effective N/
observed N
Type I
error rate
6.593
7.067
8.837
0.4395
0.4711
0.5891
0.0300
0.0380
0.0795
0.25, type I error = 0.048). Table 7 gives effective N and the type I error rate for each
of the taxonomies. Even though none of the
taxonomies captures all the evolutionary information of the input phylogeny, the taxonomies using four and five levels still give
conservative statistical significance levels.
However, a noticeable trend occurs, with taxonomies with fewer levels having higher
type I error rates. This result differs from
the relationship found in tests of narrowsense validity, where type I error rates decreased as fewer taxonomic levels were used.
This trend suggests that intentionally reducing the number of taxonomic levels increases
type I error rates. Therefore, reducing the
number of taxonomic levels may not eliminate the excessive type I error rates associated with other parameters.
DISCUSSION
When the assumptions of Smith's method
hold (narrow-sense validity) and 5 or fewer
taxonomic levels are used, the simulation
366
C.L. NUNN
results show that Smith’s df correction gives
conservative statistical results, regardless of
how nonindependent the branch tips are. Because most comparative studies employ 5 or
fewer taxonomic levels, the liberal departure
from expected type I error with more than
5 taxonomic levels is not a serious shortcoming of Smith‘s method, provided users are
aware of this limitation. However, the conservative nature of Smith’s method means
that a false null hypothesis, e.g., a nonzero
correlation coefficient, will more likely be
judged true by the statistical tests. In other
words, conservative type I error rates come
a t the expense of statistical power.
If the true evolutionary relationships are
best represented by more than 5 taxonomic
levels, the simulation results suggest this
conservative departure in type I error rates
disappears and, in many cases, becomes liberal. When the species were generally independent (effective N/observed N 20.75) type
I error rates approached expected error
rates. However, as nonindependence increased type I error rates became more statistically liberal. Tests of broad-sense validity suggest that deliberately sacrificing
phylogenetic resolution by reducing the
number of taxonomic levels may not eliminate liberal deviations in type I error rates.
However, although liberal bias increases as
phylogenetic relationships are obscured, the
broad-sense results also show that using taxonomic categories over phylogenies does not
necessarily invalidate Smith’s method, a s
statistical results can still be conservative.
The effect of correlated trait evolution on
type I error rates is difficult to evaluate.
However, the simulation results suggest
that if i t occurs, the effect is positive (type
I error rates increased with higher p) but
small (highest value of b = 0.0293). Given
that Smith‘s method is generally used in situations expected to give conservative statistical results, this possible trend, because it
is small, does not impose a serious limitation
on the method.
The computer simulation program developed for this study could be used to estimate
the true effective N, and this used in place of
Smith’s effective N when testing statistical
significance. This would have a n advantage
over Smith’s method by having a type I error
rate equal to the expected type I error rate;
in other words, true effective N would be
neither conservative nor liberal.
However, calculation of true effective N by
computer simulation would require a phylogeny with known branch lengths, as well
as assumptions about the model of evolutionary change (Martins and Garland, 1991).
Such a technique would also require the use
of computer simulation, which is probably
more difficult (in terms of application) than
using computer programs that implement
other comparative methods. Thus, the benefits of Smith’s method are lost by using computer simulation to empirically find true effective N, and, if the necessary information
is available, other methods, notably
Felsenstein’s (1985) independent contrasts
method, may actually be easier to implement. Tests of narrow-sense validity of independent contrasts methods show that type
I error rates equal expected error rates
(Martins and Garland, 1991; Purvis et al.,
1994).
This initial round of simulations suggests
that Smith’s method is statistically conservative under the conditions common to most
comparative studies, provided the taxonomy
employed captures most of the evolutionary
relationships. Because the correction is
mathematically tractable, Smith‘s method
may be used for many data sets as a n alternative to other comparative methods, particularly in the early stages of a comparative
analysis.
ACKNOWLEDGMENTS
I thank Marcy Uyenoyama, Diane Waddle,
and Frances White for their comments on
a n early version of this work. Emilia Martins
helped in many aspects of this project, and
the suggestions of John Gittleman, Lyle
Konigsberg, Richard Smith, and a n anonymous reviewer greatly improved the first
draft of this paper. Thanks also to Joe
Felsenstein for introducing me to the subject
of comparative studies, and to Ken Korey for
his help and encouragement over the years.
This research was supported by a n NSF
Graduate Student Fellowship.
A SIMULATION TEST OF SMITHS DF CORRECTION
LITERATURE CITED
Anthony MRL, and Key RF (1993) Tooth form and diet
in Ateline and Alouattine primates: Reflections on the
comparative method. Am. J. Sci. 293-A:356-382.
Cheverud JM, Dow MM, and Leutenegger W (1985)
The quantitative assessment of phylogenetic constraints in comparative analyses: Sexual dimorphism
in body weight among primates. Evolution 39:
1335-1351.
Glutton-Brock TH, and Harvey PH (1977) Primate ecology and social organization. J . Zool. (Lond.) 183:l-33.
Felsenstein J (1985) Phylogenies and the comparative
method. Am. Nat. 125:1-15.
Gittleman JL, and Kot M (1990) Adaptation: Statistics
and a null model for estimating phylogenetic effects.
Syst. Zool. 39:227-241.
Gittleman, J L and HK Luh (1992) On comparing comparative methods. Annu. Rev. Syst. 23:383-404.
Gittleman, J L and HK Luh (1994)Phylogeny, Evolutionary Models, and Comparative Methods: A Simulation
Study. In Eggleton P, and Vane-Wright RI (eds.): Phylogenetics and Ecology. London: Academic Press, pp.
103-122.
Harvey PH, and Pagel MD (1991) The Comparative
Method in Evolutionary Biology. Oxford: Oxford University Press.
Maddison WP (1990)A method for testing the correlated
367
evolution of two binary characters: Are gains or losses
concentrated on certain branches of a phylogenetic
tree? Evolution 44539-557.
Martins E P (1993)A comparative study of the evolution
of Sceloporus push-up displays. Am. Nat. 142:
994-1018.
Martins EP, and Garland T Jr (1991) Phylogenetic analyses of the correlated evolution of continuous characters: A simulation study. Evolution 45:534-557.
Page1 MD, and Harvey PH (1992)On solving the correct
problem: Wishing does not make it so. J . Theor.
Biol. 156:425-430.
Purvis A, Gittleman JL, and Luh HK (1994) Truth or
consequences: Effects of phylogenetic accuracy on two
comparative methods. J . Theor. Biol. 167:293-300.
Ridley M (1983) The Explanation of Organic Diversity.
The Comparative Method and Adaptations of Mating.
Oxford: Clarendon Press.
Sessions SK and Larson A (1987) Developmental correlates of genome size in plethodontid salamanders and
their implications for genome evolution. Evolution
41:1239-1251.
Smith R J (1994) Degrees of freedom in interspecific allometry: An adjustment for the effects of phylogenetic
constraint. Am. J. Phys. Anthropol. 93:95-107.
Sokal RR, and Rohlf JF (1995) Biometry. New York:
WH Freeman.
Zar J H (1984)Biostatistical Analysis. Englewood Cliffs,
NJ: Prentice-Hall.