AMERICAN JOURNAL OF PHYSICAL ANTHROPOLOGY 98:355-367 (1995) A Simulation Test of Smith’s “Degrees of Freedom” Correction for Comparative Studies CHARLES L. NUNN Department of Biological Anthropology and Anatomy, Duke University, DUMC Box 90383, Durham, North Carolina 27708-0383 KEY WORDS Comparative methods, Phylogenetic constraint, Nonindependence, Computer simulation ABSTRACT Computer simulation was used to test Smith‘s (1994) correction for phylogenetic nonindependence in comparative studies. Smith’s method finds effective N, which is computed using nested analysis of variance, and uses this value in place of observed N a s the baseline degrees of freedom (do for calculating statistical significance levels. If Smith’s formula finds the correct df, distributions of computer-generated statistics from simulations with observed N nonindependent species should match theoretical distributions (from statistical tables) with the df based on effective N. The computer program developed to test Smith’s method simulates character evolution down user-specified phylogenies. Parameters were systematically varied to discover their effects on Smith’s method. In simulations in which the phylogeny and taxonomy were identical (tests of narrow-sense validity), Smith’s method always gave conservative statistical results when the taxonomy had fewer than five levels. This conservative departure gave way to a liberal deviation in type I error rates in simulations using more than five taxonomic levels, except when species values were nearly independent. Reducing the number of taxonomic levels used in the analysis, and thereby eliminating available information regarding evolutionary relationships, also increased type I error rates (broad-sense validity), indicating that this may be inappropriate under conditions shown to have high type I error rates. However, the use of taxonomic categories over more accurate phylogenies did not create a liberal bias in all cases in the analysis performed here. The effect of correlated trait evolution was ambiguous but, relative to other parameters, negligible. o 1995Wiley-Liss, Inc. Species values in comparative studies are not necessarily independent of one another. However, the statistical techniques commonly used in these studies require independent data points. When species values a r e falsely considered independent, the degrees of freedom ( d o appropriate for statistical testing are inflated, resulting in a n overstatement of statistical significance (Felsenstein, 1985; Harvey and Pagel, 1991; Martins and Garland, 1991). All the new comparative methods that address problems of phylogenetic nonindepen0 1995 WILEY-LISS, INC dence are limited in their application and have strict assumptions. Some methods, such as Felsenstein’s (1985) independent contrasts method, require a phylogeny with known branch lengths. Other methods, including higher nodes approaches that use nested analysis of variance (ANOVA)to identify a n independent level of analysis (Clutton-Brock and Harvey, 19771, assume that the taxonomy used estimates the actual phy- Received July 1, 1994; accepted May 21, 1995. 356 C.L. NUNN logenetic relationships, while autocorrelation techniques (Cheverud e t al., 1985; Gittleman and Kot, 1990) require a hierarchical ordering of relatedness and a n association between phylogenetic relatedness and trait variation (“phylogenetic correlation”: Gittleman and Luh, 1992, 1994). By averaging traits of lower taxonomic levels, some methods, such a s higher nodes approaches (Clutton-Brock and Harvey, 1977) and parsimony techniques (Ridley, 1983; Maddison, 1990), ignore species-level variation (Anthony and Kay, 1993; Smith, 1994). Finally, although computer software that calculates many of the corrections is widely available, the mathematical complexity of most methods seems to limit their application by many in the scientific community. Smith’s (1994) method, a correction for the degrees of freedom that uses nested ANOVA to partition variance to taxonomic levels, avoids some of the restrictions of other comparative methods. However, Smith’s method has its own limitations. In Smith’s method, species values are considered independent for the initial calculation of a statistical measure. Only when computing statistical significance levels are phylogenetic effects incorporated. Using nested ANOVA procedures on taxonomic levels, Smith partitions the overall sample variance into percentage variance components (PVCs). Smith computes effective N, which is used in place of observed N when calculating the degrees of freedom, by multiplying the PVCs by the number of taxa at each hierarchical level and summing these values through the taxonomic levels under investigation: Effective N = (# species)(PVC,,,,i,,) + (# genera)(PVC,,,,,)+. . . + (# taxa)(PVC,,,). Because of the structure of evolutionary trees, effective N must be less than or equal to the number of species in the data set, or observed N. The difference between effective N and observed N depends on how variation is partitioned across taxonomic levels. When phylogenetic constraints are weak and species are relatively free to vary, variation will occur mostly at lower taxonomic levels (species). In this case, the PVC for species ap- proaches 1, and effective N approaches observed N. At the other extreme, when phylogenetic effects are strong and lower taxonomic levels are constrained, nested ANOVA is expected to partition most of the variance to higher taxonomic levels. In this case, the PVC for species (and other low taxonomic levels) approaches 0, and effective N is consequently much less than observed N. By using nested ANOVA to partition variance, and then using this partitioning to find the baseline degrees of freedom, Smith’s correction is supposed to account for phylogenetic nonindependence when calculating statistical significance levels. In many ways, Smith’s method is a compromise between the older, nonphylogenetic comparative methods and their new counterparts; it allows one to use the “traditional‘equilibrium’ analysis” (Martins and Garland, 1991) by providing a correction for phylogenetic effects that is implemented at a later stage in the analysis. Smith’s method has three advantages over the other new comparative methods. First, Smith’s method uses the observed variance at each taxonomic level to estimate phylogenetic constraint and is thus independent of a specific model of evolutionary change (for other methods, see Gittleman and Luh, 1994). Second, Smith’s method uses all information at the species level, thus avoiding the criticism that available information is lost with new comparative methods (Anthony and Kay, 1993; Smith, 1994). Finally, the ease of interpreting and calculating effective N-which can be done with commonly used statistical packages-makes Smith’s method a n attractive alternative to more mathematically complicated procedures. No transformations or contrasts are required, further easing statistical interpretation of the output. Although Smith’s method overcomes some weaknesses of previous comparative methods, the method still has limitations. Smith’s method may be unsatisfactory when the taxonomy chosen by the user does not accurately represent all evolutionary relationships, such a s when phylogenetic relationships are misstated or nodes are incompletely resolved (polytomies). Inability to meet the assumptions necessary for proper nested ANOVA calculations, such a s the re- 357 A SIMULATION TEST OF SMITHS DF CORRECTION h -.8 -.6 -.4 -.2 0 .2 4 .6 .8 Correlation Coefficient, r Fig. 1. Example of an observed distribution. The evolution of two traits is simulated down an input phylogeny, and the values at the branch tips (the species values) are used to calculate the correlation between the traits (I). The number of r in the observed distribution equals the number of simulations. quirements of homoscedasticity and the description of evolution a s hierarchical, may also limit application of Smith’s method. I n addition, equal sample sizes are preferred when calculating nested ANOVAs. When sample sizes are unequal, calculation of nested ANOVAs becomes more complex, and only approximate tests of significance are available (Sokal and Rohlf, 1995). Smith’s method lowers the df, which makes it more conservative than simply using observed N. In addition, Smith (1994) suggests that his method gives conservative statistical results even after taking phylogeny into account. However, the theoretical basis of Smith’s method is unclear, and the ability of Smith’s method to estimate the true effective N, the value that a correction for the degrees of freedom should calculate, has not been established analytically. This paper describes a simulation test of Smith‘s method. The computer program created for this test simulates the evolution of two traits down fully known phylogenies, which in some cases is identical to the taxonomy used to calculate effective N. By repeatedly simulating evolution down a known phylogeny, the program generates distributions of statistics (“observed distributions”) calculated from the resulting branch tip trait values (Fig. 1). In the analysis of Smith’s method, goodness-of-fit tests were used to compare observed distributions to their expected distributions (calculated using Smith‘s method), and type I error rates and true effective Ns were estimated. Thus, the possible bias and statistical error associated with Smith’s correction were established empirically with different phylogenies and under different evolutionary scenarios. METHODS The computer program was written in THINK Pascal 4.0 (Symantec Corp., 1991, Cupertino, CA). I t allows the user to specify the input phylogeny, the correlation between traits (p), and the number of simulations to perform (=2,000 for all results discussed here). Trait changes are calculated for each internode branch on the input phylogeny, and each trait change is normally distributed with mean equal to zero and variance proportional to the branch length. This follows a model of evolutionary change known a s Brownian motion (Felsenstein, 1985), where the variance accumulates a t a rate linearly proportional to time. The assump- 358 C.L. NUNN tion of Brownian motion simplifies partitioning of variance when branch lengths for each taxonomic level are equal, as the PVC of a taxonomic level is simply its branch length’s proportion of the total phylogeny’s length. For each simulation of trait evolution the trait values a t the branch tips are used to calculate a product-moment correlation coefficient (r). Each r is a n estimate of the userspecified correlation p. The observed distribution is composed of the r calculated from each simulation. Thus, each r is computed using all species in the input phylogeny (observed N), and the number of rs in the observed distribution equals the user-specified number of simulations. If Smith’s effective N is a n unbiased estimate of the true effective N, experimentally generated distributions of correlation coefiicients calculated from observed N should match their expected distributions based on effective N, despite the fact that effective N must be less than observed N. Goodness-offit tests, calculation of type I error rates, and comparison of effective N to true effective N show whether Smith’s method correctly estimates true effective N, and, if not, whether the method is statistically conservative or liberal. Statistical tests of goodness-of-fit (G-statistics, using William’s correction) were used to compare observed distributions to their expected distributions. When comparing two observed distributions generated with different input phylogenies, the KolmogorovSmirnov two-sample test was used. Statistical measures of goodness-of-fit, such as G, give the probability that two distributions have the same underlying distribution. With this probability, the null hypothesis that no difference exists between the distributions (and, hence, that Smith’s method estimates the correct baseline df) was accepted or rejected a t a significance level of 5%. Type I error rates were calculated by finding the percentage of the observed distribution that exceeds the CL = 0.05 critical value, where the CL = 0.05 critical values were taken from statistical tables (r distributions) using effective N-2 a s the df. If Smith’s correction estimates the true effective N, type I error rates should be approximately 5%. If Smith’s correction is conservative, type I error rates should be less than 5%, and if Smith‘s correction is liberal, type I error rates should exceed 5%. The true effective N was determined by finding the expected distribution (Zar, 1984, Table B.16) that (1) best fits the observed data (using G-statistics), and (2) maintains a conservative bias in type I error rates (type I error rate ~ C =L 0.05). Harmonic interpolation was used to find noninteger df when integer values for the true effective N did not fit the data. The resolution used in finding the true effective N was limited to 0.25. Narrow- versus broad-sense validity Two sets of simulations were performed that test different aspects of Smith’s correction. First, simulations were conducted when the assumptions of Smith’s method hold. This set of simulations tests the method’s “narrow-sense” validity (Pagel and Harvey, 1992), a type of validity that applies when the assumptions of the comparative method are met by the data set. One key assumption of Smith’s method is that the taxonomy used accurately reflects all phylogenetic relationships. This assumption is preserved in tests of narrow-sense validity by simulating evolution on a phylogeny that is really a n ideal taxonomy: the phylogeny is a dichotomously branching tree with equal branch lengths at each taxonomic level. (The terms “phylogeny” and “taxonomy” are thus interchangeable in tests of narrow-sense validity.) Branch lengths can differ between taxonomic levels (e.g., terminal branches may be longer than internal branches), but within each taxonomic level branches have the same length. An example of such a n input phylogeny is given in Figure 2. Three parameters were experimentally manipulated in this series of simulations: the number of taxonomic levels; how variance accumulates a t successive taxonomic levels (or, the amount of phylogenetic nonindependence); and the correlation between the simulated traits (p). These tests are summarized in Table 1. The assumptions of a comparative study are almost never perfectly satisfied in actual data sets. Therefore, the second set of simulations tests the method’s “broad-sense’’ va- A SIMULATION TEST OF SMITHS DF CORRECTION 7 G n - E F - K P- Fig. 2. In tests of narrow-sense validity t h e input phylogeny was really a n ideal taxonomy: within each taxonomic level branch lengths were equal. However, branch lengths could differ between taxonomic levels. This is shown in t h e hypothetical phylogeny here, with t h e terminal level having longer branch lengths t h a n internal levels. lidity (Pagel and Harvey, 1992), or how the method fares when some (or all) of its assumptions are broken. Smith‘s method uses taxonomic categories. Therefore, one potential problem with the method is that a n accurate phylogeny, because it includes branch lengths, will provide more information about evolutionary relationships than a n accurate 359 taxonomy. In addition, as fewer taxonomic levels are used in the comparative study, more phylogenetic information is lost: essentially, the phylogeny has more unresolved nodes, or polytomies. Decreasing the resolution of phylogenetic relationships has been shown to increase the type I error rates of other methods (Purvis et al., 1994), and the same may be true of Smith’s method. To test this possibility, evolution was simulated down a n input phylogeny used in previous simulation studies [Sessions and Larson’s (1987) plethodontid salamander phylogeny used in Martins and Garland‘s (1991) study, chosen because branch lengths are provided]. From the input phylogeny, three possible taxonomies were created, each having a different number of taxonomic levels. The taxonomy using the most levels best approximates the input phylogeny, while the taxonomy using the fewest levels has the poorest resemblance to the input phylogeny. By calculating Smith’s effective N for each of these taxonomies and then comparing these values to the true effective N, the effect on type I error of using unresolved evolutionary relationships was established for this phylogeny. RESULTS Narrow-sense validity For testing the narrow-sense validity of Smith’s method, simulations were conducted with the parameter of interest varying, while all other parameters were held constant. The input phylogeny bifurcated at each node. Three parameters were systematically varied: the number of taxonomic levels, the distribution of variance across taxo- TABLE 1. Summary of the simulations conducted Type of validity Narrow-sense validity Factor investigated Number of taxonomic levels Variance partitioning (PVCs differ) Trait correlation (p) Broad-sense validity The phylogenetic resolution of the taxonomy used in finding effective N Parameters simulated 3-9 taxonomic levels, all branch lengths equal Effective Nhbserved N ranges from 0.125 to 0,875 in increments of 0.125; taxonomic levels range from 4 to 7 p = 0, 0.25, 0.50, 0.75, 0.90, 0.95, and 0.98; various effective N and number of taxonomic levels 3-5 taxonomic levels (5-11 internal nodes), taxonomies based on Sessions and Larson’s (1987) ulethodontid salamander uhvlogenv 360 C.L. NUNN nomic levels, and the correlation between the simulated traits (p). The number of taxonomic levels and p are not included a s variables in Smith’s formula for effective N. However, effective N does incorporate the distribution of variance across taxonomic levels by using nested ANOVA to calculate PVCs. Number of taxonomic levels To deter- mine whether the number of taxonomic levels biases estimates of true effective N, simulations were conducted with the number of taxonomic levels ranging from 3 to 9 (seven different input phylogenies). All branch lengths on the input phylogenies were equal. Using G-tests, the observed distributions were compared to a n expected distribution of correlation coefficients with df = effective N-2 (Zar, 1984, Table B.16; 9 ranges of the expected distribution are provided by Zar and used in the analysis; a 10th range was added to reflect the limits of correlation coefficients: -1 to + l ) . Because the computer program simulates evolution by Brownian motion and because all branch lengths were equal, PVCs were the same for each taxonomic level (holding the number of taxonomic levels constant). The input phylogeny bifurcates a t each node, so a t each taxonomic level, 2L taxa occur, where L is the hierarchical level counted from the root of the tree. These valuesPVCs and the number of taxa a t each hierarchical level-were used to calculate effective N. For example, with five taxonomic levels, PVC = 1.0/5 = 0.2 a t each level. Thus, Effective N = 0.2(2’ + 2’ = 12.4. + 23 + 24 + 25) Harmonic interpolation between values from statistical tables was used to find noninteger df. Although noninteger df have only theoretical value, interpolation eliminates the conservative departure from type I error t h a t is more pronounced a t small values of effective N. The df appropriate for finding values from statistical tables of r distributions is n-2. Returning to the example, using Smith’s method with harmonic interpolation, the observed distribution of correlation coefficients TABLE 2. Observed N, effectiue N, and effectiue Nlobserued N for simulations with taxonomic levels uarvinp and branch lengths equal of taxonomic levels Nn. . .. 3 4 5 6 7 a 9 ~ Observed N Effective N Effective N1 observed N 8 16 32 64 128 256 512 4.67 7.50 12.40 21.00 36.29 63.75 113.56 0.584 0.469 0.388 0.328 0.284 0.249 0.222 should follow the expected distribution with 12.4 - 2 = 10.4 df. By contrast, a n analysis that ignores phylogenetic effects would use Z5 - 2 = 30 df (5 taxonomic levels in a balanced, bifurcating phylogeny gives z5= 32 branch tips, or observed N). Table 2 lists observed Ns and effective Ns for simulations in which the number of taxonomic levels was varied. G-statistics were compared to the x2distribution with df = number of categories - 1. To avoid observed frequencies of 0 in a category, which complicates calculation of G-statistics, simulations of input phylogenies with 3 and 4 taxonomic levels were pooled into seven ranges of the theoretical distribution. For the remaining simulations (5 to 6 taxonomic levels), i t was possible to use all 10 ranges of the theoretical distribution. Table 3 lists effective N, GadJ,the type I error rate, and true effective N for simulations run with different numbers of taxonomic levels. For all tests the observed distribution differed significantly from expected (all P-values are <0.005). Using G,, a s a n estimate of goodness of fit, Smith’s correction worked best in the middle ranges of taxonomic levels simulated (5 and 6 levels); simulations on input phylogenies with fewer (3 and 4) or more (7 to 9) taxonomic levels generated observed distributions that departed more from expected. When 3 to 5 taxonomic levels were simulated, Smith’s correction gave conservative type I error rates (type I error < a = 0.05). This conservative departure disappeared when 6 or more taxonomic levels were simulated (type I error > a = 0.05). For simulations of 8 and 9 taxonomic levels, the type I error rate was substantial (8 levels, type I 36 1 A SIMULATION TEST OF SMITHS DF CORRECTION TABLE 3. Results of simulations, with the number of taxonomic levels varvine and branch lengths eaual' Taxonomic levels 3 4 5 6 7 8 9 Effective N Gad, Type1 error True effective N 4.67 7.50 12.40 21.00 36.29 63.75 113.56 220.735 76.438 24.459 29.036 266.545 689.022 1600.100 0.010 0.019 0.038 0.069 0.129 0.175 0.243 6.25 9.00 13.50 18.00 24.00 31.00 39.00 TABLE 4. Results of Kolmogorou-Smirnou two-sample test, testing for the effect on observed distributions of changing the input phylogeny while holding effective N a n d number of taxonomic levels constant' Comparisod uarameters TvDe of test Magnitude of branch lengths differs (but relative branch lengths the same) 'All observed distributions are significantly different from expected (Gad,:3 and 4 taxonomic levels, df = 6; 2 5 taxonomic levels, df = 9; all P-values 40.005). error = 0.175; 9 levels, type I error = 0.243). Another way of looking at this is by comparing Smith's effective N to true effective N: as more taxonomic levels were simulated, Smith's effective N underestimated true effective N, indicating a liberal statistical bias to the method as more taxonomic levels are included in the analysis (Fig. 3). Note, however, that few comparative studies use more than 5 taxonomic levels (see examples in Harvey and Pagel, 19911, and Smith's method is therefore expected to be conservative in the narrow sense. All internode branch lengths were equal in this set of simulations. This essentially assumes a punctuational model of evolutionary change, where the amount of change is proportional to the number of speciation events (Martins and Garland, 1991; Martins, 1993). More importantly, the equality of branch lengths means that effective N/ observed N declines as more taxonomic levels are simulated (Table 2). Consequently, the increase in type I error rates a s more taxonomic levels were simulated may instead only reflect the decline in effective N/ observed N, which is really a measure of the nonindependence in the data set. The next section separates the effects of these two parameters. Variance partitioning When species values are less independent, variance is partitioned mostly to high taxonomic levels (e.g., orders), reducing effective N/observed N. When species values are nearly independent, the variance is partitioned mostly to low taxonomic levels (e.g., species), and effective N/observed N approaches 1.Effective N/observed N, really a measure of phyloge- Relative branch lengths differ 1 Critical value of D,,,,,,,,= D">.v Variance = 1 vs. var = 10, 4 taxonomic levels 0.0240, n.s. var = 1 vs. var = 100, 4 taxonomic levels var = 1 vs. var = 10, 6 taxonomic levels var = 1 vs. var = 100, 6 taxonomic levels 0.0185, n.s. Effective N = 20, 5 taxonomic levels Effective N = 25, 6 taxonomic levels Effective N = 30, 6 taxonomic levels 0.0190, n.s. 0.0345, n.s. 0.0100, n.s. 0.0260, n.s. 0.0185, n.s. 0.0430. netic nonindependence, may explain the deviations in type I error rates noted above. Internode and terminal branch lengths of the input phylogeny were altered such that effective N/observed N varied from 0.125 to 0.875 in 0.125 increments. For low values of effective N (strong phylogenetic nonindependence), higher taxonomic levels had the longest relative branch lengths, while for high values of effective N (species values nearly independent), lower taxonomic levels had the longest relative branch lengths. To preserve the assumption that the taxonomy accurately reflects all phylogenetic relationships, branch lengths were identical within each taxonomic level. Simulations were run on input phylogenies with 4 , 5 , 6 ,and 7 taxonomic levels (effective N/observed N = 0.125 was not possible for 4 taxonomic levels without some branch lengths of 0 and, consequently, it was not tested here). By varying both the number of taxonomic levels and effective N/observed N, the effects of these two parameters could be separated. Before simulating different effective Ns by changing relative branch lengths, the relative branch lengths themselves had to be excluded as factors influencing observed dis- 362 C.L. NUNN 3 4 5 6 8 7 9 Taxonomic Levels Fig. 3. Graphic representation of effective N and true effective N, the value that Smith’s method should calculate, with the number of taxonomic levels varying and branch lengths equal. Smiths method underestimated true effective N when 3 to 5 taxonomic levels were simulated; when more than 5 taxonomic levels were simulated, Smith’s method overestimated true effective N. tributions and type I error rates. The Kolmogorov-Smirnov two-sample test was used to compare observed distributions from input phylogenies that differed in their branch lengths but had the same number of taxonomic levels and effective N. Two types of changes in input phylogenies were tested: (1)differences in the magnitude of internode branch lengths (with relative branch lengths the same), and (2) differences in relative branch lengths. Table 4 presents the results of these tests. In no cases were significant differences found between distributions simulated with the same number of taxonomic levels and effective N but different absolute or relative branch lengths, suggesting that effective N could be varied by changing branch lengths. Returning to simulations with different effective Ns and different numbers of taxonomic levels, Table 5 lists the parameters simulated and their type I error rates. Figure 4 provides a graphic representation of the results. As effective N/observed N was increased (species values were more independent) type I error rates approached the expected type I error rate (a = 0.05). This suggests that Smith’s correction works best when branch tip data points are nearly independent. As effective Nlobserved N was de- TABLE 5. Results of simulations with effectiveN a n d number of taxonomic levels varying Taxonomic levels Effective NI observed N Type I 4 6 8 10 12 14 0.25 0.375 0.5 0.625 0.75 0.875 0.002 0.019 0.0185 0.0275 0.0255 0.0385 4 8 12 16 20 24 28 0.125 0.25 0.375 0.5 0.625 0.75 0.875 0.003 0.036 0.035 0.037 0.034 0.0345 0.04 6 8 16 24 32 40 48 56 0.125 0.25 0.375 0.5 0.625 0.75 0.875 0.143 0.092 0.0735 0.076 0.064 0.0465 0.0515 7 16 32 48 64 0.125 0.25 0.375 0.5 0.625 0.75 0.875 0.246 0.151 0.116 0.109 0.097 0.0515 0.0545 Effective N 80 96 112 error 363 A SIMULATION TEST OF SMITHS DF CORRECTION 0 0.125 0.25 0.375 0.5 0 625 0.75 0.875 1 effectiveN I observedN Fig. 4. Type I error rates (expected = 0.05) with the number of taxonomic levels and effective N1 observed N varying. When 4 o r 5 taxonomic levels were simulated, Smiths method always gave conservative statistical significance levels. For 6 and 7 taxonomic levels decreasing effective Nlobserved N increased type I error rates creased the magnitude and the direction of departure from a depended on the number of taxonomic levels simulated. Type I error rates for simulations of 4 and 5 taxonomic levels never exceeded a = 0.05, suggesting that Smith’s method is statistically conservative, regardless of effective Nlobserved N, when 5 or fewer taxonomic levels are used in the study. By contrast, results from simulations of 6 and 7 taxonomic levels indicate that Smith’s method is statistically liberal when effective N/observed N is less than about 0.75. Furthermore, this liberal deviation increased as effective N/observed N decreased. In every case, for a given effecthe N/observed N, simulations on input phylogenies with more taxonomic levels had higher type I error rates. Correlation between traits To test for a n effect of correlated character evolution on Smith’s method, simulations were run with p = 0.00, 0.25, 0.50, 0.75, 0.90, 0.95, and 0.98. Different numbers of taxonomic levels (4,5, and 6) and effective Ns were simulated. Linear regression techniques were used to see whether a relationship exists between type I error rates and p. Early results suggested that as p increases type I error rates also increase, although this positive relationship was small (linear regression: b = 0.0217, P = 0.0054). However, this result was not consistently found when tested over a wider range of parameters. In fact, b ranged from -0.0119 to 0.0293, with most values insignificantly different from 0. I n all cases b was small, indicating that the effect of correlated trait evolution on type I error rates is minor compared to the effect of the parameters discussed above. One explanation for the inconsistency of this result may be the method of calculating critical values for nonzero input correlations. As p approaches its limits ( 5l.O), the statistical distributions become more asymmetrical. Authors differ in their methods of finding critical values for p # 0 (e.g., compare Zar, 1984, wth Sokal and Rohlf, 1995). Furthermore, the approximations used may be biased in a n unknown way for small sample sizes and high values of p. This could create a n observed increase in type I error rates for some sets of simulation parameters when in fact no relationship exists. One way of dealing with these potential biases is to find critical values by computer simulation. These values are, however, only estimates of the true critical values. This means that finding a relationship between 364 C.L. NUNN p and type I error rates is still difficult, a s a n actual trend would be diluted by stochastic error in the estimated critical values. This is especially true when the effect is small, a s initial analyses indicated here. In one series of simulations (6 taxonomic levels, all internode branches equal) with critical values found by simulation (n = 101000),a significant positive trend was discovered (b = 0.0293; P = 0.0273). However, under the same simulation conditions and four taxonomic levels, the trend was negative, but not significantly so (b = -0.0119; P = 0.057). These were also the most extreme values of b found. The inconsistency in these findings may reflect (1)unknown variables or a n interaction between variables, (2) a lack of resolution in finding critical values, because of either biased statistical methods or stochastic error associated with computer simulations, or (3) statistical artifacts coupled with no real trend. In all cases, the effect is small relative to the effects of the other parameters tested. Broad-sense validity Some researchers, citing the narrow-sense results from above, may be inclined to group their species into fewer taxonomic levels to reduce excessively high type I error rates. However, the above results are from tests of narrow-sense validity, where the taxonomy and the phylogeny were identical. Grouping the species into fewer taxonomic levels breaks this assumption by obscuring the species’ true phylogenetic relationships, and this makes it incorrect to apply tests of narrow-sense validity to this situation. This section tests the “broad-sense1’ validity of Smith’s method, or how the method works when some of its assumptions are broken. As Page1 and Harvey (1992) point out, there are a n infinite number of ways in which the assumptions of a comparative method can fail. Therefore, only a subset of possible violations can ever be tested experimentally. Smith’s method makes use of the observed variance at each taxonomic level and therefore makes no assumptions about the model of evolutionary change. Consequently, the model of evolutionary change is not a n interesting parameter to test in this case. Instead, the analysis of broad-sense va- I Emalrna(A) Aneidesferreus (Bj A Frvrpunctatus (C) - A. lugubris (D) A hardir (E) Plethodon Iarsellr @j P elongatus (G) P vehrculum fl) P dunnr (7) P. jordanr (M) P. yonahlossee (?f) P. glutinoms (0) Fig. 5. Sessions and Larson’s (1987) plethodontid salamander phylogeny, used as the input phylogeny in tests of broad-sense validity. lidity here focuses on how rearrangements of the taxonomy, where the taxonomy used in calculating effective N does not equal the input phylogeny, affects Smith’s method. Evolution was simulated using Sessions and Larson’s (1987) plethodontid salamander phylogeny (Fig. 5) a s the input phylogeny (also used in Martins and Garland, 19911, and the resulting observed distribution was used to find the true effective N. Three taxonomies, each with fewer taxonomic levels (5, 4,and 31, were created from the input phylogeny. Taxonomies with more levels best approximated the input phylogeny, and thus had the greatest phylogenetic resolution. However, no taxonomy provided the resolution of the input phylogeny. The taxonomies created and used in the analysis are shown in Figure 6. Then, with a set of branch tip data points, simulated using Sessions and Larson’s phylogeny as the input phylogeny, PVCs were calculated using the procedure “proc n e s t e d in SAS (SAS Institute Inc., 1992, Cary, NC). This procedure was repeated 25 times, each time with different simulated branch tips, and the PVCs calculated from the 25 nested ANOVAs were averaged for each taxonomic level. The entire process was then repeated 365 A SIMULATION TEST OF SMITHS DF CORRECTION = 7. I A I I L - - LtoO O It00 Fig. 6. Taxonomies used in finding effective N when testing broad-sense validity. Letters correspond to the species in Fig. 5. Sessions and Larson's (1987) phylogeny broken down into 5 taxonomic levels (a),4 taxonomic levels (b),and 3 taxonomic levels (c). TABLE 6. Tests of broad-sense validity: number of taxa and PVCs for each taxonomic feuel Taxonomic levels 3 Level number 1 2 3 Number taxa PVC 2 3 5 9 15 0.3405 0.1307 0.1262 0.1914 0.2111 2 3 5 15 0.3580 0.1358 0.1649 0.3413 2 3 15 0.3836 0.0980 0.5184 for the next taxonomic arrangement. Table 6 gives the PVCs for each of the taxonomies. To summarize, traits were simulated down the actual plethodontid salamander phylogeny (to find true effective N), but PVCs were calculated from the taxonomies (to find Smith's effective N). The departure of effective N from true effective N thus estimates how Smith's method fares when evolutionary relationships are obscured by eliminating taxonomic levels. Because none of the taxonomies uses all the available phylogenetic information, the effect of using taxonomic relationships over phylogenetic ones is also tested. The true effective N for Sessions and Larson's input phylogeny is 7.5 (resolution to TABLE 7. Effective N, effective Nlobserved N, and type I error rates from tests of broad-sense validitv Taxonomic levels 5 4 3 Effective N Effective N/ observed N Type I error rate 6.593 7.067 8.837 0.4395 0.4711 0.5891 0.0300 0.0380 0.0795 0.25, type I error = 0.048). Table 7 gives effective N and the type I error rate for each of the taxonomies. Even though none of the taxonomies captures all the evolutionary information of the input phylogeny, the taxonomies using four and five levels still give conservative statistical significance levels. However, a noticeable trend occurs, with taxonomies with fewer levels having higher type I error rates. This result differs from the relationship found in tests of narrowsense validity, where type I error rates decreased as fewer taxonomic levels were used. This trend suggests that intentionally reducing the number of taxonomic levels increases type I error rates. Therefore, reducing the number of taxonomic levels may not eliminate the excessive type I error rates associated with other parameters. DISCUSSION When the assumptions of Smith's method hold (narrow-sense validity) and 5 or fewer taxonomic levels are used, the simulation 366 C.L. NUNN results show that Smith’s df correction gives conservative statistical results, regardless of how nonindependent the branch tips are. Because most comparative studies employ 5 or fewer taxonomic levels, the liberal departure from expected type I error with more than 5 taxonomic levels is not a serious shortcoming of Smith‘s method, provided users are aware of this limitation. However, the conservative nature of Smith’s method means that a false null hypothesis, e.g., a nonzero correlation coefficient, will more likely be judged true by the statistical tests. In other words, conservative type I error rates come a t the expense of statistical power. If the true evolutionary relationships are best represented by more than 5 taxonomic levels, the simulation results suggest this conservative departure in type I error rates disappears and, in many cases, becomes liberal. When the species were generally independent (effective N/observed N 20.75) type I error rates approached expected error rates. However, as nonindependence increased type I error rates became more statistically liberal. Tests of broad-sense validity suggest that deliberately sacrificing phylogenetic resolution by reducing the number of taxonomic levels may not eliminate liberal deviations in type I error rates. However, although liberal bias increases as phylogenetic relationships are obscured, the broad-sense results also show that using taxonomic categories over phylogenies does not necessarily invalidate Smith’s method, a s statistical results can still be conservative. The effect of correlated trait evolution on type I error rates is difficult to evaluate. However, the simulation results suggest that if i t occurs, the effect is positive (type I error rates increased with higher p) but small (highest value of b = 0.0293). Given that Smith‘s method is generally used in situations expected to give conservative statistical results, this possible trend, because it is small, does not impose a serious limitation on the method. The computer simulation program developed for this study could be used to estimate the true effective N, and this used in place of Smith’s effective N when testing statistical significance. This would have a n advantage over Smith’s method by having a type I error rate equal to the expected type I error rate; in other words, true effective N would be neither conservative nor liberal. However, calculation of true effective N by computer simulation would require a phylogeny with known branch lengths, as well as assumptions about the model of evolutionary change (Martins and Garland, 1991). Such a technique would also require the use of computer simulation, which is probably more difficult (in terms of application) than using computer programs that implement other comparative methods. Thus, the benefits of Smith’s method are lost by using computer simulation to empirically find true effective N, and, if the necessary information is available, other methods, notably Felsenstein’s (1985) independent contrasts method, may actually be easier to implement. Tests of narrow-sense validity of independent contrasts methods show that type I error rates equal expected error rates (Martins and Garland, 1991; Purvis et al., 1994). This initial round of simulations suggests that Smith’s method is statistically conservative under the conditions common to most comparative studies, provided the taxonomy employed captures most of the evolutionary relationships. Because the correction is mathematically tractable, Smith‘s method may be used for many data sets as a n alternative to other comparative methods, particularly in the early stages of a comparative analysis. ACKNOWLEDGMENTS I thank Marcy Uyenoyama, Diane Waddle, and Frances White for their comments on a n early version of this work. Emilia Martins helped in many aspects of this project, and the suggestions of John Gittleman, Lyle Konigsberg, Richard Smith, and a n anonymous reviewer greatly improved the first draft of this paper. Thanks also to Joe Felsenstein for introducing me to the subject of comparative studies, and to Ken Korey for his help and encouragement over the years. This research was supported by a n NSF Graduate Student Fellowship. A SIMULATION TEST OF SMITHS DF CORRECTION LITERATURE CITED Anthony MRL, and Key RF (1993) Tooth form and diet in Ateline and Alouattine primates: Reflections on the comparative method. Am. J. Sci. 293-A:356-382. Cheverud JM, Dow MM, and Leutenegger W (1985) The quantitative assessment of phylogenetic constraints in comparative analyses: Sexual dimorphism in body weight among primates. Evolution 39: 1335-1351. Glutton-Brock TH, and Harvey PH (1977) Primate ecology and social organization. J . Zool. (Lond.) 183:l-33. Felsenstein J (1985) Phylogenies and the comparative method. Am. Nat. 125:1-15. Gittleman JL, and Kot M (1990) Adaptation: Statistics and a null model for estimating phylogenetic effects. Syst. Zool. 39:227-241. Gittleman, J L and HK Luh (1992) On comparing comparative methods. Annu. Rev. Syst. 23:383-404. Gittleman, J L and HK Luh (1994)Phylogeny, Evolutionary Models, and Comparative Methods: A Simulation Study. In Eggleton P, and Vane-Wright RI (eds.): Phylogenetics and Ecology. London: Academic Press, pp. 103-122. Harvey PH, and Pagel MD (1991) The Comparative Method in Evolutionary Biology. Oxford: Oxford University Press. Maddison WP (1990)A method for testing the correlated 367 evolution of two binary characters: Are gains or losses concentrated on certain branches of a phylogenetic tree? Evolution 44539-557. Martins E P (1993)A comparative study of the evolution of Sceloporus push-up displays. Am. Nat. 142: 994-1018. Martins EP, and Garland T Jr (1991) Phylogenetic analyses of the correlated evolution of continuous characters: A simulation study. Evolution 45:534-557. Page1 MD, and Harvey PH (1992)On solving the correct problem: Wishing does not make it so. J . Theor. Biol. 156:425-430. Purvis A, Gittleman JL, and Luh HK (1994) Truth or consequences: Effects of phylogenetic accuracy on two comparative methods. J . Theor. Biol. 167:293-300. Ridley M (1983) The Explanation of Organic Diversity. The Comparative Method and Adaptations of Mating. Oxford: Clarendon Press. Sessions SK and Larson A (1987) Developmental correlates of genome size in plethodontid salamanders and their implications for genome evolution. Evolution 41:1239-1251. Smith R J (1994) Degrees of freedom in interspecific allometry: An adjustment for the effects of phylogenetic constraint. Am. J. Phys. Anthropol. 93:95-107. Sokal RR, and Rohlf JF (1995) Biometry. New York: WH Freeman. Zar J H (1984)Biostatistical Analysis. Englewood Cliffs, NJ: Prentice-Hall.
© Copyright 2026 Paperzz