Austral Ecology (2002) 27, 681–688 Of rowing boats, ocean liners and tests of the homogeneity of variance assumption ANOVA KEITH A. MCGUINNESS School of Biological, Environmental and Chemical Sciences, Northern Territory University, Darwin, Northern Territory 0909, Australia (Email: [email protected]) Abstract One of the assumptions of analysis of variance (ANOVA) is that the variances of the groups being compared are approximately equal. This assumption is routinely checked before doing an analysis, although some workers consider ANOVA robust and do not bother and others avoid parametric procedures entirely. Two of the more commonly used heterogeneity tests are Bartlett’s and Cochran’s, although, as for most of these tests, they may well be more sensitive to violations of the ANOVA assumptions than is ANOVA itself. Simulations were used to examine how well these two tests protected ANOVA against the problems created by variance heterogeneity. Although Cochran’s test performed a little better than Bartlett’s, both tests performed poorly, frequently disallowing perfectly valid analyses. Recommendations are made about how to proceed, given these results. Key words: ANOVA, assumption, homogeneity, Bartlett’s, Cochran’s, variance, Type I error. INTRODUCTION ‘To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!’ (Box 1953) The analysis of variance (ANOVA) is generally regarded as one of the most powerful and flexible methods for testing null hypotheses about means of samples (Underwood 1997). Indeed, if its assumptions are met, ANOVA provides the most powerful (i.e. most likely to reject the null when it is actually false) tests of such null hypotheses. The validity of the ANOVA assumptions, however, frequently causes much angst, and no little confusion, among ecologists; a situation certainly created, at least in part, by some authors of statistical texts (Table 1). The most critical assumption, and possibly the least understood (although it should be the easiest satisfied), is that the observations (strictly speaking, the error terms) are uncorrelated, or independent and, although ‘this difficulty is largely taken care of by proper randomization’, ‘substantial biases’ may occur if this assumption is not met (Cochran 1947). Glass et al. (1972), in their extended review of the ANOVA assumptions conclude that ‘non-independence of errors seriously affects both the level of significance and power of the F-test regardless of whether n’s are equal or unequal’ (see Underwood (1997) for an extended discussion of non-independence and the other ANOVA assumptions). Accepted for publication May 2002. Most attention, however, routinely focuses on the assumptions that the observations are (after appropriate transformation) normally distributed, and that the compared groups have equal variances, despite ANOVA being widely regarded as robust against moderate violations of these requirements (Cochran 1947; Box 1953; Glass et al. 1972; Winer et al. 1991; Underwood 1997). Underwood (1981), for instance, was critical of workers who used ANOVA without checking its requirements, but texts, particularly at an introductory level, may mention neither the assumptions, nor the procedures which may be used to test these (e.g. Brown & Hollander 1977; Ennos 2000; Table 1). Texts that do discuss these assumptions often present differing views of the importance of violating them and contradictory advice on which methods should be used to check them (Table 1). Fowler et al. (1998), for instance, describe the use of the F-max test (also known as Hartley’s test) then recommend that if ‘there is doubt as to whether a particular set of data satisfies the assumptions made in the use of a parametric test, then a non-parametric alternative should be used.’ Oehlert (2000), in contrast, takes the opposite line: ‘There are formal tests for equality of variance—do not use them!’ (italics in original). Oehlert (2000) is one of several authors who describe and recommend graphical procedures, such as residual and normal plots, although some of these authors also describe formal tests (e.g. Neter et al. 1982; Weiss 1995). The three formal tests that appear to be most commonly recommended are the F-max test, Bartlett’s test and Cochran’s test (Table 1; Conover et al. 1981; Winer et al. 1991). Levene’s test is becoming more frequently mentioned, although it is not clear why as 682 K. A. MCGUINNESS the version usually described does not appear to have any particular advantages (Conover et al. 1981). Of these three tests, Bartlett’s may be the most frequent choice (Underwood 1997). Underwood (1997) suggests that this is because it handles unbalanced designs but it is also one of the most frequently described procedures (Table 1). Texts describing Bartlett’s test (Zar 1984; Neter et al. 1985) often stress its undue sensitivity to nonnormality and recommend that it not be used unless the observations are reasonably normal. In fact, excessive sensitivity to non-normality is the Achilles heel of most variance homogeneity tests (Conover et al. 1981). This point is noted by several texts (Zar 1984; Neter et al. 1985; Winer et al. 1991; Underwood 1997), with Oehlert (2000) being the most scathing: ‘classical tests of constant variance (such as Bartlett’s test or Hartley’s test) are so incredibly sensitive to non-normality that their inferences are worthless in practice’ (italics in original). In fact, these ‘classical tests’ are more sensitive to non-normality and heterogeneity than ANOVA itself is, and Neter et al. (1985) suggested that ‘if the populations are reasonably normal, so that the Bartlett test can be employed and the sample sizes do not differ greatly, a fairly low level may be justified in testing the equality of variances for determining the aptness of the ANOVA model, since only large differences between variances need be detected’. Conover et al. (1981) compared a wide range of homogeneity tests and identified three that were robust against non-normality and were also reasonably powerful (see their paper for details). These tests would be useful when differences among variances were of genuine interest, as may sometimes be the case (Parker 1979; Underwood 1997), but they are not usually included in standard texts or statistical packages. In any case, the simulations done by Conover et al. (1981) did not explicitly test the usefulness of these variance tests for checking the importance of violating the ANOVA assumption of homogeneity. Thus, their results do not address the criticism made of all such tests by Oehlert (2000) that ‘such tests do not tell us what we need to know: the amount of non-constant variance that is present and how it affects our inferences’. The key issue is not whether variances are equal, but whether any heterogeneity that may be present affects the reliability of the ANOVA. Despite extensive tests of the effects of heterogeneity and non-normality on ANOVA (Glass et al. 1972), and on homogeneity tests such as Bartlett’s tests (Conover et al. 1981), there does not seem to have been any attempt to address this specific issue. The work described here aimed to do this by answering three questions: (i) Are ANOVA and variance homogeneity tests similarly responsive to heterogeneity of variance?; (ii) Do variance homogeneity tests reliably identify situations in which the validity of ANOVA is compromised?; (iii) Does the use of a smaller significance level (e.g. = 0.01), as suggested by Neter et al. (1985), improve the reliability of homogeneity tests? Only Bartlett’s and Cochran’s tests were examined: the former because it is probably the most frequently implemented and used; the latter because it has properties, as Winer et al. (1991) and Underwood (1981, 1997) discuss, that make it poten- Table 1. Summary of the amount of discussion of the ANOVA homogeneity assumption in textbooks at a range of levels, together with the procedures that the author(s) recommend Authors Level Discussion Tests Brown and Hollander (1977) Ennos (2000) Fowler et al. (1998) Neter et al. (1985) Intermediate Introductory Introductory Advanced Minimal None Minimal Moderate Neter et al. (1982) Oehlert (2000) Introductory Introductory Moderate Moderate Parker (1979) Sokal and Rohlf (1969) Snedecor and Cochran (1980) Underwood (1997) Introductory Intermediate Intermediate Advanced Minimal Moderate Moderate Extended Weiss (1995) Introductory Minimal Winer et al. (1991) Advanced Extended Zar (1984) Intermediate Moderate None None F-max Residual plots; if populations are reasonably normal then Barlett’s, F-max Residual plots Residual plots, recommends against the use of tests but favours Levene’s test if variances must be compared F-max Bartlett’s, F-max, problems with these noted Bartlett’s, F-max (Hartley), Levene’s; latter recommended Discusses Bartlett’s, Levene’s and Cochran’s tests; latter recommended Normal probability plot followed by correlation between data and normal scores Describes Cochran’s, Bartlett’s, F-max, Box-Scheffé, BrownForsythe; recommends Cochran’s and F-max for general use Recommends none, but suggests Bartlett’s, if populations normal, otherwise none The selection of texts is idiosyncratic (they were just those I owned). In the classification used, an introductory text would be suitable for a first or second year introductory course in experimental design and analysis. TESTING THE ANOVA HOMOGENEITY ASSUMPTION 683 required mean (mean of 5000 numbers = 100.01; required mean = 100), although the variance was slightly lower than expected (variance of 5000 numbers = 0.97; required variance = 1). The latter result is of little importance as the results of all situations were compared with ‘control’ simulations, in which all means and variances were equal. In most cases, the percentage of significant results in ‘control’ simulations was very close to the nominal significance level (Table 2; Table 3), indicating that the simulations were sound. tially more useful than other procedures: these issues are discussed in more detail later. Cochran’s test statistic is simply the largest variance divided by the sum of all variances (tables are available in Winer et al. 1991). Formulae for calculating Bartlett’s test, which gives a 2 value, are available in many texts (Table 1) and it is routinely output by statistical packages. METHODS Simulations Effect of variance heterogeneity, sample size and number of groups on Cochran’s and Barlett’s tests All simulations were run using Visual Basic for Applications in Microsoft Excel 97. A test of the function used to generate random numbers confirmed that it produced normally distributed numbers (Kolmogorov–Smirnov test, P > 0.20) with the Sample data were generated for ANOVA with k = 3 and k = 5 groups, and with sample sizes of n = 3, 10, 17 and 37 (the latter three sizes were used because Winer Table 2. Effect of heterogeneity of variance, and increasing sample size, on Type I error rate (%) of the percentage of significant Cochran’s and Bartlett’s tests (power), in comparisons of k = 3 means ‘Odd’ variance k n=3 n = 10 n = 17 n = 37 F-ratio Cochran’s Bartlett’s F-ratio Cochran’s Bartlett’s F-ratio Cochran’s Bartlett’s F-ratio Cochran’s Bartlett’s ANOVA F-test, and the 1/81 1/9 = 0.05 1 3 81 1/81 1/9 = 0.01 1 3 81 8.2 24.3 57.7 7.2 47.2 100.0 6.7 69.2 100.0 6.2 100.0 100.0 6.8 14.4 14.8 5.9 34.3 88.1 5.7 53.4 99.6 5.7 96.5 100.0 5.2 5.0 4.4 4.8 5.2 5.0 5.3 5.2 4.9 6.0 4.8 5.1 9.2 32.6 24.0 7.1 93.9 91.4 7.4 99.6 99.4 7.7 100.0 100.0 14.3 84.9 77.8 9.3 100.0 100.0 9.1 100.0 100.0 8.4 100.0 100.0 2.4 9.6 17.7 2.0 23.6 100.0 2.1 38.1 100.0 1.6 82.2 100.0 1.9 4.0 3.0 1.8 14.3 65.5 1.7 25.1 97.3 1.3 60.3 100.0 1.0 1.0 0.7 1.1 0.8 1.1 0.9 1.3 1.3 1.2 1.0 0.9 2.9 12.5 6.6 2.6 85.6 79.6 2.4 98.7 98.0 2.7 100.0 100.0 7.2 69.3 51.7 4.1 100.0 100.0 3.7 100.0 100.0 3.4 100.0 100.0 Results are given for tests at the 5% and 1% significance levels. Table 3. Effect of heterogeneity of variance, and increasing sample size, on Type I error rate (%) of the percentage of significant Cochran’s and Bartlett’s tests (power), in comparisons of k = 5 means ‘Odd’ variance k n=3 n = 10 n = 17 n = 37 F-ratio Cochran’s Bartlett’s F-ratio Cochran’s Bartlett’s F-ratio Cochran’s Bartlett’s F-ratio Cochran’s Bartlett’s ANOVA F-test, and the 1/81 1/9 = 0.05 1 3 81 1/81 1/9 = 0.01 1 3 81 6.5 12.2 39.9 6.3 20.6 100.0 5.8 29.7 100.0 5.8 50.2 100.0 6.7 10.3 12.5 6.0 18.4 84.6 5.7 24.1 99.7 5.9 41.3 100.0 5.1 5.1 4.5 5.1 4.9 4.6 5.3 5.4 5.1 5.1 4.9 4.6 11.3 43.0 28.1 9.8 96.1 93.1 9.8 99.8 99.6 9.3 100.0 100.0 17.8 89.8 82.8 13.6 100.0 100.0 12.6 100.0 100.0 11.9 100.0 100.0 1.7 3.8 10.6 1.5 6.4 100.0 1.4 10.3 100.0 1.3 20.4 100.0 1.6 2.9 2.5 1.2 5.2 56.0 1.5 7.8 96.1 1.4 15.5 100.0 1.1 1.1 0.9 1.0 1.2 1.2 1.2 1.0 1.2 1.1 1.0 1.0 4.7 24.7 12.0 4.8 92.1 84.9 4.3 99.5 98.9 4.2 100.0 100.0 10.8 83.2 70.3 7.6 100.0 100.0 7.0 100.0 100.0 6.7 100.0 100.0 Results are given for tests at the 5% and 1% significance levels. 684 K. A. MCGUINNESS et al. (1991) has critical values for Cochran’s test with samples of these sizes). In all cases, the means of the groups were all equal (= 100), so the null hypothesis being tested by the ANOVA F-test was always true; any significant F-tests were therefore Type I errors. It is well known that the greatest Type I error rates in ANOVA result from situations in which a single variance differs from the remainder (Glass et al. 1972; Underwood 1981, 1997; Winer et al. 1991), so the simulations done here followed this pattern. In all, five scenarios were examined. In one, the control scenario, all variances were equal. There were two scenarios in which one variance was much larger than the remainder (9 and 81) and two in which one variance was much smaller than the remainder (1/9 and 1/81). The values for the largest and smallest variances (81 and 1/81) were extreme but using such values does provide ‘worst case’ scenarios. Each scenario was run 5000 times and the proportions of F, Cochran’s and Bartlett’s tests significant at = 0.05 and 0.01 were tabulated (critical F and 2 values were calculated using Excel functions; critical Cochran’s values were taken from the tables in Winer et al. 1991). For the ANOVA F-test, this gives the Type I error rate (because the null hypothesis was always true). When the null hypothesis tested by Cochran’s and Bartlett’s tests was true (i.e. all variances were equal), the results give the Type I error rates for these tests; otherwise they give the power of the test (i.e. the probability of rejecting the null when it is false). Reliability of the results of Cochran’s and Bartlett’s tests It is well known that as heterogeneity increases, so does the ANOVA Type I error rate (Glass et al. 1972; Underwood 1981, 1997). The simulations in this study were not designed to test this but, instead, to compare the frequency of these errors, under different heterogeneity conditions, with the frequency with which Cochran’s and Bartlett’s tests identified a problem (i.e. indicated that variances were heterogeneous). Because these homogeneity tests are used in a ‘gatekeeper’ role (being used to indicate whether or not the ANOVA is likely to be valid) it seems desirable that these tests should detect heterogeneous variances on most of the occasions in which the F-test results in a Type I error. Thus, in the simulations described previously, records were also made of the number of cases in which the F-test and the homogeneity tests agreed (both significant or non-significant) and disagreed (one significant and the other not). From these totals it is possible to determine how often the homogeneity tests resulted in a false positive, indicating that there was a problem when there wasn’t, or in a false negative, failing to detect a problem that existed. These simulations used the 5% significance level for the F-test and Cochran’s test because this is the usual practice. Effect of a smaller significance level ( = 0.01) on the reliability of Cochran’s test The utility of the suggestion of Neter et al. (1985) of using a smaller significance level for testing the homogeneity assumption was examined during the ‘reliability’ simulations by also testing for heterogeneity at the 1% level while retaining the 5% level for the ANOVA F-test. This comparison was only made for Cochran’s test because, as noted by other authors (Winer et al. 1991; Underwood 1997) and confirmed by the simulations described above, this test has desirable properties that Barlett’s test does not. RESULTS AND DISCUSSION Effect of variance heterogeneity As expected, the frequency of ANOVA Type I errors and significant Cochran’s and Bartlett’s tests increased with heterogeneity among variances (Table 2; Table 3). As reported by other authors, the frequency of ANOVA Type I errors did not increase greatly, even when one variance was much larger than all others (Cochran 1947; Glass et al. 1972; Underwood 1997). The two homogeneity tests, however, rejected the null of variance equality most of the time in this situation, even using small sample sizes (Table 2; Table 3). For instance, at k = 3 and n = 3, the ANOVA Type I error rate at the nominal level of 5% was actually approximately 14% (i.e. the null was rejected almost three times as often as it should be) but Cochran’s test and Bartlett’s test indicated heterogeneous variances in approximately 59 and 80% of analyses, respectively. Thus, not only are these tests considerably more sensitive to non-normality than is ANOVA (e.g. Box 1953; Zar 1984; Winer et al. 1991), they are also considerably more sensitive to variance heterogeneity. This is the problem to which Neter et al. (1985) and Oehlert (2000) allude. It is also worth noting, as reported by other authors (Glass et al. 1972; Winer et al. 1991; Underwood 1997), that ANOVA was more seriously affected by one variance being larger than the others than by one being smaller (Table 2; Table 3). One important result was that both homogeneity tests displayed similar behaviour, rejecting the null more frequently when one variance was much larger than the others, but this trend was much more apparent with Cochran’s test than with Bartlett’s (Table 2; Table 3). This desirable behaviour follows directly from the definition of TESTING THE Cochran’s test, which compares the largest variance to the sum of all variances. Effect of sample size As expected, as sample size increased so did the power of the homogeneity tests, leading to the null being rejected more frequently (Table 2; Table 3). In contrast, it is known that as sample size increases, ANOVA becomes less sensitive to variance heterogeneity (Glass et al. 1972; see Table 2; Table 3). Thus, paradoxically, the homogeneity tests are more likely to indicate a problem in situations in which ANOVA is actually less likely to be affected. Indeed, anyone who analyses large experiments, with more than approximately 15–20 groups, will routinely find it difficult to accept the homogeneity assumption. Effect of number of groups It is also known that as the number of groups increases, ANOVA tends to become more sensitive to variance heterogeneity (Glass et al. 1972; Wilcox et al. 1986; see Table 2; Table 3) but the homogeneity tests tended to become less sensitive (Table 2; Table 3). Thus, these homogeneity tests again tended to be more sensitive when it was less important for ANOVA, and vice versa. ANOVA HOMOGENEITY ASSUMPTION Rate of false positives and negatives The results given herein do suggest that homogeneity tests fulfil their primary role, albeit overzealously, as ‘gatekeepers’, warning of situations in which the ANOVA might be unreliable. Unfortunately, this impression may not be entirely correct. Results indicated that not only did the homogeneity tests have a high rate of ‘false positives’, indicating that there was a problem when there wasn’t, they also often had a high rate of ‘false negatives’, failing to detect a problem when one existed (Table 4; results for Bartlett’s test, which are not shown, were similar to those for Cochran’s test, save that the former had a slightly higher rate of false positives and lower rate of false negatives). This paradoxical result is, perhaps, easiest explained by example. When the ratio of variances was 1:1:81, and n = 3, ANOVA gave a ‘significant’ result in 684 of the 5000 simulated analyses: a Type I error rate of approximately 14%. Cochran’s test was significant in 4262 of those simulated analyses, a power of approximately 85%. Unfortunately, Cochran’s test and the ANOVA F-test were only jointly significant on 345 occasions; thus Cochran’s test only correctly identified approximately half of the ANOVA in which Type I errors occurred. As would be expected from earlier results, the frequency of false positives (indicating a problem when none existed) increased greatly with variance heterogeneity and sample size (Table 4), but averaged Table 4. Percentage of significant and non-significant Cochran’s (C) tests, categorized according to whether or not the F-test was significant Test result n=3 n = 10 n = 17 n = 37 No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) 685 ANOVA 1/81 1/9 k=3 1 3 81 1/81 1/9 k=5 1 3 81 69.8 22.5 6.1 1.7 48.6 44.7 3.3 3.4 29.0 64.3 2.2 4.4 0.0 94.0 0.0 6.0 78.7 14.2 6.4 0.7 59.8 33.6 4.6 2.0 42.9 51.1 3.0 3.0 3.8 90.3 0.4 5.6 90.5 4.8 4.5 0.2 90.8 4.7 4.3 0.3 90.5 4.8 4.4 0.3 90.5 4.1 5.0 0.3 59.3 32.2 7.8 0.8 4.3 88.0 0.9 6.8 0.2 91.8 0.2 7.8 0.0 92.5 0.0 7.5 8.0 78.3 6.8 6.9 0.0 90.6 0.0 9.4 0.0 90.1 0.0 9.9 0.0 91.7 0.0 8.3 82.4 10.8 5.9 0.9 74.7 19.4 4.7 1.2 66.0 28.2 4.3 1.6 49.8 46.7 0.0 3.5 83.9 9.6 5.9 0.5 77.8 16.9 4.4 0.9 72.4 22.5 3.6 1.5 58.7 38.8 0.0 2.5 89.7 5.0 5.1 0.2 91.3 4.1 4.5 0.1 89.8 5.3 4.8 0.1 95.1 4.7 0.0 0.2 48.0 41.2 9.1 1.6 3.0 86.9 1.1 9.0 0.2 91.2 0.0 8.6 0.0 91.5 0.0 8.5 4.5 77.2 5.2 13.1 0.0 86.7 0.0 13.3 0.0 87.0 0.0 13.0 0.0 88.2 0.0 12.8 In this context, a false positive (+ve) occurred when the ANOVA was not significant but Cochran’s test was: in these cases the test ‘identified’ a problem that did not exist. A false negative (–ve) occurred when the ANOVA was significant but the heterogeneity test indicated that variances were homogeneous: in these cases the test failed to identify a problem that did exist. ‘No problem’ means that Cochran’s test correctly indicated no variance heterogeneity and ‘VH found’ means that it correctly found a variance problem. All significance tests used the 5% level. Results are given for analyses with 3 or 5 groups, with each of four sample sizes, and for five variance heterogeneity conditions. NS, non-significant; *significant. 686 K. A. MCGUINNESS approximately 49%. As a consequence, the rate of false negatives (failing to identify a problem when one actually existed) decreased with increasing sample size and variance heterogeneity (Table 4). Even so, much of the time (38% on average) Cochran’s test failed to detect a problem when one actually existed (note in Table 4 that the percentage in the ‘false negative’ row is usually much greater than that in the ‘VH detected’ row beneath). In other words, in simulations with heterogeneous variances, the ANOVA often gave a Type I error in cases where Cochran’s test indicated that variances were not significantly unequal. (It is important to note that even if homogeneity tests worked well they would not necessarily always indicate a problem when a Type I error occurred in the ANOVA, as Type I errors will arise even when variances are equal.) Thus, even Cochran’s test does not appear to provide consistent and reasonably reliable information about the issue of most concern: the likely validity of the ANOVA. instance, when one variance was nine times larger than the other two, and there were three replicates, use of a 0.01 significance level resulted in false positives dropping from 32.2 to 12.3%, whereas false negatives only increased from 7.8 to 8.4%. These resulted in Cochran’s test, in this case, only correctly identifying a small proportion of the analyses in which the ANOVA gave a Type I error but, as noted previously, using the usual 0.05 level gave results that were only slightly better. In almost all cases with large samples sizes, and where one variance was larger than the remainder, using a significance level of 0.01 for Cochran’s test resulted in no change in the number of ANOVA with Type I errors that were identified. Overall, using Cochran’s test with a significance level of 0.01 does improve the performance of the test but this improvement decreases with increasing sample size and heterogeneity. Conclusions and recommendations Test of Cochran’s test at = 0.01 Testing Cochran’s value at = 0.01, instead of = 0.05, markedly reduced the number of false positives but only for the smaller samples sizes and usually only when heterogeneity resulted from one small variance (compare Table 4 and Table 5). The penalty for this was, of course, a reduction, although sometimes only slight, in the number of times that the test correctly identified heterogeneity that affected the ANOVA result (i.e. an increase in false negatives). For These results confirm Box’s (1953) conclusion that testing the homogeneity assumption ‘may well lead to more wrong conclusions than if the preliminary test was omitted’. These simulations all used normally distributed observations: results would, if anything, have been worse with non-normal observations because ANOVA, with a balanced design, is not greatly affected by moderate non-normality (Glass et al. 1972) but most homogeneity tests, including Cochran’s and Bartlett’s, are (Conover et al. 1981). There is also no reason to suspect that any of the procedures recom- Table 5. Percentage of significant and non-significant Cochran’s (C) tests , categorized according to whether or not the F-test was significant, when the ANOVA F-ratio was tested at the 5% level but Cochran’s value was tested at the 1% level Test result n=3 n = 10 n = 17 n = 37 No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) No problem (F NS; C NS) False +ve (F NS; C *) False –ve (F *; C NS) VH found (F *; C *) ANOVA 1/81 1/9 k=3 1 3 81 1/81 1/9 k=5 1 3 81 83.2 9.1 7.1 0.6 71.8 21.5 4.9 1.8 58.7 34.6 4.3 2.4 16.1 77.9 1.2 4.8 88.7 4.1 7.1 0.1 78.6 14.8 5.9 0.7 69.2 24.8 4.5 1.5 39.1 55.0 2.6 3.4 94.3 1.0 4.7 0.0 94.5 1.0 4.5 0.0 94.3 0.9 4.7 0.0 93.8 0.8 5.3 0.1 79.2 12.3 8.4 0.1 11.4 80.9 2.0 5.6 0.8 91.2 0.3 7.7 0.0 92.5 0.0 7.5 20.9 65.4 10.0 3.6 0.0 90.6 0.0 9.4 0.0 90.1 0.0 9.9 0.0 91.7 0.0 8.3 89.9 3.3 6.6 0.2 88.3 5.9 5.4 0.4 84.3 9.9 5.4 0.4 49.8 19.1 29.7 1.3 90.6 3.0 6.3 0.1 89.2 5.4 5.1 0.3 87.6 7.2 4.6 0.5 58.7 14.7 25.8 0.8 93.9 0.9 5.2 0.1 94.5 0.9 4.6 0.0 94.1 1.0 4.9 0.0 95.1 1.0 3.8 0.1 65.1 24.2 10.1 0.7 6.3 83.6 1.9 8.2 0.5 90.9 0.1 8.5 0.0 91.5 0.0 8.5 8.3 73.5 7.9 10.3 0.0 86.7 0.0 13.3 0.0 87.0 0.0 13.0 0.0 88.2 0.0 11.8 Results are given for analyses with 3 or 5 groups, with each of four sample sizes, and for five variance heterogeneity conditions. See caption for Table 4 for further details. +ve, positive; –ve, negative; NS, non-significant; *significant. TESTING THE mended by Conover et al. (1981) would have performed better than Cochran’s test and they may have performed worse: compared with Cochran’s test, all three lacked power at the small sample sizes, when ANOVA is least robust and where power is most needed. How then should one proceed? Given the results here, and the observations of Neter et al. (1985), Winer et al. (1991) and Underwood (1981, 1997) several recommendations can be made. 1. A balanced design should be used whenever possible and, as Underwood (1997) discusses, there are rarely good reasons for not doing this. A few replicates may be lost during the study but this should be of little importance. 2. Graphical methods (residual plots, normal plots, variance vs means plots) should be used to check the distribution of observations and variances. All that is required for the analysis to be reliable is approximate normality of observations and equality of variances. Only when there is single, large variance, or marked non-normality, are there likely to be substantial problems. Winer et al. (1991), Underwood (1981, 1997) and Conover et al. (1981) may be consulted for further discussion (but also see Wilcox et al. 1986). 3. Groups with large variances should be checked for errors and outliers. An unusually large variance may simply result from an incorrectly recorded or entered observation (Underwood 1981, 1997). This is probably not a common reason for heterogeneous variances but it does happen. 4. The data should be appropriately transformed if there is a relationship between means and variances, or if there are biological reasons for expecting non-normal distributions (for appropriate transformations see standard texts; Underwood 1981, 1997). Such relationships may be the rule, rather than the exception, in ecological studies. 5. If a formal test of the homogeneity assumption is required, Cochran’s test, with = 0.01, may be used, because it is less sensitive to heterogeneity caused by small variances, but this procedure is still likely to be overly conservative except for small sample sizes. 6. Workers should remember that, as the usual effect of heterogeneous variances on ANOVA is to increase the frequency of Type I errors, a non-significant F-test is likely to be reliable, whether or not variances are heterogeneous (Underwood 1981, 1997). 7. Workers should accept that statistical tests do not give exact results and the true Type I error rate (significance level) for any test will only approximate, albeit perhaps closely in some cases, the nominal level. (Anyone who writes P < 0.05135 is fooling themselves.) 8. In cases where there is very substantial nonnormality or heterogeneity, alternative procedures (such as the use of randomization tests, or generalized ANOVA HOMOGENEITY ASSUMPTION 687 linear models with non-normal error distributions) may be pursued through specialist advice. ACKNOWLEDGEMENTS This work resulted from discussions with Michael Douglas about detecting and dealing with heterogeneity, which prompted me to run simulations to check the reliability of a short-cut I had been using. These confirmed that my short-cut was probably reliable but this result proved less comforting that I might have hoped. Comments by referees improved the manuscript. REFERENCES Box G. E. P. (1953) Non-normality and tests on variances. Biometrika 40, 318–35. Brown B. W. & Hollander M. (1977) Statistics, a Biomedical Introduction. John Wiley & Sons, New York. Cochran W. G. (1947) Some consequences when the assumptions for the analysis of variance are not satisfied. Biometrics 3, 22–38. Conover W. J., Johnson M. E. & Johnson M. M. (1981) A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23, 351–61. Ennos R. (2000) Statistical and Data Handling Skills in Biology. Pearson Education, Edinburgh Gate. Fowler J., Cohen L. & Jarvis P. (1998) Practical Statistics for Field Biology, 2nd edn. John Wiley & Sons, New York. Glass G. V., Peckham P. D. & Sanders J. R. (1972) Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Rev. Educ. Res. 42, 239–88. Neter J., Wasserman W. & Kutner M. H. (1985) Applied Linear Statistical Models, 2nd edn. Richard D. Irwin, Homewood. Neter J., Wasserman W. & Whitmore G. A. (1982) Applied Statistics, 2nd edn. Allyn and Bacon, Boston. Oehlert G. W. (2000) A First Course in Design and Analysis of Experiments. W. H. Freeman, New York. Parker R. E. (1979) Introductory Statistics for Biology, 2nd edn. The Institute in Biology’s Studies in Biology no. 43. Edward Arnold, London. Snedecor G. W. & Cochran W. G. (1980) Statistical Methods, 7th edn. Iowa State University Press, Ames. Sokal R. R. & Rohlf F. J. (1969) Biometry, the Principles and Practice of Statistics in Biological Research. W. H. Freeeman, San Francisco. Underwood A. J. (1981) Techniques of analysis of variance in experimental marine biology and ecology. Ann. Rev. Oceanogr. Mar. Biol. 19, 513–605. Underwood A. J. (1997) Experiments in Ecology, Their Logical Design and Interpretation Using Analysis of Variance. Cambridge University Press, Cambridge. Weiss N. A. (1995) Introductory Statistics, 4th edn. AddisonWesley, Reading. Wilcox R. R., Charlin V. & Thompson K. (1986) New Monte Carlo results on the robustness of the ANOVA F, W and F(*) statistics. Commun. Stat. Simul. Comp. 15, 933–44. 688 K. A. MCGUINNESS Winer B. J., Brown D. R. & Michels K. M. (1991) Statistical Principles in Experimental Design, 3rd edn. McGraw-Hill, New York. Zar J. H. (1984) Biostatistical Analysis, 2nd edn. Prentice Hall, Englewood Cliffs.
© Copyright 2026 Paperzz