Of rowing boats, ocean liners and tests of the ANOVA homogeneity

Austral Ecology (2002) 27, 681–688
Of rowing boats, ocean liners and tests of the
homogeneity of variance assumption
ANOVA
KEITH A. MCGUINNESS
School of Biological, Environmental and Chemical Sciences, Northern Territory University, Darwin,
Northern Territory 0909, Australia (Email: [email protected])
Abstract One of the assumptions of analysis of variance (ANOVA) is that the variances of the groups being
compared are approximately equal. This assumption is routinely checked before doing an analysis, although some
workers consider ANOVA robust and do not bother and others avoid parametric procedures entirely. Two of the more
commonly used heterogeneity tests are Bartlett’s and Cochran’s, although, as for most of these tests, they may well
be more sensitive to violations of the ANOVA assumptions than is ANOVA itself. Simulations were used to examine
how well these two tests protected ANOVA against the problems created by variance heterogeneity. Although
Cochran’s test performed a little better than Bartlett’s, both tests performed poorly, frequently disallowing perfectly
valid analyses. Recommendations are made about how to proceed, given these results.
Key words: ANOVA, assumption, homogeneity, Bartlett’s, Cochran’s, variance, Type I error.
INTRODUCTION
‘To make the preliminary test on variances is rather like
putting to sea in a rowing boat to find out whether
conditions are sufficiently calm for an ocean liner to
leave port!’ (Box 1953)
The analysis of variance (ANOVA) is generally
regarded as one of the most powerful and flexible
methods for testing null hypotheses about means of
samples (Underwood 1997). Indeed, if its assumptions
are met, ANOVA provides the most powerful (i.e. most
likely to reject the null when it is actually false) tests of
such null hypotheses. The validity of the ANOVA
assumptions, however, frequently causes much angst,
and no little confusion, among ecologists; a situation
certainly created, at least in part, by some authors of
statistical texts (Table 1).
The most critical assumption, and possibly the least
understood (although it should be the easiest satisfied),
is that the observations (strictly speaking, the error
terms) are uncorrelated, or independent and, although
‘this difficulty is largely taken care of by proper
randomization’, ‘substantial biases’ may occur if this
assumption is not met (Cochran 1947). Glass et al.
(1972), in their extended review of the ANOVA assumptions conclude that ‘non-independence of errors
seriously affects both the level of significance and
power of the F-test regardless of whether n’s are equal
or unequal’ (see Underwood (1997) for an extended
discussion of non-independence and the other ANOVA
assumptions).
Accepted for publication May 2002.
Most attention, however, routinely focuses on the
assumptions that the observations are (after appropriate transformation) normally distributed, and that the
compared groups have equal variances, despite ANOVA
being widely regarded as robust against moderate
violations of these requirements (Cochran 1947; Box
1953; Glass et al. 1972; Winer et al. 1991; Underwood
1997). Underwood (1981), for instance, was critical of
workers who used ANOVA without checking its requirements, but texts, particularly at an introductory level,
may mention neither the assumptions, nor the procedures which may be used to test these (e.g. Brown &
Hollander 1977; Ennos 2000; Table 1).
Texts that do discuss these assumptions often present
differing views of the importance of violating them and
contradictory advice on which methods should be
used to check them (Table 1). Fowler et al. (1998), for
instance, describe the use of the F-max test (also
known as Hartley’s test) then recommend that if ‘there
is doubt as to whether a particular set of data satisfies
the assumptions made in the use of a parametric test,
then a non-parametric alternative should be used.’
Oehlert (2000), in contrast, takes the opposite line:
‘There are formal tests for equality of variance—do not
use them!’ (italics in original). Oehlert (2000) is one of
several authors who describe and recommend graphical procedures, such as residual and normal plots,
although some of these authors also describe formal
tests (e.g. Neter et al. 1982; Weiss 1995).
The three formal tests that appear to be most
commonly recommended are the F-max test, Bartlett’s
test and Cochran’s test (Table 1; Conover et al. 1981;
Winer et al. 1991). Levene’s test is becoming more
frequently mentioned, although it is not clear why as
682
K. A. MCGUINNESS
the version usually described does not appear to have
any particular advantages (Conover et al. 1981). Of
these three tests, Bartlett’s may be the most frequent
choice (Underwood 1997). Underwood (1997)
suggests that this is because it handles unbalanced
designs but it is also one of the most frequently
described procedures (Table 1).
Texts describing Bartlett’s test (Zar 1984; Neter
et al. 1985) often stress its undue sensitivity to nonnormality and recommend that it not be used unless the
observations are reasonably normal. In fact, excessive
sensitivity to non-normality is the Achilles heel of most
variance homogeneity tests (Conover et al. 1981). This
point is noted by several texts (Zar 1984; Neter et al.
1985; Winer et al. 1991; Underwood 1997), with
Oehlert (2000) being the most scathing: ‘classical tests
of constant variance (such as Bartlett’s test or Hartley’s
test) are so incredibly sensitive to non-normality that
their inferences are worthless in practice’ (italics in
original). In fact, these ‘classical tests’ are more
sensitive to non-normality and heterogeneity than
ANOVA itself is, and Neter et al. (1985) suggested that
‘if the populations are reasonably normal, so that the
Bartlett test can be employed and the sample sizes do
not differ greatly, a fairly low level may be justified in
testing the equality of variances for determining the
aptness of the ANOVA model, since only large differences between variances need be detected’.
Conover et al. (1981) compared a wide range of
homogeneity tests and identified three that were robust
against non-normality and were also reasonably
powerful (see their paper for details). These tests would
be useful when differences among variances were of
genuine interest, as may sometimes be the case (Parker
1979; Underwood 1997), but they are not usually
included in standard texts or statistical packages. In any
case, the simulations done by Conover et al. (1981) did
not explicitly test the usefulness of these variance tests
for checking the importance of violating the ANOVA
assumption of homogeneity. Thus, their results do not
address the criticism made of all such tests by Oehlert
(2000) that ‘such tests do not tell us what we need to
know: the amount of non-constant variance that is
present and how it affects our inferences’.
The key issue is not whether variances are equal, but
whether any heterogeneity that may be present affects
the reliability of the ANOVA. Despite extensive tests of
the effects of heterogeneity and non-normality on
ANOVA (Glass et al. 1972), and on homogeneity tests
such as Bartlett’s tests (Conover et al. 1981), there does
not seem to have been any attempt to address this
specific issue. The work described here aimed to do this
by answering three questions: (i) Are ANOVA and
variance homogeneity tests similarly responsive to
heterogeneity of variance?; (ii) Do variance homogeneity tests reliably identify situations in which the
validity of ANOVA is compromised?; (iii) Does the use
of a smaller significance level (e.g. = 0.01), as suggested by Neter et al. (1985), improve the reliability of
homogeneity tests? Only Bartlett’s and Cochran’s tests
were examined: the former because it is probably the
most frequently implemented and used; the latter
because it has properties, as Winer et al. (1991) and
Underwood (1981, 1997) discuss, that make it poten-
Table 1. Summary of the amount of discussion of the ANOVA homogeneity assumption in textbooks at a range of levels, together
with the procedures that the author(s) recommend
Authors
Level
Discussion
Tests
Brown and Hollander (1977)
Ennos (2000)
Fowler et al. (1998)
Neter et al. (1985)
Intermediate
Introductory
Introductory
Advanced
Minimal
None
Minimal
Moderate
Neter et al. (1982)
Oehlert (2000)
Introductory
Introductory
Moderate
Moderate
Parker (1979)
Sokal and Rohlf (1969)
Snedecor and Cochran (1980)
Underwood (1997)
Introductory
Intermediate
Intermediate
Advanced
Minimal
Moderate
Moderate
Extended
Weiss (1995)
Introductory
Minimal
Winer et al. (1991)
Advanced
Extended
Zar (1984)
Intermediate
Moderate
None
None
F-max
Residual plots; if populations are reasonably normal then
Barlett’s, F-max
Residual plots
Residual plots, recommends against the use of tests but favours
Levene’s test if variances must be compared
F-max
Bartlett’s, F-max, problems with these noted
Bartlett’s, F-max (Hartley), Levene’s; latter recommended
Discusses Bartlett’s, Levene’s and Cochran’s tests; latter
recommended
Normal probability plot followed by correlation between data
and normal scores
Describes Cochran’s, Bartlett’s, F-max, Box-Scheffé, BrownForsythe; recommends Cochran’s and F-max for general use
Recommends none, but suggests Bartlett’s, if populations
normal, otherwise none
The selection of texts is idiosyncratic (they were just those I owned). In the classification used, an introductory text would be
suitable for a first or second year introductory course in experimental design and analysis.
TESTING THE
ANOVA
HOMOGENEITY ASSUMPTION
683
required mean (mean of 5000 numbers = 100.01;
required mean = 100), although the variance was
slightly lower than expected (variance of 5000 numbers
= 0.97; required variance = 1). The latter result is of
little importance as the results of all situations were
compared with ‘control’ simulations, in which all
means and variances were equal. In most cases, the
percentage of significant results in ‘control’ simulations
was very close to the nominal significance level
(Table 2; Table 3), indicating that the simulations were
sound.
tially more useful than other procedures: these issues
are discussed in more detail later. Cochran’s test statistic is simply the largest variance divided by the sum
of all variances (tables are available in Winer et al.
1991). Formulae for calculating Bartlett’s test, which
gives a 2 value, are available in many texts (Table 1)
and it is routinely output by statistical packages.
METHODS
Simulations
Effect of variance heterogeneity, sample size and
number of groups on Cochran’s and Barlett’s tests
All simulations were run using Visual Basic for
Applications in Microsoft Excel 97. A test of the
function used to generate random numbers confirmed
that it produced normally distributed numbers
(Kolmogorov–Smirnov test, P > 0.20) with the
Sample data were generated for ANOVA with k = 3 and
k = 5 groups, and with sample sizes of n = 3, 10, 17
and 37 (the latter three sizes were used because Winer
Table 2. Effect of heterogeneity of variance, and increasing sample size, on Type I error rate (%) of the
percentage of significant Cochran’s and Bartlett’s tests (power), in comparisons of k = 3 means
‘Odd’ variance k
n=3
n = 10
n = 17
n = 37
F-ratio
Cochran’s
Bartlett’s
F-ratio
Cochran’s
Bartlett’s
F-ratio
Cochran’s
Bartlett’s
F-ratio
Cochran’s
Bartlett’s
ANOVA
F-test, and the
1/81
1/9
= 0.05
1
3
81
1/81
1/9
= 0.01
1
3
81
8.2
24.3
57.7
7.2
47.2
100.0
6.7
69.2
100.0
6.2
100.0
100.0
6.8
14.4
14.8
5.9
34.3
88.1
5.7
53.4
99.6
5.7
96.5
100.0
5.2
5.0
4.4
4.8
5.2
5.0
5.3
5.2
4.9
6.0
4.8
5.1
9.2
32.6
24.0
7.1
93.9
91.4
7.4
99.6
99.4
7.7
100.0
100.0
14.3
84.9
77.8
9.3
100.0
100.0
9.1
100.0
100.0
8.4
100.0
100.0
2.4
9.6
17.7
2.0
23.6
100.0
2.1
38.1
100.0
1.6
82.2
100.0
1.9
4.0
3.0
1.8
14.3
65.5
1.7
25.1
97.3
1.3
60.3
100.0
1.0
1.0
0.7
1.1
0.8
1.1
0.9
1.3
1.3
1.2
1.0
0.9
2.9
12.5
6.6
2.6
85.6
79.6
2.4
98.7
98.0
2.7
100.0
100.0
7.2
69.3
51.7
4.1
100.0
100.0
3.7
100.0
100.0
3.4
100.0
100.0
Results are given for tests at the 5% and 1% significance levels.
Table 3. Effect of heterogeneity of variance, and increasing sample size, on Type I error rate (%) of the
percentage of significant Cochran’s and Bartlett’s tests (power), in comparisons of k = 5 means
‘Odd’ variance k
n=3
n = 10
n = 17
n = 37
F-ratio
Cochran’s
Bartlett’s
F-ratio
Cochran’s
Bartlett’s
F-ratio
Cochran’s
Bartlett’s
F-ratio
Cochran’s
Bartlett’s
ANOVA
F-test, and the
1/81
1/9
= 0.05
1
3
81
1/81
1/9
= 0.01
1
3
81
6.5
12.2
39.9
6.3
20.6
100.0
5.8
29.7
100.0
5.8
50.2
100.0
6.7
10.3
12.5
6.0
18.4
84.6
5.7
24.1
99.7
5.9
41.3
100.0
5.1
5.1
4.5
5.1
4.9
4.6
5.3
5.4
5.1
5.1
4.9
4.6
11.3
43.0
28.1
9.8
96.1
93.1
9.8
99.8
99.6
9.3
100.0
100.0
17.8
89.8
82.8
13.6
100.0
100.0
12.6
100.0
100.0
11.9
100.0
100.0
1.7
3.8
10.6
1.5
6.4
100.0
1.4
10.3
100.0
1.3
20.4
100.0
1.6
2.9
2.5
1.2
5.2
56.0
1.5
7.8
96.1
1.4
15.5
100.0
1.1
1.1
0.9
1.0
1.2
1.2
1.2
1.0
1.2
1.1
1.0
1.0
4.7
24.7
12.0
4.8
92.1
84.9
4.3
99.5
98.9
4.2
100.0
100.0
10.8
83.2
70.3
7.6
100.0
100.0
7.0
100.0
100.0
6.7
100.0
100.0
Results are given for tests at the 5% and 1% significance levels.
684
K. A. MCGUINNESS
et al. (1991) has critical values for Cochran’s test with
samples of these sizes). In all cases, the means of the
groups were all equal (= 100), so the null hypothesis
being tested by the ANOVA F-test was always true; any
significant F-tests were therefore Type I errors.
It is well known that the greatest Type I error rates in
ANOVA result from situations in which a single variance
differs from the remainder (Glass et al. 1972; Underwood 1981, 1997; Winer et al. 1991), so the simulations done here followed this pattern. In all, five
scenarios were examined. In one, the control scenario,
all variances were equal. There were two scenarios in
which one variance was much larger than the remainder (9 and 81) and two in which one variance was
much smaller than the remainder (1/9 and 1/81).
The values for the largest and smallest variances (81
and 1/81) were extreme but using such values does
provide ‘worst case’ scenarios.
Each scenario was run 5000 times and the proportions of F, Cochran’s and Bartlett’s tests significant at
= 0.05 and 0.01 were tabulated (critical F and 2
values were calculated using Excel functions; critical
Cochran’s values were taken from the tables in Winer
et al. 1991). For the ANOVA F-test, this gives the Type I
error rate (because the null hypothesis was always
true). When the null hypothesis tested by Cochran’s
and Bartlett’s tests was true (i.e. all variances were
equal), the results give the Type I error rates for these
tests; otherwise they give the power of the test (i.e. the
probability of rejecting the null when it is false).
Reliability of the results of Cochran’s and Bartlett’s
tests
It is well known that as heterogeneity increases, so does
the ANOVA Type I error rate (Glass et al. 1972; Underwood 1981, 1997). The simulations in this study were
not designed to test this but, instead, to compare the
frequency of these errors, under different heterogeneity
conditions, with the frequency with which Cochran’s
and Bartlett’s tests identified a problem (i.e. indicated
that variances were heterogeneous). Because these
homogeneity tests are used in a ‘gatekeeper’ role (being
used to indicate whether or not the ANOVA is likely to
be valid) it seems desirable that these tests should
detect heterogeneous variances on most of the occasions in which the F-test results in a Type I error. Thus,
in the simulations described previously, records were
also made of the number of cases in which the F-test
and the homogeneity tests agreed (both significant or
non-significant) and disagreed (one significant and the
other not). From these totals it is possible to determine
how often the homogeneity tests resulted in a false
positive, indicating that there was a problem when there
wasn’t, or in a false negative, failing to detect a problem
that existed. These simulations used the 5% significance
level for the F-test and Cochran’s test because this is the
usual practice.
Effect of a smaller significance level ( = 0.01) on
the reliability of Cochran’s test
The utility of the suggestion of Neter et al. (1985) of
using a smaller significance level for testing the homogeneity assumption was examined during the ‘reliability’ simulations by also testing for heterogeneity at the
1% level while retaining the 5% level for the ANOVA
F-test. This comparison was only made for Cochran’s
test because, as noted by other authors (Winer et al.
1991; Underwood 1997) and confirmed by the simulations described above, this test has desirable properties
that Barlett’s test does not.
RESULTS AND DISCUSSION
Effect of variance heterogeneity
As expected, the frequency of ANOVA Type I errors and
significant Cochran’s and Bartlett’s tests increased with
heterogeneity among variances (Table 2; Table 3). As
reported by other authors, the frequency of ANOVA
Type I errors did not increase greatly, even when one
variance was much larger than all others (Cochran
1947; Glass et al. 1972; Underwood 1997). The two
homogeneity tests, however, rejected the null of
variance equality most of the time in this situation,
even using small sample sizes (Table 2; Table 3). For
instance, at k = 3 and n = 3, the ANOVA Type I error
rate at the nominal level of 5% was actually approximately 14% (i.e. the null was rejected almost three
times as often as it should be) but Cochran’s test and
Bartlett’s test indicated heterogeneous variances in
approximately 59 and 80% of analyses, respectively.
Thus, not only are these tests considerably more
sensitive to non-normality than is ANOVA (e.g. Box
1953; Zar 1984; Winer et al. 1991), they are also
considerably more sensitive to variance heterogeneity.
This is the problem to which Neter et al. (1985) and
Oehlert (2000) allude.
It is also worth noting, as reported by other authors
(Glass et al. 1972; Winer et al. 1991; Underwood
1997), that ANOVA was more seriously affected by one
variance being larger than the others than by one being
smaller (Table 2; Table 3). One important result was
that both homogeneity tests displayed similar behaviour, rejecting the null more frequently when one
variance was much larger than the others, but this
trend was much more apparent with Cochran’s test
than with Bartlett’s (Table 2; Table 3). This desirable
behaviour follows directly from the definition of
TESTING THE
Cochran’s test, which compares the largest variance to
the sum of all variances.
Effect of sample size
As expected, as sample size increased so did the power
of the homogeneity tests, leading to the null being
rejected more frequently (Table 2; Table 3). In contrast, it is known that as sample size increases, ANOVA
becomes less sensitive to variance heterogeneity (Glass
et al. 1972; see Table 2; Table 3). Thus, paradoxically,
the homogeneity tests are more likely to indicate a
problem in situations in which ANOVA is actually less
likely to be affected. Indeed, anyone who analyses large
experiments, with more than approximately 15–20
groups, will routinely find it difficult to accept the
homogeneity assumption.
Effect of number of groups
It is also known that as the number of groups increases,
ANOVA tends to become more sensitive to variance
heterogeneity (Glass et al. 1972; Wilcox et al. 1986; see
Table 2; Table 3) but the homogeneity tests tended
to become less sensitive (Table 2; Table 3). Thus,
these homogeneity tests again tended to be more
sensitive when it was less important for ANOVA, and
vice versa.
ANOVA
HOMOGENEITY ASSUMPTION
Rate of false positives and negatives
The results given herein do suggest that homogeneity
tests fulfil their primary role, albeit overzealously, as
‘gatekeepers’, warning of situations in which the ANOVA
might be unreliable. Unfortunately, this impression
may not be entirely correct. Results indicated that not
only did the homogeneity tests have a high rate of ‘false
positives’, indicating that there was a problem when
there wasn’t, they also often had a high rate of ‘false
negatives’, failing to detect a problem when one existed
(Table 4; results for Bartlett’s test, which are not
shown, were similar to those for Cochran’s test, save
that the former had a slightly higher rate of false
positives and lower rate of false negatives). This
paradoxical result is, perhaps, easiest explained by
example. When the ratio of variances was 1:1:81, and
n = 3, ANOVA gave a ‘significant’ result in 684 of the
5000 simulated analyses: a Type I error rate of approximately 14%. Cochran’s test was significant in 4262 of
those simulated analyses, a power of approximately
85%. Unfortunately, Cochran’s test and the ANOVA
F-test were only jointly significant on 345 occasions;
thus Cochran’s test only correctly identified approximately half of the ANOVA in which Type I errors
occurred.
As would be expected from earlier results, the
frequency of false positives (indicating a problem
when none existed) increased greatly with variance
heterogeneity and sample size (Table 4), but averaged
Table 4. Percentage of significant and non-significant Cochran’s (C) tests, categorized according to whether or not the
F-test was significant
Test result
n=3
n = 10
n = 17
n = 37
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
685
ANOVA
1/81
1/9
k=3
1
3
81
1/81
1/9
k=5
1
3
81
69.8
22.5
6.1
1.7
48.6
44.7
3.3
3.4
29.0
64.3
2.2
4.4
0.0
94.0
0.0
6.0
78.7
14.2
6.4
0.7
59.8
33.6
4.6
2.0
42.9
51.1
3.0
3.0
3.8
90.3
0.4
5.6
90.5
4.8
4.5
0.2
90.8
4.7
4.3
0.3
90.5
4.8
4.4
0.3
90.5
4.1
5.0
0.3
59.3
32.2
7.8
0.8
4.3
88.0
0.9
6.8
0.2
91.8
0.2
7.8
0.0
92.5
0.0
7.5
8.0
78.3
6.8
6.9
0.0
90.6
0.0
9.4
0.0
90.1
0.0
9.9
0.0
91.7
0.0
8.3
82.4
10.8
5.9
0.9
74.7
19.4
4.7
1.2
66.0
28.2
4.3
1.6
49.8
46.7
0.0
3.5
83.9
9.6
5.9
0.5
77.8
16.9
4.4
0.9
72.4
22.5
3.6
1.5
58.7
38.8
0.0
2.5
89.7
5.0
5.1
0.2
91.3
4.1
4.5
0.1
89.8
5.3
4.8
0.1
95.1
4.7
0.0
0.2
48.0
41.2
9.1
1.6
3.0
86.9
1.1
9.0
0.2
91.2
0.0
8.6
0.0
91.5
0.0
8.5
4.5
77.2
5.2
13.1
0.0
86.7
0.0
13.3
0.0
87.0
0.0
13.0
0.0
88.2
0.0
12.8
In this context, a false positive (+ve) occurred when the ANOVA was not significant but Cochran’s test was: in these cases the
test ‘identified’ a problem that did not exist. A false negative (–ve) occurred when the ANOVA was significant but the heterogeneity test indicated that variances were homogeneous: in these cases the test failed to identify a problem that did exist. ‘No
problem’ means that Cochran’s test correctly indicated no variance heterogeneity and ‘VH found’ means that it correctly found
a variance problem. All significance tests used the 5% level. Results are given for analyses with 3 or 5 groups, with each of four
sample sizes, and for five variance heterogeneity conditions. NS, non-significant; *significant.
686
K. A. MCGUINNESS
approximately 49%. As a consequence, the rate of false
negatives (failing to identify a problem when one
actually existed) decreased with increasing sample size
and variance heterogeneity (Table 4). Even so, much of
the time (38% on average) Cochran’s test failed to
detect a problem when one actually existed (note in
Table 4 that the percentage in the ‘false negative’ row is
usually much greater than that in the ‘VH detected’ row
beneath). In other words, in simulations with heterogeneous variances, the ANOVA often gave a Type I
error in cases where Cochran’s test indicated that
variances were not significantly unequal. (It is
important to note that even if homogeneity tests
worked well they would not necessarily always indicate
a problem when a Type I error occurred in the ANOVA,
as Type I errors will arise even when variances are
equal.) Thus, even Cochran’s test does not appear to
provide consistent and reasonably reliable information
about the issue of most concern: the likely validity of
the ANOVA.
instance, when one variance was nine times larger than
the other two, and there were three replicates, use of a
0.01 significance level resulted in false positives
dropping from 32.2 to 12.3%, whereas false negatives
only increased from 7.8 to 8.4%. These resulted in
Cochran’s test, in this case, only correctly identifying a
small proportion of the analyses in which the ANOVA
gave a Type I error but, as noted previously, using the
usual 0.05 level gave results that were only slightly
better. In almost all cases with large samples sizes, and
where one variance was larger than the remainder,
using a significance level of 0.01 for Cochran’s test
resulted in no change in the number of ANOVA with
Type I errors that were identified. Overall, using
Cochran’s test with a significance level of 0.01 does
improve the performance of the test but this improvement decreases with increasing sample size and heterogeneity.
Conclusions and recommendations
Test of Cochran’s test at = 0.01
Testing Cochran’s value at = 0.01, instead of
= 0.05, markedly reduced the number of false
positives but only for the smaller samples sizes and
usually only when heterogeneity resulted from one
small variance (compare Table 4 and Table 5). The
penalty for this was, of course, a reduction, although
sometimes only slight, in the number of times that the
test correctly identified heterogeneity that affected the
ANOVA result (i.e. an increase in false negatives). For
These results confirm Box’s (1953) conclusion that
testing the homogeneity assumption ‘may well lead to
more wrong conclusions than if the preliminary test
was omitted’. These simulations all used normally
distributed observations: results would, if anything,
have been worse with non-normal observations
because ANOVA, with a balanced design, is not greatly
affected by moderate non-normality (Glass et al. 1972)
but most homogeneity tests, including Cochran’s and
Bartlett’s, are (Conover et al. 1981). There is also no
reason to suspect that any of the procedures recom-
Table 5. Percentage of significant and non-significant Cochran’s (C) tests , categorized according to whether or not the
F-test was significant, when the ANOVA F-ratio was tested at the 5% level but Cochran’s value was tested at the 1% level
Test result
n=3
n = 10
n = 17
n = 37
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
No problem (F NS; C NS)
False +ve (F NS; C *)
False –ve (F *; C NS)
VH found (F *; C *)
ANOVA
1/81
1/9
k=3
1
3
81
1/81
1/9
k=5
1
3
81
83.2
9.1
7.1
0.6
71.8
21.5
4.9
1.8
58.7
34.6
4.3
2.4
16.1
77.9
1.2
4.8
88.7
4.1
7.1
0.1
78.6
14.8
5.9
0.7
69.2
24.8
4.5
1.5
39.1
55.0
2.6
3.4
94.3
1.0
4.7
0.0
94.5
1.0
4.5
0.0
94.3
0.9
4.7
0.0
93.8
0.8
5.3
0.1
79.2
12.3
8.4
0.1
11.4
80.9
2.0
5.6
0.8
91.2
0.3
7.7
0.0
92.5
0.0
7.5
20.9
65.4
10.0
3.6
0.0
90.6
0.0
9.4
0.0
90.1
0.0
9.9
0.0
91.7
0.0
8.3
89.9
3.3
6.6
0.2
88.3
5.9
5.4
0.4
84.3
9.9
5.4
0.4
49.8
19.1
29.7
1.3
90.6
3.0
6.3
0.1
89.2
5.4
5.1
0.3
87.6
7.2
4.6
0.5
58.7
14.7
25.8
0.8
93.9
0.9
5.2
0.1
94.5
0.9
4.6
0.0
94.1
1.0
4.9
0.0
95.1
1.0
3.8
0.1
65.1
24.2
10.1
0.7
6.3
83.6
1.9
8.2
0.5
90.9
0.1
8.5
0.0
91.5
0.0
8.5
8.3
73.5
7.9
10.3
0.0
86.7
0.0
13.3
0.0
87.0
0.0
13.0
0.0
88.2
0.0
11.8
Results are given for analyses with 3 or 5 groups, with each of four sample sizes, and for five variance heterogeneity
conditions. See caption for Table 4 for further details. +ve, positive; –ve, negative; NS, non-significant; *significant.
TESTING THE
mended by Conover et al. (1981) would have performed better than Cochran’s test and they may have
performed worse: compared with Cochran’s test, all
three lacked power at the small sample sizes, when
ANOVA is least robust and where power is most
needed.
How then should one proceed? Given the results
here, and the observations of Neter et al. (1985), Winer
et al. (1991) and Underwood (1981, 1997) several
recommendations can be made.
1. A balanced design should be used whenever
possible and, as Underwood (1997) discusses, there are
rarely good reasons for not doing this. A few replicates
may be lost during the study but this should be of little
importance.
2. Graphical methods (residual plots, normal plots,
variance vs means plots) should be used to check the
distribution of observations and variances. All that is
required for the analysis to be reliable is approximate
normality of observations and equality of variances.
Only when there is single, large variance, or marked
non-normality, are there likely to be substantial
problems. Winer et al. (1991), Underwood (1981,
1997) and Conover et al. (1981) may be consulted for
further discussion (but also see Wilcox et al. 1986).
3. Groups with large variances should be checked
for errors and outliers. An unusually large variance may
simply result from an incorrectly recorded or entered
observation (Underwood 1981, 1997). This is probably
not a common reason for heterogeneous variances but
it does happen.
4. The data should be appropriately transformed if
there is a relationship between means and variances, or
if there are biological reasons for expecting non-normal
distributions (for appropriate transformations see
standard texts; Underwood 1981, 1997). Such
relationships may be the rule, rather than the exception,
in ecological studies.
5. If a formal test of the homogeneity assumption is
required, Cochran’s test, with = 0.01, may be used,
because it is less sensitive to heterogeneity caused by
small variances, but this procedure is still likely to be
overly conservative except for small sample sizes.
6. Workers should remember that, as the usual effect
of heterogeneous variances on ANOVA is to increase the
frequency of Type I errors, a non-significant F-test
is likely to be reliable, whether or not variances are
heterogeneous (Underwood 1981, 1997).
7. Workers should accept that statistical tests do not
give exact results and the true Type I error rate
(significance level) for any test will only approximate,
albeit perhaps closely in some cases, the nominal level.
(Anyone who writes P < 0.05135 is fooling themselves.)
8. In cases where there is very substantial nonnormality or heterogeneity, alternative procedures
(such as the use of randomization tests, or generalized
ANOVA
HOMOGENEITY ASSUMPTION
687
linear models with non-normal error distributions)
may be pursued through specialist advice.
ACKNOWLEDGEMENTS
This work resulted from discussions with Michael
Douglas about detecting and dealing with heterogeneity, which prompted me to run simulations to
check the reliability of a short-cut I had been using.
These confirmed that my short-cut was probably
reliable but this result proved less comforting that I
might have hoped. Comments by referees improved the
manuscript.
REFERENCES
Box G. E. P. (1953) Non-normality and tests on variances.
Biometrika 40, 318–35.
Brown B. W. & Hollander M. (1977) Statistics, a Biomedical
Introduction. John Wiley & Sons, New York.
Cochran W. G. (1947) Some consequences when the assumptions for the analysis of variance are not satisfied. Biometrics
3, 22–38.
Conover W. J., Johnson M. E. & Johnson M. M. (1981) A
comparative study of tests for homogeneity of variances,
with applications to the outer continental shelf bidding data.
Technometrics 23, 351–61.
Ennos R. (2000) Statistical and Data Handling Skills in Biology.
Pearson Education, Edinburgh Gate.
Fowler J., Cohen L. & Jarvis P. (1998) Practical Statistics for Field
Biology, 2nd edn. John Wiley & Sons, New York.
Glass G. V., Peckham P. D. & Sanders J. R. (1972) Consequences of failure to meet assumptions underlying the fixed
effects analysis of variance and covariance. Rev. Educ. Res.
42, 239–88.
Neter J., Wasserman W. & Kutner M. H. (1985) Applied Linear
Statistical Models, 2nd edn. Richard D. Irwin, Homewood.
Neter J., Wasserman W. & Whitmore G. A. (1982) Applied
Statistics, 2nd edn. Allyn and Bacon, Boston.
Oehlert G. W. (2000) A First Course in Design and Analysis of
Experiments. W. H. Freeman, New York.
Parker R. E. (1979) Introductory Statistics for Biology, 2nd edn.
The Institute in Biology’s Studies in Biology no. 43. Edward
Arnold, London.
Snedecor G. W. & Cochran W. G. (1980) Statistical Methods, 7th
edn. Iowa State University Press, Ames.
Sokal R. R. & Rohlf F. J. (1969) Biometry, the Principles and
Practice of Statistics in Biological Research. W. H. Freeeman,
San Francisco.
Underwood A. J. (1981) Techniques of analysis of variance in
experimental marine biology and ecology. Ann. Rev.
Oceanogr. Mar. Biol. 19, 513–605.
Underwood A. J. (1997) Experiments in Ecology, Their Logical
Design and Interpretation Using Analysis of Variance.
Cambridge University Press, Cambridge.
Weiss N. A. (1995) Introductory Statistics, 4th edn. AddisonWesley, Reading.
Wilcox R. R., Charlin V. & Thompson K. (1986) New Monte
Carlo results on the robustness of the ANOVA F, W and F(*)
statistics. Commun. Stat. Simul. Comp. 15, 933–44.
688
K. A. MCGUINNESS
Winer B. J., Brown D. R. & Michels K. M. (1991) Statistical
Principles in Experimental Design, 3rd edn. McGraw-Hill,
New York.
Zar J. H. (1984) Biostatistical Analysis, 2nd edn. Prentice Hall,
Englewood Cliffs.