Some Comments and Definitions Related to the Assumptions

1
Newsom
USP 634 Data Analysis I
Spring 2013
Some Comments and Definitions Related to the
Assumptions of Within-subjects ANOVA
The Sphericity Assumption
The sphericity assumption states that the variance of the difference scores in a withinsubjects design (the S D2 in a paired t-test) are equal across all the groups. Imagine a withinsubjects design with three levels of the independent variable (e.g., pretest, posttest, and followup), and that difference scores are calculated for each subject comparing pretest with posttest,
posttest with follow-up, and pretest with follow-up. 1 The sphericity assumption (sometimes
called the “circularity” assumption) assumes that the variances of each of these sets of
difference scores are not statistically different from one another. The sphericity assumption is
similar to the homogeneity of variance assumption with between subjects ANOVA. When this
assumption is violated, there will be an increase in Type I errors, because the critical values in
the F-table are too small. One could say that the F-test is positively biased under these
conditions. There are several different approaches to correcting for this bias.
Lower bound correction. This is a correction for a violation of the sphericity
assumption. The correction works by using a higher F-critical value to reduce the likelihood of a
Type I error. In the non-SPSS world, the lower bound correction is really referred to as the
“Geisser-Greenhouse” correction in which dfA = 1 and dfAxS = s - 1 are used instead of the usual
dfA = a - 1 and dfAxS = (a - 1) (s - 1). Under this correction to the dfs, the F-critical values will be
larger and it will be harder to detect significance, thus reducing Type I error. Unfortunately, this
correction approach tends to overcorrect, so there are too many Type II errors with this
procedure (i.e., low power). This correction assumes the maximum degree of heterogeneity
among the cells.
Huynh & Feldt correction. This correction is based on a similar correction by Box
(1954). In both cases, an adjustment factor based on the amount of variance heterogeneity
(i.e., how much the variances are unequal) is computed (the adjustment factor is called epsilon).
Then, both dfAxS and dfA are adjusted by this factor, so that the F-critical will be somewhat
larger. The correction is not as severe as the "lower bound" correction.
Geisser-Greenhouse correction. The Geisser-Greenhouse correction referred to in
SPSS is another variant on the procedure described above under Huynh & Feldt. A slightly
different correction factor (epsilon) is computed, which corrects the degrees of freedom slightly
more than the Huynh & Feldt correction. So, significance tests with this correction will be a little
more conservative (higher p-value) than those using the Huynh-Feldt correction.
Mauchly's test. Mauchly's test is a test of the sphericity assumption using the chisquare test. Unfortunately, this test is not a very useful one. For small sample sizes, it tends to
have too many Type II errors (i.e., it misses sphericity violations), and for large sample sizes, it
tends to be significant even though the violation is small in magnitude (Type I error; e.g.,
Kesselman, Rogan, Mendoza, & Breen, 1980). Thus, the Mauchly’s test may not be much help
and may be misleading when deciding if there is a violation of the sphericity assumption.
Compound Symmetry
Another assumption of within-subjects ANOVA that you may hear about is the “compound
symmetry” assumption. The compound symmetry assumption is a stricter assumption than the
sphericity assumption. Not only do the variances of the difference scores need to be equal for
pairs of conditions, but their correlations (technically, the assumption concerns covariances—
the unstandardized version of correlation) must also be equal. Imagine taking differences
between scores for each possible pair of cells. Then correlations are calculated among all those
difference scores. Under the compound symmetry assumption these correlations (or
covariances, actually) must not be different in the population. A violation of this stricter
1
The sphericity assumption does not apply to within-subjects ANOVAs that have only two levels.
Newsom
USP 634 Data Analysis I
Spring 2013
2
compound symmetry assumption does not necessarily mean the sphericity assumption will be
violated. SPSS does not currently provide a test of this assumption.
Nonadditivity
The error term for within-subjects is the interaction term, S x A. In other words, we assume that
any variation in differences between levels of the independent variable is due to error variation.
It is possible, however, that the effect of the independent variable A is different for different
subjects, and thus there is truly an interaction between S and A. Thus, some of what we
consider to be error when we calculate S x A, is really an interaction of subject and treatment
and not error variation. For example if Factor A represents program groups, then an S X A
interaction suggests the program is not equally effective for each subject. This is the so-called
“additivity” assumption that there is no interaction between A and S that is not error (interactions
are multiplicative or “nonadditive”). Because nonadditivity (a violation of this assumption)
implies hetergeneous variances for the difference scores, the sphericity assumption will be
violated if nonadditivity occurs. SPSS provides a test for this assumption, the Tukey test for
nonadditivity, under the scale reliability command. The output includes a suggested
transformation of the data (raising each score to a certain power) that can be employed if the
test is significant.
Comments and Recommendations
The sphericity and compound symmetry assumptions do not apply when there are only two
levels (or cells) of the within-subjects factor (e.g., pre vs. post test only). The sphericity
assumption, for instance, does not stipulate that the variances of the scores for each level are
the same, but that difference scores are calculated for pairs of levels (e.g., Time 2 scores minus
Time 1 scores vs. Time 3 scores minus Time 2 scores) because these assumptions involve the
differences between levels being equal. In SPSS, the Mauchly’s chi-square is reported as 1
and the sig as “.” when there are only two levels of the within-subjects factor.
I'll make a couple of important points regarding these corrections (based on recommendations
of Greenhouse & Geisser): 1) If F is nonsignifcant, do not worry about the corrections, because
the corrections will only increase the p-values. 2) If the F is significant using all three
approaches, do not worry about the corrections. That is, if the most conservative approach is
still significant, there is no increased risk of Type I error. If the various correction tests lead to
different conclusions, there is no one perfect solution. I generally use the Huynh and Feldt
correction, because it addresses the Type I error problem and is more powerful than the "lower
bound" correction. It does not perform perfectly, however, and it might be safest to report the
results from all correction approaches when they do not show the same result. With large
sample sizes, a small departure from sphericity might be significant, but the correction in these
situations should be relatively minor and there is not likely to be difference in the conclusions
drawn from the significance tests using the various corrections.
There are two other possible solutions: 1) transform the dependent variable scores using a
square root transformation or raise the score to a fractional power (e.g., .33), or 2) use a
multivariate analysis of variance (MANOVA) test. For the transformation approach, you may
have to try different transformations until the sphericity problem is resolved. The transformation
approach may be quite helpful in resolving the problem, but the researcher will have more
difficulty interpreting the results. Under some conditions, the MANOVA approach can be more
powerful than the within-subjects ANOVA, and the MANOVA test does not assume sphericity.
The MANOVA test is printed on the GLM repeated measures output in SPSS. MANOVA
assumes that the data have a “multivariate” normal distribution—that the analysis variable is
jointly normally distributed when all levels are considered together. Algina and Kesselman
(1997) suggest that for few levels and large sample sizes, the MANOVA approach may be more
Newsom
USP 634 Data Analysis I
Spring 2013
3
powerful. Their guidelines are to use MANOVA if a) the number of levels is less than or equal to
4 (a < 4) and n greater than the number of levels plus 15 (a + 15); or b) the number of levels is
between 5 and 8 (5 < a < 8) and n is greater than the number of levels plus 30 (a + 30).
Example
Below is an example of a within-subjects ANOVA with three levels of the independent variable
which shows the sphericity assumption test and corrections. This hypothetical study compares
performance on a vocabulary test after different lecture topics. Each student hears each lecture
topic and takes a vocabulary test afterward. Notice that the adjustments are not made to the
calculated F-values. Instead, the corrections are made to adjusted critical values (not shown),
so the only difference you may see is in the “Sig” values (p-values). This is accomplished by
adjusting the degrees of freedom. The biggest correction to df is the Lower-bound, followed by
the Greenhouse-Geiser, and the Huynh-Feldt. Thus, the Lower-bound will have the largest pvalue (the most conservative significance test), the Greenhouse-Geisser will have an
intermediate p-value, and the Huynh-Feldt will have the smallest p-value (the most liberal
significance tests of the corrections).
The following is the MANOVA portion of the output automatically generated by the GLM
repeated measures test.
Newsom
USP 634 Data Analysis I
Spring 2013
4
There are a = 3 levels and 12 cases, so there were not more than a + 15 cases in this example,
so I would not recommend the use of the MANOVA test. If MANOVA was appropriate, there are
four different tests presented by SPSS, and with a large sample size they will all tend to show
the same results. Roy’s largest root is the most liberal of the tests and Pillai’s trace is said to be
the most robust to variance differences with small samples (Olson, 1979). Wilk’s lambda seems
to be the most commonly reported for some reason.
Below I ran a reliability analysis, analyze scale  reliability analysis, and chose the
statistics button and check the Tukey’s test of additivity. The reliability analysis is usually
used to assess internal reliability of a scale and obtain Cronbach’s alpha, but we can use it here
for the nonadditivity test.
The row of the table labeled “Nonadditivity” gives the test of significance of this assumption
violation, F(1,24) = 2.027, ns, and suggests no evidence of the A X S interaction. The footnote
contains the recommended transformation of the data that could be used if the nonadditivity test
was significant.
Write-up Example
A within-subjects ANOVA was used to compare the vocabulary scores in the three lecture
conditions. The average vocabulary score was the highest in the first topic (M1 = 40.00),
followed by the third topic (M3 = 34.50), and then second topic (M2 = 26.00) [means not shown
in the above output excerpts]. The results indicated that there was a significant difference
among the three conditions, F(2,22) = 12.305, p < .05, for all sphericity assumption correction
tests.
References
Algina, J., & Kesselman, H.J. (1997). Detecting repeated measures effects with univariate and multivariate statistics. Psychological
Methods, 2, 208-218.
Kesselman, H.J., Rogan, J.C., Mendoza, J.L., & Breen, L.J. (1980). Testing the validity conditions of repeated measures F tests.
Psychological Bulletin, 87, 479-481.