Significantly Insignificant F Tests

Significantly Insignificant F Tests
Ronald CHRISTENSEN
P values near 1 are sometimes viewed as unimportant. In fact, P
values near 1 should raise red flags cautioning data analysts that
something may be wrong with their model. This article examines
reasons why F statistics might get small in general linear models. One-way and two-way analysis of variance models are used
to illustrate the general ideas. The article focuses on the intuitive
motivation behind F tests based on second moment arguments.
In particular, it argues that when the mean structure of the model
being tested is correct, small F statistics can be caused by not
accounting for negatively correlated data or heteroscedasticity;
alternatively, they can be caused by an unsuspected lack of fit. It
is also demonstrated that large F statistics can be generated by
not accounting for positively correlated data or heteroscedasticity.
KEY WORDS: Analysis of variance; Correlation; Heteroscedasticity; Homoscedasticity; Linear models; Regression;
Robustness.
1.
INTRODUCTION
This article looks at statistical reasons why F statistics might
get small in general linear models. The discussion targets results for general models, using familiar models to illustrate the
ideas. It focuses on the intuitive motivation behind F tests using
second-moment arguments. In particular, we argue that when
the mean structure of the model being tested is correct, small F
statistics can be caused by not accounting for negatively correlated data or heteroscedasticity. Alternatively, small F statistics
can be caused by an unsuspected lack of fit. We also demonstrate that large F statistics can be generated by not accounting
for positively correlated data or heteroscedasticity even when
the mean structure of the model being tested is correct.
Example: Schneider and Pruett (1994) discussed control
chart data obtained “to monitor the production of injection
molded bottles. Experience has shown that the outside diameter
of the bottle is an appropriate measure of process performance.”
A molding machine makes four bottles simultaneously, so rational subgroups of size four were used. Data were presented for 20
groups. The one-way ANOVA test that the groups means are all
equal gives F = 0.45 with 19 and 60 degrees of freedom for a
P value of 0.970. The F statistic is suspiciously small. Standard
residual plots look fine.
Although the small F value suggests that something is wrong,
it does not provide any specific suggestions as to the problem. In
this example, the solution is easy to find. The four observations
in each group were being made by different molding heads that
were assumed to be performing identically. Cross-classifying the
data and running a two-way ANOVA, the test for groups gives
F = 1.34 while the test for heads gives F = 40.03. Note that if
there had been substantial differences between the groups, we
would not have had a significantly small F statistic in the oneway ANOVA to suggest the existence of a problem. Nonetheless,
whenever we do see a significantly small F statistic, it should
give us pause.
There exists a substantial literature on the robustness of t
and F statistics. One noteworthy historical discussion is Scheffé
(1959, chap. 10). Recent examples include Kiviet (1980), Benjamini (1983), Westfall (1988), Posten (1992), and Barksalary,
Nurhonen, and Puntanen (1992). Robustness studies typically
involve examining the effect of nonnormality, correlation, or
heteroscedasticity on the Type I error rate of the test. For example, Scariano and Davenport (1987) examined the effect of
correlation on the normal theory one-way ANOVA F test by examining a specific family of covariance matrices as an alternative
to the usual homoscedastic, independence model. The argument
here is similar but distinct. We argue that if the F statistic is
unusual, in this case small, that it suggests problems with the
(null) model and we examine models that can generate small F
statistics. Corresponding Type I error rates do not look at how
often F is small, but how infrequently it is large.
Section 2 examines first moment lack-of-fit issues. Section
3 provides simple inequalities involving the covariance matrix
that determine the behavior of the F statistic. In particular, for
a rank r full model yi = xi β + εi and a rank r0 reduced model
yi = xi0 γ + εi , we will see that if the reduced model is true but
n
n
var(yi )
i=1 var(xi β̂)
< i=1
,
(1)
r
n
and
n
i=1
var(xi β̂ − x0i γ̂)
<
r − r0
n
i=1
var(yi )
,
n
(2)
the F statistic for testing the models tends to get small. Conversely, if the inequalities are reversed, the F statistic gets large.
The entire discussion is based on first and second moments.
Normality is not assumed.
The presentation of testing in general linear models follows
ideas and notations in Christensen (2002, sec. 3.2). In particular,
for any matrix A, let r(A) denote the rank of A, C(A) denote
the column space of A, and MA denote the perpendicular projection operator (ppo) onto C(A), that is, MA = A(A A)− A ,
where (A A)− denotes a generalized inverse. Additionally, for
Ronald Christensen is Professor, Department of Mathematics and Statistics,
matrices X, X0 , denote their ppos M , and M0 , respectively.
University of New Mexico, Albuquerque, NM 87131 (E-mail: [email protected].
C(X) ⊥ C(A) indicates that the column spaces are orthogedu). I thank the associate editor and referees for many good suggestions on both
content and presentation.
onal. Write C(X)⊥ for the orthogonal complement of C(X)
c 2003 American Statistical Association DOI: 10.1198/0003130021019
The American Statistician, February 2003, Vol. 57, No. 1
1
and, if C(X0 ) ⊂ C(X), write C(X0 )⊥
C(X) for the orthogonal complement of C(X0 ) with respect to C(X). To derive
a test, start by assuming that a full model Y = Xβ + e,
E(e) = 0, cov(e) = σ 2 I fits the data. Here Y is an n × 1
vector of observable random variables, X is an n × p matrix
of fixed observed numbers, β is an n × 1 vector of unknown
fixed parameters, and e is an n × 1 vector of unobservable
random errors. We then test the adequacy of a reduced model
Y = X0 γ + e which has the same assumptions about e but in
which C(X0 ) ⊂ C(X). Based on second-moment arguments,
the test statistic is a ratio of variance estimates. We construct
an unbiased estimate of σ 2 , Y (I − M )Y/r(I − M ), and another statistic Y (M − M0 )Y/r(M − M0 ) that has E[Y (M −
M0 )Y/r(M − M0 )] = σ 2 + β X (M − M0 )Xβ/r(M − M0 ).
Given the covariance structure, this second statistic is an unbiased estimate of σ 2 if and only if the reduced model is correct.
The test statistic
F =
Y (M − M0 )Y/r(M − M0 )
Y (I − M )Y/r(I − M )
(3)
=1+
β X (M − M0 )Xβ
.
σ 2 r(M − M0 )
Under the null hypothesis, F is an estimate of the number 1.
Values of F much larger than 1 suggest that F is estimating something larger than 1, which suggests that β X (M −
M0 )Xβ/σ 2 r(M − M0 ) > 0, which occurs if and only if the
reduced model is false. The usual normality assumption leads
to an exact central F distribution for the test statistic under the
null hypothesis, so we are able to quantify how unusual it is to
observe any F statistic greater than 1.
If, when rejecting a test, you want to reasonably conclude that
the alternative is true, you have to have a reasonable belief in
the validity of all the assumptions of the model other than the
null hypothesis. Thus, in a linear model, you need to check independence, homoscedasticity, and normality. There are standard
methods based on residual analysis for checking homoscedasticity and normality. If the data are taken in time sequence, there
are standard methods based on residual analysis for checking
independence. More generally, Christensen and Bedrick (1997)
discussed methods for testing the independence assumption. Ultimately, there is no substitute for thinking hard about the data
and how they were collected. If all of the assumptions check
out, then one can conclude that a large F statistic indicates that
the full model is needed. This article illustrates two things: (1)
that if these assumptions are violated, there are perfectly plausible explanations for getting a large F statistic other than that
the full model is more appropriate than the reduced model and,
more importantly, and (2) that if you get a significantly small F
statistic, something may be wrong with your assumptions, regardless of whether you performed standard diagnostic checks.
In other words, large P values should be not be ignored, they
should be used as a diagnostic that suggests problems with the
model.
2
General
LACK OF FIT AND SMALL F STATISTICS
In testing Y = Xβ + e against Y = X0 γ + e, C(X0 ) ⊂
C(X), large F statistics (3) are generally taken to indicate that
the reduced model suffers from a lack of fit, that is, the more
general mean structure of the full model is needed. However,
small F statistics can occur if the reduced model’s lack of fit
exists in the error space of Y = Xβ + e.
Suppose that the correct model for the data has the form
Y = X0 γ + Wδ + e,
C(W) ⊥ C(X0 ),
(4)
where assuming C(W) ⊥ C(X0 ) creates no loss of generality.
Rarely will anyone tell us the true matrix W. The behavior of
F depends on the vector Wδ.
1. The F statistic estimates 1 if the model Y = X0 γ + e is
correct, that is, Wδ = 0.
2. The F statistic estimates something greater than 1 if the
full model Y = Xβ + e is correct and needed, that is, if 0 =
Wδ ∈ C(X0 )⊥
C(X) .
3. F estimates something less than 1 if 0 = Wδ ∈ C(X)⊥ ,
that is, if Wδ is in the error space of the full model. Then the
numerator estimates σ 2 but the denominator estimates
is a (biased) estimate of
σ 2 + β X (M − M0 )Xβ/r(M − M0 )
σ2
2.
σ 2 + δ W (I − M )Wδ/r(I − M )
= σ 2 + δ W Wδ/r(I − M ).
⊥
4. If Wδ is in neither C(X0 )⊥
C(X) nor C(X) , it is not clear
how F will behave because neither the numerator nor the denominator estimates σ 2 . (Under normality, F is doubly noncentral.)
Christensen (1989, 1991) contains related discussion of these
concepts. We now illustrate the ideas.
Example: Simple Linear Regression. The full model is yi =
β0 + β1 xi + εi and we test the reduced model yi = β0 + εi . If
the true model is yi = β0 + β1 xi + δ(xi − x̄· )2 + εi , Case 4
applies and neither the numerator nor the denominator estimate
σ 2 . However, if the xi ’s are equally spaced, say, xi = a + bi and
β1 = 0, we are in Case 3 so that unless δ is close to 0, the true
model causes the F statistic to be close to 0 and the P value to be
close to 1. The term δ(xi − x̄· )2 only determines Wδ indirectly
but Wδ is easily seen to be
in the error space of the simple linear
n
regression model because i=1 (xi − x̄· )(xi − x̄· )2 = 0.
Lack of fit testing for a general regression model Y = Xβ+e
involves testing the model against a constructed full model, say,
Y = X∗ β ∗ + e with C(X) ⊂ C(X∗ ) that generalizes the
mean structure of Y = Xβ + e, see Christensen (2002, sec.
6.6). However, if lack of fit exists because of features that are
not part of the original model, generalizing Y = Xβ + e may
be inappropriate.
Example: Traditional lack of fit test. For simple linear regression this begins with the replication model yij = β0 +
β1 xi + eij , i = 1, . . . , a, j = 1, . . . , Ni . It then assumes
E(yij ) = µ(xi ) for some function µ(·) which leads to constructing the full model yij = µi + eij , a one-way ANOVA, and
thus to the traditional lack of fit test. The test statistic is
SSTrts − SS(lin)
F =
MSE,
a−2
(5)
where, from the one-way ANOVA model, SSTrts is the sum
of squares for treatments, SS(lin) is the sum of squares for the
linear contrast, and MSE is the mean squared error. If there is
no lack of fit in the reduced model, F should be near 1 (case 1 in
list). If lack of fit exists in the replicated simple linear regression
model because the more general mean structure of the one-way
ANOVA fits the data better, then the F statistic tends to be larger
than 1 (Case 2 in the list).
Suppose that the simple linear regression model is balanced,
that is, all Ni = N , that for each i the data are taken in time
order t1 < t2 < · · · < tN , and that the lack of fit is due to the
true model being
yij = β0 + β1 xi + δtj + eij ,
δ = 0.
(6)
Thus, depending on the sign of δ, the observations within each
group are subject to an increasing or decreasing trend. In this
model, for fixed i the E(yij )s are not the same for all j, thus invalidating the assumption of the traditional test. In fact, this form
of lack of fit puts us in Case 3 of our list because Wδ becomes a
vector with elements
δ(tj − t̄) which is orthogonal to the vector
of xi ’s because ij xi (tj − t̄) = 0. This causes the traditional
lack-of-fit test to have a small F statistic. Another way to see this
is to view the problem in terms of a balanced two-way ANOVA.
The true model (6) is a special case of the two-way ANOVA
model yij = µ + αi + ηj + eij in which the only nonzero contrasts are the linear contrast in the αi ’s and the linear contrast in
the ηj ’s. Under model (6), the numerator of the statistic (5) gives
an unbiased estimate of σ 2 because SSTrts in (5) is SS(α) for
the two-way model and the only nonzero α effect is being eliminated from the treatments. However, the mean squared error in
the denominator of (5) is a weighted average of the error mean
square from the two-way model and the mean square for the ηj ’s
in the two-way model. The sum of squares for the significant linear contrast in the ηj ’s from model (6) is included in the error
term of the lack-of-fit test (5), thus biasing the error term to estimate something larger than σ 2 . In
particular, the denominator
N
has an expected value of σ 2 + δ 2 a j=1 (tj − t̄· )2 /a(N − 1).
If the appropriate
N model is (6), the statistic in (5) estimates
σ 2 /[σ 2 + δ 2 a j=1 (tj − t̄· )2 /a(N − 1)] which is a number
less than 1. Thus, a lack of fit that exists within the groups of the
one-way ANOVA can cause values of F much smaller than 1.
Note that in this balanced case, true models involving interaction
terms, for example, models like
yij = β0 + β1 xi + δtj + γxi tj + eij
also fall into Case 3 and tend to make the F statistic small if
either δ = 0 or γ = 0. Finally, if there exists lack of fit both
between the groups of observations and within the groups, we
are in Case 4 and it can be very difficult to identify. For example,
if β2 = 0 and either δ = 0 or γ = 0 in the true model
yij = β0 + β1 xi + β2 x2i + δtj + γxi tj + eij ,
there is both a traditional lack of fit between the groups (the
significant β2 x2i term) and lack of fit within the groups (δtj +
γxi tj ). In this case, neither the numerator nor the denominator
in (5) is an estimate of σ 2 .
Graphing such data is unlikely to show these kinds of problems. In the one-way ANOVA example, it would be possible to
plot the data against time in such a way that the relationship with
time is visible, but if you thought there might be a time trend you
would not do a one-way ANOVA in the first place. If you thought
there might be a time trend, you would fit a two-way ANOVA
with time as a factor and you would have seen the important
time effect. If you did not think to put time in the ANOVA, it is
unlikely that you would think of using time in a graphical procedure. The point is that when you get a small F statistic, you
need to try to figure out what is causing it. You need to look for
things like possible time trends even though you initially were
convinced that they would not occur.
The illustration involving a linear trend within treatment
groups is closely related to a more general phenomenon that happens all too frequently: analyzing a randomized complete block
design as if it were a completely randomized design (CRD). If
there are no treatment effects, the numerator in the CRD analysis
estimates σ 2 , but the denominator estimates σ 2 plus a positive
quantity due to nonzero block effects, so the F statistic tends
to be small. More generally, if the treatment effects are nonzero
but small relative to the block effects, the F statistic still tends
to be small, even though neither the numerator nor the denominator of the statistic is estimating σ 2 . Specifically, in this case
the expected value of the numerator of the F statistic is still
much smaller than the expected value of the denominator of the
F statistic, thus causing F to be small.
Similar things happen in other situations. Ignoring a split plot
design and analyzing it as at two-way ANOVA leads to using
only one error term obtained by pooling the whole plot and subplot errors. Typically, whole plot effects look more significant
than they should while subplot and interaction effects look less
significant. With fractional factorials, high order interactions are
often assumed to be 0. If they are not 0, F statistics become inappropriately small.
Another possible explanation for a small F statistic is the
existence of “negative correlation” in the data.
3. THE EFFECT OF CORRELATION AND
HETEROSCEDASTICITY ON F STATISTICS
The test of the reduced model is no better than the assumptions
made about the full model. The mean structure of the reduced
model may be perfectly valid, but the F statistic can become
small or large because the assumed covariance structure is incorrect. We now examine the F statistic (3) when the true model
is Y = X0 γ + e, E(e) = 0, cov(e) = σ 2 V. We refer to the
condition
tr[M V ] <
1
r(X)
tr(V)tr[M ] =
tr(V)
n
n
The American Statistician, February 2003, Vol. 57, No. 1
(7)
3
as having negative correlation in the full model. The condition
tr([M − M0 ]V) <
1
tr(V)tr[M − M0 ]
n
r(X) − r(X0 )
=
tr(V) (8)
n
is referred to as having negative correlation for distinguishing
between models. If the inequalities are reversed, we refer to
positive correlations. Note that when V = I, equality holds in
both conditions. In spite of the terminology, the inequalities can
be caused by heteroscedasticity as well as negative correlations.
Together, (7) and (8) cause small F statistics even when the mean
structure of the reduced model is true. Reversing the inequalities
causes large F statistics.
To establish these claims we use a standard result on the
expected value of a quadratic form, for example, Christensen
(2002, theorem 1.3.1). The numerator of the F statistic (3) estimates
E{Y [M − M0 ]Y/r(M − M0 )}
= tr{[M − M0 ]V}/r(M − M0 ) < tr(V)/n
and the denominator of the F statistic estimates
E{Y (I − M )Y/r(I − M )}
= tr{[I − M ]V}/r(I − M )
= (tr{V} − tr{[M ]V}) /r(I − M )
1
> tr{V} − tr(M )tr(V) /r(I − M )
n
n − r(M )
=
tr(V)/r(I − M ) = tr(V)/n.
n
F in (3) estimates
E{Y [M − M0 ]Y/r(M − M0 )}
E{Y (I − M )Y/r(I − M ))}
tr{[M − M0 ]V}/r(M − M0 )
tr(V)/n
=
<
= 1,
tr{[I − M ]V}/r(I − M )
tr(V)/n
so having both negative correlation in the full model and having
negative correlation for evaluating differences between models
tends to make F statistics small. Exactly analogous computations show that having both positive correlation in the full model
and having positive correlation for evaluating differences between models tends to make F statistics greater than 1.
The general condition (7) is equivalent to (1), so, under homoscedasticity, negative correlation in the full model amounts to
having an average variance for the fitted values (averaging over
the rank of the covariance matrix of the fitted values) that is less
than the common variance of the observations. Similarly, having
negative correlation for distinguishing the full model from the
reduced model leads to (2).
Example: General Mixed Models. For mixed models, results similar to those for lack of fit hold. Suppose that the correct
model is Y = X0 γ + Wδ + e, E(e) = 0, cov(e) = σ 2 I,
where δ is random with E(δ) = 0 and cov(δ) = D, so
cov(Y) = σ 2 I + WDW . If C(W) ⊂ C(X)⊥ , conditions (7)
4
General
and (8) hold, so F tends to be small. If C(W) ⊂ C(X0 )⊥
C(X) ,
the reverse of (7) and (8) hold, causing large F statistics.
Example: Scariano and Davenport (1987) mentioned the
possibility of generalizing their work on F test robustness beyond one-way ANOVA. To study lack of independence, their
class of alternative covariance matrices can be generalized to
the family determined by
V = c1 M0 + c2 (M − M0 ) + c3 (I − M ),
where the ci ’s are positive. Scariano and Davenport (1987) presented normal theory size and power computations in one-way
ANOVA for two cases with homoscedastic, equicorrelation covariance structures. These determine specific values of M , M0 ,
and the ci ’s. In particular, their Cases 1 and 2 are specific instances with c1 = c2 > c3 and c1 = c3 > c2 , respectively.
Additionally, when c1 = c2 < c3 , the covariance structure is
similar to that of the previous example with C(W) ⊂ C(X)⊥
and, when c1 = c3 < c2 , the covariance structure is similar to
that with C(W) ⊂ C(X0 )⊥
C(X) . It is easy to check that conditions (7) and (8) hold by taking c1 = c2 < c3 or c1 = c3 > c2 ,
Similarly, by taking c1 = c3 < c2 or c1 = c2 > c3 , the reverse
of conditions (7) and (8) hold. The resulting second moment
based qualitative assessments of how F will behave agree with
Scariano and Davenport’s explicit normal theory computations.
Although this alternative family is well suited for studying lack
of independence, it does not lend itself to studying independent
observations with unequal variances.
Example: Homoscedastic balanced one-way ANOVA. Consider both a full model yij = µi + eij , i = 1, . . . , a, j =
1, . . . , N , and n ≡ aN , which we write Y = Zγ + e where
Z is a matrix of 0’s and 1’s indicating group inclusion, and a
reduced model yij = µ + eij , which in matrix terms we write
Y = Jµ + e, where J is an n × 1 vector of 1’s. In matrix terms
the usual one-way ANOVA F statistic is
F =
Y [MZ − (1/n)JJ ]Y/(a − 1)
,
Y (I − MZ )Y/a(N − 1)
(9)
so M ≡ MZ and M0 ≡ (1/n)JJ . We now assume that the true
model is Y = Jµ + e, E(e) = 0, cov(e) = σ 2 V and examine
the behavior of the F statistic.
From condition (7), negative correlation in the full model is
characterized by
a
i=1
N
var(ȳi· )
1
a
= tr[MZ V] < tr(V)tr[MZ ] = tr(V).
2
σ
n
n
(10)
a
This can also be written as i=1 var(ȳi· )/a < σ 2 /N , that is,
the average variance of the group means is less than σ 2 /N .
Positive correlation in the full model is characterized by the
reverse inequality.
From condition (8), negative correlation for evaluating differences between models reduces to
a
N var(ȳi· − ȳ·· ) = tr([MZ − (1/n)JJ ]V)
σ 2 i=1
1
a−1
< tr(V)tr[MZ − (1/n)JJ ] =
tr(V). (11)
n
n
This condition can also be written
a
a − 1 σ2
i=1 var(ȳi· − ȳ·· )
.
<
a N
a
Positive correlation for evaluating differences between models
is characterized by the reverse inequality.
If all the observations in different groups are uncorrelated, they have a block diagonal covariance matrix σ 2 V, in
which case condition (11) holds if and only if condition (10)
holds. This follows because tr(MZ V) = tr[(1/N )Z VZ] =
atr[(1/n)J VJ] = atr[(1/n)JJ V].
To illustrate these ideas for homoscedastic balanced one-way
ANOVA, we examine some simple examples with a = 2, N =
2. The first two observations are a group and the last two are a
group.
First, consider a covariance structure


1
.9
.1 .09
1 .09 .1 
 .9
V1 = 
.
.1 .09 1
.9
.09 .1
.9
1
Intuitively, there is high positive correlation between the two observations in each group, and weak positive correlation between
the groups. Condition (10) becomes
3.8 = 2(1/2)[3.8] = tr[MZ V1 ] >
a
tr(V1 ) = (2/4)4 = 2,
n
so there is positive correlation within groups, and (11) becomes
1.71 = 3.8 − 2.09 = tr([MZ − (1/n)JJ ]V1 )
a−1
>
tr(V1 ) = (1/4)4 = 1,
n
so there is positive correlation for evaluating differences between
groups.
Now consider


1
.1
.9 .09
1 .09 .9 
 .1
V2 = 
.
.9 .09 1
.1
.09 .9
.1
1
Intuitively, this has weak positive correlation between the two
observations in each group, with high positive correlation between some observations in different groups. Now
2.2 = 2(1/2)[2.2] = tr[MZ V2 ] >
a
tr(V2 ) = (2/4)4 = 2,
n
so there is positive correlation within groups, but
.11 = 2.2 − 2.09 = tr([MZ − (1/n)JJ ]V2 )
a−1
tr(V2 ) = (1/4)4 = 1,
<
n
so there is negative correlation for evaluating differences between groups.
Suppose the observations have the correlation structure of an
AR(1) process,

1
ρ
V3 =  2
ρ
ρ3
ρ
1
ρ
ρ2
ρ2
ρ
1
ρ

ρ3
2
ρ 
.
ρ
1
When −1 < ρ < 0, we have negative correlation within groups
because
2(1 + ρ) = tr[MZ V3 ] < 2.
If 0 < ρ < 1, the inequalities are reversed. Similarly, for −1 <
ρ < 0 we have negative correlation for evaluating differences
between groups because
ρ
1 + (1 − 2ρ − ρ2 )2 = tr([MZ − (1/n)JJ ]V3 ) < 1.
2
However, we only get positive correlation
√ for evaluating differences between groups when 0 < ρ < 2−1. Thus,
√ for negative
2 − 1 we tend
ρ we tend to get small F statistics,
for
0
<
ρ
<
√
to get large F statistics, and for 2 − 1 < ρ < 1 the result is
not clear.
To illustrate what happens for large ρ, suppose ρ = 1 and the
observations all have the same mean, then with probability one,
all the observations are equal and, in particular, ȳi· = ȳ·· with
probability one. It follows that
a
var(ȳi· − ȳ·· )
a − 1 σ2
0 = i=1
<
a N
a
and there is negative correlation for evaluating differences between groups. More generally, for very strong positive correlations, there will be positive correlation within groups but negative correlation for evaluating differences between groups. Note
also that if ρ = −1, it is not difficult to see that F = 0.
Example: Heteroscedastic balanced one-way ANOVA. In
the heteroscedastic balanced one-way ANOVA with uncorrelated observations, V is diagonal. This generates equality between the left sides and right sides of (7) and (8), so under
heteroscedasticity F still estimates the number 1.
Example: Heteroscedastic unbalanced one-way ANOVA.
The covariance conditions (7) and (8) can be caused for uncorrelated observations by heteroscedasticity. For example, consider
the unbalanced one-way ANOVA F test. For concreteness, assume that var(yij ) = σi2 . As mentioned earlier, since the observations are uncorrelated, we need only check condition (10),
which amounts to
a
i=1
σi2 /a <
a
Ni
i=1
n
σi2 .
Thus, when the group means are equal, F statistics will get small
if many observations are taken in groups with large variances
and few observations are taken on groups with small variances.
F statistics will get large if the reverse relationship holds.
The American Statistician, February 2003, Vol. 57, No. 1
5
4.
CONCLUSIONS
Christensen (1995) argued that testing should be viewed as
an exercise in examining whether or not the data are consistent
with a particular (predictive) model. While possible alternative
hypotheses may drive the choice of a test statistic, any unusual
values of the test statistic should be considered important. By this
standard, perhaps the only general way to decide which values
of the test statistic are unusual is to identify as unusual those
values that have small densities or mass functions as computed
under the model being tested.
The argument here is similar. The test statistic is driven by the
idea of testing a particular model, in this case the reduced model,
against an alternative that is here determined by the full model.
However, given the test statistic, any unusual values of that statistic should be recognized as indicating data that are inconsistent
with the model being tested. Values of F much larger than 1 are
inconsistent with the reduced model. Values of F much larger
than 1 are consistent with the full model but, as we have seen,
they are consistent with other models as well. Similarly, values
of F much smaller than 1 are also inconsistent with the reduced
model and we have examined models that can generate small
F statistics. Thus, significantly small F statistics should always
cause one to rethink the assumptions of the model. As always, F
values near 1 merely mean that the data are consistent with the
assumptions, they do not imply that the assumptions are valid.
[Received March 2001. Revised December 2001.]
6
General
REFERENCES
Baksalary, J. K., Nurhonen, M., Puntanen, S. (1992), “Effect of Correlations and
Unequal Variances in Testing for Outliers in Linear Regression,” Scandinavian Journal of Statistics, 19, 91–95.
Benjamini, Y. (1983), “Is the t Test Really Conservative When the Parent Distribution is Long-Tailed?” Journal of the American Statistical Association, 78,
645–654.
Christensen, R. (1989), “Lack of Fit Tests Based on Near or Exact Replicates,”
The Annals of Statistics, 17, 673–683.
(1991), “Small Sample Characterizations of Near Replicate Lack of Fit
Tests,” Journal of the American Statistical Association, 86, 752–756.
(1995), Comment on Inman (1994), The American Statistician, 49, 400.
(2002), Plane Answers to Complex Questions: The Theory of Linear
Models (3rd ed.), New York: Springer-Verlag.
Christensen, R., and Bedrick, E. J. (1997), “Testing the Independence Assumption in Linear Models,” Journal of the American Statistical Association, 92,
1006–1016.
Kiviet, J. F. (1980), “Effects of ARMA Errors on Test for Regression Coefficients: Comments on Vinod’s Article; Improved and Additional Results,”
Journal of the American Statistical Association, 75, 353–358.
Posten, H. O. (1992), “Robustness of the Two-Sample t Test Under Violations
of the Homogeneity of Variance Assumption, Part II,” Communications in
Statistics, A, 21, 2169–2184.
Scariano, S. M., and Davenport, J. M. (1987), “Robustness of the Two-Sample t
Test Under Violations of the Homogeneity of Variance Assumption, Part II,”
The American Statistician, 41, 123–129.
Scheffé, H. (1959), The Analysis of Variance, New York: Wiley.
Schneider, H., and Pruett, J. M. (1994), “Control Charting Issues in the Process
Industries,” Quality Engineering, 6, 345–373.
Westfall, P. (1988), “Robustness and Power of Tests for a Null Variance Ratio,”
Biometrika, 75, 207–214.