Analysis of Variance - owen.vanderbilt.edu

JOURNAL
OF
CONSUMER
5-35
Iacobucci,
Dawn
(ed.) (2001),
Journal
of Consumer
Psychology's Special Issue on Methodological and Statistical
PSYCHOLOGY,
10(1&2),
?of2001,
Erlbaum
Inc.Researcher, 10 (1&2), Mahwah, NJ: Lawrence Erlbaum Associates, 5-35.
Copyright
Associates,
Concerns
theLawrence
Experimental
Behavioral
Analysis of Variance
I.A. CAN I TEST FOR SIMPLE EFFECTS
IN THE PRESENCE OF AN
INSIGNIFICANTINTERACTION?
On more thana few occasions I have encounteredthe situation in which a hypothesized two-way interactionwas not
significantaccordingto the analysis of variance(ANOVA),
and yet simple effects tests yielded results consistent with
the expected interaction(e.g., a significant simple effect in
one condition but an insignificantsimple effect in the other
condition). What can account for this apparentinconsistency? Moreover,if an interactionis hypothesized,is it appropriate to proceed with the simple effects tests even
though the interactionterm fails to achieve significance in
the ANOVA?I have seen this done in publishedarticlesand
believe thatit is appropriatewhen a prioriexpectationsexist,
but I have also heardothers(includingreviewers)arguethat
simple effects tests areappropriateonly if the ANOVAinteraction term is significant.
Editor: I see two elementsin thisquestion-namely, (a) Why
might a simple effect be significantwhen the overallinteraction was not?, and (b) Mustthe overallinteractionbe significant to conducttests of the simple effects?
Toanswerto the firstquestion,considertheapparentinteractiondepictedin Figure 1. One's MSerortermmay be large
enoughthatthe interactionwouldnotbe significant.The simple effect of FactorA at Level b = 1 may also be insignificant.
However,a test of the simple effect of FactorA at Level b = 2
could be significant.In the assessmentof the overallinteraction, which is, afterall, an aggregateof these components,the
significantsimple effect may be washedout by the insignificant one.
It would be a superiorlyclean resultif both simple effects
were significant-for example,eithera qualitativedifference
in thatthe slopes were going in oppositedirections,or a quantitativedifferencewith both slopes going in the same direction, but one being steeper than the other. Either of these
situationscapturewhat is meant intuitivelyby an "interaction"(thatthe effect of one factoris contingenton the other).
Whenone slope, or differencebetweenmeans,is significant,
but the otheris not, the researcheris left to concludethatone
half of the dataare interesting,and one half yield the philosophicallyand statisticallytroublesomenull result.
Why is such a result, like that depicted in the figure,
suboptimal?The analysis of "simple effects" breaks down
sums of squares,notjust those attributableto theA x B interaction,butalso theA maineffect for the simpleeffects ofA at
eachlevel ofB (ortheB maineffect forthe simpleeffects ofB
at each level of A). Most researcherswill test all four simple
effect combinations-SSA@bl, SSA@b2,SSB@al,SSB@az-and
then write up the two that illuminate the theorizing most
clearly.Let us say we focus on the simple effects of FactorA
at each level of FactorB. The sum of those simple effects'
sums of squares,SSA@bl+ SSA@b2,will equalnot SSAxB, but
SSA+ SSAxB(cf. Keppel, 1991,pp. 241-242). The interaction
in the typical 2 x 2 design has a single degree of freedom.
Eachsimpleeffect also has one numeratordegreeof freedom.
Testingtwo simpleeffects thenuses two degreesof freedom,
one morethanmay be derivedfromthe interaction.The second degree of freedomis borrowedfrom the main effect of
FactorA-that is, the simple effects will in partreflect the
maineffects;the simpleeffectsdo notreflectonly the interaction. Therefore,for example,an SSA@bIeffect is partlydetermined by the A x B interactioneffect (which we want) and
partlycapitalizeson theA maineffect in the presenceor even
absenceof an interaction(whichis not good, andmathematically the only thing we can do aboutthatis to verify thatthe
interactionitself is significant).
An intuitiveway to understandthese relationsis againby
examiningtheplot. Onepatternof datathatis consistentwith
b=2
b=1
a=l
a=2
FIGURE1 Hypothetical
plotof means.
two-wayinteraction
6
ANALYSISOF VARIANCE
the scenariounderdiscussionwould be if therewas a strong
maineffect for FactorA. If therewere, even in the absenceof
a significantinteraction,the simple effect of A at b = 2 could
be significant.As depicted,the a = 1 meanis lowerthanthe a
= 2 mean,both for the maineffect (i.e., collapsingoverb = 1
and b = 2), and specifically also for the b = 2 case, the focal
simple effect.
If one's datalooked like those in the figure,and one concludedthattherewas a significantsimpleeffect of FactorA at
Levelb = 2, butnot forb = 1, it is somewhatmisleadingto suggest thatthereis an elementof interactionin one's data.The
significantsimple effect statesthatthe top line's slope is significantlydifferentfrom(greaterthan)zero.However,the insignificant interactionindicates that it is neverthelessnot
significantlydifferentfromthe slope of the line beneathit.
Abstractly,the second partof the questionaddressesthe
appropriaterelationbetween an omnibus,overallF test of a
hypothesisandtests for follow-upcomparisonof meansthat
comprise a partof that global hypothesis. OverallF tests in
our typical experimentalworld consist of two kinds-those
for main effects and those for interactions.The statistics
texts that addressthis issue tend to presentthe argumentin
the simpler, main effects context. We begin with main effects, but subsequentlywill addressthe case of interactions.
One may expect that the argumentfor the interactionis a
simple extension of that for the main effect, but stay tuned,
because it is not.
Withinthe maineffects context,we can furtherdistinguish
two classes of questions:Whatto do when the factorhas two
levels, andwhatto do whenthe factorhasthreeormorelevels.
Even afterexposureto the most elementalANOVAmaterial,
researchersknowthatfactorswith two levels aremathematically simple;if a 2 x 2 factorialyields significantmaineffects
for FactorsA or B, there is no furtheranalyticalwork to be
done to understandthe natureof those main effects. The researcher knows the null hypothesis, Ho: gl = j,2, is to be re-
jected, andtakentogetherwith the meanson thetwo levels of
the factor,the interpretation
of sucha findingis unambiguous.
mean
is
(Thebig
significantlylargerthanthe smallone.) Furthermore,if one were to conduct a "contrast"between the
only two means available,the mean squarefor the contrast
would equal exactly the mean squarefor the omnibustest of
thatmain effect-the tests are identicaland thereforeredundant.Giventhatthe extraordinarily
vastmajorityof theexperiments we conduct and reporthave factors with only two
levels (usually 2 x 2s), this scenariois predominant,so the
questionof contrastson maineffects is usuallymoot.The situationis morecomplicatedfor a significantinteractionin a 2
x 2 factorialbecausetherearefourmeans;the comparisonof
severalpairs of which may be theoreticallyinteresting.We
turnto interactionsshortly.
Considernow the situationof a factorwith threeor more
levels. The null hypothesistestedis of the form,Ho:,li = 12 =
... = g,k. Rejecting this null indicatesthatat least one mean,
or a combinationof means, is differentfrom others,but fol-
low-upcontrastsarenecessaryto determinethe natureof the
mean differences(Scheffe, 1959, pp. 55, 66; Winer,Brown,
& Michels, 1991, p. 140; cf. "A significantF allows, if not
demands, a furtheranalysis of the data,"Keppel, 1991, p.
I 1; "AnoverallF test is often merely the first step in analyzing a set of data,"Kirk, 1982, p. 90). Furthermore,we
know that our follow-up analysis will be rewarding:"If an
overallF test is significant,an experimentercan be certain
that some set of orthogonal'contrastscontains at least one
significant contrastamong the means" (Kirk, 1982, p. 95;
Scheffe, 1959, p. 71). Accordingly,we operateby the decision rule:If my F is significant,I must conductcontraststo
comparethe means more precisely.
We also operateby the obverserule:If my F is not significant,thereis no point in conductingcontrastsbecauseI will
not find any significant differences. For example, Keppel
(1991) stated,
With a nonsignificantomnibusF, we are preparedto assert
thatthereareno real differencesamongthe treatmentmeans
andthattheparticular
samplemeanswe haveobservedshow
forbyexperimental
differences
thatarereasonably
accounted
error.Wewillstoptheanalysisthere.Whyanalyzeanyfurther
to be chancedifferwhenthe differences
canbe presumed
ences?(p. il l)
If therewould seem to be a one-to-onerelationbetweena
significantoverallF and the presenceof at least one significantcontrast(albeitsometimesnot the one we want),how do
we reconcilethis intuitionwith statementsthatsay we can do
"planned"comparisonswhetherthe overallF is significant?
Considerthese threeexemplars:(a) "Insome circumstances
... comparisonsareundertakenfollowing a significantF test
conductedon the entireexperiment.In othercircumstances,
they areconductedinsteadof the overallF test, in whichcase
theyareoftenreferredto as plannedcomparisons,"orplanned
"comparisonscanbe conducteddirectlyon a set of datawithoutreferenceto the significanceornonsignificanceof theomnibus F test" (Keppel, 1991, pp. 110, 112); (b) "It is not
necessaryto performan overalltest of significancepriorto
testingplannedorthogonalcontrasts"(Kirk,1982,p. 96); (c)
"Testsmay ... arisebecause of the particularhypothesesthe
experimenterhas which (s)he wants to evaluate.Those hypothesescanbe evaluatedwith orwithoutthe overallanalysis
of variance"(Wineret al., 1991, p. 141).
Weknowthatamaineffect forFactorA has(a - 1) degrees
of freedom,andso, ifwe testeda set of(a - 1) orthogonalcontrasts,Wl, X12, ... , I/a - 1 (and therewould be a multitudeof
possibilities, some sets being more interestingtheoretically
thanothers),the sum of these contrasts'sums of squares,SS
\I + SS V2+ ... + SS5a-l, wouldequalSSA.(Of course,not all
Do not be overlyconcernedwiththe adjectiveorthogonalhere,because,
as Kirk(1982) acknowledged,any contrast"canbe expressedas linearcombinationsof those in an orthogonalset" (p. 93).
ANALYSISOF VARIANCE
the interestingresearch questions involve orthogonalcontrasts,cf. Kirk, 1982, p. 95, butthis assumptionhelps to simplify this discussion.)For some sets of (a - 1) Wis,we could
orderthem fromlargestto smallestandknowthatif MSAwas
significant,so too will at least the largestMS v be, as well as
perhapsothers.On the otherhand,it could be thatwe have a
(barely)significantMSA,but thatamongthe Wiswe have our
heartset on, eventhemaximumSS V maynot exceedthe critical value.This relationmay be seen by consideringthe inherent powerof the maineffect test, whereinall the datafromall
the levels of FactorA arelentto thetest,comparedwiththatof
the one degreeof freedomcontrast.Forexample,imagineour
design is 5 x 2, with the typical 10 participantsper cell. The
degreesof freedomto test MSAare (5 - 1) and 5*2*(10 - 1),
= 2.53. The degreesof freedomto testMS
for a criticalFa= .05os
=
xVare 1 and90; Fa 05= 4.00. Thatis, the maineffect testmay
detectsome variationfromthenull,butwe mayhaveto search
a while before finding it in one contrastor two. This result
suggests to me thateven thoughwe are given license to test a
contrastwithouttestingthe overallF, I wouldalwaysbe interested in the diagnosticinformationof the overallE, particularly in case my priorguesses or hypotheseshadbeen wrong
(perhapsthis has never happenedto you?)-the data in this
case would be indicatingthatalthoughmy pet theorywas not
supported,somethingelse interestingmay be found.(And,in
the "whatis thebig deal"category,no one computesthe overall or contrastFs by hand,andeven if one was only intenton
studyingandreportingthe contrastFs, the statisticalcomputing packages all producethe overallFs in the ANOVAs,so
why would one ignore completeinformation?)
Wehave seen thatif an overallF is significant,thenat least
the maximum contrastwill achieve significance. What we
havenot seen yet is a scenarioforwhicha contrastmaybe significantandthe overallF not significant.Thatis becausethis
scenariois logically and statisticallyimpossible,unless one
begins to compareapplesto oranges.Two examplescfwhere
the overallF test is an apple and the follow-up contrastsare
oranges are these: First, Keppel (1991, p. 124) discussed
plannedcomparisonsfor which only a subsetof the dataare
used to compute an errorterm for each contrast-namely,
only those cells being comparedcontributeto the temporary
errorterm(also, see his responseto the questionon identifying appiopliatemeansquareerror[MSE]termsin this special
issue). The argumentis thatthe within-cellvariabilitymaybe
smallerfor the focal cells, so thatthe fuller,aggregateMSerror
term provides insufficientsensitivity to detect the contrast.
Thus, the overallF test would not be significant,whereasthe
test of the contrastsagainsta presumablysmallererrorterm
would be more sensitive and could yield significance.
However,note that the varianceswould need to be fairly
heterogeneousto benefitfromthis adjustment(notto mention
reviewersgoing bonkers);for example,in our 5 x 2 example
with 10 participantsper cell, we saw the MSAwouldbe tested
on 4 and 90 degrees of freedom, for an Fa = .05=2.53.If we
conducteda contraston a = 1 versusa = 2, andcreateda tem-
7
terminvolvingonly the datain those cells, the
poraryMSerror
of
freedom
would be 1 and 2*2*9, for an Fa= .05=
degrees
4.17. In anyevent,this sortof test is changingslightlythe settingin whichthe contrastis testedfromthatin whichthe overall F was tested.
A secondexampleof applesto orangesis dueto thenumerous comparisontestprocedures,each of which is designedto
detect somewhat different alternative hypotheses (i.e.,
noncentralityparameters;e.g., Hochberg& Tamhane,1987;
Klockars& Hancock, 1992; Seaman,Levin, & Serlin, 1991;
also see Wahlsten,1991). For example, Scheff6's follow-up
testuses theF distribution,whereasTukey'stest (for comparing pairsof means)uses a differentdistributionto obtaincritical values (Scheffe, 1959, p. 73). Scheffe (p. 67) said that if
cell samplesizes areequal,andyou reallyonly wantto comparemeansin pairs(e.g., gi - lj = 0, not, e.g., pi + m - 2 Lk=
0), then Tukey's test provides greaterprecision (narrower
confidenceintervals).Thus,it may be possible to have a significantoverallF; a significantSchefft test of some complicated combinationlike pi + l,j - 2 gk (i.e., the average in
conditionsi andj differs from k); no significantlydifferent
pairsof meansusing the Scheffe test, but significantlydifferent pairs of means using the Tukeytest; or, an insignificant
overallF;no significantScheffefollow-uptests;anda significantTukeytest.
(Itis evenconceivablethatthe computationof contrastson
portionsof one's dataratherthanon the fuller dataset had a
simple sociological explanation,e.g., the computationof
suchstatisticsby handin the daysbeforecomputers.Evenso,
this situationno longerapplies,so lingeringpracticesshould
also be updated.)
Thereremainsto be discussedthe issue regardingthe alpha level of contrasttests thatare a priorior plannedversus
comparisonsthatareaposteriori orpost hoc (Hays, 1988,pp.
384-385; Kirk, 1982). Most recommendationsallow each
plannedcomparisonto be tested at the .05 errorrate,relying
on the logic that the hypothesizedcontrastwas constructed
beforethe researchersaw the data,so therewould havebeen
no possibility for capitalizationon chance. Note the maximum numbersof contrastsare still constrainedby their degrees of freedom(e.g., a - 1 for A, etc.), because otherwise,
theresearchercoulddelineatea very long list of contrastsand
claim them all as planned,for who among us is not clever
enoughto createsome rationalizationif need be.
Recommendationsusually urge a correctionon the .05
error rate for post hoc contrasts,usually of a Bonferroni
form in which .05 is dividedby the numberof tests run for
FactorA (and then .05 is adjustedseparatedfor the family
of tests run on B, etc.; cf. Holland & Copenhauer,1988).
Alternatively,one could use the macho (statisticallyconservative) Scheffe (1959) test, which permits testing of any
sort of post hoc comparison(e.g., pairs of means or more
complicatedcombinations)and"is applicableonly in a situation where a preliminaryoverall F test for the treatment
has shown significance. If such a test has shown signifi-
8
ANALYSISOF VARIANCE
cance, then there exists at least one comparisonfor which
the null hypothesiswill be rejectedat the same level of significance" (Minium, 1978, p. 420; Scheffe, 1959, p. 70).
Given the conservativenatureof the Scheff6 test, statisticians classify it as an alternativeto the Bonferronicorrection on alpha (i.e., if one is using a Scheffe test, one need
not also correct for the numberof tests on the alpha rate;
Hays, 1988, pp. 414-415).
My theoreticiancolleagues see statisticsas a tool, not a
master,and hold the datato the standardthata workinghypothesisis to be judged superior,at least for the moment,if it
is moreparsimoniousand explainsmore of the datapatterns
thanany competitivetheory.Certainly,andone key modifier,
as I thinkthey would agree, is the phrase"forthe moment."
Significancetests are indexes to aid the researcherin understandingwhateffects arelikelyreplicable.If a posthoc analysis temporarilyyields a bettertheoreticalexplanation,yet is
not replicatedwith repeated,concertedefforts,thenone must
deduce thatthe resulthad been a samplingblip and that the
formerlysuperiortheorymayneed to be replacedwith its apparentlyless elegantpredecessors.It would be interestingto
explore in greaterdetail the intersectionsof philosophyand
statistics,given their common heritagein logic and mathematics. Just as we should pay attentionto substantivetheories, we must also play by the rules of statisticaltheories:
Many statisticiansare unforgivingof post hoc tests unless
theirfoundationalassumptionsaremet, like the corresponding overall omnibus tests being significant-for example,
"Whatone cannotdo is to attachan unequivocalprobability
statementto suchposthoc comparisons,unlesstheconditions
underlyingthe methodhavebeen met"(Hays, 1988, p. 418).
Accordingly,I do not think it hurtsto be a little cautiousnamely, statisticallyconservative-in post hoc tests by correctingalphaor using the Scheffe (1959) test (cf. Cliff, 1983,
pp. 123-124).
Finally,let us returnto the questionof the applicationof
this logic to the heretoforeseemingly analogous overall F
test for an interaction and the follow-up simple effects.
Given the answer previously to (a), wherein it was described that simple effects breakdown not just the interaction SSA x B, but also the sums of squares for one of the
main effects, it would be inappropriateto test and report
simple effects in the absence of a significantoverallF test
for thatinteraction.We have seen thata simple effect can be
significantdrivenby an interactionor a main effect. The autonomy in the main effect case between the overall F and
the follow-up contrastsis due to the fact thatall the contrast
sums of squares are component portions of the sums of
squaresfor the main effect-that is, the componentssum up
to the whole. By comparison,the tests of an overallinteraction and its simple effects cannotbe viewed as similarlyautonomously;if a simple effect is significant,one hypothesis
is that the interactionis significant,but for many data patterns, an alternativeexplanation-a completely reasonable
and likely plausible competing theory-is that there is no
interaction,only a main effect. Consider Keppel's (1991)
words on the issue:
Occasionally
youwillsee simpleeffectstestedwithnomenof
Thetypicalpattern
testof interaction.
tionof thestatistical
resultswillconsistof twosimpleeffects-one thatis signifithatinconcludes
cantandonethatis not-and theresearcher
is thatthe
is present.Thetroublewiththisargument
teraction
insimpleefassessesthedifferences
formaltestof interaction
thatis basedon thetestof
fects.An inferenceof interaction
simpleeffectsalonedoesnotprovidethisnecessaryinformaintion... A simpleeffectof FactorA doesnotreflect"pure"
main
effect
or
the
is
influenced
but
also
teraction
by average
of thefact.... Thus,testingthesimpleeffectsalonedoesnot
provideanunambiguouspictureof interaction.(pp.244-245)
However,let us look at a few importantexceptions, occasions for which one may proceed to interpretsimple effects unambiguously.Recall the basic equationsthattesting
a simple effect of Factor A at bl or b2 breaks down the
SSAXBand SSA,and that testing a simple effect of FactorB
at al or a2 breaksdown the SSAxBand SSB.In Figure2, we
see a patternof means thatis consistentwith therebeing no
significantmain effect forA but a significantmain effect for
B. To say SSAis not significantis to say it is statisticallynot
greaterthan zero. If so, then the simple effects of FactorA
at bl or b2 may be attributedmore certainlyto the interaction, given that the SSA term is statistically zero. Note,
though,that the interpretationof the simple effects of Factor B at al or a2 is still ambiguousdue to the significant
main effect for B. Also, note that the SSAwill not be precisely zero; if the main effect for A is approachingsignificance, it would be conservativeto not interpretthe simple
effects of A at bl or b2 either.
Figure 3 presentsthe conversesituation:the main effect
fo; A is significantandforB it is not. GiventhatSSBis statistically negligible, the interpretationof the simple effects of
b=2
b=l
I
a=l
FIGURE2
the figure.
I
a=2
Now, imagineSSAwas not significantandSSs was as in
ANALYSISOF VARIANCE
b=2
b=l
I
a=1
I
9
fect, as well as a priori and post hoc contrasts;why would
we neglect certain aspects of understandingour data?)
thetestingof simpleeffects,clearlythepreference
is
Regarding
to do so afterhavingdemonstrated
a significantinteraction.
If the
itselfis notsignificant,
onemustproceedcarefully.
interaction
Ifthe
maineffectofA is significant
andBisnot,onemayconductthetests
of SSBal andSSB@a2,
butnotSSA@blorSSA@b2.
If themaineffect
ofB is significant
andAis not,onemayconductthetestsof SSA@bJ
andSSA@b2,
butnotSSB@al
orSSB@a2.
If neithermaineffectis significant,anyof thefoursimpleeffectsis defensible.
Finally,thereis
no debateamongtheexpertsthatproceedingto testsimpleeffects
in thepresenceof
a significantinteraction
withoutdemonstrating
bothmaineffectsbeingsignificantis indefensible.
a=2
REFERENCES
FIGURE3 Now,imagineSSBwasnotsignificant
andSSAwasasin
thefigure.
Cliff, Norman.(1983). Some cautionsconcerningthe applicationof causal
modelingmethods.MultivariateBehavioralResearch,18, 115-126.
&
Hays,WilliamL. (1988).Statistics(4thed.).NewYork:Holt,Rinehart,
Winston.
Yosef,& Tamhane,
Hochberg,
AjitC. (1987).Multiplecomparison
procedures.NewYork:Wiley.
b=2
b=l
II~~~~~~~~~~~
a=l
a=2
4 Finally,theresult
FIGURE
ifneitherSSA
norSSsweresignificant.
Holland, Burt S., & Copenhauer,MargaretDiPonzio. (1988). Improved
Bonferroni-typemultiple testing procedures.Psychological Bulletin,
104, 145-149.
Keppel, Geoffrey.(1991). Design and analysis: A researcher'shandbook
(3rd ed.). EnglewoodCliffs, NJ: PrenticeHall.
Kirk,RogerE. (1982). Experimentaldesign: Proceduresfor the behavioral
sciences (2nd ed.). Belmont,CA: Brooks/Cole.
Klockars,Alan J., & Hancock,GregoryR. (1992). Powerof recentmultiple
comparisonproceduresas appliedto a completeset of plannedorthogonal contrasts.PsychologicalBulletin, 111, 505-510.
Minium,EdwardW. (1978). Statisticalreasoninginpsychology and education (2nd ed.). New York:Wiley.
Seaman,MichaelA., Levin,Joel R., & Serlin,RonaldC. (1991). New developmentsin pairwisemultiplecomparisons:Some powerfulandpracticable procedures.PsychologicalBulletin, 110, 577-586.
Scheffe, Henry.(1959). Theanalysis of variance.New York:Wiley.
Wahlsten,
Douglas.(1991).Samplesize to detecta plannedcontrastanda
interaction
effect.Psychological
onedegree-of-freedom
Bulletin,110,
587-595.
FactorB at al or a2areinterpretableunambiguouslyas attributableto the interaction.The simple effects of FactorA at bl
or b2 are, of course, still problematicdue to the significant
main effect for A.
Finally, Figure 4 depicts a scenario for which neither
main effect is significant.In cases like these, SSAandSSBare
both statisticallyzero, so "haveat it."Eitherset of simple effects, those thatexamine the effect of FactorA at bl or b2or
those thatexaminethe effect of FactorB at al or a2,areinterpretableunambiguously.
In conclusion,with regardto testingcontrastson main effects withouttestingthe omnibusmaineffecthypothesis,there
is some differenceof opinionamongexperts.Forthe vastmajority of the behavioralresearchin our field, the debate is
moot becausemost of our factorshave only two levels. (My
personalapproach,had I a factorwith three or more levels,
would be to examinefull information-the omnibusmain ef-
Winer,B. J., Brown,DonaldR.,& Michels,KennethM.(1991).Statistical
principles in experimentaldesign (3rd ed.). New York:McGraw-Hill.
The issues of a prioriversuspost hoc tests and logics are intriguing,so I invitedtwo of my philosophically-inclinedcolleagues to speakto the issue, andtheirresponsefollows.
Professors Alice Tybout and
Brian Sternthal
Northwestern University
To addressthe a prioriversuspost hoc explanationissue, considerthe following situation:A study is conductedin which
the outcomespredictedfromtheoryarenot confirmedby the
data.The investigatordevelopsan alternativeexplanationfor
10
ANALYSIS
OFVARIANCE
the datathatappearsto providea compellingaccountof the
findings.No rivalexplanationscanbe thoughtof thataccount
for the dataas well as the investigator'stheorizing.Does this
representa rigoroustest of theory?
Intuitively,it would seem that a test of theory is more
rigorous if the phenomenaobserved are predictedprior to
conducting the research rather than if an explanation is
fashioned after the outcomes are known. Accurateprediction is viewed as a formidabletask because the researcher
must anticipateoutcomes that are unseen and that may be
influencedby poor measurementand extraneousevents. In
contrast,post hoc analysis is often pejorativelyreferredto
as a fishing expedition that is simplified by the availability
of data at the time the explanation is being constructed.
The concern is that the explanation offered post hoc is
likely to be particularto the dataat handand unlikely to be
confirmedin subsequenttests. The conventionalwisdom is
that when explanation is post hoc, another test in which
the theory's predictions are confirmed is requiredfor the
test to be rigorous.
The view thatpredictionoffers a more compellingtest of
theorythanpost hoc explanationthusrestson the notionthat
predictionis a moredifficulttask.Evenif thiswerethecase, it
is inappropriateto equate task difficultywith the rigor of a
theorytest.Therigorof a theorytest is determinedby whether
one theoreticalview offers a moreparsimoniousexplanation
forthe findingsthando its rivals(Sterthal, Tybout,& Calder,
1994). An explanationis preferredif it accountsfor the observed outcomes with fewer concepts than other explanations, or if it can explain more phenomenawith the same
numberof concepts as its rivals.
Definingrigorin termsof parsimonyhas an importantimplicationwith regardto predictionandpost hoc explanation.
It impliesthatwhen an explanationis conceivedin relationto
data,collectionis immaterial.Forthe moment,thebest available accountfor this and otherrelevantdatahas been identified.Anotherstudymaybe conductedto testthepredictionsa
priori,but if the theory'spredictionsare simply confirmed,
the study would not representfurthertheoreticalprogress.
Giventhe limitationof induction,anotherstudywouldnot increase confidence about the adequacyof the theory or the
rigorof the test.
The view that post hoc explanationadvantagesthe researcheris fallacious on other grounds.First, if the explanation is derived after the data are collected, the inclusion
of experimental controls that would be useful in distinguishing a preferred explanation from its rivals are less
likely to have been anticipatedand administeredthanif the
theory test were based on prediction. Thus, finding a
unique explanation for the data despite this impediment
would seem impressive.
Furthermore,it should also be noted thatthe seeming advantageof developingan explanationpost hoc is offsetby the
factthatforanexplanationto be parsimoniousandfora testto
be rigorous,the explanationmust account for relevantdata
beyond that reportedby the investigator.As Brush (1989)
noted, in this case postdictionis even more impressivethan
predictionbecauseit offersa betteraccountof datain a situation in whichrivalshavehadan opportunityto be formulated
andtested.In contrast,predictionmakesit farmorelikelythat
laterviews will be foundto offermoreparsimoniousexplanations thanthe currentlypreferredtheory.
The focus on parsimonyas the criterionfor the rigorof a
theorytest circumventsthe problemattendantto when an explanationis conceived.Post hoc explanationis typicallysuspected when predictionsare counterintuitiveor complex. In
thissituation,theinvestigatoris oftenchallengedto document
thatthe outcomesobtainedwere predictedand,regardlessof
the reality,may havedifficultydoing so. By focusingon parsimony,the thornyissue of when explanationwas conceived
becomes immaterial.
Indevelopingapost hoc explanation,howwell the dataare
explainedby alternativetheoriesis examined.If two or more
views offer equallyparsimoniousexplanationsfor the data,
additionalresearchis neededif thetestof theoryis to be rigorous. If the datain this andrelatedstudiesare explainedmore
parsimoniouslyby one theorythananother,the test is rigorous and the researchis in some sense complete.
REFERENCES
Thecaseof light
andtheoryevaluation:
G.(1989).Prediction
Brush,Stephen
bending.Science,246, 1124-1129.
AliceM.,&Calder,
BobbyJ.(1994).Experimental
Sterthal,Brian,Tybout,
and theoreticalexplanation.In RichardP.
design:Generalization
research(pp.195-223).CamBagozzi(Ed.),Principlesof marketing
bridge,MA:Blackwell.
I.B. CAN I MAKECONCLUSIONS BASED
ON AN INTERACTIONWITHOUT
TESTING THE SIMPLE EFFECTS?
A researcherpredicts a crossoverinteractionbetween two
variables (one continuous, the other dichotomous)that is
tested via a linearcontrastregressionmodel containingthe
maineffect andinteractionterms.The interaction
appropriate
term receives a significantcoefficient. The researcherthen
concludes that the data supportthe predictedinteraction.Is
conclusion?Morespecifically,giventhata
thisanappropriate
couldoccurfordatapatternsthatdiffer
interaction
significant
fromthe one predicted(e.g., a noncrossoverpattern,a crossover patternin the opposite direction),is it not necessaryto
undertakethe appropriatesimple main effects tests to establish whetherthe data actually supportthe differencespredictedby the crossoverinteraction?
Editor: Yes, it is imperativeto defenda statementpurported
to describedata with a statistic.The scenarioyou have de-
ANALYSISOF VARIANCE
F
wouldbe likeobtaininga significant
scribedforregression
in ANOVAandthendoingno
testfortheMSAxB interaction
further
is, boththeplotof thecellmeans
investigations-that
claimsabout
andthe testsof simpleeffectsto substantiate
fromeach
different
are
preciselywhichmeans significantly
other.Youareabsolutelyrightthata meresignificant
regrestermyieldsno
sioncoefficientassociatedwithaninteraction
If
aboutthe natureof thatinteraction.
detailedinformation
yousee someonetryingto do this-make a claimabouttheir
of
datawithouta statisticto supportthatclaim(regardless
whetherin regressionor ANOVAor anythingelse for that
matter),nailthem.(YoumayfindQuestionsI.A.andII.H.in
thisspecialissuehelpfulalso.)
I.C. IDENTIFYINGTHE APPROPRIATE
MSETERM FOR AN FTEST
When you have a design involving multiple factors, what
is the appropriateMSE to employ in estimating follow-up
contrasts involving a subset of the overall design? For ex-
ample,if I havetwo factors(AandB) withtwo levelseach,
andI wantto runa contrastanalyzingthe simpleeffectof
A, what MSE shouldbe used? Whenall factorsare beI have
tweensubjectsin nature,a generalrecommendation
seen is thatthe correcterrortermis thatassociatedwith
the overalldesign.However,in somerecente-mailsI have
exchangedwith Geoff Keppel(personalcommunication,
1997),he indicatedthatin the new editionof his bookhe
will advocateusingtheMSEonly fromthedatarelevantto
the contrast-that is, in the earlier example,using a
exposed
one-wayANOVAbasedon datafromparticipants
to the first level of B for one contrast,then a separate
one-wayusingdatafromthoseexposedto the secondlevel
of B only.
A morecomplicated,butrelated,issueis whena mixed
design is involved-for example,whereA andB arebefacfactorsandC andD arewithin-subjects
tween-subjects
tors.InanalyzingthesimpleeffectofA, theproblemhereis
thatthereis no MSEfromthe overalldesign.Rather,there
areMSEtermsforrelevantsubsetsof the design(e.g., one
forthesubsetof thedesignincludingonlyA andB, onethat
includesA, B, andC, etc.).I haveseenanarticlerecently(I
cannotrecallwhich one in particular)in whichthe MSE
fromthe mostrelevantsubsetof the designis used.However,giventhatthismayincludedatanotentirelyrelevantto
the contrastof interest,a morepreciseway to estimatethe
simple main effect of the variableof interestwouldbe,
again,to use a one-wayANOVAusingan errortermbased
to thateffect.
onlyon datacontributing
a
thatmaybedebatable),
Althoughmoreprecise(although
errorterms
usingseparate
problemwithrunningthecontrasts
relativeto
is thelowerpowerassociatedwiththisalternative
thatfor an errortermbasedon a largerpartof the design.
11
dilemma:moreprecisionor
thisis aninteresting
Therefore,
morepower?
Professor Geoffrey Keppel
University of California, Berkeley
Althoughthis questionfocuseson the natureof the error
termsfor evaluatingfollow-upcontrastsinvolvinga subset
froma largerdesign,I wouldpreferto broadenthis
extracted
discussionto includetheplannedcontraststhatusuallyare
in lieuof anoveralloromnibustest.
conducted
Between-subjectsdesigns. Whenall factorsarebetween subjects(a between-subjects
design),the frequent
choiceforthe errortermis the one basedon the variances
of the independentgroups,pooled or averagedover all
treatmentconditions.This choice is appropriate
provided
thatthe groupvariancesarehomogeneous
and,less imporfor all groups.
tant,thatthe scoresarenormallydistributed
the overallerrortermprovides
Underthesecircumstances,
variancesforthe
a morepreciseestimateof thepopulation
treatment
groupsthanerrortermsbasedon subsetsof the
experimentand with greaterpowerdue to the increased
numberof degreesof freedomassociatedwith this estivariancescanof homogeneous
mate.Whentheassumption
notbe reasonably
met,however,the averageerrortermeithe errorvariability
or underestimates
theroverestimates
for any givensubsetof the data,the direction
appropriate
variancesincludedin thesubset
on theparticular
depending
of groups(e.g., see Keppel,1991, pp. 123-124.) Under
thesecircumstances,
then,I suggestthatthe correctchoice
of errortermis dictatedby concernsforprecision,which
wouldbe anerrortermbasedonlyon thosegroupsincluded
Onthe issueof power,a
in the subsetunderconsideration.
researchershouldincludesufficientnumbersof observationsto provideacceptable
powerforthesecriticalcompareffectshouldbe viewedas sepin
that
isons-comparisons
in partby the
whosepoweris determined
arateexperiments
choiceof the numberof participants.
researcher's
Single-factordesigns. In a single-factordesign,
most plannedand follow-upanalyseswould consist of
conditionandanother.
betweenone treatment
comparisons
Withheterogeneity
present,one mightuse the Welchtest,
whichcalculatestheerrortermas a weightedaverage(with
the error
two groupsand equalnumbersof participants,
with
termis simplyan averageof thetwo groupvariances),
its degreesof freedomadjustedsomewhat,dependingon
the degreeof heterogeneity
present(see Keppel,1991,pp.
124-128).
12
ANALYSISOF VARIANCE
Two-factordesigns. In a two-factorfactorialdesign,
the possibilities multiply.Analyses focusing on subsets of
treatmentgroups may consist of comparisonsbetween two
means contributingto eitherof the two maineffects, interaction contraststhatfocus on differentprocessescontributingto
the overallinteraction,simple effects thatfocus on the effects
of one of the independentvariables(IVs) at a particularlevel
of the other,and so-called simple comparisonsthat usually
contrasttwo ofthe treatmentmeansmakingup anygivensimple effect. If variancehomogeneityis a safe assumption,the
errortermfor all of these analysesis the averagegroupvariance for the entirefactorialdesign. On the otherhand,if this
assumptioncannot be reasonablymet, a researchershould
considercalculatingindividualerrortermsfor each analysis.
One way to visualizethese analysesis to block out or identify
the groupscontributingto a givenanalysisanduse as theerror
term an averageof those particulargroupvariances.Again,
the logic behindthis approachis based on using a moreprecise estimateof errorvariancefor the particularanalysisinvolved. Concernsforpowershouldbe addressedin thedesign
of the experimentandnot be dependenton the small gain in
apparentpower affordedby the degrees of freedomassociated with the errortermfromthe overallanalysis.
Within-subjects designs. Letus tumnowto thespecification of errorterms in within-subjectsdesigns-designs in
whichparticipants
servein all possibletreatmentconditionsor
treatmentcombinations.
The assumptionsassociatedwiththese
designs are more complicatedthan those associatedwith between-subjectsdesigns.Moreover,violationsof these assumptionsappearto be morecommonandmoreserious.Fortunately,
manyof these complicationsare safely solvedby followingan
analogousprocedureto theone I suggestedforbetween-subjects
is present.
designswhenheterogeneity
Single-factor designs. For the single-factor design,
whichI referto as an (A x S) design,comparisonsbetweentwo
treatmentmeansareevaluatedwithanerrortermbasedonlyon
the datacontributingto thatparticularanalysis.To illustrate,a
comparisonbetweentwo meansbecomesin effect a miniature
single-factor design-a (Acomp. x S) design in which
"Acomparison"
designatesthe factorrepresentedby the two
conditionsbeing compared(see Keppel, 1991, pp. 356-361).
TheresultingF is associatedwithone degreeof freedomforthe
numerator(Acomp.)and(a- 1)(n- 1) = (2 - l)(n- 1) = n - 1
degreesof freedomfor the numerator(Acomp. x S).
Two-factordesigns. The detailed analysis of the
two-factor within-subjectsdesign follows a similar procedure, coveredin detail in Keppel (1991, pp. 468-478). Any
analysis of the two main effects startsby combinng or collapsingthe datafor each participantover the variablenot in-
volved in the analysis(overthelevels of FactorB in the analysis of the A maineffect andoverthe levels of FactorA in the
analysisof the B maineffect). Forthe purposesof this analysis, then,we cantreatthese combineddataas if they were derivedfroman actualsingle-factorwithin-subjectsdesign,using the miniature single-factor arrangementsof the data
describedin the precedingparagraph-an (Acomp. x S) design for a comparisonbased on the A main effect and a
(Bcomp.x S) designfora comparisonbasedon theB maineffect. The analysis of simple effects divides the factorialdesign into a numberof single-factorwithin-subjectsdesigns.
Forexample,ananalysisof the simpleeffectsofFactorA produces a set of single-factorwithin-subjectsdesigns, one (A x
S) design for each level of FactorB. The errortermfor each
analysisis basedonly on thedatarelevantfortheanalysis.For
the analysisof a simpleeffectofFactorA, the errortermis the
interactionof FactorA with participantsat a particularlevel
of FactorB, whereasthe errortermfor a simple comparison
betweentwo meanscontributingto a simpleeffect is an interaction based on this particularcomparisonand participants
(Acomp. x S). The analysis of interactioncontrastsis also
based on miniature experiments-namely, a minimal
two-factorwithins-subjectsdesign in which each factorrepresentsa comparisonbetweentwo means;theerrortermis the
three-way interactionbased on this miniatureexperiment
(Acomp. x Bcomp. x S).
In summary,
we cansee anoverallanalysisstrategyemerging
fromthisdiscussion.If the homogeneityassumptionforthebetween-subjectsdesignsand the additionalassumptionsfor the
designsarein question,thesafestcourseof action
within-subjects
is to avoiderrortermsthatincludeestimatesof errorvariancethat
arenotdirectlyrelevantto theanalysisunderconsideration.
Specializederrortermsforeithertypeof designarederivedfromminiaturedesignsformedby extractingthe relevantdatafromthe
cveralldesignandbasingtheanalysisonthisrestricted
setofdata
Mixed factorial designs. The analysisof mixed factorialdesignsin whichbothtypes of IVs, between-subjectsand
within-subjects,arepresentfurthercomplicatesthe specification of errorterms for analyticalcomparisonsand analyses,
when thereis reasonto believe thatthe homogeneityassumptionsareviolated,which,in my opinion,is a highlylikelypossibility.Underthese circumstances,you mightpreferto seek
out detailedexpositionsavailablein books, such as the two
chaptersI devotedto the discussionof these sortsof analyses
within the context of the mixed two-factordesign (Keppel,
1991, chaps. 17 and 18), whichprovideextensiveelaboration
andworkednumericalexamples.Forthisrelativelyshortcomment,however,I simplynotethatthejustificationforthe error
termsI recommendinthosechaptersareextensionsof theprinciplesdiscussedhereforthe analysisof purebetween-subjects
andwithin-subjectsdesigns.Tobe morespecific,you firstisolate thatportionof the datamatrixthatcontributesdirectlyto
the specificanalysisyou arecontemplating.Doing so guaran-
ANALYSIS
OFVARIANCE 13
tees thatyou will baseyourestimatesof errorvariabilityon the
relevantportionof yourdatamatrix.Second,you shoulddetermine the natureof the type of designthatis reflectedafteryou
have isolated the relevantdata matrix.Let us see how this
worksout forthe analysisof the two sets of simpleeffectsand
of an interactioncontrast,usingthe two-factormixeddesignas
an example.
Simple effects involving the between-subjects factor.
Supposeyou areconsideringthe analysisof the simpleeffects of the between-subjectsfactor(FactorA) at Level bl.
Once you have identifiedthe relevantportionof the datamatrix,you shouldbe ableto see thatthe resultingdesignis a single-factorbetween-subjectsdesign and analyze the data accordingly-that is, use for the error term the average
within-cellvarianceforthe differentA conditionsat bl. Thisis
a between-subjectsdesign because each participantsupplies
only one observationfor this particularanalysis.If the simple
effect is significant,you wouldprobablyconsiderconducting
meaningfulcomparisonsbetweenmeansthatarecontributing
to this significanteffect.Becausethissubsetof datais in effect
a single-factorbetween-subjectsdesign,you wouldproceedas
I suggestedin the firstsectionof this note if the cell variances
arenotreasonablyhomogeneous-namely,by calculatingseparateerrortermsand adjustingthe degreesof freedomfor the
errorterm.You would repeatthis entireprocesswith the subsets of datarepresentingtheremaininglevels of FactorB (fora
numericalexample,see Keppel, 1991, Section 17.5).
dataforthis analysis,you shouldrealizethatthe errortermfor
the interactioncontrastis the interactionof the contrastwith
participants(Acomp.x Bcomp. x S; foranumericalexample,
see Keppel, 1991, Section 18.3).
Higher order factorial designs. A discussion of the
detailed analysis of higher order factorialdesigns, such as
the four-factordesign describedby the poser of this question, would requirea detailedand lengthy answerand, perhaps, a numericalexample to render the answer comprehensible. On the otherhand, it should be possible to divine
a strategyby consideringwhat we do with simplerdesigns
and then extendingthat strategyto the more complex ones.
The strategyI have been advocatingis one of isolating the
portion of the data directly relevantto the question being
asked and then analyzingthis subset as if it were a separate
experiment-for example, an analysis of the simple effects
in a two-factor design as separateone-way designs. Offhand,I do not see why this strategycould not be appliedto
complex designs.
REFERENCE
handbook
Keppel,Geoffrey.(1991).Designandanalysis:A researcher's
Cliffs,NJ:PrenticeHall.
(3rded.).Englewood
Editor: If you obtaina significantinteractionbetween FactorsA andB, havingtestedanF statisticof the form,F= MSA
Simple effects involving the within-subjects factor.
Let us considernext the analysis of the simple effects of
the repeated-or within-subjectsfactorat the differentlevels
of FactorA. Startingwith Level a 1, you would identify the
relevantportionof the datamatrix,which you will find is a
single-factor within-subjectsdesign-a (B x S) design at
Level al. From this point on, you would follow the procedures I described for analyzing an actual single-factor
within-subjectsdesign.Inthis case, the errortermforthe simple effect of FactorB at Level al is the B x S interactionobtainedfromthis datasubset.The errortermforeach comparison between means you conduct would be based on that
portion of this data subset that is relevantto this particular
comparison.You would repeat this entire process with the
subsetsof datarepresentingthe remaininglevels of FactorA
(for a numericalexample, see Keppel, 1991, Section 17.6).
Mixed interaction contrasts. Finally, I briefly describethe analysisof interactioncontrasts.Assumingthatthe
two interactingcontrasts(Acomp. and Bcomp.) consist of
comparisonsbetween two means, the resultingdesign is a
mixed 2 x 2 design extractedfromthe larger,originalmixed
factorial.Once you have isolated the relevantportionof the
xB/MSerror,
where MSerroris typically the MSS/AB(in Keppel's,
1991 notation),whereparticipantsare said to be nested in A
andB becauseeachparticipantis exposedto one andonly one
combinationof levels of those factors-it is the "errorterm
that is associatedwith the overalldesign,"as phrasedin the
question-and you thenwish to follow up thatomnibusinteractionF test with simpleeffects ofA at each level ofB (orB at
each level of A), the mean squaresfor those simple effects
termif you wantthe
mustalso be dividedby the sameMSerror
statisticaltesting to be coherent;thatis, if you do not find a
significantinteraction,you will not find any significantsimple effect.
Bird andHadzi-Pavlovic(1983) definedcoherencein the
contextof a multivariateanalysisof variance(MANOVA).If
you do not reject an omnibusnull in the MANOVAusing a
particulartest statistic(e.g., Wilks's likelihood ratio test A,
Roy's R, Pillai-Bartletttrace V,Hotelling-Lawley trace 7),
therewill not be any significantmultivariatecontraststo be
found as long as you are consistent in examining the follow-up contrastsusing the same choice of statistic (the F
basedon A,R, V or T).Thus,BirdandHadzi-Pavlovicstated,
In a coherenttwo-stageanalysis,the overalltestcanbe regardedas a theoreticallyunnecessarybut practicallyuseful
14
ANALYSIS
OFVARIANCE
startingpoint in the explorationof data:theoreticallyunnecessary because the overalltest could be omittedwithoutaffecting the outcomesof tests on contrasts;useful in practice
because a nonsignificant overall test indicates that there
would be no point in carryingout follow-uptests. (p. 168)
Analogously,in the contextof an ANOVA,Scheff6(1959,
p. 67) describedthe relationbetweenthe overallFtest andthe
follow-upcontrastsamongsubsetsof meansusing his multiple comparisonprocedure.These analytical situations are
close analogsto the issue raisedin the question.
The logic of testing contrasts(for main effects) or simple
effects (the termfor a contrastin an interaction)is thatof furtherbreakingdown variancetermsto tryto understandwhat
occurredin the data.Thinkof a Venndiagramwith the largest circle representingSStotal;within thatcircle arethe components due to the variabilityof SSA,SSB, SSAx B, and the
remainderis SSS/AB.A significant interactionindicates that
the variabilitydue to the A x B term is large relative to the
within-cell variability,hence we call it significant. Simple
effects areattemptsto breakdownthe interestingA, B, andA
x B variabilities into even smaller units that make better
sense theoretically-that is, the maineffects andinteraction
sums of squaresportionsof the largerVenndiagramcan be
seen as comprisedof units representingSSA@bl
and SSA@b2
(or SSB@aland SSB@a2;however you wish to slice the data
further).2If you test the overall interactionagainstMSS/AB,
but test the simple effects against some other errorterm (a
variantor portionof MSs/AB),youroveralltest of the interaction and your simple effects will representresultsthatcompareapplesto oranges.Youcan almostalways do something
mathematicallyor statistically,but that does not mean it is
particularlygood logic.
It is not even persuasiveto arguethatcomputingan overall interaction F test (MSA x B/MSS/AB) and then running a
one-factorANOVAon the b = 1 datato simulatethe simple
effect of A@bl (and then runninganotherANOVAon the b
= 2 data to simulateA@b2) is more precise because only
the sums of squares due to within-cell variabilityfor the
cells of focus in each test are contributing(i.e., albl and
a2bl for the firsttest, and alb2 and a2b2 for the second test).
The only conditionunderwhich this could be more precise
is if you believed the assumptionof homogeneity did not
hold for your data. For example, if the within-cell variabilities are tiny in the albl and a2bl cells, but large in the
alb2 and a2b2 cells, then the pooled overall error term
MSS/ABwould seem too big for the A@bl test and too small
for the A@b2 test.
+ SSA@b2)
2Thesimple effects (ofA at each level ofB, SSA@bI
breakdown
theSSAxB+ SSA(andthesimpleeffectsofBateachlevelofAbreakdownSSA
x B+ SSB= SSB@Sa
+ SSB@a2).
In essence, the Venndiagramcomponentsof
ThetermsSSA@bl
andthenreapportioned.
and
SSAxBandSSAareremerged
shouldbetestedagainstthesameerrortermaswereSSAxBandSSA.
SSA@b2
Formoredetail,see QuestionI.A.in thisspecialissue.
Even then,two criticismsprevail:
1. We know that the univariateanalysis of variance is
ratherrobustto fairlydisparateheterogeneity-we counton
this robustnessso much that we rarely investigatewhether
the assumptionholds or not, becausewe know thatit largely
does not matter(i.e., we will still be led to the properdecision of rejectingthe null hypothesisor not). If the variances
are so differentthatthe conditionsof purportedprecisionas
just describedheld, the analyst has far bigger problems to
worryaboutthanwhetherthe simple effects arerunproperly
(with the overall errorterm)or as simulatedwithin the single-factorANOVAs.
2. As the questionerimplies, an even biggerproblemresults fromthe factthatpoweris lost dueto thenew degreesof
freedombeing muchsmallerthanfor the overallMSS/AB
term
now
It
is
a(n
1), etc.). really optimalto
(usuallyab(n 1),
the
data
in
the
sense of representingthe acanalyze
properly,
tualtwo-factordesignandusingall thedatasimultaneouslyto
enhance one's power and accuracy.Sometimes researchers
do the pseudo follow-upANOVAs(e.g., parsingthe dataset
into the a = 1 dataand separatelythe a = 2 data),simply becausetheyhadnot figuredouthow to coax the statisticalcomputingpackagetheyused intocomputingsimpleeffects.This
canbe a nontrivialpragmaticproblem,butit is solvable(e.g.,
in StatisticalAnalysis System, or SAS, to test the simple effect of A at level b = 1 in a 2 x 2 factorial,includethe statement,"contrast'A@b ' a 1-1 a*b 1 0-1 0";to testA at level b
= 2, the statement,"contrast'A@b2'a 1 -1 a*b 0 1 0 -1"; to
test for the simple effect of B at a = 1, include"contrast'B@
al 'b 1-1 a*b 1-1 00"; andto testB ata =2, include"contrast
'B@a2'b 1 -1 a*b 0 0 1-1.")
I had conditionedmy firststatementon the typicalchoice
of MSerrorbeing MSS/AB.It is importantto rememberthat
sometimesMS"er,,rrtakes on a form differentfromMSS/AB.
Forexample,if FactorA is a fixed-effectsfactorandFactorB
is a random-effectsfactor,thenthetestsofthe maineffectofB
butthe test
andthe interactionareeachtestedagainstMSS/AB,
of the main effect of A is tested againstMSAx B (with correspondingloss of errordegreesoffreedom,fromab(n- 1)to (a
- 1)(b- 1)).Mostexperimentsarerunwith all factorsas fixed,
and so this issue does not usually arise-both main effects
and the interactionare tested against the MSS/AB
term; the
overallerrortermas it is phrasedin the question.(See Hicks,
1982, Iacobucci,1995,andJackson& Brashers,1994, forthe
issues of fixed vs. randomeffects andthe computationof expectedmeansquares[EMS],whichwill guideyou to the correct error term. Other references useful with regard to
experimentaldesign include Box, Hunter,& Hunter,1978;
Cochran& Cox, 1957;Maxwell & Delaney, 1990.)
Regardingthe issue in the second paragraphof the question, it is truethatmixeddesigns can be morecomplicatedto
trackwhichMS termservesas an errortermforwhichMSeffect terms;however,thebasic logic of the simplertwo-factor,
ANALYSISOF VARIANCE
between-subjectsdesign as discussed thus far still holds.
Hence, in the four-factorexample given in the question(assumingall factorsas fixed forsimplicity),theMSAxBinteraction would be tested againstMSS/AB,andthe MSc maineffect
wouldbe testedagainstMScxS/AB.
Simpleeffects on theA x B
interactionwould also be testedagainstMSS/AB,andcontrasts
on FactorC would be testedagainstMSc xS/AB.
Therefore,alis
no
overall
it
that
there
is
one
error
termfor such
true
though
a design, what is importantis thatthe errortermused in each
omnibustest (e.g., the overallA x B interaction,the overallC
main effect) is also consistentlyused as the errortermwhen
doing follow-up tests (e.g., simple effects withinA x B, contrastswithin C, respectively).
In conclusion, determiningthe appropriateMSerrorterm
for an F test dependson the EMS that may be derivedas a
functionof whetherthe factorsare fixed or randomand the
natureof the experimentaldesign. (The usual case is thatall
to be of the form
factors are fixed, which leads all MSerrors
If
the
MSS/AB, simple typical analysis.) any factorsotherthan
participantsare random,or the experimentaldesign is not a
factorial,consultdesignbooks like Box et al.
straightforward
Cochran
and Cox (1957), Hicks (1982), Jacksonand
(1978),
Brashers(1994), or Maxwell and Delaney (1990). (Youmay
also findQuestionI.G. in this special issue of interest.)When
one determinesthe errortermfor the test of the omnibushypothesis, the same errorterm should be used to test the follow-up contrastsand comparisons.
REFERENCES
Bird,Kevin D., & Hadzi-Pavlovic,Dusan.(1983). Simultaneoustest proceduresandthe choice of a test statisticin MANOVA.PsychologicalBulletin, 93, 167-178.
Box, GeorgeE. P.,Hunter,WilliamG., & Hunter,J. Stuart.(1978). Statistics
for experimenters.New York:Wiley.
Cochran,William G., & Cox, GertrudeM. (1957). Experimentaldesigns
(2nd ed.). New York:Wiley.
Hicks, Charles R. (1982). Fundamentalconcepts in the design of experiments.New York:Holt, Rinehart,& Winston.
lacobucci, Dawn. (1995). Reviewof randomfactors in ANOVA.Journal of
MarketingResearch,32, 238-239.
Jackson,Sally,& Brashers,Dale E. (1994). RandomeffectsinANOVA.Thousand Oaks, CA: Sage.
Keppel, Geoffrey.(1991). Design and analysis: A researcher'shandbook
(3rd ed.). EnglewoodCliffs, NJ: PrenticeHall.
Maxwell,Scott E., & Delaney,HaroldD. (1990). Designingexperimentsand
analyzingdata. Belmont,CA: Wadsworth.
Scheffe, Henry.(1959). Theanalysis of variance.New York:Wiley.
I.D. POOLINGACROSS FACTORS
Whatis the appropriatebasis forpooling acrossthe levels of a
factorin a multifactordesign (e.g., pooling across C in a design includingFactorsA, B, and C.) I have seen or heardof
several recommendations,but each independentlyof the
other(ofcourse). Onecriterionis thatpoolingmaybe allowed
15
if there is no theoreticalbasis for expecting the interaction
betweenAandB to be qualifiedby C (i.e., no A x B x C interaction would be anticipated).A second, relatedcriterionis
whether,empirically,C is found to interactwith A andB. A
thirdcriterionis suggestedby MacKenzie,Lutz, and Belch
(1986) that involves testing the equalityof variancesfor all
relevantdependentvariables(DVs) across all levels of Cpoolingis appropriateif no significantvariancesaredetected.
Given thatone criterioncan be met withoutthe others(e.g.,
significantinteractionsmay be detected even though none
were anticipatedconceptually,variancesmay be equal even
thoughinteractionsinvolvingC are anticipatedor detected),
some consolidationor hierarchyof these criteriamay be useful as a means of reconcilingany potentialconflicts.
REFERENCE
MacKenzie,Scott B., Lutz,RichardJ., & Belch, GeorgeE. (1986). The role
of attitudetowardthe ad as a mediatorof advertisingeffectiveness:A
test of competing explanations.Journal of MarketingResearch, 23,
130-143.
Editor: You are certainlyrightthatthe criteriayou describe
can yield conflictingresults,leavingthe researcherto ponder
whether'tis noblerto pool ornotto pool. The firsttwo of your
statedcriteriarepresentthe classic interplaybetween theory
anddata.Weusuallyput theoryon a pedestal,in this application,sayingthatno matterwhatthedatalook like, we intendto
collapseoverlevelsofFactorC (although,then,one mustwonder why FactorC was includedin the experimentaldesign).
Doing so wouldbe foolhardyif thereis a significantA*B*C,
A*C, orB*C interaction;evena Cmaineffectsuggeststhatthe
A x B designwouldbe muddieda bit withoutrecognizingthe
underlyingmore complicatedfactorialstructure.The most
practicaldangerof aggregatingin thepresenceof anyof these
significanttermswould be that the remainingA x B effects
could be washedout or misrepresented(i.e., the bias thatresults frommissing variablesin a model).
An advantageof maintainingtheA x B x C structurein reportingis that it is a more valid representationof the actual
study,and the significancetests should be more precise because the MSEtermhas been parsedmore finely (i.e., the effects of C, on its own andjointly with the otherfactors,have
beenremoved).Of course,we mightacknowledgethatifanA
x B x C factorialwere submittedto a journal,with no significantmaineffect for C andno interactionsinvolvingC, surely
thereviewerswouldpounce,wonderingwhatwas thepointof
(the probably exploratory)Factor C. Although collapsing
over levels of C may not retainthe sensitivity of the A x B
tests, given the MSEissue, the powermay still be strongdue
to the same effective samplesize.
Empiricsneed to dominatetheoryin this decision.If there
is a significanteffect involvingC, the fullerfactorialoughtto
16
ANALYSISOF VARIANCE
be representedbecausethe dataareessentiallyindicatingthat
differentthings are happeningdependingon FactorC. This
more complexeffect indicatesthatthe theorizingneeds to be
enrichedto accountfor the empiricallydemonstratedcontingencies. Of course,one of the ways we allow theoryto dominate is to question the data-in this case, questioningthe
importanceof the significanteffects involving C; for example, "well, it is only a small effect";"theresultsare qualitatively similar, it is just that A x B at Cl does not achieve
significance,whereasA x B at C2 does";or even, "therearea
couple of significanteffects involving C, but I do not know
whatto do with them,andtheA x B effects thatI wish to present are still significanteven thoughI have collapsedthe design in the analysis"(though,no, you do not get bonuspoints
for the test being more conservativehere), and so on. Rationalizationsto oneself or otherscan be persuasive,as long as
the researcheris not missing an opportunityto advancethe
theoreticalexplanation.
The MacKenzieet al. (1986, p. 135) criterionwas a different kind of animal.They sought aggregationto have bigger
sample sizes to enable them to execute structuralequations
models. (Theytestedforhomogeneityof varianceson eachof
the 12measurestheywereproposingto modelanddidnot find
differencesacross conditions.Perhapsthey could have used
simultaneous,i.e., multivariate,tests to examinewhetherthe
12 x 12 variance-covariancematriceswere equal acrossthe
conditions.It is also interestingthat,althoughwe expect our
manipulationsto affectmeans,we expectvariances,or in this
case, covariances,as inputto the structuralequationsmodels,
to remainconstant.)Thefocusofthe aforementionedquestion
seemstobe on theappropriateness
ofcollapsingacrossa factor
in the contextof analysisof variance,not structuralequations
models,andANOVAis knownto be fairlyrobustto violations
ofhomogeneityof variance.Thatis, even if one wereto apply
the MacKenzieet al. (1986) criterionand find thatthe variancesdifferedsignificantlyacrossconditions,the subsequent
ANOVAis less likely to be affected.Therefore,I do not really
see theircriterionas a thirdviablecandidateforthetypicaldata
analysisscenarioin ANOVA.
I.E.UNEQUAL
CELLSIZES
(MISSINGDATA)
How criticalis the problemof unequalcell sizes (missing
dedata),andwhatshouldbedoneaboutit?Myexperimental
signprofessorinoureducational
says
psychologydepartment
thatendingup withunequalcell sizes givesthe appearance
didnothavecompletecontrolovertheexthattheresearcher
periment,in additionto creatingevenmorepotentiallyserious concernsaboutdifferential
dropoutrates(experimental
conditions
andviolatingequalvaridifferent
from
mortality)
ance assumptions.
However,despiteourbest efforts(e.g.,
foreachcondition),
of participants
numbers
signingupequal
showupfortheirassignedsessions,comnotallparticipants
plete the task, or supply usable data.I have heardsolutions
suchas randomlydroppingone ortwo participantsto evenout
cell sizes. Is thispreferredto collectingmoredatato addto the
size of the smallercells? And, how shouldwe reportor describethis problem(andhow we handledit) in writingup the
resultsof the experiment?
Editor: It seems a bit harshto me to make an attributionregardinga researcher's(lack of) controlwhen participantsdo
not showup fortheirassignedsessionsor fail to yield datafor
otherreasons.If it is anyconsolation,I thinkmostresearchers
regularlyexperiencethe problemsyou have described.
Itis certainlytruethatperfectlybalanceddesignsarehighly
desirable;fora givensamplesize, equalcell sizes contributeto
the overallpowerin detectingsignificanteffectsandto therobustnessin detectingthose effects in the presenceof violated
assumptions(e.g., nonnormalityor heteroscedasticity;cf.
Algina & Oshima, 1990; Hakstian,Roed, & Lind, 1979;
Sawilowsky& Blair,1992).Yoursolutionsarecommonlyoffered,withtheusualcounterobjectionsandcaveats-it seems
cost-ineffectiveandwastefulto dropparticipants,andin contrast,itmaybe effortfulto gatheradditionaldatathatnowcarry
a possible confoundingof havingbeen collectedlaterin time
thantheearlier,previouslyunbalanceddata(e.g., Efron,1994;
Little, 1992;Little & Rubin, 1987; Searle, 1987).
The problemof unequalcell sizes is actuallysevere.In a 2
x 2 design,cell sizes of {10, 10, 10, 11} areclearlybetterthan
10, 10, 10, 20} (a patternwith high variance)or {8, 12, 14,
6} (some cell sizes are quitesmall), for example,but strictly
speaking,all threeof these sets areunbalanced.
One of the issues is easy to see. For example,in the tiny
dataset of N = 6 observationsdepictedin the following, the
cell meanssuggesttherewill be no significantmaineffect for
FactorA, nor an interaction,but perhapsa main effect for B.
The marginalmeans will vary dependingon whether one
computesthemas meansof cell means(e.g., 8 + 5/2 = 6.5 for
a = 1 anda = 2) ormeansofraw data(e.g., (7 + 9 + 5)/3 = 7 for
a = 1 and (8 + 4 + 6)/3 = 6 for a = 2).
data:
a=l
b=l
b=2
7
5
cell means:
b=l
b=2
8.0
5.0
a=2 8.0
5.0
a=l
9
a=2 8
4
6
Thebiggerpickleactuallycomes in thattheunbalancecreatesa problemanalogousto multicollinearityin regression,in
thatthe hypothesesto be tested are affected(e.g., Iacobucci,
1995; Milligan, Wong, & Thompson, 1987; Perreault&
Darden,1975).
Imagine writing those six data points in terms of the
ANOVAmodelparameters(Yyk= ! + ai + pj+ (ap)ij+ ijk),or
at least the firstseveralof the model terms,to keep it simple.
ANALYSISOF VARIANCE
Totest forthe maineffect ofFactorA, we wouldposit Ho:al =
a2 = O, or (i - (2 = 0.
a=l
b=l
b=2
7=+C+al+p,
5=I+aI+P2
9=!+la!+p,[
a=2
8=jl+a2+p,
4-=++a2+P2
6=g+a2+P2
If we comparedthe mean for a = 1 to the mean for a = 2,
where the means are computedas means of cell means, we
would have
+ 2)}
/2 {'/2[([i+ a +i) )
(I+ aa + P1)] + ( +
-2 {( + a2 + 131)+ 2 [(gt + a2 + 32)+ (#l + a2 + P2)]}
= /2 [ (I
a +
)
(I + a + 2)] - /2 [(t + a2 + P1)+
(U + a2 + P2)]
= /2 (2l + 2al + pl + P2)- l/2 (2g + 2a2 + P1+
= + al +
+ tl 2 1 /2P
1al
Xa2 /2P1 1/2p2
= a
32)
- a2
which is precisely the hypothesiswe wish to test.
If, instead,we used the means for a = 1 and a = 2, where
they were computedas means of raw data,we would have
()
-()
[(+ + a +pi 1) + (
+ al + P1) + (g + al + 12)]
[(,A+ a2 + p1) + (A + a2 + P2) + (U + a2 + 12)]
= (,) (39 + 3al + 21i + 12) - (Y) (3p + 3a2 + P1+ 212)
= + al + (Y3)p1
-2
+ (Y)pI2- g a2
P -()( 2
= al - a2 + (0(13i - 2)
which is some interestingbizarrohypothesis,but not one we
wish to test. We want to isolate the model termspurely-the
as for the maineffect ofA, the 3sfor the maineffect ofB, and
the (ap)s for the interaction.The differencein the previous
examplesis pickingup on some effect of B in additionto the
focal effect on A-that is, we would say thatthe effect ofA is
biasedby thatofB. Weneedto adjusttheeffect ofA to remove
thiscontaminationofFactorB. Otherways of statingthis goal
include the following: We want our effects to be orthogonal
(i.e., the tests of one factorshouldbe statisticallyindependent
of the other effects), we wish to examine the effects having
been adjustedfor the othereffects, or we want the othereffects to have been partialledout.
Thankfully,as nastyas the problemis, the solutionis truly
simple (and it is endorsedwith fairly clear consensusacross
statisticians).Themajorstatisticalcomputingpackagesmake
availablemorethanone type of sumsof squarescomputation,
and it is simply a matterof reportingthe statistics(SSs,MSs,
Fs, andp values)associatedwith thepropertype of computation thatis testing the rightset of hypotheses.
In SAS, the analyst should use the Procedure General
LinearModel (PROC GLM; PROCANOVA assumes perfectly balanced data). Within GLM, SAS computes four
17
types of sums of squares and, by default, reports Types 1
and 3. It is the set of results listed underType 3 that should
be reportedand interpreted.(It is actually good practice to
obtainonly the Type 3 sums of squaresfrom SAS, so as not
to get confused, by including the option SS3 on the model
statement;e.g., "model depvar= a b a*blss3).
Thanks go to my colleague, Angela Lee, who demonstratedto me thata currentversion of WindowsPC Statistical Package for the Social Sciences (SPSS) has a menu
from which to select one of the four types of sums of
squares,and the properType 3 is the correctdefault. However, check the documentationor help function, because
dependingon whetherone is using a PC or mainframe-network package version of SPSS, and depending on which
version is installed, the defaults vary and the types of sums
of squares have different labels. If the type of sums of
squaresis not labeled "3,"use those labeled "unique,""partial," or "regression,"but not "sequential."In some versions of SPSS, the default for the MANOVAprocedureis
correct, but the default for the ANOVA procedure is not.
(Be choosy with statisticalpackages-be wary of purchasing one that is user friendly with neat graphics with documentation that does not even address the issue of
unbalanceddesigns-probably signaling thatthe programmers were unaware of the statistical issues-how user
friendly is a package that gives the wrong results? "Hop."
That was the sound of me getting off my soapbox.)
In conclusion,aim for balance-the mechanicaleffortof
obtainingmoreparticipantswill pay off statistically(in terms
of powerandrobustness),butif theresultingdataarenotperfectlybalanced,use SAS Type3 sumsof squares(orType3 in
SPSS) to analyzethem.
REFERENCES
Algina,James,& Oshima,TakakoC. (1990). Robustnessof the independent
samplesHotelling's12 to variance-covarianceheteroscedasticitywhen
sample sizes are unequaland in small ratios. Psychological Bulletin,
108, 308-313.
Efron,Bradley.(1994). Missingdata,imputation,andthe bootstrap.Journal
of the AmericanStatisticalAssociation,89, 463-479.
Hakstian,A. Ralph,Roed,J. Christian,& Lind,JohnC. (1979). Two sample
T2 procedureandthe assumptionof homogenouscovariancematrices.
PsychologicalBulletin,36, 1255-1263.
Iacobucci,Dawn. (1995). The analysis of variancefor unbalanceddata. In
David W. Stewart& Naufel J. Vilcassim(Eds.), Marketingtheoryand
applications(Vol. 6, pp. 337-343). Chicago:AmericanMarketingAssociation.
Little,RoderickJ. A. (1992). Regressionwith missingx's: A review.Journal
of the AmericanStatisticalAssociation,87, 1227-1237.
Little, RoderickJ. A., & Rubin,Donald B. (1987). Statisticalanalysis with
missingdata. New York:Wiley.
Milligan,GlennW., Wong,DannyS., & Thompson,PaulA. (1987). Robustness propertiesof nonorthogonalanalysis of variance.Psychological
Bulletin, 101, 464-470.
Perreault,WilliamD., & Darden,WilliamR. (1975). Unequalcell sizes in
marketingexperiments.Journalof MarketingResearch, 12, 333-342.
18
ANALYSISOF VARIANCE
Sawilowsky,ShlomoS., & Blair,R. Clifford.(1992). A morerealisticlook at
therobustnessandTypeIIerrorpropertiesof the t testto departuresfrom
populationnormality.PsychologicalBulletin,111, 352-360.
Searle, Shayle. (1987). Linear modelsfor unbalanceddata. New York:
Wiley.
thingsthatarecommonlydoneis to treatrepeatedmeasuresas
fixedeffects;to discounttherelevanceof aberrationsin structuralequationmodels,suchas Heywoodcases, andto not use
log-linearanalysisin an appropriatemanner;or to simplyapply a probitmodel withoutexplicitlytesting for interactions
amongthe predictors.)
I.F.DIRECTIONALPREDICTIONS IN
MEAN DIFFERENCES
Whena directionalpredictionexists for an interactionin a 2 x
2 ANOVA,is it appropriateto divide in half the significance
of the F or is it necessaryto runa specific contrast?
Professor Joseph Verducci
Ohio State University
Yes. In the 2 x 2 case, the F test for an interactionis just the
squareofa t test, so it is okay to test for a positiveinteraction
at the a = .05 level by rejectingonly if the estimatedinteraction effect is positive andthep value for the associatedF test
is less than .10.
Editor: Okay,butwhy? In a 2 x 2, if eithermaineffect is significant, once you have the means, you know immediately
which is largeror smallerthan the other.Theoretically,it is
the rarecircumstancein which a researchercould arguepersuasivelythata hypothesiswas trulyone-tailed;ifone expects
Al > A2 , is it trulyof no interestif instead A2> Al ? (Itmight
be of greatinterestat leastbecauseevidentlythe a prioriguess
was inaccurate.)Statistically,if an effect is significantat a
two-tailedlevel (.025), thetest statisticwill exceedthecritical
value at the one-tailed level (.05), so one-tail tests usually
strikeme as a move of desperationby the researcher.Finally,
in testingsimpleeffects (follow-uptestingon interactions),it
is not clear what a directionaltest would look like.
I.G. SHOULD I TREATA PARTICIPANT
ATTRIBUTE(E.G., GENDER) AS A
BLOCKINGFACTOR?:PART1
Onequestionthatsorelyneedsto be addressedis abouta common misconceptionthatI have encounteredwhen reviewing
forjournalsand servingon dissertationcommittees.Authors
will often say somethinglike, "A two-factorfactorialdesign
was used, with X levels of the firstfactorandtwo levels of the
genderfactor."Then,the authorswill proceedto treatthe experimentandanalysisas a between-subjects,two-factoranalysis design, instead of as a one-factor randomizedblock
design.Often,thetheorysectionhasa predictionof aninteraction, andtheyperformthe test as such,which is clearlyincorrect. I would like to see this matterbe given attention.(Other
Professor Joan Meyers-Levy
University of Minnesota
Assuming that authorswho are making statementsabout
two-factorfactorialdesigns like the one you mentionedare
studyingtheoreticalissues like those thatmaypertainto gender differences,I see no problemin treatingthe analysisas a
between-subjects,two-factordesign ratherthana one-factor
randomizedblock design. Given thatpredictionshave been
made involvinggenderand the other factor,genderis not a
factorthatis introducingundesirableerrorvariancethatcould
obscurethe observationof some othermore focal (main)effect. Rather,any effects involving gender are an object of
study and interestin theirown rightas the effect of the first
factoris indeedanticipatedto varyas a functionof gender.AlthoughKeppel(1973, p. 504; cf. Keppel, 1991)notedthatthe
use of randomizedblocksgenerally"resultsin an experiment
that is more sensitive than the correspondingexperiment
withoutblocks"becausethe errorterm"reflectsthe variability of participantsfrompopulationsin which variationin the
blockingfactoris greatlyrestricted,"controllingthe variance
associatedwith gender does not serve an essential purpose
herebecausethe effects associatedwith all factors(including
gender)areequallythe focus of interest.In fact, one may argue thatif theanticipatedeffect (e.g., an interactioninvolving
gender)emergesdespitethe factthatthe datahavebeen analyzed as a two-factordesignwiththe varianceassociatedwith
genderallowedto varyunchecked,the observedeffect could
be viewed as all the more robust.
REFERENCES
Keppel, Geoffrey.(1973). Design and analysis: A researcher'shandbook.
EnglewoodCliffs, NJ: PrenticeHall.
Keppel, Geoffrey.(1991). Design and analysis: A researcher'shandbook
(3rd ed.). EnglewoodCliffs, NJ: PrenticeHall.
Professor James Barnes
University of Mississippi
When treatmentlevels are combinedsuch that each level of
FactorA appearsonce with each level of FactorB, the treatmentsaresaidto be completelycrossed,a characteristicof all
factorialdesigns. Participantsarethen randomlyassignedto
one of the treatmentcombinations(Hicks, 1993;Kirk,1995).
ANALYSISOF VARIANCE
In the situationthatyou describe,participantscannotbe randomly assigned to the differentlevels of gender (excluding
surgicalchanges).
A randomizedblock design allows for differences(gender) before the experimentto be evaluated.A randomized
block design enablesthe researcherto test two null hypotheses-namely, thattreatmentpopulationsmeansareequaland
thatblock populationmeans are equal (Kirk, 1995).
The presence of block-treatmentinteractioneffects alters the expected values of the mean squares producinga
nonadditivemodel (see Table 7.3- in Hicks, 1993, p. 267).
Furthermore,the researchproblem that you describe is either a fixed-effects model (FactorA and block fixed) or a
mixed-effects model (Factor A random and block fixed).
Thus,researchersmustinsurethatthe appropriatemeansum
of squaresis being used forthe particularsituationthatis being examined.
A repeatedmeasuresdesign (eachparticipantis observed
underall, or a subset,of the conditionsin the experiment)is a
special case of the randomizedblock design. That is, each
participantservesas his or her own blockingvariable(Hicks,
1993; Kirk, 1995).
The generalsituationsin whichmodelsof discreteDVs are
relevantare cases in which the researcheris interestedin
choice behavior or the occurrenceor nonoccurrenceof an
event.One is usuallynot interestedin estimatingthe value or
size of the DV but in analyzingthe underlyingprobabilityof
the given eventor choice. Logit andprobithavebeenused extensively in this situation.Both models requiredifferentassumptionsconcerningtheunderlyingprobabilitydistribution
functionof the DV. As the name suggests, logit assumesthe
logistic functionas the underlyingdistribution.Probit,on the
otherhand,assumesthatthe underlyingprobabilitydistribution functionis normal.
REFERENCES
Hicks,CharlesR. (1993). Fundamentalconceptsin thedesignofexperiments
(4th ed.). New York:Holt, Rinehart,& Winston.
Kirk,Roger E. (1995). Experimentaldesign: Proceduresforthe behavioral
sciences (3rd ed.). Pacific Grove,CA: Brooks/Cole.
Editor: Hmm.Well, it is truethatgenderis not typicallyrandomlyassigned.However,all the levels of bothfactorsappear
in all possible combinations-that is, the factorsarecrossed.
It is also true thatblocking is usually done to mechanically
control for an extraneous factor, whereas in the question
posed, gender is a theoreticallyfocal factor.Therefore,perhaps a differenttack is to determinewhen it actuallymatters
statistically.Therearetwo issues to address:First,arethe factorsrandomor fixed?;second,does one fit the full or reduced
block ANOVAmodel?
areusuallythoughtto be
Factorsthatcharacterize
participants
random,but as Bares notes here,genderwouldbe considered
19
of levelsof genderis captured
fixed,becausetheentirepopulation
of
hasaccessto theentirepopulation
withthetwo-the researcher
levelsof thatfactor.(If one'ssubjectfactororpotentialblocking
factorwere somethinglike citizenshipor countryof origin,and
trulyweresampledrandomly,one couldmakea
representatives
caseforthefactorbeingrandom,withthepossibilityof generalizTheFstatisticsforthemainefpopulation.)
ingto theparticipant
fects for A, B, and the A x B interactionin a 2 x 2 factorial
(followingHicks,1993,etc.)wouldbe
1. If the subjectfactor,B, is fixed, as here,
=
(a) and if FactorA were fixed, the usual case, F
F = MSBIMSS/AB,F = MSA x B/
MSA/MSS/AB,
MSS/AB
F=
(b) if FactorA were random,F = MSA/MSs/AB,
=
x
x
F
B/MSS/AB.
MSA
MSB/MSA ,
2. Tobe complete,if thesubjectFactorB wererandom,
=
(a) and Factor A fixed, F = MSA/MSAx B F
F = MSAx BIMSS/AB
MSBIMSS/AB,
A
Factor
or
random, F = MSA/MSAx B, F =
(b)
x
B
MSB/MSA F = MSA x lBMSs/AB.
It is rarethatresearchersin our field makeclaimsthatthey
have selectedthe levels of theirfactorsrandomly.Thus,cases
1(b)and2(b)arelessprevalent.If we keepourfocuson thequestion'ssubjectvariableof gender,thenFactorB will be consideredfixedalso-that is, scenario1(a)is relevant,not2(a).TheF
statisticsin circumstancel(a) arethe usualFs. The tests forB
l(a)
andA x B arethesame.ThetestsforA differ;furthermore,
than
more
will be
2(a).
powerful
Anotherissue arisesin blockingdesignsregardingwhether
one shouldbe fittinga full or reducedANOVAmodel. If the
blockingfactoris a nuisancevariable-one thatis expectedto
yield differenceson the DV, yet is not of focal interest-one
groupsrespondentsintohomogenousblocks andthendistributesthoserespondentswithineachblockacrosstheexperimental conditions.For example,if the DV of interestis memory
(recallorsomething),apotentialblockingfactormaybe intelligence. (Smarterpeople arelikely to recallmore,regardlessof
butthisfindingwouldnotbe a great
theexperimentaltreatment,
be formedin whichsimilarly
would
theoreticalinsight.)Blocks
intelligentpeople (as measuredby some indicatorvariables)
acrosslevelsofthe experimentalfactor-a
wouldbe distributed
portionofbrightrespondentswouldgo intoLevelsa1,a2,andso
on.A portionofmoderatelybrightrespondentswouldsimilarly
be assignedto Levels al, a2, and so on. A portionof the low
wattagegroupsimilarlywouldbeassignedacrossLevelsal, a2,
andso on. At thispoint,thenuisancevariableof intelligenceis
approximatelyequallyrepresentedacrossthe groups,withthe
interimportantimplicationthattheA(factor)*B(intelligence)
be
would
actionis assumedto be zero.Highlyintelligentpeople
the
of
experimental
expectedto remembermore, regardless
treatmentto whichtheywereexposed;the sameappliesforthe
othergroups.Therefore,inconditional, therewill be a groupof
highlyintelligentpeoplewho recalledthemost,a groupof less
20
ANALYSISOF VARIANCE
intelligentpeople who recalledthe next most, and so on; and
these comparativeresultswouldbe truefor a2, andso on. The
questionregardingthe effectivenessof the experimentalinterventionis simplywhetherthe overallperformance(of all three
groups)was betterin al, ora2, andso on, regardlessofthe individualdifferencescausedby intelligence,whichhopefullyhave
been madeuniformacrossthe conditions.
This scenario is not that which describesthe gender researchin ourjournals.In genderresearch,the questionis in
factwhethermenandwomen(orboys andgirls)will reactdifferently in different treatmentconditions-that is, the assumptionis not made that men (or women) will behave the
same under all conditions.Perhapsmen or women will be
more or less sensitive or affectedby exposuresto stimuli of
differentsorts.Thus,the theoreticalquestionsrevolvearound
the interactionbetweenthe FactorA andthe subjectproperty
B. Why does this matter?
If one believes thatthe blockinghas createdan orthogonal
setup for which an interactionhas no logical meaning, one
would expectthatby definition,SSAxB = 0. Accordingly,it is
suggestedthatone fits a reducedmodel.A full modelincludes
all terms,A, B, and A x B; a reducedmodel would include
only termsA andB (e.g., in SAS witha statement:Modely = a
b;). Note thatdoing so meanstheA x B interactioncannotbe
testedbecausethe model will not producean estimatefor its
SS. The assumedzero or negligibleSSAxBtermis aggregated
intotheSSs/AB
term,the old errorterm,resultingin a new error
+ SSAx B, with degrees of freedom=
term, SS"E"(= SSS/AB
=
+ dfAxB ab(n -1) + (a - l)(b - 1)). The Ftests would
dfS/AB
andF = MSBIMSE,
for all possibilitiesof A
be, F = MSA/MSE
andB being fixed or random.
In conclusion,the firstresolutionmustbe a determination
of the theoreticalintent of the research-if one is tryingto
simplycontrolforgender,theblockingapproachmakessense
anda reducedmodel canbe defended(andthe interactionnot
tested). (Alternatively,one could run an analysis of covariance,or ANCOVA,with genderintroducedas a dummy
variable,discussedelsewherein this special issue.) If one is
tryingto documentdifferencesbetween men and women in
theirreactionsto experimentalstimuli,a differentphilosophical approachis embraced,the full model is fit to allow for the
testingof the interaction,andthenone mustsimplyverifythe
status(randomor fixed) of the experimentalfactorto choose
betweentheF tests in 1(a) or 1(b). If genderis not the subject
variable, one might instead be choosing between 2(a) and
2(b). A relatedquestionfollows.
I.H. SHOULD I TREATA PARTICIPANT
ATTRIBUTE(E.G., GENDER) AS A
BLOCKINGFACTOR?:PART2
If I measurean IV, such as participantgender,andexpect an
interactionbetween this IV and another,manipulatedIV,
would this still be considereda blocking variable?My un-
derstandingis thatblocking is intendedto remove error,to
control for the effects of an extraneoussource of variance
possibly not handled through randomization.But, if the
blocked variableexhibits theoreticallyinterestingdifferential effects on the otherIV,would it not makesense to treatit
as anotherIV? Cook andCampbell(1979) said I can do this,
butI havehada reviewerinsist thatmixing IVs andblocking
variablesis not "clean,"and I should focus on main effects.
Whatis reallythe differencebetweenblockingvariablesand
measuredvariables?
REFERENCE
Cook, ThomasD., & Campbell,DonaldT. (1979). Quasi-experimentation:
Design and analysis issuesforfield settings. Chicago:RandMcNally.
Professor Joan Meyers-Levy
University of Minnesota
I agreewith yourview thatif one drawson a theorythatsuggests gendershouldinteractwith some manipulatedIV,genderneed not be treatedas a blockingvariable.Keppel(1973)
noted that, consistent with this view, "blocking may be
thoughtof as a controlfactor,whereits introductionis motivatedby a desireto reduceerrorvarianceratherthanto study
the effect of the subjectfactorperse" (p. 509). In sucha situation,theuse ofrandomizedblocks generally"resultsin anexperiment that is more sensitive than the corresponding
experimentwithoutblocks"becausethe errorterm"reflects
the variabilityof participantsfrompopulationsin whichvariationin the blockingfactoris greatlyrestricted"(p. 504). Yet,
Keppel(1973) also notedthata randomizedblock designmay
be used when factorssuch as genderare"theobjectof study"
and the design is "not merely a means to reduceerrorvariance"(p. 509). Using a randomizedblock design in this case,
whenthereis a desireto ashowever,seems to be appropriate
sess the generalizabilityof some othereffect across certain
classifications of situations or individuals (e.g., gender).
Nonetheless,if as in the case you outline,a studyis intended
to test a theorythatspecificallypredictsthata subjectfactor
like genderand some otherfactorinteract,such a studydoes
not constitutesimplyan assessmentof whethersome particulareffect generalizesacrossgender.As such, I see no reason
why the studymustor necessarilyshouldbe treatedas a randomizedblock design.
REFERENCE
handbook.
Keppel,Geoffrey.(1973).Designandanalysis:A researcher's
EnglewoodCliffs, NJ: PrenticeHall.
OFVARIANCE 21
ANALYSIS
ProfessorJames Barnes
Universityof Mississippi
A randomizedblock design allows for differences,such as
gender,before the experimentto be evaluated.In the situation that you describe,participantscannotbe randomlyassigned to the differentlevels of gender,so blocking is still
necessary even with an expected gender-factorinteraction.
However, the presence of block-treatmentinteractioneffects altersthe expectedvalues of the mean squaresproducing a nonadditivemodel (see Table7.3-1 in Hicks, 1993, p.
267). Furthermore,the researchproblemthat you describe
is either a fixed-effectsmodel (FactorA andblock fixed) or
a mixed-effects model (FactorA randomand block fixed).
Thus, you need to be sure thatthe appropriatemean sum of
squaresare being used for statisticaltesting. The reviewers'
comments concerning "clean" results most likely result
from the inability to interpretmain effects when interactions are present (Hicks, 1993).
When treatmentlevels can be combined such that each
level of FactorA appearsonce with eachlevel of FactorB and
participantscanbe randomlyassignedto one of the treatment
combinations,then a factorialdesign would be appropriate
(Hicks, 1993; Kirk, 1995). If randomizationcan be accomplished,thenCookandCampbell(1979) arecorrectandeither
ablockingora factorialdesignshouldproducesimilarresults.
REFERENCES
Hicks,CharlesR. (1993). Fundamentalconceptsin thedesignofexperiments
& Winston.
(4thed.).NewYork:Holt,Rinehart,
Kirk,Roger E. (1995). Experimentaldesign: Proceduresforthe behavioral
sciences(3rded.).PacificGrove,CA:Brooks/Cole.
I.I.HOW TO ANALYZEDATAFROM A
NESTED DESIGN
I often run experimentsin which a constructis operationalized by choosing two or more types of advertisementsor
brands,and a numberof real ads or brandsare nestedwithin
type (T). A typical design would look like this:
Type2
TypeI
Brand1
Brand2
Brand3
Brand4
Brand5
Brand6
Al S1-S20 S1-S20 S1-S20 S1-S20 S1-S20 S1-S20
A2 S21-S40 S21-S40 S21-S40 S21-S40 S21-S40 S21-S40
I am interested in the A x T interaction. I always run into
problems when analyzing this design. My questions are
the following:
1. How do I decide whether it is better to make type a
within-subjectsfactor(as shown earlier)or a between-subjects factor?
2. Suppose I have the choice between adding more
brandsor addingmore participantsper cell. How do I make
that trade-off?
3. I realizethatI shouldhave as manybrandsas possible
withineach type,buthow do I decidehow manyareenough?
Does it matter whether type is within-subjects or between-subjects?
4. Usually,I wantto adda Latinsquareto balanceout ordereffects.How do I do this, andhow shouldI specify all my
effects afterwards?Maybe I shouldnot do this at all.
5. Workingwith realbrandsor ads, I once hadanunequal
numberof brandsor ads in each level of type: for example,
threebrandsof Type 1 andfivebrandsof Type2. I didnot succeed in programmingthe analysisin SAS. Is this due to SAS
or is therea realproblem?(Note that SAS problemsare real
problemsfor me; I am not a statistician.)Can I only do this
kind of analysisif I have equal numbers?
6. Whatif my DV is categorical?
7. My psychologist friends tell me I should not bother
with tryingto model the brandeffect andjust averageacross
brands-that is, computean averagefor each cell in theA x T
design and analyze the 2 x 2 interaction.I know that this is
wrong,but how wrong is it? Whatare the consequencesfor
Type 1 andType2 error?
8. Whataboutthe selection of brands?Ideally,I assume,
both samples of brandsare randomfrom an infinitely large
sample from each type. I treatbrandsas randomfactor,because I wantto generalizebeyond the chosen brands.
(a) However,in reality,I often do this experimentwith
brandsfroma single productcategory(e.g., domesticbrands
and importedbrandsof cars). This is not an infinitelylarge
population. What if my sample exhausts the population?
Whatif I know the sampleis largerelativeto the population,
but I do not exhaust it (say I include half of the existing
brands).Whatif I know the samplemustbe largerelativeto
the population,but I do not know exactly how large?
(b) If I amhonest,I knowthatmy samplesofbrandsorads
arenot reallyrandom;I takewhatI can get. Does this change
anythingto the inferencesI can make?Whatexactly?
Professor Joan Meyers-Levy
University of Minnesota
1. In most cases, it is likelyto be betterto make typea beinfactor,becausedoingso eliminatesartifactual
tween-subjects
fluences(e.g., demandeffects,cany overeffects,reactivity)that
variable.At the
can emergefromtreatingit as a within-subjects
same time, exceptionsmay occurif the anticipatedeffectsare
variance
bytheheightened
likelytobeverysubtleandjeopardized
factor.In this
introduced
by treatingtype as a between-subjects
22
ANALYSISOF VARIANCE
case-and it will be a judgment call-treating type as a
factormaybe preferable(see Keppel,1973).
within-subjects
2. Addingmorebrandswouldseem advisableto the extent
thatyou anticipate(for theoreticalreasons)thatdifferentoutcomes may emergefor differentsortsof brands.Addingmore
brandsalso may be advisableif yourresearchis primarilyapplicationversus theoryoriented.For more discussionof this
issue, see Calder,Phillips,and Tybout(1981). On the other
hand, if multiplebrandsare being includedsimply to offer
some evidence for the generalizabilityof the effects of type
and A, I do not see all thatmuch value in greatlyincreasing
the numberof brands-after all, the observationthatthe anticipatedeffects were robustfor n brandscan neverrule out the
possibilitythattheywill not generalizeforan n + 1 brand.
3. The decision of how many brandsis enough depends
largelyon whetheryour researchis theoryor applicationfocused (cf. Calderet al., 1981). If your researchis primarily
theoretical,includingmorebrandsthatarelikelyonly to show
moreandmore evidenceof generalizabilitydoes not seem to
be all thatworthwhile.Again, as noted earlier,assemblinga
huge numberof replicationsof an effect does not add much
theoretically,andit cannevereliminatethepossibilitythatthe
focal effect may not replicatein yet some otherinstance.
REFERENCES
Calder,Bobby J., Phillips, LynnW., & Tybout,Alice M. (1981). Designing
researchfor application.Journalof ConsumerResearch,8, 197-207.
Keppel, Geoffrey.(1973). Design and analysis: A researcher'shandbook.
EnglewoodCliffs, NJ: PrenticeHall.
Professor James Barnes
University of Mississippi
The authorof thisquestionposes an interestingproblemdealing with hierarchicalor nestedfactorialdesignsandrepeated
measures.Sucha designcouldresultfroma studyof cooperative advertisingwhere a retailer features several different
brandsin a single advertisement.
Kirk(1995, p. 476) providedan excellentgraphicalexample of the differencebetween hierarchicaland crossed designs. In a nested or hierarchicaldesign (T(A)), Ti and T2
appearonly withAl, andas a result,therecan be no A x T interaction(see the following figure).
TI
T2
T3
T4
In a crossed design, each level of T appearsonce and only
once with each level of TreatmentA, andanA x Tinteraction
can be estimated(Kirk, 1995). Winer(1971, p. 540) consideredtwo cases ofthree-factorstudieswithrepeatedmeasures.
+ Ai +Sj(i)+Tk+ ATik+ TSj(i) + Bm+ ABim+
=
Yijkm
BSmj(i) + TBkm+ ATBikm+ TBSmkj(i)
whereAi and Sj(i)are the only between-subjectseffects:Ai =
advertisement, i = 1,2; Sj(i)= subject,j = 1,2,... 20 for all i; Tk
=type, k = 1, 2; and Bm = brand, m = 1, 2, ... 6.
In Case 2, subjectsare nested within two (fixed) factors
andthe thirdfactorcrosses(oris repeatedon) all subjects.The
model for Case 2 is
= g + Ai + Tk+ ATik+ Sj(ik)+ (between-subjectsterms)
Yijkm
Bm+ ABim+ TBkm+ ATBik +
(within-subjectsterms)
BSmj(ik)
InCase1,typeis a within-subjects
factor,whereasinCase2, it
factor.To answeryourfirstquestion,one
is a between-subjects
whethertypeis to be a crossedora nestedfacneedsto determine
tor.Yourdesignappearsto havetypecrossedwith subjectsand
FactorA.
In my experimentaldesignclass, I alwaysstressto my studentsthatthey developan EMS tablewhen firstformulating
an experimentalproject.An EMStableprovidesa deeperunderstandingof the experiment,shows whichtests canbe conducted, and reveals areas of potential problems before
makinga heavy investmentin time andresources.Detailprocedures for creatingan EMS table are providedby Hicks
(1993, pp. 160-163) andKirk(1995, pp. 86-95). IfI am correctlyreadingyourexample,yourmodelis analogousto Case
1becausetypeandbrandarerepeatedon all subjects,butsubjects are nested within FactorA (e.g., advertisementtype),
andbrandis nestedin type.The followingis an EMStabledeterminedby thisdesignandthe samplesizes thatyou provide.
(I am assumingthatsubjectsandbrandsare random.)
Source
A.
Sj(i)
Bm(k)
AT.
ABdm(k)
A2
Al
Case 1 has two of the three factors(fixed) crossing (or repeatedon) all subjects.The model for Case 1 would be
TSk.(i)
A Ss Type Brand
3
2 20 2
R
F R F
(F=Fixed,R-random)
F-test
m
k
i i
ExpectedMean Square
3
0 20 2
2s +20a2 +6a2 +120<A no test
vs. MSBs
3
1
l 2
O2s + 60
no test
+
+
3
+
40a2
2 20 0
3TS 1204
(2Bs
vs. MSas
+
40a
1
2 20 1
a2B
+
+
+
0 20 0
3
2BS 20a2B 3aCT 60qA2T no test
O
vs. MSBs
1
0 20 1
(2S + 20ayB
I
vs.
+
1 0
MSBS
3
1
cBS 3T
BSm(k)j(i) 1
1 1
1
OYs
Direct tests are indicatedin the previousEMS table. For
example,the test of the null hypothesisconcerningbrandis
againstthe Brandx Subjectinteraction.Notice thatthereare
ANALYSISOF VARIANCE
no directtests fortheA ortypemaineffects,northeirAx Tinteraction.It is possible to compute a pseudo F test (Hicks,
1993,pp. 167-169) by combiningandsubtractingotherEMS
components.I am not aware of any softwarepackages that
compute pseudo F tests. However,because most statistical
packagesoutputmean squarevalues, manuallyconstructing
such a test in not very difficult.
An EMS table is also invaluablewhen decidingwhereadditional resources should be allocated. Considera table of
criticalvaluesfortheFdistribution.Forsmalldegreesof freedom, critical values are quite large. For example, F1,,.95=
161, whereas F2,2,.95= 19.0. The implication of degrees of
freedomis that for small degrees of freedom,a largereffect
size is necessaryto be able to rejectthe null hypothesis.Becausemuchof whatone studiesin consumerpsychologyconsists of small effect sizes, small degreesof freedommakethe
researchtaskmoredifficult.At analphalevel of 0.05, the critical valueoff for the test of theA x Brandinteractionnullhypothesis(on k(i- l)(m-1) = 4 and(m- )k(j- l)i = 152 df) is
= 2.43. If all otherthingsareequal,increasingthe numF4,152
berof typesby one (F6,228)producesa criticalvalueof 2. 14.In
comparison,increasingthe levels of FactorA by one (F8,228)
reducesthe criticalvalue to 1.98. Even thoughthese differences can be slight, it could mean the differencebetweenrejecting andnot rejectingthe null hypothesisfor a small effect
size. (However,practicalissues are also a concern.Note that
increasingthe levels of FactorA to achievethe slightlymore
powerful, lower critical value would requirerunningmore
subjects because Factor A is a between-subjects factor;
whereasthe dropinFvalues from2.43 to 2.14 requiresan additionaltype category,but no more subjects.)Thus, the answer to yourquestionof whetherto addmorebrandsor more
subjectsor whatevershouldbe based on an assessmentof the
effect size you wish to be able to detectrelativeto the critical
F value.
A Graeco-Latin
squarecanbe ausefuldesignif oneis dealing
withthreefactorsandeachof thosefactorshasthesamenumber
of treatmentlevels (Hicks, 1993). The smallestsuch design
wouldbe two levelsof ad,two levelsof type,andtwo levels.Increasingthe numberof levels of any factorrequiresincreasing
the other.TheGraeco-Latin
squareis like a cubein thatit is the
same size in all threedimensions.Althoughthis design deals
verywell withtheproblemof ordereffects,thereareseveralconstraintsthatone mustbe willingto accept.A majorrestrictionis
thata factorlevelcanappearonceandonlyoncein eachrow,column, and depthof the cube. Findingnaturallyoccurringads,
brands,andso on maybe a verydifficulttask.Becauseof thedesign, it is not possibleto recoverany interactionsbetweenfactors-only maineffectscan be examined.In addition,missing
valuespose a greaterproblembecausethecombinationof factor
levelsoccursonly once in the design.Takingadditionalreplicationscanbe usedto offsetmissingcells, andtherearesomeproceduresfor estimatingmissingvalues(Hicks, 1993).
ANOVAis based on threeassumptions:(a) Observations
arenormallydistributedon the DV in each group,(b) homo-
23
geneity of the populationvaries for the groups, and (c) observations are independent.Stevens (1992, pp. 240-241)
provided a succinct summaryin table form describingthe
consequences of violating each of these assumptions.Unequalns have the greatestimpacton the actualalphalevel if
the assumptionviolation results from heterogeneousvariances. If the ns are equal, there is only a slight effect on the
actualalphalevel. Yourpsychologist friend'ssuggestion of
computingan averagefor each cell andthenproceedingas if
thereis a single observationin each cell is often suggestedas
a method for dealing with unequalns. Providedyour effect
size is sufficient to be detectedby an F test with single degrees of freedom,there should be no statisticaldifferences
from this approach.
The SAS programwill testall effectsagainstthe errorterm
if one exists in the model.Therefore,the programmustbe instructedas to which tests areappropriate(yet anotheruse for
the EMS table). This instructionis given by the "test"command,wherehypothesis(H) equalsthe termto be tested,and
error(E) equalstheproperdivisorfortesting.I do notprofess
to be an expertSAS programmer,but I thinkthe instructions
would look somethinglike this (PROCGLM;Class A T B;
Model Y = AIT(S)IB(S);
Test H = T(S)E = TPB)to test type
x
Brand
the
interaction.
against Type
A categorical DV turns the analysis into a log-linear
model. Althoughthereis insufficientspace to discuss details
of log-linear analysis, I will highlight some differencebetweenthedifferentapproaches.Recallthatin ANOVA,one of
the basic assumptionsis the normaldistribution.In log-linear
models, the appropriatedistributionis the multinomial.In
log-linearanalysis, one fits a series of models to the data,
whereasin ANOVA,we generallythinkof fittinga model to
the data.As a result,we need to reverseour thinkingabout
tests of significance.InANOVA,we usuallywantthe statistic
to be significant,whereasin log-linearanalysis,a test thatis
not significantis good in the sensethatthegivenmodelfitsthe
data.Stevens(1992) devoteda chapterin his textto log-linear
analysisin the multivariatecase.
Finally,turningto yourlast questionconcerningthe brand
factoras a fixed variable,it would be useful to returnto the
EMS table and change the status of brandfrom randomto
fixedandrecomputethe EMS. Changingbrandto a fixed factor would createdifferentstatisticaltests, with differentdegrees of freedom.
REFERENCES
Hicks,CharlesR. (1993). Fundamentalconceptsin thedesignof experiments
(4th ed.). New York:Holt, Rinehart,& Winston.
Kirk,RogerE. (1995). Experimentaldesign: Proceduresfor the behavioral
sciences (3rd ed.). PacificGrove,CA: Brooks/Cole.
Stevens,James.(1992). Appliedmultivariatestatisticsforthesocial sciences
(2nd ed.). Hillsdale,NJ: LawrenceErlbaumAssociates, Inc.
Winer,B. J. (1971). Statisticalprinciples in experimentaldesign (2nd ed.).
New York:McGraw-Hill.
24
ANALYSISOF VARIANCE
Editor: Special note to the Northwesternalumni of my
ANOVAPhD seminar:See? I was not (merely)being a sadistic professorwhen I torturedyou with EMS.
I.J. HOW TO ANALYZEDATAFROMA
'"YOKED"
DESIGN
I am interestedin the effects of attentionalself-control-that
is, the opportunityof a consumerto controlthespeedof informationprocessingor the rateat which informationitems are
processed(e.g., Internetvs. television).I wantto be ableto assess the effect of being in controlindependentlyof the actual
speed at which informationis processed.
A long timeago, in anundergraduate
psychologyclass, we
readan old articleon the relationbetweenbehavioralcontrol
and stress (I forgot who the authorwas) in which monkeys
were deliveredshocks(thiswas beforeanimalrightsexisted).
Theywere testedin pairs.Onemonkeyhadthe opportunityto
controlthe shocks (theywereprecededby an auditorysignal,
andthe monkeycould press a buttonto avoid some of them).
The other monkey received exactly the same shocks at the
same timesbutwas not able to controlthem.The difficultyof
the controltask was manipulatedsuch thatthe controlmonkey did not succeed in avoidingall the shocks. If a shock was
not avoided, it was deliveredto both monkeys. The actual
numberof shocks variedfrom pair to pair.The DV was the
formationof ulcers;pairedmonkeyswere comparedon ulcer
formation.Monkeyswith controldevelopedmanyless ulcers
thanmonkeysnot in control.This design was called a yoked
design. At the time we did not focus on the dataanalysis,but
this problemseems methodologicallysimilarto the research
problemI introducedearlier.
Therefore,the questionis how to design and analyze experiments in which respondentsare randomlyassigned to
pairs(i.e., in a yokeddesign). Both membersof a pairreceive
exactlythe sametreatment,exceptforthe opportunityto exert
control.Thereis variabilitybetweenpairs,though,in the actual situationto which they areexposed.In the proposedscenario, both members of the pair receive exactly the same
informationat the same rateabouta product,but only one is
able to controlthe rateat which the informationis delivered.
Theremay be differencesbetweenpairsin the amountof informationthat is delivered.
How are data like these analyzed?What if the main IV
(control)is crossedwith anotherbetween-subjectsvariable?
Whatif I wantto test the interactionof the controlmanipulation with an individualdifferencevariable?
Editor: It is very importantnot to scramblethe analysis of
yoked experiments, but rather to conduct the analysis
"eggxactly" right. Thankfully, the Journal of Consumer
Psychology (JCP) reader is a hard-boiled researcher, so
this discussion should go over easy. Okay, enough yokes
aboutyolks.
One can envision your lab setup in which one person
wouldbe makingmouse clicks to investigatesome consumer
purchase,while anotherperson would be seated passively
watchinga networkedscreenacrosswhich the same images
flew,withno opportunityto sendinputto the system.Thetwo
conditionswould be labeled"control"and "no control"and
the dyad is paireddue to their sharedinformationthatmay
differsomewhatacrossyoked pairs.
The datafromthis yoked design would be analyzedvia a
two-groupmatchedt test, which is the simstraightforward
plest version of a repeatedmeasuresmodel. Matchedt tests
are easily envisioned when we hear of classic studies on
twins (i.e., put one twin in one condition,the other twin in
the other condition, and their data are linked given their
special relationship).Sometimes the criterionof matching
is self-selected, like when husband-wife pairs are distributed acrosscounselingsessions. (And, cohortanalysis is essentially the linking of large groups of people who are
approximatelythe same age, who thereforewould have experienced similar culturalphenomena;cf. Menard, 1991.)
In this arrangement,the matching,or yoking, has been imposed by the experimenter.
For any of these scenarios,the datawould look like this:
yoked pair
score for person
in Condition 1
1
2
3
Xll
n
Xln
score for person
in Condition2
X21
X22
X23
Xl2
X13
Whenmatchedt tests arepresentedin introductorystatistics
books (e.g., Hays, 1988, pp. 313-315; Mann, 1995, pp. 494,
521-528; Weinberg& Goldberg, 1979, pp. 310-316), the
examples illustrateexplicitly how the performanceof people in the first conditionis comparedto thatof the matched
personsin the second, literallyby takingdifferencescores:
yoked pair
score for
Person 1
score for
Person2
1
X1i
X21
2
x12
X22
3
X13
X23
n
Xln
difference
score
d = XIl - X21
d2 = X12- X22
d3 = X13-X23
dn = Xln - X2n
ANALYSISOF VARLANCE
A one-samplet test (with n - 1 dj) is computedon the difference scores to test Ho: gd = 0:
t(d-d)
( Sd
whered is themeanofthe ds andsdistheirstandard
deviation.
As the design becomes more complex (addingconditions
or factors,etc.), the matchedt test becomes a repeatedmeasures design (or mixed design if thereare also between-subjects factors),andone proceedsas usualfor a within-subjects
dataanalysis. (Those designs are applicablewhetherwe are
comparingone person in multipleconditionsor structurally
linked persons in multiple conditions.) The computer or
model uses the instructionsregardingwhich participantsare
matchedwith which othersandconductsthe analysisaccordingly-the userneednotactuallycomputedifferencescores.
REFERENCES
Hays,WilliamL.(1988).Stadstics(4thed). NewYork:Holt,Rinehart& Winston.
Mann,Prem S. (1995). Introductorystatistics (2nd ed.). New York:Wiley.
Menard,Scott. (1991). Longitudinalresearch.NewburyPark,CA: Sage.
Weinberg,SharonL., & Goldberg,KennethP.(1979). Basic statisticsfor education and the behavioralsciences. Boston: HoughtonMifflin.
I.K.DEALINGWITHHIGHERORDER
INTERACTIONSIN REPEATED
MEASURES DESIGNS
How should one deal with high-orderstatisticallysignificant
interactionsin multifactorrepeatedmeasuresdesigns?I often
use scenariosas experimentalstimuliin the contextof my research.These scenariosvaryaccordingto severalcharacteristics. Participantsgenerally react to a subset or all of the
scenarios,so I use repeatedmeasuresANOVAfor statistical
analysis.Forinstance,in a recentstudyon sponsorship,I varied fourfactorsin a 2 x 2 x 2 x 2 factorialdesign.It is not rare
that I use more complex designs. The problemI often have
when I proceedto analyzethe datais observingseveralstatistically significant high-orderinteractions,because, as you
know, repeatedmeasuresdesigns are more sensitive.Now, I
understandthatsignificantinteractionshaveto be interpreted.
of a quadruHowever,you must admitthatthe interpretation
ple interactionis not an easy thingto do. In addition,the variance explainedby such interactionsis often small.
of
Whatshouldbe done?Concentrateon the interpretation
significantmain effects only? Concentrateon those interactions thatare more important?How do you judge the importance of an effect (apartfrom looking at the range of mean
differences and omega squaredstatistics-which are problematic in the contextof repeatedmeasuresANOVA)?
I findthisproblemcomingbackeverytimeI do thistypeof
experimentalresearch,particularlywhen samplesize is large
25
(whichoughtto be a good thing).I thinkmost of the time the
interactionsarenot worthlookingat, butwhatshouldI say in
reportingthe results?Thereare so many reviewersout there
eager to find the little bug in an articlewho will notice the
problemquicklyif you try to evade it. Whatis most frustrating is the fact thatincreasingsample size is likely to lead to
this type of situation.One is temptedto reducesamplesize to
avoid these problems.
A relatedquestionfroma differentresearcherfollows.
Althoughmy majorresearchinterestsare main effects or
two-way interactioneffects, my results showed a three-way
interaction(which is not reallyimportantto my researchobof the maineffect
jective). In sucha case, is the interpretation
or the two-wayinteractioneffect still meaningful,andshould
we make any conclusions based on the main effects or
two-way interactioneffect?
Professor Scott Maxwell
Notre Dame
It is important
to realizethata maineffectrepresentsdifferences
in marginalmeansaveragingoverall otherfactorsin thedesign.
As such,themaineffectis an average(whichmaybe eitherunweightedorweighted)of relevantsimpleeffects(e.g.,Maxwell
& Delaney, 1990, p. 244; also see Cole, Maxwell,Arvey,&
Salas,1994;Maxwell,1992;Maxwell& Delaney,1990).A staimpliesthatthesevarioussimple
tisticallysignificantinteraction
effectsarenot all equalto one another.Evenso, it is still legitimateto regardthe maineffect as a correctstatementregarding
averageeffects. However,when an interactionis present,the
individualcomponentsof this averageare known(probabilistically)to be differentfromone another.Thus,therealquestion
here is whetherit is meaningfulto interpretan averageeffect
whenthe individualeffectsmakingup the averagearedifferent
from one another.This questioncannotbe answeredsimply
froma statisticalperspective,becausemeaningfulnessderives
largelyfromthepurposeofthe study.However,oneaspectof the
datathatis oftenimportantis whetherthe interactionthatqualiof the maineffectis ordinalor disordinal
fies the interpretation
(cf. Maxwell& Delaney,1990,pp.244-247,298). Whentheinteractionis ordinal,thesimpleeffectsallhavethesamesignand
differonlyin magnitude.Ina casesuchas this,it canmakesense
to interpretan averageeffect.However,whenthe interactionis
disordinal,thesimpleeffectshavedifferentsigns,makinganaverageless meaningful.Finally,it shouldbe notedthateven in a
simpletwo-waydesign,the two-wayinteractioncanbe ordinal
withrespectto onefactorbutdisordinalwithrespectto theother.
REFERENCES
Cole, David A., Maxwell, Scott E., Arvey, RichardD., & Salas, Eduardo.
(1994). How the powerofMANOVAcan both increaseanddecreaseas
a functionof the intercorrelations
amongthe dependentvariables.Psychological Bulletin, 115, 465-474.
26
ANALYSIS
OFVARIANCE
Huberty,CarlJ., & Morris,J. D. (1989). Multivariate
analysisversus
multiple univariate analyses. Psychological Bulletin, 105, 302-
308.
inMANOVA
Maxwell,ScottE.(1992).Recentdevelopments
applications.
In BruceThompson(Ed.),Advancesinsocial science methodology(pp.
CT:JAI.
1-40).Greenwich,
and
Maxwell,ScottE.,&Delaney,HaroldD.(1990).Designingexperiments
analyzingdata: A model comparisonperspective.Pacific Grove,CA:
Brooks/Cole.
Editor: As Maxwell'sanswersuggests,theresponsedoes not
depend on whether the experimental design contains
within-subjectsfactors.As the questionerseems to know,we
speak of the interpretationof lower ordereffects (e.g., main
effects, two-way interactions)as being moderatedby the
higher orderinteractions(three- and four-wayinteractions,
etc.). I haveempathyforthis researcherwho, seekingreliable
data, evidently gathers relatively large samples and then
"pays the price"by watchingall the termsmanifestsignificant.Significanceis usuallycause for celebration,butit does
seem tediousto explainawaythe significantinteractionsthat
do not matter-for example,theoreticallytheywere intended
as replications(i.e., the stimuli exposures).I have also seen
reviewers criticize the fact that an effect was not perfectly
consistentacross the stimuli (e.g., demonstrationsover purchase categories).To be fair,spottingsuchpatternsof results
is surely part of a reviewer'sjob, because if there could be
found consistencies among the inconsistencies, additional
contingentfactorswouldyield morecomplextheorizing.If a
researcher'stheorylargelyexplainsthe data,butthe inconsistencies aretroublingthe reviewer,it may be the burdenof the
reviewerto offera theorythatexplainsthe databetter(as parsimoniouslyandalso encompassingthe inconsistencies).The
questioner'sinstinctsaregood, for althoughit may seem like
some index of effect size could provevaluable(becausethey
are independentof sample size), thatclass of indexes is not
without its own problems(cf. Fern & Monroe, 1996). (See
also the discussion in this special issue of stimulias replications in a nested design, QuestionI.I.)
REFERENCE
KentB.(1996).Effect-size
estimates:
Issuesand
Fern,Edward
F.,&Monroe,
Journalof ConsumerResearch,23(2), 89problemsin interpretation.
105.
I.L.MULTIPLESTIMULIAND LEARNING
STUDIED THROUGHA REPEATED
MEASURES DESIGN
We have completedcollecting datain a computerinteractive
experiment,the goal of which was to ascertainhow individuals react to outcome feedbackin probabilisticenvironments
given externallyset expectationlevels. To achievethis objec-
tive, graduatestudentsmade sales estimates given various
levels of advertisingbudgetsfor seven consecutiveperiods,
receivingoutcomefeedbackaftereachtrial.Experimentparticipantswere randomlyassigned to one of four treatment
conditions:small-largeforecastingerrorsx attainable-unattainablepresetexpectationlevels. Thepatternof errors(overand underestimatingsales) was randomizedacross participantsto mitigatethe possibilityof cheating,but in all cases,
the sum of the errorswas zero. Afterdeterminingthe pattern
of errorsfor a participantin the smallerrorcondition,the patternwas multipliedby fourandpresentedto an individualin
the largeerrorcondition.Aftereach decision,outcomefeedbackwas provided,thus,the participantscould reactto feedback on the second to the seventhtrials. The externallyset
performanceexpectationlevels were embedded,but highlighted,in the taskdescription,whichwas presentedin a case
format.Expectationswere manipulatedso thatthey could or
could not be met.
Numeroushypothesesare being examined.For clarification purposes,we shareone. We hypothesizethatattainable
versus unattainableexpectationlevels will affect the way
managersinterprettheir success or failure,and that this in
turn will affect subsequentdecision-makingprocesses. In
early iterationssolving recurringproblems,we expect that
when participantscannot meet performanceexpectations,
they will change their behavior. Specifically, we expect
them to interprettheir failureto meet performancegoals as
due to their inability to correctly estimate the parameter
value in the providedadvertisingresponsedecision model,
ratherthan the result of inherentunpredictability(noise) in
the decision environment.As a result, they will adjustthe
model's parametermore than will participantswho succeeded in meeting performancegoals. Thus, if we measure
the extent to which participantschange the parameterattachedto the DV (advertising),in the attainableexpectation
conditioncue weights shouldbe revisedmorethanin the attainable expectationcondition.
To analyze the data, we have used two methodological
approaches.The first is to use the MANOVA routine in
SPSS where the six trials are the repeatedmeasures and
the crossed treatmentconditions are the IVs. As noted earlier, the pattern of errors received by each subject was
shuffled (though in all cases the sum of the errors was
zero). Thus, one subject may have received a small negative error on the first trial and anotherin the same treatment condition a positive errorafter the first trial. We did
not "backshuffle"the responses so thatthe orderof the errors was the same for each subject prior to analysis. We
believe we are correct in not reorderingthe responses, because one of our goals is to tap the speed of learningover
trials. As we understandit, this means we cannot examine
learningacross treatmentsat each trial;rather,we must examine the averagedF test by each of the two main effects
as well as the interaction.If we are correct thus far, what
appearsin the printoutis the univariateF test by trial, fol-
ANALYSISOF VARIANCE
lowed by the averageF test. Wouldwe just reportthe latter
in a publication?There are no treatmentmeans associated
with the latter.How do we know that significanteffects are
significant in the intended direction?
Giventhatone ofthe goals is to examinehow quicklypeople learnto reactappropriatelyin a specific decision-making
environment,a more complicatedbutinterestingway to analyze the data is as follows. Assuming we believe one treatmentconditionwill causeparticipantsto learnfaster(i.e., the
changesto theirdecision weights will morerapidlydecrease
towardzero), we fit a regressionmodel to the changes in the
cue weights over trials for each individualparticipant.The
DV is the percentagechangein cue weight over trials;the IV
is the trialnumber.This is done for each participant.We anticipate that the slopes in the fitted models for those in the
"learnfaster"condition would be larger,negativenumbers.
Although the individual regression models are unreliable
(there are only six observations),these betas would then be
used as DVs in a 2 x 2 full factorialANOVAmodel. Does the
unreliabilityat the individuallevel washoutwhen analyzed
with similardatafromnumerousindividuals?The slopes to
the fitted line do appearto be a wonderfulway of capturing
the speed with which people learn.We arenot awareof anyone using this technique.
Professor Greg Allenby
Ohio State University
The goal here is to producea model in which noisy dataproduceslearningandaccuratedataproducesmorelearning.The
researcherknows the extentof noisiness (uncertaintyor variance),butthekey is to understandtheperceivedoroperational
amountof uncertainty.I would study this using a Bayesian
modelofupdating,butinsteadof knowingthepriorandlikelihoodandthenderivingtheposterior(asin a standardanalysis),
I wouldspecify thepriorandposteriorandthenderivethelikelihood. The varianceparameterin the likelihoodreflects the
operationallevel of uncertainty.I would need to know more
aboutthe problemto flesh out the exact models thatI would
use, butthe keyhereis thatthe measurementmustconditionon
what is observed,andwhat is observedin this problemis the
priorand posterioramountof knowledge.
Editor: Bayesian updating models do sound apropos and
worth investigating.Your choice of the repeatedmeasures
ANOVAmodel sounds perfectly well suited for addressing
your concerns.I am not surewhat you meanby back shuffle
(was that a dance in the late 1970s?), but I am guessing that
you are trying to make a case that the orderof stimuli presented over trialsis unimportant.Often researcherswho test
for "order,"hoping it will not have an effect or interaction
with any otherfactor,find theirhope fulfilledpartlybecause
27
so few participantswereassignedto anygivenorderthatthere
is no powerto detect an ordereffect.
Yoursituationseems abit morecomplicated.I sense some
ambiguityin how you characterizethe stimuliin yourdesign.
On the one hand,you do not wish to examinethe F tests per
trial,becauseyou exposed differentparticipantsto different
stimuli as they proceeded along, so an aggregate statistic
(across respondentswithin a trial) may be meaningless.In
thiscase, as you intuit,thebest you coulddo is reportthe final
F test, as capturingthe outcome of the process, but not the
steps along the process itself (nor, as you recognize, the
meansat each trial).
Onthe otherhand,you makeit soundlike you could argue
philosophicallythatthe stimulipresentedover trialsare expected to be theoreticalreplicates,in which case you could
presumablyinterpretthe F tests andmeansper trial.Another
perspectivewould be to not worryaboutthe stimuliwithina
trialwhateverandrathercreatean indexthatflags when (i.e.,
duringwhichtrial)therespondenthasproceededupthe learning curve,as determinedby some criterionthatseems suited
to your learningtask (correctanswers,speed in responding,
etc.). This suggestion is related to your investigationinto
slopes-again, a nice intuition.However,you need not shift
paradigmsto test the shapes of individuals'learningcurve
functionsor theirpoints of"inflection"(essentiallyat which
trialsdo you see substantialincrementsof improvementfor
each respondent?).Within the ANOVA framework,you
would treattrials as quantitativefactors,and you could test
"trend"contrasts (i.e., linear, quadratic,etc.; cf. Keppel,
1991).Relatedquestionsin this specialissue mayalso be useful, including(a) studyingeffects in regressionanalogousto
simple effects in regression(SectionII.H.)and (b) hierarchically nested designs in ANOVA(Section I.I.).
REFERENCE
Keppel, Geoffrey.(1991). Design and analysis: A researcher'shandbook
(3rd ed.). EnglewoodCliffs, NJ: PrenticeHall.
I.M.SHOULD WE PLOT MEANS OR
MEAN DIFFERENCES WHEN
ANALYZINGWITHIN-SUBJECTS
DESIGNS?
Should we plot means or mean differenceswhen analyzing
within-subjectsdesigns?
Professor Jan-Benedict Steenkamp
Catholic University of Leuven, Belgium
I do not feel stronglyaboutthis. Mean differencesare easily
computedfrom the means. In any case, the data suffer from
28
ANALYSIS
OFVARIANCE
correlatederrors,so I amnot surewhethermeandifferencesis
a greathelp.
Editor: Steenkampis right,thoughthe intuitionsbehindthe
questionare good, given thatthe underlyingphilosophyin a
within-subjects, or repeated measures, design is one of
comparingparticipants'responses under one condition to
their responses under anothercondition, to see what kinds
of changes have been established.McNeil (1992) had some
useful suggestions for plotting data observationsthat are
structurallytied to each other.To that same end, you may
find Harris's(1985) presentationof "profileanalysis"interesting, in plottingand analyzingshapesof curvesacross trials or conditions.
REFERENCES
statistics(2nded.).New
J.(1985).Aprimerofmultivariate
Harris,Richard
York:Academic.
46,
Statistician,
McNeil,Don.(1992).Ongraphing
paireddata.American
307-311.
I.N. USES OF ANCOVA
I understandthe typicaluse of the ANCOVA.However,is it
appropriateto use ANCOVAto measurewhetherthe experiment resultwas confoundedby some uncontrolledvariables
(e.g., demographicvariables)?If yes, how do we interpretthe
ANCOVAresultin such a case?If not, whatareotherstatistical tools to detect such confoundingeffects?
Editor: Yes, the use of the ANCOVAas suggested in this
question is legitimate and could be implementedbeneficially in this mannermore frequently.When conductingexperiments,we randomlyassign respondentsto conditionsto
enhanceinternalvalidity-our hopes at being able to attribute group differences on various DVs to our experimental
interventions.If we have randomlyassignedpersonsto conditions, then a priori, before our manipulations, these
groups should be, statistically speaking, roughly (i.e.,
within sampling variability)equal on nearly any measure
one could concoct. However, random and statistically
speaking always work in the long run (over one's career?),
which is to say that any given random assignmentmight
look peculiar indeed to the eye. In particular,we worry
aboutconfounding-extant groupdifferenceson extraneous
(i.e., nonfocal) attributesthat could serve as alternativeexplanations for the group differences observed on the dependentmeasuresof interest.In such situations,we can use
the ANCOVAto correctfor, or statisticallycontrolfor, such
observed differences. In these circumstances, ANCOVA
serves as a post hoc equatingof the groups when the random assignmentdid not appearsufficient.
Inparadigms,wherewe knowcertainextraneousfactorsto
be persistent,we may opt to mechanicallycontrolfor differences by "blocking"on thatextraneousfactor-purposively
distributingequal numbersof persons high and low on the
blocking attributeacross the experimentalconditionsto assure therewill be no confounding.Blocking can be clumsy,
however,and it is difficultto knowjust which nuisancefactorswill flareup on any given occasion.An advantageof the
ANCOVAis thatall thepotentialcovariatesmaybe measured
(afterthe DVs so as to minimizeanypossiblecontamination),
and then these variablesmay be tested as candidatecovariates, one at a time or in groups, to see which corrections
shouldbe maintainedin the analysis.
As an example of using ANCOVAto correct for initial
groupdifferences,I have been playing with datathat arose
froma classic pretest-posttestdesign in which we foundthe
predictedposttestresults.However,on furtherinspection,we
also foundthe pretestresultsto be significant,with a similar
patternof means. Once the pretestattitudescores were includedas a covariate,the adjustedeffects on the DV were no
longerimpressive(norsignificant);the groups'attitudessimply differedacrossconditionsbefore we ever intervened.
How does one interpretANCOVAresultswhen using the
modelto correctforextantgroupdifferences?Hopefully,theeffect for the covariate(andperhapsits interactionwith one or
moreof the experimentalfactors)is significant,buteven if it is
not, includingthe covariatein the modelusuallyhelpsto clean
uptheanalysisby reducingtheMSEtermthatenhancesthelikelihood of findingsignificancein the DV. In reportingregressions, researchers frequently describe the theoretically
findings,mentioningthattheyalsocontrolledforcerimportant
tainothervariables-for example,by havingincludedvariables
theextraneousfactorssimulanddummyvariablesrepresenting
taneouslyin the model.The logic forANCOVAis similar,and
thissortof reportingwouldseemtobe a reasonableformat-for
example,"usinga covariateto controlforthea priorigroupdifferencesobservedon [weight],we neverthelesssaw significant
differenceson [self-reported
cravingsfor chocolate]acrossthe
x
various[Advertisement RoomScent]conditions."
For additionalinformationon the ANCOVA,consultEdwards (1979), Keppel (1991, pp. 297-236), Kirk (1982, pp.
715-757), andWildtandAhtola(1978).
REFERENCES
andtheanalysisof variance
AlienL.(1979).Multipleregression
Edwards,
NewYork:Freeman.
andcovariance.
handbook
Keppel,Geoffrey.(1991).Designandanalysis:A researcher's
Cliffs,NJ:PrenticeHall.
Englewood
thebehavioral
designproceduresfor
Kirk,RogerE. (1982).Experimental
sciences(2nded.).Belmont,CA:Brooks/Cole.
Wildt,AlbertR., & Ahtola,Olli.(1978).Analysisof covariance.Beverly
Hills,CA:Sage.
ANALYSISOF VARIANCE
1.0. WHENTO USE MANOVA
AND
SIGNIFICANT
AND
MANOVAS
INSIGNIFICANT
ANOVASOR
VICEVERSA
Consider an experimentalstudy that involves multiple dependentmeasureshypothesizedto respondin the same manner to an experimentalmanipulation.When the data are
analyzedwith MANOVA,the experimentalmanipulationhas
a nonsignificanteffect. Nonetheless,the univariateANOVAs
show a significanteffect in the predicteddirectionfor eachof
the dependentmeasures.Why does this inconsistencyoccur,
and which set of resultsshouldthe researcherrely on?
A relatedquestionfollows:I haverunintoconfusionwhen
determining when to use MANOVA versus separate
ANOVAs.Most textbooksare vague on this subjectandjust
mentionthatMANOVAcanbe usedwithcorrelatedDVs. The
questionis, Whatis the desiredlevel of correlationto justify
use of MANOVA?I askedmy residentstatisticsexpert,and
he feels thatDVs correlatedfromabout.3 to about.7 areeligible, becauseanythinglowerindicatesvariablesthatreallyare
not very related,and anythinghigher indicatesredundancy.
He was also quickto pointoutthatit mustmakesense conceptually to packageDVs togetherbefore doing so. Whatis the
conventionalwisdom on this matter?I have had reviewers
thatinsist MANOVAmustbe used if therearemultipleDVs,
period,withoutconsideringthe conceptualimplications.
Anotherquestionis, Whatdo I do if MANOVAis not significant,but ANOVAson the individualDVs are?Is it possible that the DVs share variance and that prevents the
MANOVAfromapproachingsignificance?Is thereany way
to justify using the individualANOVAsinstead?
29
of the individualDVs, accordingto univariatetests. The reason suchdiscrepanciescanoccuris becausethepowerof each
univariatetest depends on mean differences and variances
(andobviouslysamplesize), butthe powerof MANOVAdepends notjust on groupmean differencesand variances,but
also on correlationsamong the DVs. Although some researchershavesuggestedthathigh correlationsleadto greater
power, others have suggested just the opposite. Cole,
Maxwell, Arvey,and Salas (1994) showedthatthe power of
MANOVAcan eitherincreaseor decreaseas a functionof the
intercorrelations
amongthe DVs.
Inparticular,to theextentthatthevariousDVs tendto measurethesameunderlyingconstruct,thepowerofMANOVAis
likelyto be less thanif themeasuresaremoredisparate.Thus,
especiallywhenvariablesarehighlyrelated,approachesother
than MANOVAmay be preferable.For example, Maxwell
(1992)pointedoutthepotentialbenefitsofforminga composite dependentmeasuresuchas a simpleunweightedaverageor
firstprincipalcomponentof the originalDVs.
Alternatively,a latentvariablemodelcouldbe adoptedand
latentvariablemean differencesinvestigated(cf. Cole, Maxwell, Arvey,& Salas, 1993). More generally,researchersare
advisedto heedHubertyandMorris's(1989) adviceaboutthe
extentto which theirmultipledependentmeasuresconstitute
a variablesystem.
If the researcher'sonly reason for using MANOVAis to
controlthe family-wiseType 1 errorrate,HubertyandMorris
(1989) suggested that bettermethodsmay be available.On
the otherhand,when researchersareinterestedin examining
linearcombinationsof the DVs, MANOVAis likely to be the
methodof choice.
REFERENCES
ProfessorScott Maxwell
NotreDame
TheMANOVAnullhypothesisis thatgroupmeandifferences
arezero on each and everyDV.Thus,all it takesis for groups
to differ on a single DV for the null hypothesisto be false.
Therefore,with a large enough sample,the multivariatenull
hypothesiswill be rejectedwitha probabilityapproaching1.0
even if the groups are identicalon all but a single variable.
Most researchers, however, are not blessed with an indeterminatelylarge sample.Thus, statisticalpowerbecomes
an importantconsiderationin planningas well as evaluating
groupcomparisons.
In some situations,univariatetests may be morepowerful
thanthe multivariatetest, leadingto one or more statistically
significantunivariateresultsbuta nonsignificantmultivariate
test. The opposite is also possible-the multivariatetest can
be more powerful than any of the univariatetests, in which
case the multivariatetest may be statisticallysignificant,but
thereareno statisticallysignificantgroupdifferenceson any
Cole, David A., Maxwell, Scott E., Arvey, RichardD., & Salas, Eduardo.
(1993). Multivariate group comparisons of variable systems:
MANOVAand structuralequationmodeling.Psychological Bulletin,
114, 174-184.
Cole, David A., Maxwell, Scott E., Arvey, RichardD., & Salas, Eduardo.
(1994). How the powerof MANOVAcan bothincreaseanddecreaseas
a functionof the intercorrelations
amongthe dependentvariables.Psychological Bulletin,115, 465-474.
Huberty,CarlJ., & Morris,JohnD. (1989). Multivariateanalysisversusmultiple univariateanalyses.PsychologicalBulletin, 105, 302-308.
Maxwell, Scott E. (1992). Recentdevelopmentsin MANOVAapplications.
In Bruce Thompson(Ed.), Advances in social science methodology
(Vol. 2, pp. 200-222). Greenwich,CT:JAIPress.
Maxwell, Scott E., and Delaney,HaroldD. (1990). Designing experiments
and analyzingdata: A model comparisonperspective. Pacific Grove,
CA: Brooks/Cole.
Editor:
MANOVA
is a generalization
of analysisof variancethatallows the researcherto analyzemorethanone dependentvari-
30
ANALYSIS
OFVARIANCE
able. Manytimes it is of interestto comparegroupmeanson
severalvariablessimultaneously,and MANOVAallows the
researcherto do this in both experimentaland observational
research.(Bray & Maxwell, 1985, p. 5)
BrayandMaxwell (p. 31ff) provideda nice demonstrationof
one of the argumentsfor MANOVA, that it can be more
powerful in detecting group differences-for example, the
marginaldistributionson eitherof two DVs may yield insignificant ANOVA results, yet jointly the groups are cleanly
separablein the 2-D multivariatespace, via MANOVA(cf.
Iacobucci, 1994).
Usually the selection of the variablesto model simultaneously in a MANOVAis conceptuallydriven.Multipleindicatorsofa single constructarelikelyto be correlated,andthey
make for a sensible variablesystem (as statedpreviously)or
conceptualpackage(as per the second questionstatedpreviously). Alternatively,a researchermay be interestedin modeling the joint behavior of multiple complementaryDVs,
whose measurementmay not be redundant,butwhose informationis correlated,as in componentsof a behavioralprocess
system.Ineithercase, "thedependentvariablesshouldbe theoretically correlated as well as empirically correlated"
(Weinfurt,1995, p. 251).
If, as is typically argued, an advantage of MANOVA
over ANOVAis thatthe formertakes into accountthe correlations among the dependentmeasuresin these conceptual
sets, as the second questionposes, just how correlatedought
those measuresbe? I agree whole heartedlywith the questioner's resident stats experts-a range of .3 to .7 seems
sensible, albeit a subjectiverule of thumb.If all the DVs are
pairwisecorrelatedgreaterthan .7, it may be simplerto create a composite score-a simple average-and execute a
univariateANOVA.
If all the correlationsfall short of .3 in magnitude,the
DVs are essentially contributingunique information,and a
series of ANOVAsmay be conductedwith little loss of information.However,if one were to do so, it would be importantto considerthe Type 1 errorratesand compensatein
a Bonferronifashion, correctingalphaby dividingit by the
number of DVs. Protection from excessive Type 1 errors
has been anotherargumentfor MANOVA,thoughresearchers remind us that this protectionis true only of the first
stage in hypothesis testing (i.e., the MANOVAoverall test
that determineswhetheranythingis significant).In the follow-up tests to identify just which groups differ on just
what variables,Type 1 errorsmay abound(Bird & HadziPavlovic, 1983; Ramsey, 1980, 1982; see also Huberty&
Morris, 1989).
Having said all that, all the right things, I have to say
that I personally cannot remember ever reading a single
journal article in which the hypotheses to be tested were
actually multivariate.ANOVA and MANOVA are linear
models (one term, like a main effect, gets addedto or subtractedfrom anotherterm like an interaction).Linearmod-
els are inherentlycompensatory,meaning that if a respondent scores high on one variableand low on another,their
data are treatedlike a respondentwith the reversepattern,
or one who scores in the middle on both variables(within
limits, e.g., if the variablesare roughly equally weighted).
Stated anotherway, linear models treat data in a disjunctive form (is X high or is Y high). Most journal articles
state hypotheses in a conjunctiveform (e.g., H1: "We expect X to be high underthese conditions,"and H2: "Weexpect Y to be high under these conditions"). Researchers
with these predictionswould not conclude a success if X
were extrahigh and Ywere low-the form of the hypothetical priorsare not compensatory.If a univariatehypothesis
is stated, it may make most sense to simply test it
univariately.If a numberof indicatorvariables have been
measuredand they are shown (througha factoranalysis) to
be tappingthe same construct,a composite (e.g., average)
may be formedand a simple ANOVAconducted.If a number of DVs are to be tested and they cannotbe aggregated,
for example, they are hypothesized to representdifferent
constructs with different anticipatedresults, then the researchershould do separateANOVAs and probablyoperate conservatively by adjusting alpha.
Nevertheless, a most definitive scenario in which
MANOVAshouldbe used ratherthanANOVAis when modeling repeatedmeasures(within-subjects)variablesor mixed
designs (i.e., in conjunctionwith between-subjectsfactors).
The ANOVAtreatmentfor a repeatedmeasuresor mixeddesign is knownto be nonrobustto violationsof the homogeneity of treatmentdifferencevariancesassumption,which is
relaxedin the MANOVAmodelingapproachon such data.
extension of
In theory,a MANOVAis a straightforward
ANOVA.And, in theory,if severalDVs areat least somewhat
correlatedandtheirresultsarequalitativelysimilar(showing
the samepatternsof means,etc.), thenthe conclusionsdrawn
betweenthe two techniquesshouldconverge.?
REFERENCES
testproceDusan.(1983).Simultaneous
Bird,KevinD., & Hadzi-Pavlovic,
Bulduresandthechoiceof a teststatisticinMANOVA.
Psychological
letin, 93, 167-178.
Bray,JamesH., & Maxwell, Scott E. (1985). Multivariateanalysis of vari-
ance.Newbury
Park,CA:Sage.
data.InRichard
Dawn.(1994).Analysisof experimental
lacobucci,
Bagozzi
(Ed.), Principles of marketingresearch (pp. 224-278). Cambridge,
MA:Blackwell.
Ramsey,PhilipH. (1980).Choosingthemostpowerfulpairwisemultiple
Journalof
in multivariate
analysisof variance.
procedure
comparison
AppliedPsychology,65, 317-326.
2
forcomparing
powerof procedures
Ramsey,PhilipH. (1982).Empirical
groupson p variables.Journalof EducationalStatistics, 7, 139-156.
InLaurence
G.
KevinP.(1995).Multivariate
analysisof variance.
Weinfurt,
Grimm & Paul R Yamold (Eds.), Reading and understanding
DC:American
statistics(pp.245-276).Washington,
multivariate
PsychologicalAssociation.
ANALYSISOF VARIANCE
I.P.TESTINGFORSIGNIFICANT
DIFFERENCESINVARIANCES,
RATHER
THANMEANS
Istherea waytocomparewhethervariances
onsomevariable
intwodifferentgroupsaredifferent?
I havetwogroupsinmy
andI havenoticedthatmean
sampleof surveyrespondents,
are
in
one
ratings higher
groupcomparedto the other,but
theirjudgmentsalsoseemto havelowervariance.Thisfinding makestheoreticalsenseandhelpstheresearch,butI do
notknowhowto demonstrate
thatthevarianceis
statistically
smallerin theonegroupversustheother.
Professor Greg Allenby
Ohio State University
Sure,you bet.Youtreatthevarianceas anyotherparameterin a
model and build two models of the data:(a) The varianceis
constrainedto be the same in the two subsamples,and(b) the
varianceparameteris allowedto differin the two subsamples.
Fitthesetwo modelsto the datavia likelihoodmethods.Minus
two times the differencein the log likelihoodis distributed
chi-squarewith 1 dfin this problem.If-2*log likelihoodis
greaterthanthemagiccutoffnumberfora chi-squarewith 1 df
thenthe variancesare significantlydifferent.
small when o2 = 22. We use this characteristicto formulate
tests for the hypothesesgiven previously.For example, the
testofhypothesis(c) is the following:PreferHi, ifk < Q< k2,
wherekl andk2areselectedso thatthe testhas size a. If we let
the two tails haveequalareasof o/2, thenki = Fa2(nl - 1, n21), andk2= Ft-o/2(nl- 1, n2- 1).
REFERENCE
Mood, Alexander, Graybill, Franklin A., & Boes, Duane C. (1974).
Introduction to the theory of statistics. New York:
McGraw-Hill.
Editor: I do not knowwhetherthe typicalJCP readerwill be
familiarwith the likelihood methods suggested in the first
scenario,so I am going to focus on the second.Withinthe second response,let me focus on Hypothesisc. It may help to
makethis statisticmore familiarlooking if we restatethe hypothesesfrom
Ho:ao =a
=1
o2
Assumethesearerandomsamplesfromtwonormalpopulations
withmeansandvariances
(pi1,a2) and(12, C2 ). Severalhythe
about
potheses
populationvariancesmay be tested (see
&
Mood,Graybill, Boes, 1974):
a) Ho:at < 2;
b) Ho:o2 > (2;
c) Ho: a2 = (2;
Hi:of >a02
H1:o2< 2
Hl:o2 it2
The test is derived from the basic idea that if X11,X2, ... ,
Xlnl is a randomsample from a normal density with mean
gi, and variance r2 andifX21, X22, ..., X2n2 is a random
sample from a normal density with mean 1L2and variance
2, and if tthetwo samples are independent,then we know
that the statistic
X(Xi -X, )2 /(nl - 1)a2
(X2i -X2)2 /(n2 - 1)2
has the F distributionwith (ni - 1) and (n2 - 1) df when
= a2 . The statisticQ tendsto be largewhen a2
02
'1 C`2
1~slLol/~
~lL
L ~ a~~
~~ ' 1 = ;22and
vs.
Hi:o?o2
to
Ho:
Professor Sachin Gupta
Northwestern University
31
vs.
a2
Hi: -
1
?2
The F statistic,afterall,is a ratioof two variances-specifically, the null hypothesis in the ANOVA scenario involves
MSeffecWMSerr.
In thestatisticQ mentionedpreviously,we do notreallydo
anythingwith theat2 in the numerato ror the 2 in the denominator,because, accordingto the null hypothesis,these
aretheoreticallyequal,andso they essentiallydropout of the
equation.If you look at thepieces of the equationthatremain,
you havethe varianceof Sample 1 (Sf) in the numeratorand
the varianceas estimatedin Sample2 (5i) in the denominator;for example
S=
-X
)2
nl-I (xl1 i
Therefore,you simply computeQ as the ratio of (~S2)over
(S2); literally,Q = S?/ S2. Youwouldrejectthe null hypothesis andclaimthat S2 > S2 if Q exceededthe right-handcritical value in the F distribution,and you would supportthe
claim that S2 < S if Q was less than the left-handcritical
value in the F distribution.
Thistestofvariancesmaylook a little oddbecauseourtypical experiencewithF tests is thatwe rejectournull hypotheses only for large values of Fs-we have put the entire
32
ANALYSISOF VARIANCE
rejectionregionin the right-handside of the distribution,caring to demonstrateonly whether the numeratorvariance
is largerthanthe denominator(MSerror).
(Infact,this
(MSeffect)
approachso dominatesthe uses andtests of comparativevariancesthatmost of ourstatisticaltextsprovideonly right-hand
tail critical values in their tables.) Therefore,if you put the
bigger of yourtwo variancesin the numerator,call it Sample
1, and then test againstthe new alternative:
Ho:oa /2 = 1
vs.
H2:
/02 >,
you would use the criticalvalue Fa(nl - 1, n2- 1), wherethe
entirerejectionregionis in the right-handtail-for example,
F05,30,36= 1.78; F.01,12,10= 4.71, and so on. (Statistical purists
would say that by examiningthe two variancesand seeing
which is largerandputtingit in the numeratoras representing
Sample 1 is alreadya heuristicof a significancetest, anddoing so increasesthe likelihoodof rejectingthe null. If this issue is of concern,a morestringentsignificancelevel couldbe
implemented.I was simplytryingto makesalientthe similarity betweenthis test andsomethinglikely to be familiarto the
reader-the F test in ANOVA.)
By the way, if you considerthis test the varianceanalogto
a two-samplet test fortestingmeandifferences,we also know
there is a t test for testing a single mean againstsome hypotheticalvalue (i.e., Ho:gl = io comparedto the two-sample
version,I1 = L2).Thereis a similartestto see whethera single
variance estimate is equal to some given value-that is,
Ho:o=o2 , where 02 is some constant "c" say,
Ho:a2 = 4.0 ). The statistictakesthe following form: s2 / c.
For example, for a = .05, sample size n = 30, df = 29,
two-tailedchi-squarecutoff values are 16.05 and45.72 (i.e.,
rejectandconcludeyourobservedvarianceestimateis significantly less than c if this statisticfalls shortof 16.05, reject
and concludeyour observedvarianceis significantlygreater
thanc if it exceeds45.72, anddo notrejectthenull ormaintain
your varianceis plausiblyequal to c in the populationif the
statisticfalls withinthe range).The one-tailedcutoffvalue is
X2= 42.56 (if you care only whetheryourvarianceis greater
thanc; Hogg & Tanis, 1979, pp. 250-251).
high vs. low involvement;nearvs. farbrandextension).Participants'responsesto an appropriatemanipulationcheck are
affectedsignificantlyby the manipulation.However,inspection of themeanresponsesrevealsthat,in one condition,they
averaged aroundthe midpoint of the manipulationcheck
scale (i.e., the mean does not differ significantlyfrom the
scale midpoint),but were morepolarizedin the othercondition (i.e., the mean does differ significantlyfrom the midpoint). What labels should the researcheruse in describing
these two conditions(e.g., the original high-low labels vs.
higher-lower vs. high-moderate or low-moderate)? And,
given thatthe scale is intervalratherthanratioin nature,is it
inappropriateto assign meaningto any given point (e.g., the
midpoint)on the scale?
Professor Jan-Benedict Steenkamp
Catholic University of Leuven, Belgium
I am botheredby the fact that sometimesreportedmeans on
the manipulationcheck suggests that the one conditionwas
not low at all (or high)butwas actuallyclose to the midpoint.
The theory may still hold for, say, averageversus high, but
people still oftenuse the labels low-high. I believe thatis not
correct,and the correctlabels shouldbe used, reflectingthe
actualscoresgiven.I believein anycase thatmostphenomena
arecontinuousso thatsome kindofdichotomizationis a simplificationof reality.Yourquestionshowsthatthismaynotbe
the ideal strategy.
I see no particularproblemto assignmeaningto the scale.
Afterall, the scale is verballyanchored,and,hence,we know
that on a 5-point scale rangingfrom 1 (very low) to 5 (very
high), 4 meanshigh. Of course,the scale may not be interval
scaled at all. In thatcase, we do not know whetherit is rather
high, prettyhigh, or close to very high. As you know,we can
get betterinsight into these issues by applying"Homals"(a
programfor correspondenceanalysisin SPSS; see Hair,Anderson,Tatham,& Black, 1998,p. 553) or a relatedtechnique
to these scores.However,in short,to makethe scoremeaningful, we haveto give it verbalmeaning,which is also whatthe
respondentdoes.
REFERENCE
REFERENCE
Hogg, RobertV., & Tanis,Elliot A. (1979). Probabilityand statisticalinference. New York:Macmillan.
CHECK:RELATIVE
I.Q. MANIPULATION
VERSUS ABSOLUTE MEAN
DIFFERENCES
Consideran experimentalstudy containingtwo levels of a
constructthat includesa manipulationintendedto represent
the ends of the continuumembodiedby the construct(e.g.,
Hair,JosephF., Jr.,Anderson,RolphE., Tatham,RonaldL., & Black, William C. (1998). Multivariatedata analysis (5th ed.). Upper Saddle
River,NJ: PrenticeHall.
Editor: I amnot surejust how definitivelyto takea midpoint,
of one. Havingbeen baptized
or a participant'sinterpretation
in measurement,I guess I havea healthy(overactive?)skeptiI
cism of takinganymeasuredvariableliterally.Furthermore,
rethat
is
in
the
thinkthe importantpoint
applicationposed
ANALYSISOF VARIANCE
searchersaretryingto providedemonstrationsof phenomena
thatoccurat relativelyhigh andrelativelylow ends of a scale.
Therefore,labelingtwo meansas highandlow insteadof high
and middle would not keep me up at night. The empirical
questionis whetherthe high-middle less extremeoutcomes
providesufficientvariancefor interestingeffects to resultthat is, if one can produce the results of interestunder the
high-middle conditions,presumablythe focal effects would
be even more pronouncedhad the calibrationyielded highlow scores (i.e., the formercontrast,perhapsproducedby using a more subtle manipulation,yields a conservativetest).
In addition,even if the researcherthoughthigh andmiddle
were more precise labels for their currentdata, the next researchermight attemptto replicate the study conceptually
(i.e., same construct),but perhapsdoing so with a different
operationalizationand measure (i.e., different variable),
therebylikely creatingan altogethernew calibration,one that
was even more exaggeratedor truncated,dependingon the
purposeof the new study.If this issue is really troublesome,
the extent of variationon these operationalizationscan be
capturedeasily by indexes of effect size (omega-squared,
eta-squared,etc.; cf. Cramer& Nicewander,1979; Dodd &
Schultz, 1973;Dwyer, 1974;Keppel, 1991;Maxwell,Camp,
& Arvey, 1981; O'Grady,1982).
REFERENCES
Cramer,ElliotM., & Nicewander,W.Alan. (1979). Some symmetric,invariant measuresof multivariateassociation.Psychometrika,44, 43-54.
Dodd, David H., & Schultz,RogerF., Jr.(1973). Computationalprocedures
for estimatingmagnitudeof effect for some analysis of variancedesigns. Psychological Bulletin, 79, 391-395.
Dwyer,JamesH. (1974). Analysis of varianceandthe magnitudeof effects:
A generalapproach.Psychological Bulletin,81, 731-737.
Keppel, Geoffrey.(1991). Design and analysis: A researcher'shandbook
(3rd ed.). EnglewoodCliffs, NJ: PrenticeHall.
Maxwell, Scott E., Camp, CameronJ., & Arvey, RichardD. (1981). Measuresof strengthof association:A comparativeexamination.Journalof
AppliedPsychology,66, 525-534.
O'Grady,Kevin C. (1982). Measuresof explainedvariance:Cautionsand
limitations.Psychological Bulletin,92, 766-777.
I.R. EFFECTIVEMANIPULATIONIN THE
RESOURCE MATCHINGCONTEXT
Onerecentstreamof empiricaleffortsin thepersuasionliterature has focused on the resource matching frameworkthat
predictsthat persuasiondepends on the extent to which the
cognitive resourcesneeded for advertisingprocessingmatch
the cognitiveresourcesconsumershave availableat the time
of advertisingprocessing.A typicalstudywouldentaildifferent experimentalconditionsthatvary the matchof available
resources(AR)andrequiredresources(RR).Forexample,one
conditionwouldbe designedto representa match(AR= RR),
33
whereas other conditionswould be designed to representa
mismatch(AR> RRor AR < RR).
To validateassumptionsaboutwhethera given condition
representsa matchormismatch,a recentJournalof Consumer
Research(JCR)articlereportsevidencein theformof response
times(RTs)to a secondarytask.RTswere fastestin the conditiondesignedto representAR> RP,slowestin the conditionintendedto representAR < RF, with the RTs in the AR = RR
conditionfallingbetweenthesetwo extremes.Basedon theresults,it was concludedthatthe conditionsrepresentedthe desiredlevels of match-mismatchbetweenAR andRR.
I am unsureas to whetherthis conclusionis warrantedfor
the following reasons.I accept thatany conditionrepresenting AR > RR shouldlead to the fastestRTs given thatit is the
only conditionwhereexcess resourcesareavailable.Less obvious to me, however,is whetherdifferenceswould be expectedforRTsbetweentheAR= RRandAR< RRconditions.
In the matchcondition,participantsarepresumablydevoting
all of theirresourcesto advertisingprocessing,leavingminimal resourcesfor respondingto the secondarytask. Accordingly, RTs shouldbe slower thanthose observedin the AR >
RRcondition.Similarly,participantsin theAR< RRcondition
should be allocatingall of their resourcesto ad processing,
which againwould lower theirRTs.
However,because participantsare presumedto allocate
all of theirresourcesin boththeAR= RRandAR< RR conditions, it would seem thatRTsshouldbe the same in thesetwo
conditions.Do you agree? If so, then what would be an appropriateinterpretationof what each condition represents?
Forexample,mightone fairlyconcludethatthe dataindicate
thatAR always exceeded RR, but that the extent of this excess varied such that it was greatestin the AR > RR condition, smallest in the AR < RR condition, with the excess
failing between these two extremes in the AR = RR condition? More generally,how might researchersattemptto establish the relative levels of AR and RR representedby a
particularexperimentalcondition?This is a criticalissue because the degree to which a given datapatternsupportsthe
resourcematchingframeworkdirectlydependson what the
conditions actuallyrepresent.
Professor Joan Meyers-Levy
University of Minnesota
Because it is impossible to directly assess the level of resourcesan individualtrulyhas activatedandhas availablefor
processing (because, of course, this is not directly observable), assessing individuals'RT to a secondarytask would
seem to provide some, albeit perhaps incomplete or
nondefinitive,insightinto this issue. An observationthatunderparticularconditionssignificantdifferencesoccurin people's RT for performinga secondarytask can serve as an
indicatorof the degreeof difficultysuchpersonsencounterin
34
ANALYSISOF VARIANCE
conjuringup andrecruitingresourcesto performthe secondary task. Therefore,to the extentthatRTs are longer in one
conditionthananother,it follows thatindividualsareexperiencingmoredifficultyidentifyingor freeingup resourcesfor
performingthe secondarytask,and,thus,availableresources
areto some degreeless thanplentiful.Hence,if one wishes to
assertthatone conditionrepresentsAR< RR,whereasanother
representsAR = RR, a necessaryconditionis that longerRT
mustoccurin the so-calledAR< RRthanin theAR= RRcondition,because even thoughit also may be difficultandtime
consumingto identifyandrecruitresourcesfor the secondary
taskin theAR= RRcondition,it shouldbe evenmoreso in the
more resourcechallengedAR < RR condition(i.e., it should
take longerto identify or free up such resources).
At the sametime,observingsignificantdifferencesin such
RTacrossanythreeormoreconditionsonly indicatesthatrelativedifferencesexist in ease of recruitingresourcesin these
conditions,which by itself does not necessarilymean(as you
note) thatthe three conditionstrulyrepresentones in which
AR < RF,AR = RR, andAR > RR.As such, it seems thatobservingsuch significantRT differencesmay representa necessarybutnot sufficientconditionto makesucha claim.What
is needed furtheris additionalevidence of some uniquepatternof outcomeson some othermeasuresuch as persuasion,
which follows fromthe particularlogic of resourcematching
theoryand how this outcomemeasurein questionshouldbe
affectedif AR are in fact greater,equal, or less thanRR.
Although I can appreciateyour desire for some other
means of substantiatingor verifying the levels of available
relativeto requiredresources,I amunableto offeranysuggestions about how researchersmight provide more definitive
evidence. Indeed,I doubtwhetherthis is possible, given that
the levels of availablerelativeto requiredresourcescannever
be observeddirectlyand thus can be inferredonly indirectly
via the effects they exerton necessarilyindirectmeasures.
Editor: PerhapsRRneededto havebeen definedin the article
moreclearly.Perhapsthis questionertookRRto meanthe resources requiredto completethe focal (advertisingprocesswhereas
ing) task(whichI thinkis thetypicalunderstanding);
the authorsof the article conceptualizedor operationalized
RR as the resourcesrequiredto completeboth the focal and
secondarytasks. The questioner'sperspectivewould predict
thatAR = RR ought to yield resultson the focal task performance differentfrom those in AR < RR (i.e., in the AR = RR
condition,theparticipantwouldhavejust sufficientresources
to completethe primarytask,but insufficientin theAR< RR
condition) but similar performanceon secondary task requirements(i.e., in boththeAR= RRandAR< RRconditions,
the participantwould not have had the opportunityto fully
processthe secondtask).(Itmayhelp to envisiontheseequalities andinequalitiesthroughthe use of Figure5.) Even if the
questionerand authorswere definingRR similarly,it is important to remember that no matter how AR or RR are
Performance
focal task
I
I
secondary
I
AR<RR AR=<RR AR>RR
ReactionTime
secondary
task
I
AR>RR
AR<RR AR=<RR AR>RR
FIGURE5
Relationsbetweenavailableandrequiredresources.
operationalized,each is experienced as a cognitive, not
directlyobservableprocess,so calibrationswill be rough,exhibiting individualdifferences,though hopefully sufficient
for comparingacrossconditions.
I think the issue can be resolved more clearly if, as
Meyers-Levysuggests, a patternof results is sought across
multiple dependentmeasures.Distinctions may be refined
and the process understoodbetterif the researchersbuilt an
argumentby simultaneouslyexaminingmeasuresof performance on boththe primaryand secondarytasks, andRTson
at least the latter.If, say,we operationalize"resourcesavailable" simply as time allocation, it does seem puzzling that
RT for AR < RR could be less than for AR = RR, given that,
when the allottedtime is over and a response is requiredof
the participant,the AR= RRwill havejust completedthe focal task,whereastheAR< RRwill not havedone so, butthey
must also yield a response due to the time constraint,say
1,500 msec, so RTs for respondentsin both of these conditions would be at ceiling. Hence, the RTs may be predicted
to be AR > RR (<) AR = RR(=) AR< RR (e.g., 1,000 msec <
1,500 msec = 1,500 msec). Thatis, the time constraintputs a
ceiling on how high the RTs can be; if there were no time
constraint,then the RTs in AR= RRmightbe expectedto be
less thanthose in AR < RR,but it is the very time constraint
thatdefinesAR. (Of course,durationis not the only variable
availablefor manipulatingresources.)
The questionerstatesthatthe datain the articleindicated
that RTs were ordered:AR > RR (<) AR = RR (<) AR < RR.
Perhapswe mightconcludethatthe conditions,howevercon-
ANALYSISOF VARIANCE
ceptualized,were manifestoperationallyas, AR >>> RR,AR
>> RR,andAR> RR.Thatis, availableresourcesexceedthose
requiredin all conditions,differingin a matterof degree in
which the calibrationwas not thatfine (it is not clearhow to
do so errorfree),butthetheoreticalinquirywas one ofrelative
comparisons,so was this resultsufficient?
If the researcherwould also examine some performance
indicator,memory, or extent of processing data, then presumably performanceon the focal task should follow the
order suggested by the writers of the article referred-toin
the question-that is, AR > RR (>) AR = RR (>) AR < RR.
35
That is, in the first condition, participantscan succeed on
the primarytask and do so before the time constraint.In the
middle (matching) condition, participantscan succeed on
the primarytask, just completing it during the time allocated. In the insufficient resources condition, when time is
up, participants'performance is incomplete, or they respond with less accuracy, given that the focal task would
not have been completed. Therefore, even if RTs were
equal on the secondary task for the latter two conditions,
the processes may be inferredas different if data on a performanceconstructwere also analyzed.