Single-factor Experiments - UCSC directory of individual websites

D.G. Bonett (3/2016)
Lecture Notes – Module 3
One-factor Experiments
A between-subjects treatment factor is an independent variable with k β‰₯ 2 levels
in which participants are randomized into k groups. It is common, but not
necessary, to have an equal number of participants in each group. Each group
receives one of the k levels of the independent variable with participants being
treated identically in every other respect. The two-group experiment considered
previously is a special case of this type of design.
In a one-factor experiment with k levels of the independent variable, the
population parameters are πœ‡1 , πœ‡2 , …, πœ‡π‘˜ where πœ‡π‘— (j = 1 to k) is the mean of the
dependent variable if all members of the study population had received level j of
the independent variable. One way to assess the differences among the k
population means is to compute confidence intervals for all possible pairs of
differences. For instance, with k = 3 levels the following pairwise differences in
population means could be examined.
πœ‡1 – πœ‡2
πœ‡1 – πœ‡3
πœ‡2 – πœ‡ 3
In a one-factor study with k levels there are k(k – 1)/2 pairwise comparisons.
For any single 100(1 βˆ’ 𝛼)% confidence interval, the probability of capturing the
population parameter is approximately 1 βˆ’ 𝛼, if all assumptions have been
satisfied. If v 100(1 βˆ’ 𝛼)% confidence intervals are computed, it can be shown that
the probability that all v confidence intervals have captured their population
parameters could be as low as 1 βˆ’ 𝑣𝛼. For instance, if six 95% confidence intervals
are computed, the probability that all six 95% confidence intervals have captured
their population parameters could be as low as 1 – 6(.05) = 0.7.
The researcher would like to be at least 100(1 βˆ’ 𝛼)% confident that all v confidence
intervals have captured their population parameters. A simple method of achieving
this is to replace 𝛼 with 𝛼* = 𝛼/v in the critical t-value of Equation 2.1 for each pair
of means. Any confidence interval that uses 𝛼* = 𝛼/v in place of 𝛼 is called a
Bonferroni confidence interval.
In the special case where the psychologist is interested in all v = k(k – 1)/2 pairwise
comparisons of means, Tukey confidence intervals are preferred because they are
slightly narrower than the Bonferroni confidence intervals. Tukey confidence
1
D.G. Bonett (3/2016)
intervals replace the critical t-value in Equation 2.1 with a special type of critical
value that is optimized for examining all possible pairs of means. The Bonferroni
method is recommended when examining only a subset of all possible pairs of
means. A set of confidence intervals based on the Tukey or Bonferroni methods are
called simultaneous confidence intervals.
Example 3.1. There is considerable variability in measures of intellectual ability among
college students. One psychologist believes that some of this variability can be explained
by differences in how students expect to perform on these tests. Seventy-five students were
randomly selected from a directory of about 2,000 students. The 75 students were
randomly divided into three groups of equal size and all 90 students were given a
nonverbal intelligence test (Raven’s Progressive Matrices) under identical testing
conditions. The raw scores for this test range from 0 to 60. The participants in Group 1
were told that they were taking a very difficult intelligence test. The participants in Group
2 were told that they were taking an interesting β€œpuzzle”, and the participants in Group 3
were not told anything. Simultaneous Tukey confidence intervals for all pairwise
comparisons of population means are given below.
Comparison
πœ‡1 – πœ‡2
πœ‡1 – πœ‡3
πœ‡2 – πœ‡3
95% Lower Limit 95% Upper Limit
-5.4
-3.1
-3.2
-1.4
1.2
3.5
In this study population of 2,000 students, the psychologist is 95% confident that the mean
intelligence score would be 3.1 to 5.4 greater if all students had been told that the test was
a puzzle instead of a difficult IQ test, 1.4 to 3.2 greater if they all had been told nothing
instead of being told that the test is a difficult IQ test, and 1.2 to 3.5 greater if they all had
been told the test was a puzzle instead of being told nothing. The psychologist is 95%
confident regarding all three conclusions.
Linear Contrasts
Some research questions can be expressed in terms of a linear contrast of
population means. For example, in an experiment that compares two costly
treatments (Treatments 1 and 2) with a new inexpensive treatment (Treatment 3),
the magnitude of the linear contrast (πœ‡1 + πœ‡2 )/2 – πœ‡3 might provide valuable
information regarding the relative costs and benefits of the new treatment. In
general, a linear contrast can be expressed as βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— where 𝑐𝑗 is called a contrast
coefficient. Statistical packages and some statistical formulas require values of the
πœ‡
πœ‡
contrast coefficients. For instance, (πœ‡1 + πœ‡2 )/2 – πœ‡3 can be expressed as 21 + 22 βˆ’
πœ‡3 which can then be expressed as (½) πœ‡1 + (½) πœ‡2 + (-1) πœ‡3 so that the contrast
coefficients are 𝑐1= .5, 𝑐2 = .5, and 𝑐3 = -1. Consider another example where
Treatment 1 is delivered to groups 1 and 2 by experimenters A and B and Treatment
2 is delivered to groups 3 and 4 by experimenters C and D. In this study we might
want to estimate (πœ‡1 + πœ‡2 )/2 – (πœ‡3 + πœ‡4 )/2 which can be expressed as (½) πœ‡1 +
2
D.G. Bonett (3/2016)
(½) πœ‡2 + (-½) πœ‡3 + (-½) πœ‡4 so that the contrast coefficients are 𝑐1= 5, 𝑐2 =.5, 𝑐3 = -.5,
and 𝑐4 = -5.
A 100(1 βˆ’ 𝛼)% confidence interval for βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— is
βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡Μ‚ 𝑗 ± 𝑑𝛼/2;𝑑𝑓 βˆšπœŽΜ‚π‘2 βˆ‘π‘˜π‘—=1 𝑐𝑗2 /𝑛𝑗
(3.1)
where πœŽΜ‚π‘2 = [βˆ‘π‘˜π‘—=1(𝑛𝑗 βˆ’ 1) πœŽΜ‚π‘—2 ]/𝑑𝑓, df = (βˆ‘π‘˜π‘—=1 𝑛𝑗 ) βˆ’ π‘˜ and βˆšπœŽΜ‚π‘2 βˆ‘π‘˜π‘—=1 𝑐𝑗2 /𝑛𝑗 is the
estimated standard error of βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡Μ‚ 𝑗 .
Example 3.2. Ninety randomly selected elderly patients being treated for depression were
randomized into three drug treatment conditions with the first group receiving Prozac,
the second group receiving Zoloft, and the third group receiving Wellbutrin. Prozac and
Zoloft are each a selective serotonin reuptake inhibitor (SSRI) and Wellbutrin is a
norepinephrine and dopamine reuptake inhibitor (NDRI). The researcher wants to
compare the effectiveness of SSRI and NDRI drugs in a study population of elderly
depressed patients. Sample means and variances for a measure of depression following
treatment are given below.
Treatment 1
πœ‡Μ‚ 1 = 24.9
πœŽΜ‚12 = 27.2
n1 = 30
Treatment 2
πœ‡Μ‚ 2 = 23.1
πœŽΜ‚22 = 21.8
n2 = 30
Treatment 3
πœ‡Μ‚ 3 = 31.6
πœŽΜ‚32 = 24.8
n3 = 30
The 95% confidence interval for (πœ‡1 + πœ‡2 )/2 – πœ‡3 is [-9.82, -5.38]. The researcher is 95%
confident that the population mean depression score averaged across the two SSRI drug
treatments is 5.38 to 9.82 lower than the population mean depression score under the
NDRI drug treatment in the study population of elderly patients.
Hypothesis Tests for Linear Contrasts
The three-decision rule may be used to assess the following null and alternative
hypotheses regarding the value of βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— .
H0: βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— = 0
H1: βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— > 0
H2: βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— < 0
A confidence interval for βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— may be used to test the above hypotheses. If the
lower limit for βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— is greater than 0, then reject H0 and accept H1. If the upper
limit for βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— is less than 0, then reject H0 and accept H2. The results are
inconclusive if the confidence interval includes 0. The test statistic for testing H0
is t = βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡Μ‚ 𝑗 /βˆšπœŽΜ‚π‘2 βˆ‘π‘˜π‘—=1 𝑐𝑗2 /𝑛𝑗 . SPSS will compute the p-value for this test statistic
which can be used to decide if H0: βˆ‘π‘˜π‘—=1 𝑐𝑗 πœ‡π‘— = 0 can be rejected.
3
D.G. Bonett (3/2016)
One-way Analysis of Variance
The total amount of variability of quantitative dependent variable scores in a onefactor study can be decomposed into two sources of variability – the variance of
scores within treatments (also called error variance) and the variance due to mean
differences across treatments (also called between-group variance). The
decomposition of variability in a one-factor study can be summarized in a one-way
analysis of variance (one-way ANOVA) table, as shown below, where SS stands for
sum of squares, MS stands for mean square, and n is the total sample size (n = 𝑛1
+ 𝑛2 + … + π‘›π‘˜ ). The components of the ANOVA table for a one-factor study are
shown below.
Source
SS
df
MS
F
__________________________________________________________
BETWEEN
SSB
dfB = k – 1
MSB = SSB/dfB
MSB/ MSE
ERROR
SSE
dfE = n – k
MSE = SSE/dfE
TOTAL
SST
dfT = n – 1
__________________________________________________________
The sum of squares formulas are
2
𝑛
𝑗
SSB = βˆ‘π‘˜π‘—=1 𝑛𝑗 (πœ‡Μ‚ 𝑗 βˆ’ πœ‡Μ‚ + ) where πœ‡Μ‚ + = βˆ‘π‘˜π‘—=1 βˆ‘π‘–=1
𝑦𝑖𝑗 / βˆ‘π‘˜π‘—=1 𝑛𝑗
𝑛
2
𝑗
SSE = βˆ‘π‘˜π‘—=1 βˆ‘π‘–=1
(𝑦𝑖𝑗 βˆ’ πœ‡Μ‚ 𝑗 ) = βˆ‘π‘˜π‘—=1(𝑛𝑗 βˆ’ 1) πœŽΜ‚π‘—2
𝑛
2
𝑗
SST = βˆ‘π‘˜π‘—=1 βˆ‘π‘–=1
(𝑦𝑖𝑗 βˆ’ πœ‡Μ‚ + ) = SSB + SSE.
SSB will equal zero if all sample means are equal and will be large if the sample
means are highly unequal. MSE = SSE/dfE is called the mean squared error and is
equal to the pooled within-group variance (πœŽΜ‚π‘2 ) defined in Equation 3.1.
The values in the ANOVA table may be used to estimate πœ‚2 (eta-squared), which
describes the proportion of dependent variable variance in the population that is
predicted by the independent variable. Eta-squared is a measure of effect size for
a one-factor study. An estimate of πœ‚2 is
πœ‚Μ‚ 2 = SSB/SST.
(3.2)
The F statistic (F = MSB/MSE) from the ANOVA table is used to test the null
hypothesis H0: πœ‡1 = πœ‡2 = … = πœ‡π‘˜ against an alternative hypothesis that at least one
pair of population means is not equal. This type of hypothesis test is called an Ftest. The null and alternative hypotheses that are tested by the F-test also may be
expressed as H0: πœ‚2 = 0 and H1: πœ‚2 > 0.
4
D.G. Bonett (3/2016)
In APA journals, it is common to report the F statistic and its two degrees of
freedom (e.g., dfB and dfE for a one-way ANOVA) and the p-value associated with
the F statistic. It is common practice to declare the ANOVA test to be β€œsignificant”
when the p-value is less than .05, but it is important to remember that a significant
result simply indicates a rejection of H0. Furthermore, a p-value greater than .05
should not be interpreted as evidence that H0: πœ‚2 = 0 is true. The test of equal
population means in the one-way ANOVA has been criticized as being
uninformative because in any real experiment H0: πœ‚2 = 0 is known to be false and
H1: πœ‚2 > 0 is known to be true prior to conducting the study. Unlike the test of H0:
πœ‚2 = 0, a confidence interval for πœ‚2 will provide useful information regarding the
magnitude of πœ‚2 .
Rejecting H0: πœ‚2 = 0 in a one-factor experiment does not reveal anything about how
the population means are ordered or the magnitudes of the population mean
differences. When reporting the results of a one-factor experiment, the F statistic
and p-value should be supplemented with a confidence interval for πœ‚2 , pairwise
differences in means, or a linear contrast of means. A confidence interval for πœ‚2 is
a computationally intensive operation that is available in R.
Two-Factor Experiments
Human behavior is complex and is influenced in many different ways. In a onefactor experiment, the researcher is able to assess the causal effect of only one
independent variable on the dependent variables. The causal effect of two
independent variables on the dependent variable can be assessed in a two-factor
experiment. The two factors will be referred to as β€œFactor A” and β€œFactor B”. The
simplest type of two-factor experiment has two levels of Factor A and two levels of
Factor B. We call this a 2 × 2 factorial experiment. If Factor A had 4 levels and
Factor B had 3 levels, it would be called a 4 × 3 factorial experiment. In general, an
a × b factorial experiment has a levels of Factor A and b levels of Factor B.
There are two types of two-factor between-subjects experiments. In one case, both
factors are between-subjects treatment factors. In the other case, one factor is a
treatment factor and the other is a classification factor. A classification factor is a
factor with levels to which participants are classified according to some existing
characteristic such as sex, age, political affiliation, etc. A study also can have two
classification factors, but then it would be a nonexperimental design.
Example 3.3. An experiment with two treatment factors takes randomly selected enlisted
military personnel and randomizes them to one of four treatment conditions: 24 hours of
sleep deprivation and 15 hours without food; 36 hours of sleep deprivation and 15 hours
5
D.G. Bonett (3/2016)
without food; 24 hours of sleep deprivation and 30 hours without food; and 36 hours of
sleep deprivation and 30 hours without food. One treatment factor is hours of sleep
deprivation (24 or 36 hours) and the other treatment factor is hours of food deprivation
(15 or 30 hours). The dependent variable is the score on a complex problem-solving task.
Example 3.4. An experiment with one classification factor and one treatment factor uses
a random sample of males and a random sample of females from a volunteer list of
students taking introductory chemistry. The males were randomized into two groups with
one group receiving 4 hours of chemistry review and the other group receiving 6 hours of
chemistry review. The females also were randomly divided into two groups with one group
receiving 4 hours of chemistry review and the other group receiving 6 hours of chemistry
review. The treatment factor is the amount of review (4 or 6 hours) and the classification
factor is sex. The dependent variable is the score on the final comprehensive exam.
One advantage of a two-factor experiment is that the effects of both Factor A and
Factor B can be assessed in a single study. Questions about the effects of Factor A
and Factor B could be answered using two separate one-factor experiments.
However, two one-factor experiments would require at least twice the total number
of participants to obtain confidence intervals with the same precision or hypothesis
tests with the same power that could be obtained from a single two-factor
experiment. Thus, a single two-factor experiment is more economical than two
one-factor experiments.
A two-factor experiment also can provide information that cannot be obtained
from two one-factor experiments. Specifically, a two-factor experiment can
provide unique information about the interaction effect between Factor A and
Factor B. An interaction effect occurs when the effect of Factor A is not the same
across the levels of Factor B (which is equivalent to saying that the effect of Factor
B is not the same across the levels of Factor A).
The inclusion of a second factor can improve the external validity of an experiment.
For instance, if there is a concern that participants might perform a particular task
differently in the morning than in the afternoon, then time of day (e.g., morning
vs. afternoon) could serve as a second 2-level factor in the experiment. If the
interaction effect between the treatment factor and the time-of-day factor is small,
then the effect for treatment would generalize to both morning and afternoon
testing conditions, thus increasing the external validity of the experiment.
The external validity of an experiment also can be improved by including a
classification factor. In stratified random sampling, random samples are taken
from two or more different study populations that differ geographically or in other
demographic characteristics. If the interaction between the classification factor
and the treatment factor is small, then the effect of treatment can be generalized
6
D.G. Bonett (3/2016)
to the multiple study populations, thereby increasing the external validity of the
experiment.
The inclusion of a classification factor also can reduce error variance (MSE), which
will in turn increase the power of statistical tests and reduce the widths of
confidence intervals. For instance, in a one-factor experiment that compares
several treatment conditions using both male and female participants, if females
tend to score higher than males, then this will increase the error variance (the
variance of scores within treatments). If gender is added as a classification factor,
the error variance will then be determined by the variability of scores within each
treatment and within each gender, which could result in a smaller MSE value.
The population means for a 2 × 2 design are shown below.
b1
Factor B
b2
a1
πœ‡11
πœ‡12
a2
πœ‡21
πœ‡22
Factor A
The main effects of Factor A and Factor B are defined below.
A:
B:
(πœ‡11 + πœ‡12 )/2 – (πœ‡21 + πœ‡22 )/2
(πœ‡11 + πœ‡21 )/2 – (πœ‡12 + πœ‡22 )/2
The AB interaction effect is defined as
AB:
(πœ‡11 βˆ’ πœ‡12 ) – (πœ‡21 βˆ’ πœ‡22 ) or equivalently (πœ‡11 βˆ’ πœ‡21 ) – (πœ‡12 βˆ’ πœ‡22 )
The simple main effects of A and B are defined below.
A at b1: πœ‡11 βˆ’ πœ‡21
A at b2: πœ‡12 βˆ’ πœ‡22
B at a1: πœ‡11 βˆ’ πœ‡12
B at a2: πœ‡21 βˆ’ πœ‡22
The interaction effect can be expressed as a difference in simple main effects,
specifically (πœ‡11 βˆ’ πœ‡12 ) – (πœ‡21 βˆ’ πœ‡22 ) = (B at a1) – (B at a2), or equivalently, (πœ‡11 βˆ’
πœ‡21 ) – (πœ‡12 βˆ’ πœ‡22 ) = (A at b1) – (A at b2).
The main effects can be expressed as averages of simple main effects. The main
effect of A is equal to (A at b1 + A at b2)/2 = (πœ‡11 βˆ’ πœ‡21 + πœ‡12 βˆ’ πœ‡22 )/2 =
(πœ‡11
+ πœ‡12 )/2 – (πœ‡21 + πœ‡22 )/2. The main effect of B is equal to (B at a1 + B at a2)/2 =
(πœ‡11 βˆ’ πœ‡12 + πœ‡21 βˆ’ πœ‡22 )/2 = (πœ‡11 + πœ‡21 )/2 – (πœ‡12 + πœ‡22 )/2. Note also that all of the
7
D.G. Bonett (3/2016)
above effects are special cases of a linear contrast of means, and confidence
intervals for these effects may be obtained using Equation 3.1.
The main effect of A (which is the average of A at b1 and A at b2) could be misleading
if the AB interaction is large because A at b1 and A at b2 will be highly dissimilar.
Likewise, the main effect of B (which is the average of B at a1 and B at a2) could be
misleading if the AB interaction is large because B at a1 and B at a2 will be highly
dissimilar. If the AB interaction effect is large, then an analysis of simple main
effects will be more meaningful than an analysis of main effects. If the AB
interaction is small, then an analysis of the main effects of Factor A and Factor B
will not be misleading.
Two-Way Analysis of Variance
Now consider a general a × b factorial design where Factor A has a β‰₯ 2 levels and
Factor B has b β‰₯ 2 levels. The total variability of the quantitative dependent
variable scores in a two-factor design can be decomposed into four sources of
variability: the variance due to differences in means across the levels of Factor A,
the variance due to differences in means of across the levels of Factor B, the
variance due to differences in the simple main effects of one factor across the levels
of the other factor (the AB interaction), and the variance of scores within
treatments (the error variance). The decomposition of the total variance in a twofactor design can be summarized in the following two-way analysis of variance
(two-way ANOVA) table where n is the total sample size.
Source
SS
df
MS
F
____________________________________________________________________
A
SSA
dfA = a – 1
MSA = SSA/dfA
MSA/MSE
B
SSB
dfB = b – 1
MSB = SSB/dfB
MSB/MSE
AB
SSAB
dfAB = (a – 1)(b – 1) MSAB = SSAB/dfAB
MSAB/MSE
ERROR
SSE
dfE = n – ab
MSE = SSE/dfE
TOTAL
SST
dfT = n – 1
____________________________________________________________________
The TOTAL and ERROR sum of squares formulas in a two-way ANOVA shown below
are conceptually similar to the one-way ANOVA formulas
𝑛
π‘—π‘˜
SST = βˆ‘π‘π‘˜=1 βˆ‘π‘Žπ‘—=1 βˆ‘π‘–=1
(π‘¦π‘–π‘—π‘˜ βˆ’ πœ‡Μ‚ ++ )2
𝑛
π‘—π‘˜
SSE = βˆ‘π‘π‘˜=1 βˆ‘π‘Žπ‘—=1 βˆ‘π‘–=1
(π‘¦π‘–π‘—π‘˜ βˆ’ πœ‡Μ‚ π‘—π‘˜ )2
8
D.G. Bonett (3/2016)
𝑛
π‘—π‘˜
where πœ‡Μ‚ ++ =βˆ‘π‘π‘˜=1 βˆ‘π‘Žπ‘—=1 βˆ‘π‘–=1
π‘¦π‘–π‘—π‘˜ /(βˆ‘π‘π‘˜=1 βˆ‘π‘Žπ‘—=1 π‘›π‘—π‘˜ ). The formulas for SSA, SSB, and SSAB
are complicated unless the sample sizes are equal. If all sample sizes are equal to
no, the formulas for SSA, SSB, and SSAB are
SSA = 𝑏𝑛0 βˆ‘π‘Žπ‘—=1(πœ‡Μ‚ 𝑗+ βˆ’ πœ‡Μ‚ ++ )
2
𝑛
0
where πœ‡Μ‚ 𝑗+ = βˆ‘π‘π‘˜=1 βˆ‘π‘–=1
𝑦𝑖𝑗 /𝑏𝑛0
𝑛0
SSB = π‘Žπ‘›0 βˆ‘π‘π‘˜=1(πœ‡Μ‚ +π‘˜ βˆ’ πœ‡Μ‚ ++ )2 where πœ‡Μ‚ +π‘˜ = βˆ‘π‘Žπ‘—=1 βˆ‘π‘–=1
𝑦𝑖𝑗 /π‘Žπ‘›0
SSAB = SST – SSE – SSA – SSB.
2
Partial eta-squared estimates of πœ‚π΄2 , πœ‚π΅2 , and πœ‚π΄π΅
are computed from the ANOVA
sum of squares estimates, as shown below.
πœ‚Μ‚ 𝐴2 = SSA/(SST βˆ’ SSAB βˆ’ SSB)
πœ‚Μ‚ 𝐡2 = SSB/(SST βˆ’ SSAB βˆ’ SSA)
2
πœ‚Μ‚ 𝐴𝐡
= SSAB/(SST βˆ’ SSA βˆ’ SSB)
(3.3a)
(3.3b)
(3.3c)
The partial eta-squared measures are called β€œpartial” effect sizes because
variability in the dependent variable due to the effects of other factors are removed,
which is clearly seen in the denominators of partial eta-squared estimates. For
instance, the denominator of the partial eta-squared for Factor A subtracts the sum
of squares due to Factor B and the AB interaction.
2
The tests of H0: πœ‚π΄2 = 0, H0: πœ‚π΅2 = 0, and H0: πœ‚π΄π΅
= 0 suffer from the same problem
as hypothesis testing in the one-way ANOVA in that a β€œsignificant” result simply
indicates that the null hypothesis can be rejected, and a β€œnonsignificant” result
does not imply that the effect is zero. The new APA guidelines require the F
statistics and p-values for each effect to be supplemented with confidence intervals
for appropriate measures of effect size (e.g., pairwise differences, eta-squared,
linear contrasts).
Although a β€œnonsignificant” (i.e., inconclusive) test for the AB interaction effect
does not imply that the population interaction effect is zero, it is customary to
examine main effects rather than simple main effects if the AB interaction test is
inconclusive. If the test for the AB interaction effect is β€œsignificant”, it is customary
to analyze simple main effects.
Assumptions
The assumptions for ANOVA tests, pairwise confidence intervals, confidence
interval for a linear contrast, and a confidence interval for πœ‚2 are same as the
assumptions for a confidence interval for 1 –  2 described previously (random
9
D.G. Bonett (3/2016)
sampling, independence among participants, approximate normality of response
variable, equal population variances).
The effects of assumption violation for the ANOVA tests, pairwise confidence
intervals, and linear contrast confidence intervals are identical to those for the
confidence interval for 1 –  2 described previously. However, the confidence
interval for πœ‚2 is sensitive to a violation of the normality assumption, even in large
samples. A data transformation (log, square root, reciprocal) might help to reduce
skewness of the response variable.
If the sample sizes are not approximately equal, violating the equal variance
assumption should be a concern. SPSS provides unequal variance options for
linear contrasts, all pairwise comparisons (the Games-Howell method), and the
one-way ANOVA test (the Welsh test) that should be used when the sample sizes
are not similar.
Sample Size Planning
The sample size requirement per group to estimate a linear contrast of k population
means with desired confidence and precision is approximately
𝑛𝑗 = 4πœŽΜƒ 2 (βˆ‘π‘˜π‘—=1 𝑐𝑗2 )(𝑧𝛼/2 /𝑀)2
(3.4)
where πœŽΜƒ 2 is the planning value of the average within-group variance. Equation 3.4
also may be used for factorial designs where k = ab is the total number of treatment
combinations. Note that Equation 3.4 reduces to Equation 2.3 for the special case
of comparing two means. The MSE from a previous study that used the same
dependent variable as the proposed study could be used as a planning value for the
average within-group variance. Replace 𝛼 with 𝛼* = 𝛼/v in Equation 3.4 when
examining v linear contrasts.
Example 3.5. A psychologist wants to estimate (πœ‡11 + πœ‡12 )/2 – (πœ‡21 + πœ‡22 )/2 in a 2 × 2
factorial experiment with 95% confidence, a desired confidence interval width of 3.0, and
a planning value of 8.0 for the average within-group error variance. The contrast
coefficients are 1/2, 1/2, -1/2, and -1/2. The sample size requirement per group is
approximately 𝑛𝑗 = 4(8.0)(1/4 + 1/4 + 1/4 + 1/4)(1.96/3.0)2 = 13.7 β‰ˆ 14.
A simple formula for approximating the sample size needed to obtain a confidence
interval for πœ‚2 having a desired width is currently not available. However, if
sample data can be obtained in two stages, then the confidence interval width for
10
D.G. Bonett (3/2016)
πœ‚2 obtained in the first-stage sample can be used in Equation 1.7 to approximate
the additional number of participants needed in the second stage sample.
Example 3.6. A first-stage sample size of 12 participants per group in a one-factor
experiment gave a 95% confidence interval for πœ‚2 with a width of 0.51. The psychologist
would like to obtain a 95% confidence interval for πœ‚2 that has a width 0f 0.30. To achieve
this goal, [(0.51/0.30)2 – 1]12 = 22.7 β‰ˆ 23 additional participants per group are needed.
Using Prior Information
Suppose a population mean difference for a particular response variable has been
estimated in a previous study and also in a new study. The previous study used a
random sample to estimate πœ‡1 – πœ‡2 from one study population, and the new study
used a random sample to estimate πœ‡3 – πœ‡4 from another study population. This is
a 2 × 2 factorial design with a classification factor where Study 1 and Study 2 are
the levels of the classification factor. The two study populations are assumed to be
conceptually similar. If a confidence interval for (πœ‡1 – πœ‡2 ) βˆ’ (πœ‡3 – πœ‡4 ) suggests that
πœ‡1 – πœ‡2 and πœ‡3 – πœ‡4 are not too dissimilar, then the researcher might want to
compute a confidence interval for (πœ‡1 + πœ‡3 )/2 – (πœ‡2 + πœ‡4 )/2. A confidence interval
for (πœ‡1 + πœ‡3 )/2 – (πœ‡2 + πœ‡4 )/2 will have greater external validity and could be
substantially narrower than the confidence interval for πœ‡1 – πœ‡2 or πœ‡3 – πœ‡4 . A
100(1 βˆ’ 𝛼)% confidence interval for (πœ‡1 + πœ‡3 )/2 – (πœ‡2 + πœ‡4 )/2 is obtained from
Equation 3.1, with contrast coefficients of 𝑐1 = .5, 𝑐2 = βˆ’.5, 𝑐3 = .5, and
𝑐4 =
βˆ’.5.
Example 3.12. An eye-witness identification study with 20 participants per group at
Kansas State University assessed participants’ certainty in their selection of a suspect
individual from a photo lineup after viewing a short video of a crime scene. Two treatment
conditions were assessed in each study. In the first treatment condition the participants
were told that the target individual β€œwill be” in a 5-person photo lineup, and in the second
treatment condition participants were told that the target individual β€œmight be” in a 5person photo lineup. The suspect was included in the lineup in both instruction
conditions. The estimated means were 7.4 and 6.3 and the estimated standard deviations
were 1.7 and 2.3 in the β€œwill be” and β€œmight be” conditions, respectively. This study was
replicated at UCLA using 40 participants per group. In the UCLA study, the estimated
means were 6.9 and 5.7, and the estimated standard deviations were 1.5 and 2.0 in the β€œwill
be” and β€œmight be” conditions, respectively. A 95% confidence interval for (πœ‡1 – πœ‡2 ) βˆ’ (πœ‡3
– πœ‡4 ) indicated that πœ‡1 – πœ‡2 and πœ‡3 – πœ‡4 do not appear to be substantially dissimilar. The
95% confidence interval for (πœ‡1 + πœ‡3 )/2 – (πœ‡2 + πœ‡4 )/2, which describes the Kansas State
and UCLA study populations, was [0.43, 1.87].
11
D.G. Bonett (3/2016)
Graphing Results
The two-group bar chart with 95% confidence interval bars described in Module 2
can be extended in an obvious way to accommodate single-factor designs with
more than two groups. Results of a two-factor design can be illustrated using a
clustered bar chart where the means for the levels of one factor are represented by
a cluster of contiguous bars (with different colors, shades, or patterns) and the
levels of the second factor are represented by different clusters. An example of a
clustered bar chart for a 2 × 2 design is shown below.
If one factor is more interesting than the other factor, the clustered levels should
represent the more interesting factor because it is easier to visually compare means
within a cluster than across clusters. In the above graph, it is easy to see than the
means for level 2 of Factor A are greater than the means for level 1 of Factor A.
12