D.G. Bonett (12/2016)
Module 4
Within-subjects Experiments
An experiment where each participant is measured under all a ο³ 2 treatment conditions,
with treatments presented in random order, is called a within-subjects experiment (also
called a randomized block design). The independent variable in a within-subjects
experiment is called a within-subjects factor. With the same participants used in all a
treatment conditions, the total sample size in a within-subjects experiment will be a times
smaller than in a comparable between-subjects experiment where participants are
randomized into a groups and each participant receives only one treatment. For instance,
suppose a researcher wants to compare two medications. Instead of using, say, 20
participants per group in a two-group experiment (total of 40 participants) where the first
group receives one medication and the second group receive the other medication, one
group of 20 participants could be evaluated after receiving one medication, and then later
the same group of participants could be evaluated after receiving the other medication.
In a 2-level within-subjects experiment, the goal is to estimate π1 β π2 where π1 is the
mean of the dependent variable if everyone in the study population had received
Treatment 1, and π2 is the mean of the dependent variable if everyone in the study
population had received Treatment 2. These interpretations of π1 and π2 assume no
practice effect, fatigue effect, or carryover effect. A carryover effect occurs when the effect
of one treatment persists during exposure to a second treatment. When this occurs, the
dependent variable under the second treatment will also be affected by the first treatment.
If there are no practice, fatigue, or carryover effects, the interpretations of π1 and π2 in a
within-subjects experiment are the same as in a between-subjects experiment.
Confidence Interval for a Population Mean Difference
Consider a random sample of n participants who have been measured under two
treatment conditions. The two measurements for participant i are π¦π1 and π¦π2 . Compute a
difference score ππ = π¦π1 β π¦π2 for each of the n participants. Let πΜ π be the sample mean of
the n difference scores and let πΜπ2 be the sample variance of the n difference scores. It can
be shown that πΜ π = πΜ 1 β πΜ 2 , that is, the mean of the difference scores is equal to a
difference in the means. A 100(1 β πΌ)% confidence interval for π1 β π2 is
πΜ π ± π‘πΌ/2;ππ βπΜπ2 /π
(4.1)
where df = n β 1 and βπΜπ2 /π is the estimated standard error of πΜ 1 β πΜ 2 .
1
D.G. Bonett (12/2016)
It can be shown that πΜπ2 = πΜ12 + πΜ22 β 2πΜ12 πΜ1 πΜ2 where πΜ12 is the sample Pearson correlation
between the two measurements (a Pearson correlation is a measure of association
between two quantitative variables). From this equation we see that the variance of the
difference scores is smaller for larger values of the correlation between the measurements.
It is common for the two measurements in a within-subjects experiment to be moderately
or highly correlated, and the variance of the difference scores is often much smaller than
the variance of either y1 or y2. From Equation 4.1, it is clear that smaller values of πΜπ2 give
narrower confidence intervals for π1 β π2 .
Confidence Interval for a Population Standardized Mean Difference
The population standardized mean difference in a within-subjects experiment is defined
in exactly the same way as in a between-subjects experiment. A standardized mean
difference might be easier to explain than π1 β π2 in applications where the metric of the
dependent variable is not familiar to the intended audience.
A 100(1 β πΌ)% confidence interval for πΏ = (π1 β π2 )/β(π12 + π22 )/2 in a within-subjects
design is
πΏΜ ± π§πΌ/2 ππΈπΏΜ
(4.2)
2 )
Μ 2 (1+ π
Μ12
πΏ
where πΏΜ = (πΜ 1 β πΜ 2 )/β(πΜ12 + πΜ22 )/2 and ππΈπΏΜ = β
4(πβ1)
+
Μ12 )
2(1βπ
π
. Equation 4.2 assumes
equal population variances but more complicated confidence interval methods for πΏ are
available that do not make this assumption. An alternative version of πΏ uses the standard
deviation from one treatment condition (usually a control condition) as the standardizer.
Example 4.1. A random sample of eight college students participated in a reaction time study
where they pressed button 1 as soon as they saw the letter E displayed on a computer screen and
pressed button 2 if the letter E was printed backwards. The Es and backwards Es were displayed
two ways: upright (zero rotation) and rotated 600. The reaction times (in milliseconds) to the
backwards Es are given below.
Participant
1
2
3
4
5
6
7
8
00
621
589
604
543
588
647
639
590
Rotation
600
704
690
741
625
724
736
780
625
(continued)
2
D.G. Bonett (12/2016)
The sample means are πΜ 1 = 602.63 and πΜ 2 = 703.13 and the sample variances are πΜ12 = 1106.56 and
πΜ22 = 3034.36. The sample Pearson correlation between the two measurements is πΜ12 = 0.767. The
sample variance of the difference scores is 1329.33. The 95% confidence interval for π1 β π2 is 100.5 ± 2.37β1329.33/8 = [-131, -70]. The researcher is 95% confident that the mean reaction
time to the rotated backwards E is 70 to 131 ms longer than the mean reaction time to the
unrotated backwards E in the study population of college students.
The 95% confidence interval for πΏ is
2.22 (1 + .7672 )
-2.2 ο± 1.96β
28
+ 2(1 β .767)/8 = [-3.42, -0.99].
where πΏΜ = -100.5/β(11056.6 + 3034.4)/2 = -2.2 is the estimated standardized mean difference. The
researcher is 95% confident that the mean reaction time to the rotated backwards E is 0.99 to 3.42
standard deviations greater than the mean reaction time to the unrotated backwards E in the
study population of college students. (Imagine two normal curves with the curve for the rotated E
shifted to the right by 0.99 to 3.42 standard deviations.) In this example, the researcher would
report the confidence interval for π1 β π2 rather than the confidence interval for πΏ because
reaction time in milliseconds is a well understood and meaningful measurement. However, if
reaction time had been log or square-root transformed to reduce skewness, then πΏ would be more
easily interpreted than π1 β π2 .
Confidence Interval for a Ratio of Means
If the dependent variable is measured on a ratio scale, a ratio of population means (π1 /π2 )
is a unitless measure of effect size could be more meaningful and easier to interpret than
a standardized mean difference. An approximate 100(1 β πΌ)% confidence interval for
π1 /π2 is
Μ2
π
Μ2
π
Μ12 π
Μ1 π
Μ2
2π
1
2
Μ1π
Μ2π
π
ππ₯π[ππ(πΜ 1 /πΜ 2 ) ± π‘πΌ/2;ππ βπΜ21π + πΜ22π β
]
(4.3)
where df = n β 1.
Linear Contrasts
In a within-subjects study with a levels, participant i produces a scores (π¦π1 , π¦π2 , β¦, π¦ππ )
and a linear contrast score for participant i is
ππ = βππ=1 βπ π¦ππ
(4.4)
where βπ is a contrast coefficient. Note that ππ specializes to a difference score when one
coefficient is 1, another coefficient is -1, and all other coefficients are zero. It can be shown
that the mean of the linear contrast scores is equal to a linear contrast of sample means
3
D.G. Bonett (12/2016)
(i.e., πΜ π = βππ=1 βπ πΜ π ). The estimated variance of the linear contrast scores is πΜπ2 =
2
βππ=1(ππ β πΜ π ) /(π β 1). A 100(1 β πΌ)% confidence interval for βππ=1 βπ ππ is
(4.5)
πΜ π ± π‘πΌ/2;ππ βπΜπ2 /π
where df = n β 1 and βπΜπ2 /π is the estimated standard error of βππ=1 βπ πΜ π . A Bonferroni
adjustment to πΌ in the critical t-value of Equation 4.5 can be made when two or more
simultaneous confidence intervals are required.
Example 4.2. Six 4th year college students were given detailed descriptions of four San Francisco
based manufacturing companies. The first two companies produced electronic products and the
second two companies produced clothing products. The students were asked to rate each company
on a 1 to 50 scale in terms of how interested they would be to work at each company. The ratings
are given below.
Patient
1
2
3
4
5
6
Company 1
24
37
20
45
49
32
Company 2
27
35
18
48
52
34
Company 3 Company 4
21
20
31
29
17
18
40
41
43
40
30
27
g
5.0
6.0
1.5
6.0
9.0
4.5
The researcher wants a 95% confidence interval for the linear contrast (π1 + π2 )/2 β (π3 + π4 )/2.
The linear contrast score ππ = (π¦π1 + π¦π2 )/2 β (π¦π3 + π¦π4 )/2 was computed and is reported in the last
column. The sample mean of the linear contrast scores is 5.33 with a sample variance of 5.97. A
95% confidence interval for (π1 + π2 )/2 β (π3 + π4 )/2 is [2.8, 7.9]. The researcher is 95% confident
that the mean rating, averaged over the two electronic manufacturing companies, is 2.8 to 7.9
greater than the mean rating, averaged over the two clothing manufacturing companies, in the
study population of 4th year college students.
Standardized Linear Contrasts
In applications where a linear contrast of means (βππ=1 βπ ππ ) could be difficult to explain
because the metric of the dependent variable is not familiar to the intended audience, it
might be helpful to report a confidence interval for a standardized linear contrast of
population means
π = βππ=1 βπ ππ /β(βππ=1 ππ2 )/π
(4.6)
4
D.G. Bonett (12/2016)
which is a generalization of the population standardized mean difference. A 100(1 β πΌ)%
confidence interval for π that assumes equal population variances and equal population
correlations is
(4.7)
πΜ ± π§πΌ/2 ππΈπΜ
where πΜ = βππ=1 βπ πΜ π /β(βππ=1 πΜπ2 )/π, SEοͺΜ =βπΜ
2 [1+(πβ1)π
Μ 2]
2π(πβ1)
+
2
(1βπ
Μ ) βπ
π=1 βπ
π
,
and πΜ is the average of
the sample correlations for the a(a β 1)/2 pairs of measurements. A confidence interval
for π that does not require equal population variances and equal population correlations
is available but the standard error formula is more complicated than the one given in
Equation 4.7. An alternative version of π uses a standardizer that averages only those
variances corresponding to treatments with non-zero contrast coefficients. Another
version uses the standard deviation from a control condition as the standardizer.
Hypothesis Testing for a Population Mean Difference
The confidence interval for π1 β π2 (Equation 4.1) can be used to test the following
hypotheses.
H0: π1 = π2
H1: π1 > π2
H2: π1 < π2
If the lower confidence limit for π1 β π2 is greater than 0, then reject H0 and accept
H1: π1 > π2 ; if the upper confidence limit for π1 β π2 is less than 0, then reject H0 and
accept H2: π1 < π2 . The results are inconclusive if the confidence interval includes 0.
When this three-decision rule is applied to the mean difference in a within-subjects
design, it is commonly referred to as a paired-samples t-test. Statistical packages will
compute the test statistic t = (πΜ 1 β πΜ 2 )/ππΈπΜ1 βπΜ2 = πΜ π /βπΜπ2 /π and its associated p-value.
A confidence interval for π1 β π2 also can be used to choose between the following two
hypotheses in an equivalence test:
H0: |π1 β π2 | β€ π
H1: |π1 β π2 | > π
where π is some value specified by the researcher. Usually π represents the value π1 β π2
that would be considered by experts to be small or unimportant. If the confidence interval
for π1 β π2 is completely contained within the range -b to b, then H0 is accepted; if the
confidence interval for π1 β π2 is completely outside the range -b to b, then H1 is accepted;
otherwise, the results are inconclusive. The probability of falsely accepting H1: |π1 β π2 | >
π is at most πΌ.
5
D.G. Bonett (12/2016)
Hypothesis Testing for a Linear Contrast
The confidence interval for βππ=1 βπ ππ (Equation 4.5) can be used to test the following
hypotheses.
H0: βππ=1 βπ ππ = 0
H1: βππ=1 βπ ππ > 0
H2: βππ=1 βπ ππ < 0
If the lower confidence limit for βππ=1 βπ ππ is greater than 0, reject H0 and accept
H1: βππ=1 βπ ππ > 0; if the upper confidence limit for βππ=1 βπ ππ is less than 0, reject H0 and
accept H2: βππ=1 βπ ππ < 0. The results are inconclusive if the confidence interval includes 0.
Statistical packages will compute a test statistic t = βππ=1 βπ πΜ π /ππΈβππ=1 βπ πΜπ = πΜ π /βπΜπ2 /π and
its associated p-value.
One-way Within-subjects Analysis of Variance
The variance of the dependent variable scores across participants and treatments in a onefactor within-subjects study can be decomposed into three sources of variability as shown
in the within-subjects ANOVA table below. The levels of Factor P are the n participants.
Factor A is a fixed factor and Factor P is a random factor.
Source
SS
df
MS
F
_______________________________________________________________
A
SSA
dfA = a β 1
MSA = SSA/dfA
MSA/MSAP
P
SSP
dfP = n β 1
MSP = SSP/df P
AP
SSAP
dfAP = dfAdfP
MSAP= SSAP/dfAP
_______________________________________________________________
The sum of squares (SS) are computed as follows
2
SSA = π βππ=1(πΜ +π β πΜ ++ )
SSP = π βππ=1(πΜ π+ β πΜ ++ )2
2
SSAP = βππ=1 βππ=1(π¦ππ β πΜ ++ ) β SSA β SSP
where πΜ ++ = βππ=1 βππ=1 π¦ππ /ππ, πΜ +π = βππ=1 π¦ππ /π and πΜ π+ = βππ=1 π¦ππ /π.
The estimate of ππ΄2 for the within-subjects factor is πΜ π΄2 = SSA/(SSA + SSP + SSAP). Note that
SSP is included in the denominator because variability among participants contributes to
the total variability of scores. There is currently no simple method for computing a
confidence interval for ππ΄2 , and this limits the usefulness of ππ΄2 as a measure of effect size
in within-subjects designs. If standardized measures of effect size are needed in a withinsubjects experiment, confidence intervals for standardized linear contrasts of means are
recommended.
6
D.G. Bonett (12/2016)
The F statistic from the ANOVA table can be used to test H0: π1 = π2 = β¦ = ππ . A test of
this hypothesis in the within-subjects ANOVA suffers from the same problem as the test
of equal population means in the between-subjects ANOVA. Specially, we know with near
certainty that the population means across the a levels of the within-subject factor will
almost never be identical (or equivalently, we know that ππ΄2 will almost never be exactly
zero). Thus a statistical test that simply rejects or fails to reject the null hypothesis of equal
population means does not provide useful scientific information. Although the withinsubjects ANOVA is still widely used, pairwise comparisons or other linear contrasts of the
a within-subjects means (standardized or unstandardized) are the recommended
supplements to the within-subjects ANOVA test of H0: π1 = π2 = β¦ = ππ .
Two-factor Within-subjects Experiment
In a two-factor within-subjects experiment, all participants are measured under all
combinations of the two factors. In the simple case of a 2 × 2 factorial experiment, all
participants are measured under four conditions (a1b1, a1b2, a2b1, a2b2) with scores for
participant i denoted as π¦π11 , π¦π12 , π¦π21 , and π¦π22 . The population means under these four
conditions are π11 , π12 , π21 , and π22 . The main effects and interaction effects that were
previously defined for a 2 × 2 between-subjects design also apply to a 2 × 2 within-subjects
design as shown below.
AB interaction effect:
Main effect of Factor A:
Main effect of Factor B:
(π11 β π12 ) β (π21 β π22 )
(π11 + π12 )/2 β (π21 + π22 )/2
(π11 + π21 )/2 β (π12 + π22 )/2
The main effects can be misleading if there is an AB interaction effect. Simple main effects
should be examined if an AB interaction effect has been detected. The simple main effects
that were previously defined for a 2 × 2 between-subjects design also apply to a 2 × 2
within-subjects design as shown below.
A at b1:
A at b2:
B at a1:
B at a2:
π11 β π21
π12 β π22
π11 β π12
π21 β π22
Confidence intervals for the above effects are obtained by computing the appropriate
linear contrast score for all n participants and then applying Equation 4.4. As one
example, the linear contrast score for the AB interaction for participant i is gi = π¦π11 β π¦π12
β π¦π21 + π¦π22 .
7
D.G. Bonett (12/2016)
In two-factor within-subject designs where one or both factors have more than two levels,
pairwise main effects, pairwise interaction effects, or pairwise simple main effects could
be examined. These pairwise effects are defined in exactly the same way as they were in
the case of a two-factor between-subjects design. Consider the following population
means for a 3 x 3 within-subjects design
a1
a2
a3
Factor A
b1
Factor B
b2
b3
π11
π21
π31
π12
π22
π32
π13
π23
π33
and consider the following examples.
ο·
The main effect comparison of levels 1 and 2 of factor A is (π11 + π12 + π13 )/3 β
(π21 + π22 + π23 )/3 and the corresponding linear contrast score for participant i is
gi = (π¦π11 + π¦π12 + π¦π13 )/3 β (π¦π21 + π¦π22 + π¦π23 )/3.
ο·
The simple main effect comparison of levels 1 and 2 of Factor A at level 1 of factor
B is π11 β π21 and the corresponding linear contrast score for participant i is
gi = π¦π11 β π¦π21 .
ο·
The interaction comparison of levels 1 and 2 of Factor A and levels 2 and 3 of Factor
B is π12 β π13 β π22 + π23 and the corresponding linear contrast score for
participant i is gi = π¦π12 β π¦π13 + π¦π22 + π¦π23 .
A confidence interval for any of these pairwise effects is obtained by computing the
appropriate linear contrast score for all n participants and then applying Equation 4.4.
Two-way Within-subjects Analysis of Variance
The variance of the dependent variable scores across participants and the two withinsubject factors can be decomposed into seven sources of variability, as shown in the
within-subjects ANOVA table below.
Source
SS
df
MS
F
__________________________________________________________________
A
B
AB
P
AP
BP
ABP
SSA
SSB
SSAB
SSP
SSAP
SSBP
SSABP
dfA = a β 1
dfB = b β 1
dfAB = dfAdfB
dfP = n β 1
dfAP = dfAdfP
dfBP = dfBdfP
dfABP = dfAdfBdfP
MSA = SSA/dfA
MSB = SSB/dfB
MSAB = SSAB/dfAB
MSP = SSP/dfP
MSAP = SSAP/dfAP
MSA = SSBP/dfBP
MSABP = SSABP/dfABP
MSA/MSAP
MSB/MSBP
MSAB/MSABP
__________________________________________________________________
8
D.G. Bonett (12/2016)
There is currently no simple method for computing a confidence interval for the
population partial eta-squared values in a two-way within-subjects design, and
hypothesis testing in the two-way within-subjects ANOVA suffers from the same
problems as the two-way between-subjects ANOVA. Confidence intervals for linear
contrasts of means or standardized linear contrasts of means are the recommended
alternatives.
Two-factor Mixed Designs
A mixed two-factor design has one between-subjects factor and one within-subjects
factor. The mixed two-factor design is also called a split plot design. The mixed design
often provides greater power and narrower confidence intervals than a two-factor
between-subjects design. The mixed design might be preferred to a two-factor withinsubjects design if there is a concern of carryover effects for one of the factors. A mixed
design also might be preferred in studies where the levels of one factor are most
conveniently applied to each participant and the levels of the other factor are most
conveniently or appropriately applied to different groups of participants. The two-factor
mixed design is useful in studies where participants may have difficulty responding to all
a × b levels of two within-subject factors. For instance, instead of measuring participants
under all six treatment conditions of a 2 × 3 within-subjects design, participants could be
randomly divided into two groups with one group receiving treatment π1 , the other group
receiving treatment π2 , and all participants receiving just three within-subject treatments
π1 , π2 , and π3 .
The between-subjects factor can be a treatment factor or a classification factor. The
within-subjects factor can be a treatment factor, where participants are measured under
all treatment conditions in random order, or a repeated factor where participants are
measured on two or more occasions such as before and after exposure to a single
treatment.
The 2 × 2 mixed design is the most simple mixed two-factor design. Consider the following
2 × 2 mixed design where Factor A is the within-subjects factor and Factor B is the
between-subjects factor.
Factor B
a1
b1
π11
b2
π12
a2
π21
π22
Factor A
9
D.G. Bonett (12/2016)
All of the effects that were previously defined for the 2 × 2 between-subjects design and
the 2 × 2 within-subjects design also apply to the 2 × 2 mixed design. Confidence intervals
for these effects can be computed using a combination of methods and principles
described previously. A confidence interval for the AB interaction is obtained by first
computing a difference score for each participant (β1 = 1 and β2 = -1) and then applying
Equation 2.1 (Module 2) or Equation 3.1 (Module 3) with π1 = 1 and π2 = -1. The
population mean of the difference scores for participants at level π1 is π11 β π21 , and the
population mean of the difference scores for participants at level π2 is π12 β π22 . Thus, a
confidence interval for the difference in population mean difference scores gives a
confidence interval for (π11 β π12 ) β (π21 β π22 ).
A confidence interval for the main effect of Factor B (the between-subjects factor) is
obtained by computing an average within-subjects score (β1 = 1/2 and β2 = 1/2) for each
participant and then applying Equation 2.1 or Equation 3.1 with π1 = 1 and π2 = -1. The
population mean of the average of the two scores for participants at level π1 is (π11 +
π21 )/2, and the mean of the average of the two scores for participants at level π2 is (π12 +
π22 )/2. Thus, a confidence interval for the difference in population means average scores
gives a confidence interval for (π11 + π21 )/2 β (π12 + π22 )/2, which is the main effect of
Factor B.
A confidence interval for the main effect of Factor A (the within-subjects factor) is
obtained by computing a difference score for each participant (β1 = 1 and β2 = -1) and
then applying Equation 2.1 or Equation 3.1 with coefficients π1 = 1/2 and π2 = 1/2. The
population mean of the difference scores for participants at level π1 is π11 β π21 , and the
population mean of the difference scores for participants at level π2 is π12 β π22 . Thus, a
confidence interval for an average of two population difference scores gives a confidence
interval for (π11 β π21 )/2 + (π12 β π22 )/2 which is equal to (π11 + π12 )/2 β (π21 + π22 )/2,
the main effect of Factor A.
A confidence interval for the simple main effect A at π1 is obtained by computing a
difference score for each participant (β1 = 1 and β2 = -1) and applying Equation 1.6 or
Equation 3.1 with π1 = 1 and π2 = 0. Likewise, a confidence interval for the simple main
effect A at π2 is obtained by computing a difference score for each participant (β1 = 1 and
β2 = -1) and applying Equation 1.6 or Equation 3.1 with π1 = 0 and π2 = 1.
A confidence interval for the simple main effect B at a1 is obtained by computing a linear
contrast score using β1 = 1 and β2 = 0 and applying Equation 2.1 or Equation 3.1 with
π1 = 1 and π2 = -1. Likewise, a confidence interval for the simple main effect B at a2 is
obtained by computing a linear contrast score using β1 = 0 and β2 = 1 and applying
Equation 2.1 or Equation 3.1 with π1 = 1 and π2 = -1.
10
D.G. Bonett (12/2016)
The procedures described above suggest a general approach for computing confidence
intervals for the effects in a 2 × 2 mixed design. The basic idea is to compute an
appropriate linear contrast score for the within-subject factor and then estimate an
appropriate function of population mean for the between-subjects factor. The following
table summarizes the coefficients that define the effects in a 2 × 2 mixed design where the
ππ coefficients are applied to the levels of the between-subjects factor (Factor A) and the
βπ coefficients define a linear contrast score for the within-subjects factor (Factor B).
π1
π2
β1
β2
Effect
_____________________________________________________
1/2
1/2
1
-1
Main effect of A
1
-1
1/2
1/2
Main effect of B
1
-1
1
-1
AB interaction effect
1
-1
1
0
Simple main effect of B at a1
1
-1
0
1
Simple main effect of B at a2
1
0
1
-1
Simple main effect of A at b1
0
1
1
-1
Simple main effect of A at b2
_____________________________________________________
This approach can be used to estimate a wide range of interesting effects in a general
a × b mixed design. Consider a 3 × 4 mixed design where the between-subject factor
(Factor A) has 3 levels and the within-subjects factor (Factor B) has 4 levels. A pairwise
main effect comparing levels 1 and 2 of Factor B would use π1 = 1/3, π2 = 1/3, π3 = 1/3,
β1 = 1, β2 = -1, β3 = 0, and β4 = 0. A pairwise main effect comparing levels 2 and 3 of Factor
A would use π1 = 0, π2 = 1, π3 = -1, β1 = 1/4, β2 = 1/4, β3 = 1/4, and β4 = 1/4. A pairwise
interaction effect comparing levels 1 and 2 of Factor A and levels 1 and 2 of Factor B would
use π1 = 1, π2 = -1, π3 = 0, β1 = 1, β2 = -1, β3 = 0, and β4 = 0. A Factor B main effect contrast
that compares the average of levels 1 and 2 with level 3 would use π1 = 1/3, π2 = 1/3,
π3 = 1/3, β1 = 1/2, β2 = 1/2, β3 = -1, and β4 = 0.
Two-way Mixed ANOVA
A two-way ANOVA table for a two-factor mixed design, where Factor B is a betweensubjects factor and Factor A is a within-subjects factor, is shown below where n is the total
sample size. The notation P/B indicates that the Factor P (the βParticipantβ factor) is
nested within the levels of Factor B (the between-subjects factor). The notation AP/B
indicates that the combination of within-subjects factor levels (Factor A) and the levels of
Factor P are nested within the levels of the between-subjects factor (Factor B).
11
D.G. Bonett (12/2016)
Source
SS
df
MS
F
________________________________________________________________________
B
SSB
dfB = b β 1
MSB = SSB/dfB
MSB/MSP/B
P/B
SSP/B
dfP/B = n β b
MSP/B = SSP/B/dfP/B
A
SSA
dfA = a β 1
MSA = SSA/dfA
MSA/MSAP/B
AB
SSAB
dfAB = dfAdfB
MSAB = SSAB/dfAB
MSAB/MSAP/B
AP/B
SSAP/B
dfAP/B = dfAdfP/B
MSAP/B = SSAP/B/dfAP/B
________________________________________________________________________
An estimate of ππ΅2 for the between-subjects factor is πΜ π΅2 = SSB/(SSB + SSP/B), and a
confidence interval for ππ΅2 can be obtained using SAS or R. Confidence intervals for
standardized or unstandardized linear contrasts (e.g., pairwise main effects, pairwise
interaction effects, pairwise simple main effects) are recommended alternatives to the
null hypothesis tests for the main effect of Factor A and the AB interaction.
Counterbalancing
The usefulness of a within-subject experimental design is limited by the assumption of no
practice, fatigue, or carryover effects. It is possible to completely control for practice,
fatigue, and a specific pattern of carryover effect by counterbalancing the order of the
treatment conditions. For example, with a = 2 treatment conditions, one group of
participants receives treatment π1 followed by treatment π2 (π1 -> π2 ) and a second group
receives treatment π2 followed by treatment π1 (π2 -> π1 ). With a = 3 treatments, there
are six possible orders:
π1 -> π2 -> π3
π1 -> π3 -> π2
π2 -> π1 -> π3
π2 -> π3 -> π1
π3 -> π1 -> π2
π3 -> π2 -> π1 .
A design that uses all possible order conditions is called a completely counterbalanced
design.
With a = 4 treatments there are 24 possible treatment conditions. This requires a sample
size that is a multiple of 24 which might be difficult to achieve. Instead of using all possible
orders, a particular subset of the 24 possible orders can used to control for practice,
fatigue, and a specific pattern of carryover effects. With a = 4 treatments, the following
four order conditions are recommended:
π1 -> π2 -> π4 -> π3
π2 -> π3 -> π1 -> π4
π3 -> π4 -> π2 -> π1
π4 -> π1 -> π3 -> π2 .
12
D.G. Bonett (12/2016)
This particular subset of order conditions is called balanced Latin Square (BLS)
counterbalancing in which each treatment condition immediately follows every other
treatment condition in only one of the selected order conditions. Using the above four
order conditions, the sample size can be a multiple of 4 rather than 24. BLS
counterbalancing requires only a order conditions for even values of a but 2a order
conditions are needed for odd values of a. With a = 5 treatments, the following ten order
conditions provide BLS counterbalancing:
π1 -> π2 -> π5 -> π3 -> π4
π2 -> π3 -> π1 -> π4 -> π5
π3 -> π4 -> π2 -> π5 -> π1
π4 -> π5 -> π3 -> π1 -> π2
π5 -> π1 -> π4 -> π2 -> π3
π4 -> π3 -> π5 -> π2 -> π1
π5 -> π4 -> π1 -> π3 -> π2
π1 -> π5 -> π2 -> π4 -> π3
π2 -> π1 -> π3 -> π5 -> π4
π3 -> π2 -> π4 -> π1 -> π5 .
With complete or BLS counterbalanced designs, an approximately equal number of
participants should be randomly assigned to each order condition.
To illustrate how counterbalancing can control for practice, fatigue, and a specific type of
carryover effect, consider a 2 x 2 mixed design where the within-subjects factor (A) has
two treatment conditions (π1 and π2 ) and the between-group factor (B) has two order
conditions (π1 = π1 -> π2 and π2 = π2 -> π1 ). Recall from Module 2 that in a 2-group
experiment the population mean of the dependent variable under the π1 treatment is
equal to π1 , and the population mean of the dependent variable under the π2 treatment is
equal to π2 . A carryover effect can be symmetric or asymmetric. With a symmetric
carryover effect, the carryover from treatment π1 to π2 is equal to the carryover from
treatment π2 to π1 . With an asymmetric carryover effect, the carryover from treatment π1
to π2 is not equal to the carryover from treatment π2 to π1 .
Let k be the value of a practice effect, a fatigue effect, or a symmetric carryover effect. The
population means for the 2 x 2 mixed design are given below.
a1
Factor B (Order)
b1
b2
π1
π1 + π
Factor A (Treatment)
a2
π2 + π
π2
With practice, fatigue, or symmetric carryover, the main effect of A is equal to (π1 + π1 +
π)/2 β (π2 + π + π2 )/2 = π1 β π2 . Thus, the main effect of A in this 2 x 2 mixed design is
identical to the effect that would be estimated using a 2-group design. In this type of
design, we do not follow the convention of analyzing simple main effects if an A x B
interaction effect is detected. The simple main effects of A at π1 and π2 in this design are
13
D.G. Bonett (12/2016)
biased estimates of π1 β π2 . The main effect of A should always be examined in this type
of design regardless of the size of the A x B interaction.
With asymmetric carryover, the main effect of A will not equal π1 β π2 . Let π1 be the
carryover from π1 to π2 and let π2 be the carryover from π2 to π1 . With asymmetric
carryover, the main effect of A is equal to (π1 + π1 + π1 )/2 β (π2 + π2 + π2 )/2 = π1 β π2 +
(π1 β π2 )/2 and the estimate of π1 β π2 is biased by an amount equal to (π1 β π2 )/2. Thus,
with a = 2, complete counterbalancing will not control for asymmetric carryover.
With a > 2, complete and BLS counterbalanced designs control for any pattern of practice
or fatigue effects. For example, the practice effect could be large after the first treatment
but smaller for subsequent treatments; or there might not be any fatigue effect until the
last treatment. Complete and BLS counterbalancing also controls for specific patterns of
asymmetric carryover effects. For example, with a = 3 the following table describes all
possible carryover effects from one treatment to the treatment that immediately follows.
To:
π1
π2
π3
π1
--π1
π2
Carryover from:
π2
π3
--π4
π3
π5
π6
---
If it is reasonable to assume that π3 + π5 = π1 + π6 = π2 + π4 , then estimates of all
pairwise differences among the a = 3 treatment means will be unbiased in a mixed design
with complete or BLS counterbalancing. Some researchers are willing to assume that
there is a common symmetric carryover effect for all pairs of treatments. This assumption
implies π1 = π2 = π3 = π4 = π5 = π6 which is more restrictive than necessary when
complete or BLS counterbalancing is used.
In studies that use a within-subjects treatment factor, complete or BLS counterbalancing
of the treatment orders is recommended unless there is a compelling argument that there
will be no practice, fatigue, or carryover effects. When complete or BLS counterbalancing
is used, the data should be analyzed using a mixed design rather than a single factor
within-subjects design. The mixed design will provide more powerful tests and narrower
confidence intervals of the treatment effects than the statistical methods for a single factor
within-subject design (e.g., paired-samples t-test, one-way within-subjects ANOVA,
Equations 4.1, 4.2, 4.5, 4.7). The lack of power and precision in the single-factor design is
due to an addition of a practice, fatigue, or carryover effect to only some of the participant
scores within each treatment condition β this increases the variance of the dependent
variable within each treatment condition and also decreases the correlation between any
two treatment conditions. The increased variance and decreased correlation produces a
14
D.G. Bonett (12/2016)
larger standard error which in turn results in less powerful tests and wider confidence
intervals. Researchers routinely use counterbalancing in their within-subject
experimental designs but they frequently analyze the data using a single-factor withinsubjects design that could result in an unnecessary loss of power and precision.
Complete and BLS counterbalancing controls for any type of practice and fatigue effects.
However, complete and BLS counterbalancing only controls for specific types of carryover
effects. If the researcher is uncertain about the nature of the carryover effects, it might be
possible to reduce all types of carryover effects by increasing the length of time between
treatments or requiring participants to complete some unrelated task between
treatments. If these precautionary measures are impractical or could be ineffective, then
a between-subjects treatment factor should be used instead of a within-subjects treatment
factor.
Longitudinal Designs
A longitudinal design (or repeated measures design) is a within-subjects design where
the within-subjects factor is a repeated factor. In this type of nonexperimental design,
each participant is measured on two or more occasions for the purpose of assessing
natural changes over time. In applications where the change in population means over
time is assumed to be approximately linear, the slope of the line provides a useful
description of change. With a = 2 time points where all n participants are measured at
time π₯1 and again at time π₯2 , the slope of the line is π½1 = (π2 β π1 )/(π₯2 β π₯1 ). For instance,
if the social skills of kindergarten students are measured on week 1 and then again on
week 20, π₯2 β π₯1 = 20 β 1 = 19. Dividing the endpoints of Equation 4.1 by 19 (and defining
ππ as π¦π2 β π¦π1 ) gives a confidence interval for π½1.
In longitudinal designs with a > 2 time points, the slope of the line could be an interesting
parameter to estimate. The population slope is defined as π½1 = βππ=1 βπ ππ where βπ =
2
(π₯π β ππ₯ )/[βππ=1(π₯π β ππ‘ ) ] and ππ₯ is the mean of the π₯1 β¦ π₯π values. Consider a
longitudinal design where participants are measured in months π₯1 = 1, π₯2 = 3, π₯3 = 5,
2
π₯4 = 7, and π₯5 = 9. In this study ππ₯ = 5, βππ=1(π₯π β ππ₯ ) = -42 + -22 + 0 + 22 + 42 = 40, and the
contrast coefficients are β1 = -0.1, β2 = -0.05, β3 = 0, β4 = 0.05, and β5 = 0.1.
In longitudinal studies with a > 2 time points where the population means are not
expected to increase or decrease linearly over time, confidence intervals for the a β 1
slopes at all adjacent time points can provide useful information. For instance in a study
with a = 4 time points, confidence intervals for (π2 β π1 )/(π₯2 β π₯1 ), (π3 β π2 )/(π₯3 β π₯2 ),
and (π4 β π3 )/(π₯4 β π₯3 ) would provide information about each two-period change in
population means.
15
D.G. Bonett (12/2016)
Pretest-posttest Designs
A special type of longitudinal design is the pretest-posttest design where the dependent
variable is measured on one or more occasions prior to treatment and on one or more
occasions following treatment. In the most simple and popular version of this design, each
participant is measured once prior to treatment (the pretest) and once following
treatment (the posttest). The two measurements are used to construct a confidence
interval (Equation 4.1) for π1 β π2 . If the confidence interval excludes 0, it is tempting to
conclude that the treatment caused the mean of the dependent variable to change.
However, it is likely that other factors such as maturational effects or other events that
occurred between the pretest and posttest are responsible for the change.
If more than one pretest or more than one posttest measurements can be obtained, it may
be possible to rule out maturation and other confounding effects. For instance, suppose
that two pretest measurements (y1 and y2) and two posttest measurement (y3 and y4) are
obtained for each participant. Furthermore, suppose that the treatment (which is given
between Time 2 and Time 3) is expected to have a long term effect on the dependent
variable. In this design, if confidence intervals indicate that π1 β π2 is small, π2 β π3 is
not small, and π3 β π4 is small, this pattern of results would suggest that treatment has
caused a mean change in the dependent variable from Time 2 to Time 3 that has lasted
into Time 4. Now suppose that a small maturational effect may be present. In this
situation, the researcher would like to show that π2 β π3 is larger than π1 β π2 and that
π2 β π3 is also larger than π3 β π4 . Specially, the researcher would compute confidence
intervals for the two linear contrasts (π2 β π3 ) β (π1 β π2 ) = 2π2 β π3 β π1 and
(π2 β π3 ) β (π3 β π4 ) = π2 β 2π3 + π4 . Consider another situation where the effect of
treatment is expected to be short-lived and there are no expected maturational effects. In
this study the researcher would like to show that π1 β π2 is small, π2 β π3 is not small,
and π1 β π4 is small.
In longitudinal designs with multiple pretests and posttests, if the researcher can specify
very specific changes in population means that are predicted from theory or previous
research and these predictions are confirmed in a set of Bonferroni confidence intervals,
then it might be reasonable to cautiously conclude that the treatment has a causal effect
on the dependent variable.
Although pretest-posttest designs with multiple pretests and posttests can rule out many
different types of confounding variables, it is possible that the participants might
experience other events around time of the treatment and a change in the mean response
following treatment may be caused by some other event and not the treatment. If a second
group of participants who will not receive treatment can be measured at the same points
in times as the treated group, a comparison of changes over time between the two groups
16
D.G. Bonett (12/2016)
can provide stronger evidence of a possible causal effect of treatment than a single group
pretest-posttest design. For instance, suppose two groups of participants are measured
on three occasions (Time 1, Time 2, and Time 3) and the second group received a
treatment after Time 2. The population means for this design are given below.
Condition
Untreated
Treated
Time 1
π11
π21
Time 2
π12
π22
Time 3
π13
π23
If a confidence interval for (π11 β π12 ) β (π21 β π22 ) suggests that the treated and untreated
subpopulations have similar changes over time prior to treatment, and if a confidence
interval for (π12 β π13 ) β (π22 β π23 ) suggests that the treated subpopulation has changed
more than the untreated subpopulation following treatment, these two findings provide
tentative evidence of a causal effect of treatment on the response variable. In two-group
designs with two or more pretests and two or more posttests, additional linear contrasts
could be specified with predictions that, if confirmed, could provide more compelling
evidence of a possible causal effect of treatment.
Assumptions
Within-subjects ANOVA (one-way and two-way) hypothesis tests requires three
assumptions in addition to the random sampling and independence of participants
assumptions: 1) the population variances of the dependent variable are assumed to be
equal across the levels of the within-subjects factor(s), 2) the population correlation
between each pair of within-subject measurements are assumed to be equal for all pairs,
and 3) the dependent variable in the study population is assumed to have an approximate
normal distribution under a given treatment condition. The equal variance and equal
correlation assumptions together are called the compound symmetry assumption. A less
restrictive but more difficult to explain assumption of sphericity is actually required but
the sphericity assumption is satisfied if compound symmetry is satisfied. The withinsubject ANOVA hypothesis test will not perform properly when the compound symmetry
assumption has been violated, even in large samples. Most statistical packages provide
alternative within-subjects ANOVA hypothesis tests, such as the Geisser-Greenhouse and
Huynh-Feldt tests, or a multivariate test that will perform properly when the compound
symmetry assumption has been violated. The within-subjects ANOVA and alternative
tests are sensitive to skewness in the study population but should perform properly unless
the dependent variable is highly skewed and the sample size is small (n < 20).
The confidence interval for a linear contrast of means (Equation 4.4) requires only one
assumption in addition to the random sampling and independence assumptions. The only
additional assumption is that the linear contrast scores have an approximate normal
17
D.G. Bonett (12/2016)
distribution in the study population. Skewness, rather than kurtosis, of the linear contrast
scores is the major concern. The confidence interval will perform properly unless the
dependent variable is highly skewed and the sample size is small (n < 20). Greater
amounts of skewness can be tolerated with larger sample sizes. The confidence intervals
for πΏ and π are sensitive to a violation of the normality assumption (primarily
leptokurtosis) regardless of sample size.
In addition to the random sampling, independence, and normality assumptions, the twoway mixed ANOVA assumes compound symmetry (or sphericity) within each level of the
between-subjects factor and also assumes that the study population variances and
correlations are equal across the levels of the between-subjects factor. Hypothesis tests in
the mixed ANOVA will not perform properly when the compound symmetry assumption
or the assumption of equal variances and correlations across treatments has been
violated, even in large samples. Given the sensitivity of the within-subject ANOVA
hypothesis test to assumption violations and the fact that confidence intervals for
population eta-squared measures are not readily available for within-subjects factors, a
confidence interval for a linear contrast of means of linear contrast scores is the
recommended analyses in a mixed two-way ANOVA.
Missing data is more of a problem in a within-subjects ANOVA than a between-subjects
ANOVA. If a participant fails to produce a score for any of the within-subject conditions,
that participant is dropped from the analysis (this is called listwise deletion). In
longitudinal designs, a substantial number of participants could get dropped with listwise deletion because they were unavailable in the later time periods. Missing data is less
of a problem with pairwise comparisons because this analysis only needs to drop
participants who do not have the two scores required for a particular pairwise comparison
(this is called pairwise deletion). As in between-subjects designs, a random loss of data
does not affect the internal or external validity of a within-subjects study but it will
decrease the power of statistical tests and increase confidence interval widths.
Distribution-free Methods
If the linear contrast variable is skewed, a confidence interval for a population median of
the linear contrast scores may be more appropriate and meaningful than a confidence
interval for a linear contrast of population means. Equation 1.7 of Module 1 can be applied
to the n linear contrast scores. Recall that the mean of linear contrast scores is equal to a
linear contrast of means. However, it is not the case that a median of linear contrast
scores is equal to a linear contrast of medians. Nevertheless, a median of linear contrast
scores provides a useful description of a contrast effect. Recall also that Equation 1.7 only
requires random sampling and independence among participants.
18
D.G. Bonett (12/2016)
For within-subjects designs with a = 2, the sign test of Module 1 can be applied to the
difference scores to test H0: π = 0 where π is the population median of the difference scores.
The Wilcoxon signed rank test is a more powerful test of H0: π = 0 than the sign test and
assumes that the distribution of the difference scores is symmetric. The Wilcoxon signed
rank test is usually a little less powerful than the paired-samples t-test, but it can be more
powerful than the t-test if the response variable is highly leptokurtic.
Equation 1.7 can be used to obtain a confidence interval for the median of the difference
scores. If the metric of the dependent variable values is not familiar to the intended
audience, the median of the difference scores may not have a clear interpretation. In these
situations, the following confidence interval has a simple interpretation because it
describes the proportion of people in the study population who have a y1 score that is
greater than their y2 score
πΜ ± π§πΌ/2 βπΜ(1 β πΜ)/π
(4.8)
where πΜ = (f + 2)/(n + 4) and f is the number of participants in the sample with π¦1 scores
that are greater than their π¦2 scores.
For within-subjects design with a > 2, the Friedman test is a distribution-free alternative
to the one-way within-subjects ANOVA. The Friedman test is a test of the null hypothesis
that the dependent variable distribution has the same location, variance, and shape at
each level of the within-subjects factor. Although the Friedman test is used in situations
where the response variable is skewed and the sample size is small, it does not provide
useful scientific information because the null hypothesis is known to be false in virtually
every study. An alternative to the Friedman test involves performing multiple sign tests
or Wilcoxon signed rank tests for some or all pairwise comparisons using a Holm
procedure. Additionally, Equation 4.8 could be computed for some or all pairwise
comparisons using a Bonferroni adjustment to πΌ. Some researchers use the Friedman test
as a screening test to determine if multiple pairwise tests or confidence intervals are
necessary. If the p-value for the Friedman test is greater than .05, then pairwise tests and
confidence intervals are not performed.
Sample Size Requirement for Desired Precision
The width of the confidence interval for π1 β π2 in a within-subjects design depends on
the correlation between the two measurements. If the correlation of the measurements is
positive (which is typical in within-subjects designs), the sample size requirement is often
much smaller than the sample size requirement for a corresponding two-group design.
The required sample size to estimate π1 β π2 with desired precision and confidence in a
within-subjects design is
19
D.G. Bonett (12/2016)
2
n = 8πΜ 2 (1 β πΜ12 )(π§πΌ/2 /π€)2 + π§πΌ/2
/2
(4.9)
where πΜ12 is a planning value of the Pearson correlation between the two measurements,
and πΜ 2 is a planning value of the average within-group variance. Note that the sample size
requirement is larger for smaller values of πΜ12 .
Example 4.3. A researcher wants to compare men's and women's opinions about including issues
of cultural diversity in elementary school curriculums. The researcher wants to estimate π1 β π2
with 95% confidence and wants the width of the interval to be about 2. From previous research,
the researcher decides to set πΜ 2 = 5.0. If the researcher estimates π1 β π2 using a two-group study
using a random sample of married men and a second random sample of married women, the
sample size requirement per group is ππ = 8(5.0)(1.96/2)2 + 0.96 = 39.4 β 40, or a total of 80
participants. If the researcher estimates π1 β π2 using a within-subjects study of husbands and
wives, the required number of couples is given below for three different values of πΜ12 .
πΜ12
n
______________
.5
.7
.9
22
14
6
The sample size required to estimate πΏ in a within-subjects design with desired confidence
and precision is
2
n = 4[πΏΜ 2 (1 + πΜ12
)/4 + 2(1 β πΜ12 )](π§πΌ/2 /π€)2
(4.10)
Unless πΏΜ is close to zero, the required sample size to estimate πΏ will be larger than the
required sample size to estimate π1 β π2 . In general, larger sample sizes are required for
larger values of πΏΜ. The sample size requirements for 95% confidence and a desired width
of 0.5 are shown below for three values of πΏΜ and three values of πΜ12 .
With a ratio-scale dependent variable, the sample size requirement per group to estimate
π1 /π2 with desired confidence and precision is approximately
1
n = 8πΜ 2 (πΜ2 +
1
1
Μ 22
π
Μ12
2π
β πΜ
Μ2
1π
2
) [π§πΌ/2 /ππ(π)]2 + π§πΌ/2
/2
(4.11)
where πΜπ is a planning value of ππ , r is the desired upper to lower confidence interval
endpoint ratio, and ln(r) is the natural logarithm of r.
The approximate sample size requirement to estimate βππ=1 βπ ππ with desired confidence
and precision in a within-subjects study is
2
n = 4πΜ 2 (βππ=1 βπ2 )(1 β πΜ)(π§πΌ/2 /π€)2 + π§πΌ/2
/2
(4.12)
20
D.G. Bonett (12/2016)
where πΜ 2 is a planning value of the largest within-treatment variance, and πΜ is a planning
value of the smallest correlation among all pairs of measurements. This formula assumes
βππ=1 βπ = 0. A Bonferroni adjustment to πΌ in the critical z-value can be made when two or
more simultaneous confidence intervals are required.
Example 4.4. A researcher wants to replicate a published study that compared four graphical user
interfaces. The research wants a 95% confidence interval for (π1 + π2 )/2 β (π3 + π4 )/2 that has a
width of about 4.0. Using the sample variances and correlations from the original study as
planning values, interface 2 had the largest sample variance (161.9) and the smallest sample
correlation was between interfaces 2 and 4 (0.77). The required number of participants is
approximately n = 4(161.9)(¼ + ¼ + ¼ + ¼ )(1 β 0.77)(1.96/4.0)2 + 1.92 = 37.7 β 38.
The approximate sample size requirement to estimate π with desired confidence and
precision in a within-subjects study is
n = 4[
Μ 2 [1+(πβ1)π
Μ2]
π
2π
2
+ (1 β πΜ) βππ=1 β ](π§πΌ/2 /π€)2
(4.13)
π
where πΜ is a planning value for π and πΜ is a planning value for the smallest correlation
among all pairs of measurements. This formula assumes βππ=1 βπ = 0. A Bonferroni
adjustment to πΌ in the critical z-value can be made when two or more simultaneous
confidence intervals will be computed.
Example 4.5. A researcher wants to estimate π in a 4-level within-subject experiment with 95%
confidence and contrast coefficients 1/3, 1/3, 1/3, and -1. After reviewing previous research, the
researcher decides to set πΜ = 0.5, πΜ = 0.7, and w = 0.4. The required sample size is approximately
n = 4[0.25{1 + 3(0.49)}/8 + 0.3(1.33)](1.96/0.4)2 = 45.8 β 46.
Sample Size Requirement for Desired Power
The approximate sample size required to test H0: βππ=1 βπ ππ = 0 with desired power in a
within-subjects design is
2
2
n = πΜ 2 (βππ=1 βπ2 )(1 β πΜ)(π§πΌ/2 + π§π½ )2 /(βππ=1 βπ πΜπ ) + π§πΌ/2
/2
(4.14)
where πΜ 2 is a planning value for the largest variance of the a measurements, πΜ is a
planning value for the smallest correlation among all pairs of measurements, and
βππ=1 βπ πΜπ is a planning value of the effect size. This formula assumes βππ=1 βπ = 0. A
Bonferroni adjustment to πΌ in the critical π§-value can be made when two or more
simultaneous tests will be computed. In applications where βππ=1 βπ πΜπ or πΜ 2 is difficult to
specify, Equation 4.14 can be expressed more simply in terms of a planning value for π,
as shown below
2
n = (βππ=1 βπ2 )(1 β πΜ)(π§πΌ/2 + π§π½ )2 /πΜ 2 + π§πΌ/2
/2
(4.15)
21
D.G. Bonett (12/2016)
Example 4.6. A researcher is planning a 2 × 2 within-subjects experiment and wants to reject the
null hypothesis of a zero two-way interaction effect with power of .95 at πΌ = .05. After conducting
a pilot study and reviewing previous research, it was decided to set πΜ 2 = 15 and πΜ = 0.8. The
expected size of the interaction contrast is 3.0. The required sample size is approximately n =
15(4)(1 β 0.8)(1.96 + 1.65)2/3.02 + 1.92 = 21.2 β 22.
The sample size requirement per group to test H0: |π1 β π2 | β€ π for a specified level of πΌ
and with desired power is approximately
2
2
n = 2πΜ 2 (1 β πΜ)(π§πΌ + π§π½ ) /(|πΜ1 β πΜ2 | β π)2 + π§πΌ/2
/2
(4.16)
where π§πΌ is a one-sided critical π§-value and |πΜ1 β πΜ2 | is the expected effect size which must
be smaller than b. Equivalence tests usually require large sample sizes.
Example 4.7. A researcher wants to show that two prototype navigation programs have similar
usability mean ratings. A sample of participants will use both program for 20 days and then rate
each program on a 1-30 scale. The researcher believes that a 3 point difference in mean ratings is
small and unimportant. The required sample size to test H0: |π1 β π2 | β€ 3 with power of .9, πΌ =
.05, an expected effect size of 0.5, and a standard deviation planning value of 5 is approximately
n = 2(25)(1.65 + 1.28)2/(0.5 β 3)2 + 1.36 = 70.03 β 71.
Using Prior Information
Suppose a population mean difference in a within-subjects design has been estimated in
a previous study and also in a new study. The previous study used a random sample to
estimate π1 β π2 from one study population, and the new study used a random sample to
estimate π3 β π4 from another study population. This is a 2 × 2 mixed factorial design
where the between-group factor is a classification factor with Study 1 and Study 2 as the
two levels. The two study populations are assumed to be conceptually similar. Let ππ1 =
π1 β π2 and ππ2 = π3 β π4 . A 100(1 β πΌ)% confidence interval for ππ1 β ππ2 is
(4.17)
πΜ π1 β πΜ π2 ± π‘πΌ/2;ππ ππΈπΜπ1 β πΜπ2
Μπ21
π
where π‘πΌ/2;ππ is a critical t-value, ππ = ( π +
1
Μπ22
π
π2
2
Μπ41
π
) /[π2 (π
1
1
Μπ42
π
+ π2 (π
β 1)
2
], and ππΈπΜπ1 β πΜπ2 =
2 β 1)
βπΜπ21 /π1 + πΜπ22 /π2 .
If the confidence interval for ππ1 β ππ2 suggests that π1 β π2 and π3 β π4 are not too
dissimilar, the researcher might want to compute a confidence interval for (ππ1 + ππ2 )/2.
A confidence interval for (ππ1 + ππ2 )/2 will have greater external validity and could be
substantially narrower than the confidence interval for π1 β π2 or π3 β π4 . A 100(1 β πΌ)%
confidence interval for (ππ1 + ππ2 )/2 is
22
D.G. Bonett (12/2016)
(πΜ π1 + πΜ π2 )/2 ± π‘πΌ/2;ππ ππΈ (πΜπ1 + πΜπ2 )/2
Μ2
π
π
Μ2
2
Μ4
π
(4.18)
π
Μ4
where ππ = (4ππ1 + 4ππ2 ) /[16π2(ππ1 β 1) + 16π2(ππ2 β 1)] and ππΈ(πΜπ1 + πΜπ2 )/2 = βπΜπ21 /4π1 + πΜπ22 /4π2 .
1
2
1
2
1
2
Example 4.8. A study at UCSC (n = 50) used a within-subjects design to compare the favorability
of two proposals to assist homeless teenagers. The estimated mean difference in favorability
ratings was 1.37 with an estimated standard error of 0.247. This study was replicated at UC
Berkeley (n = 52) where the estimated mean difference was 1.51 with and estimated standard error
of 0.219. A 95% confidence interval for ππ1 β ππ2 suggests that ππ1 and ππ2 are similar. A 95%
confidence interval for (ππ1 + ππ2 )/2 was [0.76, 2.12] which applies to both study populations.
Reliability Designs
In the physical sciences, attributes such as weight, length, time, volume, and pressure can
be measured with great accuracy. When measuring the weight of an object, two
laboratory-grade scales will yield virtually the same value, two different technicians using
the same scale will obtain virtually the same value, or the same technician using the same
scale will obtain virtually the same value on two different occasions. In the behavioral
sciences, however, psychological attributes cannot be measured with high accuracy. For
instance, if a particular student takes two forms of the ACT, or takes the same form on
two different occasions, or if two expert graders both evaluate the studentβs written essay,
the two scores may be substantially different.
Measurement error for person i, denoted as ππ , is the unknown and unpredictable
difference between that personβs true score (ππ ) and some measurement of the true score
for that person (π¦π ) so that π¦π = ππ + ππ . In any given study population, the variance of the
observed scores ππ¦2 is assumed to equal the variance of the true scores (ππ2 ) plus the
variance of the measurement errors (ππ2 ). The reliability coefficient of a single
measurement, denoted as ππ¦ , is defined as the true score variance divided by the observed
score variance ππ¦ = ππ2 /ππ¦2 = ππ2 /(ππ2 + ππ2 ) and has a range of 0 to 1. A reliability coefficient
of 1 indicates that the measurements contain no measurement error and a reliability
coefficient of 0 indicates that the measurements are pure measurement error.
Although the reliability of a measurement is a function of the true score variance and the
true scores will be unknown in behavioral science applications, a fundamental theorem
in psychometrics shows that the reliability of a measurement can be estimated using
multiple measurements of the same attribute. In a reliability design where a β₯ 2 equally
reliable measurements are obtained from a random sample of n participants, a one-way
within-subjects ANOVA can be used to estimate the reliability of any single measurement
23
D.G. Bonett (12/2016)
where the levels of Factor A represent the multiple measurements. The a measurements
per participant could be ratings from a β₯ 2 different raters (to estimate interrater
reliability), scores on a particular questionnaire at a = 2 points in time (to estimate testretest reliability), scores from a β₯ 2 different forms of a test or questionnaire (to estimate
alternate form reliability), or the responses to a β₯ 2 items of a questionnaire (to estimate
internal consistency reliability).
The following estimate of ππ¦ can be obtained from a one-way within-subjects ANOVA
πΜπ¦ = (MSP β MSAP)/(MSP + (a β 1)MSAP)
(4.19)
which is also a type of intraclass correlation coefficient. The reliability of a sum (or
average) of a β₯ 2 equally reliable measures, denoted as ππ , is estimated as
πΜπ = 1 β MSAP/MSP
(4.20)
and is referred to as coefficient alpha (or Cronbachβs alpha).
The reliability of a sum (or average) of a β₯ 2 equally reliable measures will be more reliable
than any single measurement. The following Spearman-Brown formulas show the
relation between ππ and ππ¦
ππ =
ππ¦ =
πππ¦
(4.21)
1 + (π β 1)ππ¦
ππ
(4.22)
π β (π β 1)ππ
To illustrate the use of the Spearman-Brown formulas, suppose the reliability of a single
measurement is 0.5, then the reliability of the sum or average of a = 3 equally reliable
measurements is 3(0.5)/[1 + 2(0.5)] = 0.75. Or suppose the reliability of a 5-item
questionnaire score has a reliability of 0.9. Assuming equally reliable items, the reliability
of a single item is 0.9/[5 β 4(0.9)] = 0.643.
An approximate 100(1 β πΌ)% confidence interval for ππ is
2π
1 β exp[ln(1 β πΜπ ) β ln{n/(n β 1)} ± π§πΌ/2 β(π β 1)(π β 2) ]
(4.23)
and a 100(1 β πΌ)% confidence interval for ππ¦ is obtained by transforming the endpoints of
the confidence interval for ππ using Equation 4.21. An exact confidence interval for ππ can
be obtained in SPSS and R.
24
D.G. Bonett (12/2016)
Example 4.9. Two parole officers independently assigned recidivism scores to a random sample
of 50 sex offenders taken from a Midwest prison population of about 16,000 sex offenders. The
estimate of ππ is 0.87. An approximate 95% confidence interval for ππ is
1 β exp[ln(1 β 0.87) β ln(50/49) ± 1.96β4/48] = [0.77, 0.93]
The researcher is 95% confident that the reliability of the average of these two parole officer
ratings in the population of sex offenders is between 0.77 and 0.93. A 95% confidence interval for
the reliability of a single parole officer rating is [0.77/(2 β 0.77), 0.93/(2 β 0.93)] = [0.62, 0.87].
Effect of Measurement Error on Statistical Methods
Measurement error increases the variance of the dependent variable within treatment
conditions, which reduces the power of statistical tests and increases the widths of
confidence intervals. Measurement error also attenuates the estimates of πΏ, π, and π2 . In
within-subject designs, measurement error has the additional detrimental effect of
attenuating the correlations among the measurements, which contributes to a further
decrease in power and an increase in confidence interval width. An important
consequence of measurement error is the need for a larger sample size.
It is possible to substantially reduce the sample size requirement if the reliability of the
dependent variable can be improved. Using a sum (or average) of two or more equally
reliable measurements of the dependent variable is one way to increase reliability as
shown in Equation 4.20. The following table illustrates the effect of increasing the number
of equally reliable measurements per participant on the sample size requirement in a twogroup design where the researcher wants a 95% confidence interval for π1 β π2 to have a
width of 1.0 and assumes the within-group variance of the true scores is 1.0. The sample
size requirements are given for three different values of ππ¦ and a = 1 to 4 equally reliable
measurements.
Example 4.10.
a
ππ¦ = .4
ππ¦ = .6
ππ¦ = .8
________________________________________
1
2
3
4
78
55
48
44
53
42
39
37
40
36
35
34
_______________________________________________________
If the reliability of a single measurement is low, increasing the number of equally reliable
measurements per participant can substantially decrease the sample size requirement.
For instance, if the reliability of a single measurement is 0.4, the sample size requirement
25
D.G. Bonett (12/2016)
can be reduced from 78 to 44 by taking four equally reliable measurements per
participant.
Measurement error can have a greater effect on sample size requirements for withinsubjects designs than between-subjects designs because measurement error attenuates
the correlation among the within-subject scores and larger sample sizes are needed with
a smaller correlation among the within-subject scores. The following table illustrates the
effect of increasing the number of equally reliable measurements per participant on the
sample size requirement in a 2-level within-subject design where the researcher wants a
95% confidence interval for π1 β π2 to have a width of 1.0 and assumes the within-group
variance of the true scores is 1.0. The sample size requirements are given below for three
values of ππ¦ and two values of the correlation between the within-subject true scores (ππ1π2 ).
Example 4.11.
ππ1 π2 = .7
ππ1 π2 = .9
_______________________________________________________________________
a ππ¦ = .4
ππ¦ = .6
ππ¦ = .8
ππ¦ = .4
ππ¦ = .6
ππ¦ = .8
____________________________________________________________________________
1
58
32
19
52
26
13
2
35
22
15
29
16
9
3
27
18
14
21
12
8
4
23
17
14
17
11
7
__________________________________________________________________________________________________________
Some Psych 214A Topics
The subsequent course (Psych 214A) will introduce several classes of statistical models
(univariate general linear model, multilevel linear model, multivariate general linear
model, path model) that can be used to analyze other types of designs that were not
covered in this course. In this course only qualitative between-subject factors were
considered (e.g., training method 1, training method 2, training method 3). In the next
course, we will consider designs with quantitative between-subject factors (e.g., 2, 4, 8, 15
hours of training). In the next course, the analysis of covariance (ANCOVA) will be
introduced as a way to reduce error variance by including covariates into the design. Using
covariates can increase the power of hypothesis tests, decrease the width of confidence
intervals, or reduce the sample size requirement. In this course, only fixed factors were
considered. A fixed factor has levels that were deliberately selected and the statistical
results apply only to the factor levels included in the study. The next course will introduce
the concept of a random factor in which the factor levels are randomly selected from a
large set of all possible factor levels. The statistical results from a design with a random
factor will generalize to all possible factor levels rather than just those levels used in the
26
D.G. Bonett (12/2016)
study. In this course, we assumed that the independent variable in an experimental design
had a direct causal effect on the dependent variable. In the next course, path models will
be used to evaluate causal chains in which the independent variable is assumed to have a
causal effect on a mediator variable which in turn has an assumed causal effect on the
dependent variable. These path models can be used to assess the direct effect as well as
the indirect effect (through the mediator variable) of the independent variable on the
dependent variable. Multilevel statistical models will be introduced as an alternative
approach to the analysis of within-subjects designs. With missing data, multilevel models
do not require listwise or pairwise deletion of data but will use all available data. In
addition to the above topics, which can be used in the analysis of experimental designs,
very general correlation and regression methods also will be introduced for the analysis
of nonexperimental designs.
27
© Copyright 2026 Paperzz