The Sensitivity of an Experiment: Effect Size and Power

Keppel, G. Design and Analysis:
Chapter 4: The Sensitivity of an Experiment: Effect Size and Power
• “A researcher also attempts to design a sensitive experiment — one that is sufficiently powerful
to detect any differences that might be present in the population. In general, we can create sensitive
or powerful experiments by using large sample sizes, by choosing treatment conditions that are
expected to produce sizable effects, and by reducing the uncontrolled variability in any study.”
4.1 Estimating relative treatment magnitude
• “...the importance of an experimental manipulation is demonstrated by the degree to which we can
account for the total variability among subjects by isolating the treatment effects.”
• “All to frequently, researchers compare an F test that is significant at p < .00001 with one that is
significant at p < .05 and conclude that the first experiment represents an impressive degree of
prediction...” Because the size of the F ratio is affected by both treatment effects and sample size,
we can’t compare F ratios directly.
• “Consider two experiments, one with a sample size of 5 and the other with a sample size of 20, in
which both experiments produce an F that is significant at p = .05.” Which result is more
impressive? This is the problem posed by Rosenthal and Gaito (1963). Surprisingly, many of the
“sophisticated” researchers that they asked reported that the experiment with the larger sample size
had a larger or stronger effect. Can you explain why it’s more impressive to reject H0 with a small
sample size? If you can’t do so now, you should be able to do so after you’ve learned more about
treatment effect size and power.
• One popular measure of the size of treatment effect is omega squared (w2). It is a quite useful
measure because it is unaffected by sample size. For the single-factor independent groups design,
w2 is defined as:
s 2A
2
w = 2
s A + s 2S/ A
The numerator represents the treatment effects in the population and the denominator represents the
total variability in the population (variability due to treatment effects plus other variability). When
no treatment effects are present, w2 goes to 0, so treatment effects yield values of w2 that go
between 0 and 1. (You can get negative values of w2 when F < 1.) w2 is often referred to as the
proportion of variance “explained” or “accounted for” by the manipulation.
• Because all the terms are Greek (It’s all Greek to me!), you should recognize that the above
formula is not particularly useful (except in theoretical terms). However, we can estimate w2 in a
couple of ways:
(4-1)
wˆ 2 =
(a - 1)(F - 1)
(a - 1)(F - 1) + (a)(n)
(4-2)
Both of these formulas should give you identical answers, but they probably obscure the theoretical
relationship illustrated in the first formula for w2. So, keep in mind that the formulas are both
getting at the proportion of the total variance attributable to the treatment effects.
4-1
• If we apply the formulas to the data in Keppel p.57, we get
3314.25 - (4 - 1)(150.46) 2862.87
wˆ 2 =
=
=.543
5119.75 + 150.46
5270.21
wˆ 2 =
(4 - 1)(7.34 - 1)
19.02
=
=.543
(4 - 1)(7.34 - 1) + (4)(4) 19.02 + 16
(Using 4-1)
(Using 4-2)
So, we could say that about 54% of the variability in error scores is due to the levels of sleep
deprivation in the experiment.
• According to Cohen, we can consider three rough levels of effect size:
“Small” -> wˆ 2 ª.01
“Medium” -> wˆ 2 ª.06
“Large” -> wˆ 2 ≥.15
• Instead of w2, we could use other measures of effect size. One possible measure is R2. R2 is
always greater than w2, but the difference decreases with increases in sample size. When you look
at the formula, R2 should make sense to you as a measure of treatment effect.
R2 =
SSA
SST
R2 =
3314.25
=.647
5119.75
(4-3)
When applied to the K57 data, you’d get:
• We could also use epsilon squared (also referred to as the “shrunken” R2).
eˆ 2A =
SSA - (a - 1)(MSS/ A )
SST
For K57, we’d get
3314.25 - (3)(150.46)
eˆ 2A =
=.559
5119.75
• For this course, we’ll only use w2 for estimating effect size. As Keppel notes, the typical effect
size found in psychological research is about .06...a “medium” treatment effect.
• We can’t allow the tail to wag the dog, however, so we must keep in mind that a “meaningful
result depends on its implications for theory, not specifically on its estimated strength.” In fact,
Keppel argues that as programmatic research continues, it should increasingly address smaller and
smaller treatment effects.
• We must also keep in mind that in experimental research, the designer of the experiment can have
an impact on w2 by selecting appropriate levels of the independent variable (using extreme
manipulations, etc.).
• OK, now we can directly assess the problem posed by Rosenthal & Gaito. Assume the following:
You are comparing two experiments, one with n = 5 and one with n = 20. Your experiments are
single factor independent groups experiments with a = 4 (like Keppel 57). Your results are exactly
4-2
significant at a = .05 (i.e., p = .05 and FObt = FCrit). Compare the treatment effects and power for the
two experiments.
n=5
n = 20
FObt
Treatment Effect
Power
Can you now see why the results are more impressive with n = 5?
4.2 Controlling Type I and Type II Errors
• Type I error rate (a) is set to .05 (fairly rigidly)
• Type II error rate (b) is not known, nor is power (1-b)
• “Why should we be concerned with controlling power? The answer is that power reflects the
degree to which we can detect the treatment differences we expect and the chances that others will
be able to duplicate our findings when they attempt to repeat our experiments.”
• “...most researchers appear to pay surprisingly little attention to power...”
• Cohen (1962), Brewer (1972), Sedlmeier & Gigerenzer (1989), and others have found that power
of typical psychology experiments reported in journals is about .50. What is the implication? Half
of psychological research undertaken will not yield significant results when H0 is false.
• So, here’s a thought question. What are the implications of low power for psychology as a
discipline? Think it through...what’s the cost to the discipline of conducting research that doesn’t
find a significant result when it should? What’s the cost of finding significant results that are
erroneous (Type I error)?
4.3 Reducing Error Variance
• Don’t get hung up on increasing power by increasing sample size. Another approach to
increasing power is to reduce error variance. (You could also increase power by increasing the
treatment effect, right?)
• “There are three major sources of error variance: random variation in the actual treatments,
unanalyzed control factors, and individual differences.”
• You can control random variation in lots of ways, including: carefully crafted instructions,
carefully controlled research environments, well-trained experimenters, well-maintained equipment,
and automation of the experiment where practical.
• Control factors are best handled by careful experimental design coupled with appropriate
statistical analyses.
• In an independent groups design, you’re stuck with individual differences. You can reduce
individual differences by selecting particular populations, but that will reduce the generalizability of
your results.
4.4 Using Sample Size to Control Power
4-3
• Power is determined by a (which we cannot really vary from .05), by w2 (which we can’t really
know in advance in most research), and by n (which is the only variable readily manipulated in
research).
• In Table 4-1 (for an single-factor experiment with 4 levels), we need only look at a small portion
of the table. To achieve power of .80 (a reasonable target), and with a = .05, we need to have 17
participants per condition (68 total) when doing an experiment with a large treatment effect, 44
participants per condition (176 total) when doing an experiment with a medium treatment effect
(which is fairly common), and 271 participants per condition (1084 total) when doing an
experiment with a small treatment effect. If you really want to shudder, look at the number of
participants
need ifsufficient
you werepower
to set(.80),
a to .01!
• So, in orderyou’d
to achieve
you need lots of participants when you’re dealing
with small effect sizes or when you adopt more stringent a levels (.01).
• “Low power is poor science — we waste time energy, and resources whenever we conduct an
experiment that has a low probability of producing a significant result.”
• Although you cannot really know the level of power in advance, you can make some assumptions
that enable you to make more accurate estimations. One approach is to look at similar research or to
actually conduct pilot studies to provide a rough estimate of the needed parameters. As Keppel
notes, “we do not have to estimate the absolute values of the population means, only the expected
differences among them.”
• “...we should be a bit cautious and underestimate the effect size so that our choice of sample size
will be sure to afford reasonable power for our proposed study.”
• “...methodologists are beginning to agree that a power of about .80 represents a reasonable and
realistic value for research in the behavioral sciences...a 4:1 ratio of type I to type II errors is
probably appropriate.”
• How to use the Pearson-Hartley Power Charts (509-518) to determine sample size
• First, you need to obtain an estimate of f (which first requires obtaining f2). Keppel
suggests three approaches to getting a value for f.
Approach 1, pp. 76-80 (Estimating means and population variance):
f 2A = n¢
 (m
i
- mT)
s 2S/ A
2
a
(4-4)
Of course, with all the Greek letters, you know that you can’t really use this formula. All
that you can do is attempt to estimate the parameters from sample statistics (or guesses).
OK, what are all the terms of the formula? The potential sample size (n’) and the number of
treatment levels (a) should be fairly obvious. Note that you don’t really have to know the
population grand mean (mT) or the particular treatment means (mi) — all that you really
need to know is the difference between them. If you can make a reasonable guess as to the
likely differences among your condition means, then you are well on your way to estimating
the power of your study. The last piece of the puzzle is the population variance ( s S2 / A ).
Once again, although you cannot know the actual value, you might be able to make a
reasonable estimate.
Approach 2, pp. 82-84 (Cohen’s f, looser estimation of means):
f2 = n’f2
(4-5)
To get f, you first have to get d (the standardized range of the population means). (OK, I
never claimed this was easy! J) The formula for d is:
4-4
d=
mmax - m min
sS / A
Note that to compute d, you only need to estimate how far apart the two most distant group
means might be, as well as the population variance/standard deviation. So, this approach
doesn’t require the specificity of the first approach. However, you also need to estimate how
much variability you might have among your group means. Again, this seems like it might
be easier to do than estimating actual differences (or values) among your treatments.
Cohen suggests three options:
Minimum variation (like M
Intermediate variation (like M
MM
M
M):
M
Maximum variation (like M M
M):
M M):
f =d
1
(2)(a)
f =d
a +1
(12)(a - 1)
f =d
a2 - 1
2(a)
Approach 3, p 84 (using w2 ):
f 2 = n¢
w 2A
1 - w 2A
(4-7)
OK, I’ll admit it. This is my personal fave. It’s simple if you are willing to assume that your
research is likely to be typical of psychological research. That is, you can presume that your
effect size will be in the neighborhood of .06. In doing so, you can now get into the power
charts with f2, which will allow you to predict the sample size you need to achieve a
particular level of power (like .80). (You can also get to w2 if you’re willing to estimate the
population variance and the variance due to treatment.)
• Using the power charts to estimate the necessary sample size (n) is something of a trialand-error process. The x-axis is ruled in units of f, so the first step is to take the square root
of f2 (regardless of which of the three approaches you used to get f2).
• To conserve space, each power chart shows two sets of curves (for a = .01 and a = .05).
The x-axis is also ruled separately for the two different alpha levels. You need a different set
of curves for each different number of levels of the IV (dfnum). Thus, if you’re dealing with
an experiment with 4 levels of the IV (as in Keppel 57), you’d be on the charts with dfnum =
3 (p. 511).
• If you assume that w2 = .06 and you think that you’d like to use a sample size of n = 20,
you’d come to the rude awakening that power (1-b) was only about .40. (f2 = 1.28, f =
1.13, dfdenom = 76) OK, so let’s try a larger n, maybe 60. Now you’d find that power was
higher than you’d like (why waste time and energy running too many people through your
study?). (f2 = 3.83, f = 1.96, dfdenom = 236, so 1-b = .91) In fact, as Table 4-1 suggests,
you’d need to use n = 44 to get power of .80. (f2 = 2.81, f = 1.68, dfdenom = 172, so 1-b =
.80)
4-5
• If the power charts send you the signal that you’d need to run more participants than you can
reasonably expect to do, you have a couple of options. One is to raise your alpha level, but this is
frowned upon by most journal editors. Another approach might be to reduce the number of
conditions that you’re planning to run. That way, you’d need fewer participants. Another approach
would be to make your experiment a repeated measures experiment, if that’s a reasonable alternative
(but we’ll have more to say about this design shortly).
4.5 Estimating the Power of an Experiment
• If you’re taking Keppel’s (and Cohen’s) advice, you’ll be determining the sample size of your
study prior to collecting the first piece of data. However, you can also estimate the power that your
study has achieved after the fact. That is, you can compute an estimate of w2 and then power from
the actual data of your study. The procedure is essentially the same as we’ve just gone through, but
you’d typically compute this after the fact when you’ve completed an experiment and the results
were not significant.
• Imagine, for instance, that you’ve completed an experiment with a = 3 and n = 5 and you’ve
obtained an F(2,12) = 3.20. With FCrit = 3.89, you’d fail to reject H0. Why? Did you have too little
power? Let’s check.
wˆ 2 =
(a - 1)(F - 1)
(2)(2.20)
=
=.227
(a - 1)(F - 1) + (a)(n) (2)(2.20) + (3)(5)
So, even though we would retain H0, the effect size is large. Using our estimate of effect size, we
would find that
wˆ 2A
Ê .227 ˆ
fˆ 2A = n
= 1.47
2 = (5)Ë
1- wˆ A
.773 ¯
Thus, we would get an estimate of f = 1.21, which means that power was about .36 — way too low.
In fact, we’d need to increase n to about 12 to achieve power of .80.
• For the Keppel 57 example, you should get an estimate of power of about .88, based on f2 = 4.75.
• Cohen also has tables for computing power, but these tables require that you compute f to get into
them. The formula for f2 is similar to the formula for f2 using w2, except that n is missing. That is:
wˆ 2A
(4-9)
1 - wˆ 2A
• Various computer programs exist for computing power. You might give one of them a try,
although different programs often give different estimates of power (sometimes quite different)
depending on the underlying algorithm. One program to check out is Gpower, which is available on
the Macs in TLC206. You’ll note that this program works in terms of f, rather than w2.
f2 =
4.6 “Proving” the Null Hypothesis
• As we’ve discussed, you can presume that the null hypothesis is always false, which implies that
with a sufficiently powerful experiment, you could always reject the null. Suppose, however, that
you’re in the uncomfortable position of wanting to show that the null hypothesis is reasonable.
What should you do? First of all, you must do a very powerful experiment (see the Greenwald,
1975 article that Keppel cites). Keppel suggests setting the probability of both types of errors to
.05. (Thus, your power would be .95 instead of .80.) As you can readily imagine, the implications
of setting power to .95 on sample size will be extreme...which is why researchers don’t ever really
set out to “prove” the null hypothesis.
4-6
• Another approach might be to do the research and then show that the implications of your results
on necessary sample size to achieve reasonable power would be to run a huge sample size. You
might be able to convince a journal editor that a more reasonable approach is to presume that the
null is “true” or “true enough.”
4-7