Using Statistical Power Analysis to Tune

Using Statistical Power Analysis to Tune-Up a Research Experiment: a Case
Study
Mary B. Evans
[email protected]
Carolyn Wei
[email protected]
Jan H. Spyridakis
[email protected]
University of Washington
Box 352195
Seattle, WA 98195-2195
reviews basic concepts of statistical power and effect size.
It then explains how power analysis guided the redesign
of a pilot study intended to measure the effect of link
wording on the comprehension and browsing behavior of
people visiting an informative Web site.
Abstract
This paper presents a case study of the use of
statistical power analysis in a research study. When
University of Washington researchers ran a pilot study to
investigate the effect of link wording on Web site
browsing behavior and comprehension, they obtained
results that were not significant on most dependent
measures. To analyze the results and discover whether
link wording really had no effect, they first turned to
statistical power analysis to see whether they might be
committing a Type II error (accepting a false null
hypothesis). They did in fact find that the power of the
study was too low and the number of participants too few.
This paper explains how they used the results of the
power analysis to redesign the study and increase its
power and the likelihood of obtaining significant results if
true between-group differences did in fact exist.
Keywords: statistical power analysis, usability,
research tools
2. Statistical power and effect size
The purpose of many statistical tests is to decide, for a
given research hypothesis, whether the researcher should
reject the null hypothesis that the treatment being tested
has no effect. When making this decision, it is possible to
make two types of decision errors. A Type I error
involves rejecting the null hypothesis when it is actually
true—i.e., concluding that there is a statistically
significant difference between the treatment groups in an
experiment when no true difference exists. The
probability of a Type I error, termed alpha, is set by the
experimenter for a given study. The familiar alpha value
of p = .05 (5 percent) was used in the pilot study
described in this paper. The second type of decision error,
a Type II error, occurs when one accepts the null
hypothesis when it is actually false—i.e., when one
concludes that there is no difference between the groups
tested in an experiment, when a true difference exists.
The probability of a Type II error is termed beta. The
probabilities of Type I and Type II errors are inversely
related—as alpha increases, beta decreases, and vice
versa.
Statistical power is the ability to detect a difference
between treatment groups in an experiment, if a true
difference exists. When the statistical power of an
experiment is high enough, it is possible to obtain a
statistically significant result if there is a true difference.
If power is not high enough, a study may not produce a
statistically significant result even when there is a true
difference.
Power is largely determined by two factors [2]. The
first, effect size, is the strength of the association within
the population between the effect of interest and the
1. Introduction
Tests of statistical significance are important to
researchers and to readers of research because they reveal
whether we can be confident that the results of research
experiments are not simply due to chance. But the
concepts of statistical power and effect size are also
important, because they can help us to design experiments
efficiently and interpret results. When the power of an
experiment is too low, an effect cannot be detected even
if it exists. By predicting effect size in advance of a study,
researchers can determine how many participants to
include. By estimating effect size from final results,
researchers can offer practitioners information they can
use to judge how much difference a given design
treatment may make to their work [1].
Power and effect size concepts also can be used to
troubleshoot quantitative research projects. This paper
0-7803-8467-9/04/$20.00 ©2004 IEEE.
14
dependent variable. The larger the effect size, the greater
the power, and vice versa. Effect size is influenced by the
standard deviation of the dependent variable within the
population. The larger the standard deviation, the smaller
the effect size and the lower the power; and vice versa,
the lower the standard deviation, the larger the effect size
and the higher the power. The second important factor in
determining power is the number of participants in the
study. When other factors are held constant, increasing
the number of participants increases the experiment’s
power, and reducing the number of participants reduces
the power. Power is also influenced by the alpha level
selected by the researchers as well as the kind of
statistical test used in the study [2]. Power can vary from
0 to 1 (0 to 100 percent). Cohen ([3], [4]) recommends
that experiments be designed to achieve a power of about
.80 (80 percent). When alpha is .05, a power of .80 is
associated with a 20 percent probability of a Type II error
and 5 percent probability of a Type I error.
In power analysis, one uses the relationships between
power, effect size, and number of participants in
designing experiments and interpreting results. For
example, by knowing (or predicting) the effect size, and
choosing the desired power and alpha level, it is possible
to estimate the number of participants needed in a study.
For the study discussed in this paper, the risk of including
many more participants than required was a minor
concern, given the relatively small effect sizes for most
dependent variables. The concern instead was ensuring an
adequate number of participants. For studies for which
effect sizes are expected to be larger, the use of power
analysis would help researchers to avoid including more
participants than needed.
Because increasing the number of participants in an
experiment increases the chances of obtaining statistically
significant results, even a tiny effect of little practical
importance can be found to be statistically significant if
an experiment includes too many participants [5]. In
technical communication research studies, the practical
significance of effects is important to practitioners. One
way to roughly assess the practical significance of an
effect is to compare an estimate of the effect size to
known conventions. Effect size can be estimated in a
variety of ways, depending on the type of statistical
analysis used in an experiment (Cohen [3] presents a table
of effect size measures for eight common statistical tests).
When one-way analyses of variance (ANOVAs) are used,
the usual measure of effect size is eta-squared (η2),
defined as the proportion of the variance in the dependent
measure accounted for by the independent variable (etasquared is computed by dividing the sum of squares for
the effect of interest by the total sum of squares).
Theoretically, eta-squared can vary from 0 to 1, but in
practice, it rarely reaches 0.5 [2]. Values for eta-squared
approximately correspond to the following effect size
conventions [4]: small (0.01), medium (0.06), and large
(0.14).
3. The pilot study
In our pilot study, we expected that links within a Web
site would act as signals to readers, much as advance
organizers and headings signal readers of printed texts. A
review of the literature suggested that explicit,
informative link labels may improve learning, and that
links to tangentially related, “seductive” items may
encourage visitors to explore more widely, but may
interfere with learning. We therefore chose to investigate
two hypotheses:
1. People browsing Web sites will learn more when
links are more informative.
2. People browsing Web sites will visit more pages
when links are more intriguing.
3.1 Methodology
To test these two hypotheses, participants were asked
to browse a test Web site and then answer a questionnaire
concerning their comprehension and perceptions of the
site. Participants included 327 undergraduate engineering
students enrolled in technical communication courses in
2003. The test Web site was modified from a National
Park Service Web site on the natural history of American
Samoa. Each page included a left-hand navigation bar
with links to all pages and four to six embedded links per
page.
For the study, links were worded in one of three ways:
(1) generic links (with one- to two-word titles); (2)
intriguing links (with two- to four-word attention-getting
titles); and (3) informative links (with two- to four-word,
explicit, detailed titles). The links were varied in both the
navigation menu and within the body text, for a total of
five conditions (navigation bar-embedded links): (1)
generic-generic, (2) generic-intriguing, (3) genericinformative,
(4)
intriguing-intriguing,
and
(5)
informative-informative.
Also included in each test Web site was an
introductory section that asked participants to imagine
themselves as the new manager of the Web site, to spend
15 to 20 minutes browsing the site to become familiar
with its contents, and to avoid using the Back button. The
final section of the Web site included a multiple-choice
questionnaire assessing factual and inferential
comprehension of the contents of each page in the Web
site as well as perceptions of the Web site. Participants’
actions were logged as they worked. Some of the
measures computed from the recorded data for each
participant were number of pages browsed, proportion of
embedded and navigation bar links clicked,
comprehension scores (factual and inferential), and total
15
time spent browsing. Comprehension scores were
measured as the proportions of factual and inferential
comprehension questions that were correctly answered,
computed both for all pages in the site and also for just
those pages visited by each participant.
selecting “Estimates of effect size” and “Observed
power” from the options available for univariate
ANOVAs). For effect size, SPSS reports values for
partial eta-squared rather than eta-squared. These values
are equal in the case of the one-way ANOVAs we used,
but not for multi-way ANOVAs [6].
For all measures of comprehension and number of
links clicked, eta-squared values were in the small or
small-to-medium range and power was well below the
recommended .80 (Table 1). Following Cohen’s [4]
conventions, the estimated effect sizes for the numbers of
links clicked were between small and medium; the
estimated effect sizes for overall comprehension and for
inferential comprehension were in the small range, while
the estimated effect size for factual comprehension was
between small and medium. For the total number of links
clicked and the number of embedded links clicked, power
was about half of the recommended level; for the number
of navigation bar links clicked, power was less than onefifth of the recommended level. For overall
comprehension scores and for inferential comprehension
scores, power was no more than about one-quarter of the
recommended value. For factual comprehension scores,
power was higher, but still well below the recommended
value of .80. These findings indicate that insufficient
power may very well have been the reason for the lack of
statistically significant results obtained in the pilot study,
i.e., we may have made Type II errors and incorrectly
retained the null hypotheses.
4. Results
Analysis of the recorded data revealed that we needed
to remove many participants from the dataset. Participants
were disqualified because they had unexpectedly worked
around some aspects of the experiment, e.g., by opening
multiple browser windows, skipping the comprehension
questions, or following a noticeable answer pattern (e.g.,
choosing answer “a” for all questions). Questionnaire
responses from participants suggested that the instructions
had put many in the mindset of Web designers rather than
consumers of information. Perhaps, for that reason, many
participants felt they needed to spend less time with the
site, and as a result we believe they may have experienced
the site differently than was anticipated.
Participants for whom English is a second language
also were removed from the dataset because we found, by
running t-tests, that their responses to the comprehension
questions were significantly different from those of the
native English speakers in the study. That is, native and
non-native English speakers were deemed to be different
populations, for purposes of this study.
The final dataset contained many fewer participants
than the research team had hoped: just 140 of the original
327 participants. The number of participants in each of
the five link conditions ranged from just 23 to 35, and
averaged 28.
The final dataset then was analyzed in SPSS 11.5.
Analyses of variance showed no significant differences
among the five link conditions on factual, inferential, or
overall comprehension. Further, there were no significant
differences on the total number of links clicked, the
number of embedded links clicked, or the number of
navigation bar links clicked. Other dependent measures
could have been analyzed, but we decided first to
examine the power of the study for key dependent
variables.
Table 1. Estimates of power and effect size (etasquared) for nine dependent variables
Dependent Variable
Total number of links clicked
Number of embedded links clicked
Number of navigation bar links clicked
Proportion of correct questions
Proportion of correct factual questions
Proportion of correct inferential
questions
Proportion of correct questions (visited
pages only)
Proportion of correct factual questions
(visited pages only)
Proportion of correct inferential
questions (visited pages only)
5. A power analysis of the pilot study data
Eta-squared
0.039
0.041
0.011
0.024
0.037
0.015
Power
0.42
0.44
0.13
0.26
0.39
0.17
0.025
0.27
0.055
0.57
0.010
0.12
We inferred that including too few participants given
the relatively small effect sizes of interest probably had
resulted in inadequate statistical power. Hence, we used
the observed group differences and standard deviations
estimated from the pilot study dataset to estimate the
number of participants that would have been needed to
detect statistically significant differences between the five
link conditions, for the dependent variables in this
experiment. We used an online sample size estimation
tool [7] to make these estimates, which are shown in
This section explains how we used basic concepts of
power analysis and effect size to determine whether our
pilot study had adequate statistical power to detect
significant effects. First, for all comprehension scores (for
visited pages and for all pages), total number of links
clicked, total number of embedded links clicked, and total
number of navigation bar links clicked, we obtained
estimates of power and effect size from SPSS 11.5 (by
16
Table 2 (this tool automates a standard method of sample
size estimation described by Sokal and Rohlf [8, p. 263]).
According to these estimates, and given the final number
of participants used in data analysis, our pilot study had
about half the number of participants needed to detect a
significant difference between the most widely divergent
means for total number of links clicked, number of
embedded links clicked, and proportion of correctly
answered factual questions about visited pages. The
number of participants was much too small to detect
significant differences between the condition means for
other comprehension measures and for the number of
navigation bar links clicked.
instructions alerting participants that they would be asked
questions about the information in the study Web site.
Additionally, since pilot study participants had tended to
follow the order of links in the navigation bar when
browsing the study Web site, we redesigned the Web site
so that the order of links in the navigation bar would be
re-randomized each time a new participant began the
study. Aggregate analyses of paths through the Web site
would thus reveal a more balanced perspective on pages
visited, and would make it possible to tell whether pages
were visited because links to those pages were placed at
the top of the navigation bar or because of the link
wording.
Table 2. Estimated sample sizes to obtain power of .80.
7. Preliminary power analysis results from
revised study
Dependent Variable
Estimated sample size needed
for adequate power
Total number of links clicked
62
Number of embedded links clicked
62
Number of nav bar links clicked
266
Proportion of correct questions
158
Proportion of correct questions
(visited pages only)
Proportion of correct factual
questions (visited pages only)
Proportion of correct inferential
questions (visited pages only)
Data from the revised study are being analyzed, and
final results of a power analysis of the revised study will
be presented at the conference. Preliminary results
indicate a much lower rate of participant disqualification,
resulting in a much higher percentage of participants
retained in the new study dataset (78 percent of the total
number of participants, compared with 43 percent in the
initial study), and generally higher power as a result.
224
62
190
8. Conclusions
6. Revising the study
Completing a power analysis and estimating effect
sizes for dependent variables can prove to be useful in
several respects. First, it becomes possible to judge
whether statistically non-significant results are more
likely due to a study’s small size than to the absence of
true effects. It is possible to make this judgment by
(1) estimating effect sizes, and then (2) estimating the
sample sizes required to obtain adequate power, given
these effect sizes. Second, power analysis makes it
possible to judge whether redesigning and rerunning a
study could increase its power and the likelihood of
obtaining significant results if true between-group
differences exist. It is possible to make this judgment
using (1) estimates of required sample sizes, to estimate
the number of additional participants needed to attain
adequate power for each dependent variable, and (2)
knowledge or informed guesswork about the degree to
which redesign measures could boost participation.
Because the sample size estimates indicated that a
relatively small number of additional participants would
be required to avoid a Type II error, at least for the
number of links clicked and for factual comprehension
scores, we revised some aspects of the study for the final
study run in order to increase the number of participants
in the study dataset. Specifically, we focused on methods
that would reduce the number of participants that would
need to be disqualified from the study dataset.
We made three key changes to achieve this goal. First,
we revised the instructions displayed in the test Web site
to more explicitly guide participants and reduce
noncompliance. For example, we explicitly asked
participants to open only one browser window. We also
placed instructions to avoid using the Back button not
only on the initial page of instructions, but elsewhere in
the Web site as well. Second, we revised the study
scenario to make it more likely that participants would
read page content rather than assess the design of the site.
To do this, we asked participants to imagine themselves
as new park rangers needing to become informed about
the natural history of American Samoa in order to answer
visitors’ questions. Third, to further encourage reading of
page content, we added a sentence on the initial page of
17
9. References
About the Authors
[1] Krull, R. “What practitioners need to know to evaluate
research.” IEEE Transactions on Professional Communication.
40(3):168-181, 1997.
Mary B. Evans studies usability engineering and
international technical communication. For her M.S. in
technical communication from the University of
Washington, she developed a set of empiricallysupported Web design guidelines. She formerly served as
a research analyst and Web developer in the National
Oceanic and Atmospheric Administration, Seattle.
[2] Aron, A., and E. N. Aron, Statistics for Psychology, 3rd ed,.
Prentice-Hall, Upper Saddle River, New Jersey, 2003.
[3] Cohen, J. “A power primer.” Psychological Bulletin. 112(1):
155-159, 1992.
Carolyn Wei is a PhD student in Technical
Communication at the University of Washington. Her
research focuses on adoption of information and
communication technologies such as Internet and mobile
devices in diverse settings. Her research interests also
include interaction within virtual communities such as
distributed work groups and blog networks.
[4] Cohen, J., Statistical Power Analysis for the Behavioral
Sciences, Erlbaum, Hillsdale, New Jersey, 1988.
[5] Murray, L. W., and D. A. Dosser, Jr., “How significant is a
significant difference? Problems with the measurement of
magnitude of effect.” Journal of Counseling Psychology 34(1):
68-72, 1987.
Jan H. Spyridakis is a Professor in the Department of
Technical Communication at the University of
Washington. Her research focuses on document and
screen design variables that affect comprehension and
usability, cross-cultural audiences, and the refinement of
research methods. She has won numerous teaching and
publication awards, and she often teaches seminars in
industry.
[6] Levene, T. R., and C. R.. Hullett, “Eta squared, partial eta
squared, and misreporting of effect size in communication
research,” Human Communication Research. 28(4):612-625,
2000.
[7] Chinese University of Hong Kong. Statistics toolbox.
[Online document] [Cited April 10, 2004] Available
http://department.obg.cuhk.edu.hk/ResearchSupport/Sample_siz
e_CompMean.asp.
[8] Sokal, R. R. and F. J. Rohlf, Biometry: the Principles and
Practice of Statistics in Biological Research, W. H. Freeman,
San Francisco, 1981.
18