Using Statistical Power Analysis to Tune-Up a Research Experiment: a Case Study Mary B. Evans [email protected] Carolyn Wei [email protected] Jan H. Spyridakis [email protected] University of Washington Box 352195 Seattle, WA 98195-2195 reviews basic concepts of statistical power and effect size. It then explains how power analysis guided the redesign of a pilot study intended to measure the effect of link wording on the comprehension and browsing behavior of people visiting an informative Web site. Abstract This paper presents a case study of the use of statistical power analysis in a research study. When University of Washington researchers ran a pilot study to investigate the effect of link wording on Web site browsing behavior and comprehension, they obtained results that were not significant on most dependent measures. To analyze the results and discover whether link wording really had no effect, they first turned to statistical power analysis to see whether they might be committing a Type II error (accepting a false null hypothesis). They did in fact find that the power of the study was too low and the number of participants too few. This paper explains how they used the results of the power analysis to redesign the study and increase its power and the likelihood of obtaining significant results if true between-group differences did in fact exist. Keywords: statistical power analysis, usability, research tools 2. Statistical power and effect size The purpose of many statistical tests is to decide, for a given research hypothesis, whether the researcher should reject the null hypothesis that the treatment being tested has no effect. When making this decision, it is possible to make two types of decision errors. A Type I error involves rejecting the null hypothesis when it is actually true—i.e., concluding that there is a statistically significant difference between the treatment groups in an experiment when no true difference exists. The probability of a Type I error, termed alpha, is set by the experimenter for a given study. The familiar alpha value of p = .05 (5 percent) was used in the pilot study described in this paper. The second type of decision error, a Type II error, occurs when one accepts the null hypothesis when it is actually false—i.e., when one concludes that there is no difference between the groups tested in an experiment, when a true difference exists. The probability of a Type II error is termed beta. The probabilities of Type I and Type II errors are inversely related—as alpha increases, beta decreases, and vice versa. Statistical power is the ability to detect a difference between treatment groups in an experiment, if a true difference exists. When the statistical power of an experiment is high enough, it is possible to obtain a statistically significant result if there is a true difference. If power is not high enough, a study may not produce a statistically significant result even when there is a true difference. Power is largely determined by two factors [2]. The first, effect size, is the strength of the association within the population between the effect of interest and the 1. Introduction Tests of statistical significance are important to researchers and to readers of research because they reveal whether we can be confident that the results of research experiments are not simply due to chance. But the concepts of statistical power and effect size are also important, because they can help us to design experiments efficiently and interpret results. When the power of an experiment is too low, an effect cannot be detected even if it exists. By predicting effect size in advance of a study, researchers can determine how many participants to include. By estimating effect size from final results, researchers can offer practitioners information they can use to judge how much difference a given design treatment may make to their work [1]. Power and effect size concepts also can be used to troubleshoot quantitative research projects. This paper 0-7803-8467-9/04/$20.00 ©2004 IEEE. 14 dependent variable. The larger the effect size, the greater the power, and vice versa. Effect size is influenced by the standard deviation of the dependent variable within the population. The larger the standard deviation, the smaller the effect size and the lower the power; and vice versa, the lower the standard deviation, the larger the effect size and the higher the power. The second important factor in determining power is the number of participants in the study. When other factors are held constant, increasing the number of participants increases the experiment’s power, and reducing the number of participants reduces the power. Power is also influenced by the alpha level selected by the researchers as well as the kind of statistical test used in the study [2]. Power can vary from 0 to 1 (0 to 100 percent). Cohen ([3], [4]) recommends that experiments be designed to achieve a power of about .80 (80 percent). When alpha is .05, a power of .80 is associated with a 20 percent probability of a Type II error and 5 percent probability of a Type I error. In power analysis, one uses the relationships between power, effect size, and number of participants in designing experiments and interpreting results. For example, by knowing (or predicting) the effect size, and choosing the desired power and alpha level, it is possible to estimate the number of participants needed in a study. For the study discussed in this paper, the risk of including many more participants than required was a minor concern, given the relatively small effect sizes for most dependent variables. The concern instead was ensuring an adequate number of participants. For studies for which effect sizes are expected to be larger, the use of power analysis would help researchers to avoid including more participants than needed. Because increasing the number of participants in an experiment increases the chances of obtaining statistically significant results, even a tiny effect of little practical importance can be found to be statistically significant if an experiment includes too many participants [5]. In technical communication research studies, the practical significance of effects is important to practitioners. One way to roughly assess the practical significance of an effect is to compare an estimate of the effect size to known conventions. Effect size can be estimated in a variety of ways, depending on the type of statistical analysis used in an experiment (Cohen [3] presents a table of effect size measures for eight common statistical tests). When one-way analyses of variance (ANOVAs) are used, the usual measure of effect size is eta-squared (η2), defined as the proportion of the variance in the dependent measure accounted for by the independent variable (etasquared is computed by dividing the sum of squares for the effect of interest by the total sum of squares). Theoretically, eta-squared can vary from 0 to 1, but in practice, it rarely reaches 0.5 [2]. Values for eta-squared approximately correspond to the following effect size conventions [4]: small (0.01), medium (0.06), and large (0.14). 3. The pilot study In our pilot study, we expected that links within a Web site would act as signals to readers, much as advance organizers and headings signal readers of printed texts. A review of the literature suggested that explicit, informative link labels may improve learning, and that links to tangentially related, “seductive” items may encourage visitors to explore more widely, but may interfere with learning. We therefore chose to investigate two hypotheses: 1. People browsing Web sites will learn more when links are more informative. 2. People browsing Web sites will visit more pages when links are more intriguing. 3.1 Methodology To test these two hypotheses, participants were asked to browse a test Web site and then answer a questionnaire concerning their comprehension and perceptions of the site. Participants included 327 undergraduate engineering students enrolled in technical communication courses in 2003. The test Web site was modified from a National Park Service Web site on the natural history of American Samoa. Each page included a left-hand navigation bar with links to all pages and four to six embedded links per page. For the study, links were worded in one of three ways: (1) generic links (with one- to two-word titles); (2) intriguing links (with two- to four-word attention-getting titles); and (3) informative links (with two- to four-word, explicit, detailed titles). The links were varied in both the navigation menu and within the body text, for a total of five conditions (navigation bar-embedded links): (1) generic-generic, (2) generic-intriguing, (3) genericinformative, (4) intriguing-intriguing, and (5) informative-informative. Also included in each test Web site was an introductory section that asked participants to imagine themselves as the new manager of the Web site, to spend 15 to 20 minutes browsing the site to become familiar with its contents, and to avoid using the Back button. The final section of the Web site included a multiple-choice questionnaire assessing factual and inferential comprehension of the contents of each page in the Web site as well as perceptions of the Web site. Participants’ actions were logged as they worked. Some of the measures computed from the recorded data for each participant were number of pages browsed, proportion of embedded and navigation bar links clicked, comprehension scores (factual and inferential), and total 15 time spent browsing. Comprehension scores were measured as the proportions of factual and inferential comprehension questions that were correctly answered, computed both for all pages in the site and also for just those pages visited by each participant. selecting “Estimates of effect size” and “Observed power” from the options available for univariate ANOVAs). For effect size, SPSS reports values for partial eta-squared rather than eta-squared. These values are equal in the case of the one-way ANOVAs we used, but not for multi-way ANOVAs [6]. For all measures of comprehension and number of links clicked, eta-squared values were in the small or small-to-medium range and power was well below the recommended .80 (Table 1). Following Cohen’s [4] conventions, the estimated effect sizes for the numbers of links clicked were between small and medium; the estimated effect sizes for overall comprehension and for inferential comprehension were in the small range, while the estimated effect size for factual comprehension was between small and medium. For the total number of links clicked and the number of embedded links clicked, power was about half of the recommended level; for the number of navigation bar links clicked, power was less than onefifth of the recommended level. For overall comprehension scores and for inferential comprehension scores, power was no more than about one-quarter of the recommended value. For factual comprehension scores, power was higher, but still well below the recommended value of .80. These findings indicate that insufficient power may very well have been the reason for the lack of statistically significant results obtained in the pilot study, i.e., we may have made Type II errors and incorrectly retained the null hypotheses. 4. Results Analysis of the recorded data revealed that we needed to remove many participants from the dataset. Participants were disqualified because they had unexpectedly worked around some aspects of the experiment, e.g., by opening multiple browser windows, skipping the comprehension questions, or following a noticeable answer pattern (e.g., choosing answer “a” for all questions). Questionnaire responses from participants suggested that the instructions had put many in the mindset of Web designers rather than consumers of information. Perhaps, for that reason, many participants felt they needed to spend less time with the site, and as a result we believe they may have experienced the site differently than was anticipated. Participants for whom English is a second language also were removed from the dataset because we found, by running t-tests, that their responses to the comprehension questions were significantly different from those of the native English speakers in the study. That is, native and non-native English speakers were deemed to be different populations, for purposes of this study. The final dataset contained many fewer participants than the research team had hoped: just 140 of the original 327 participants. The number of participants in each of the five link conditions ranged from just 23 to 35, and averaged 28. The final dataset then was analyzed in SPSS 11.5. Analyses of variance showed no significant differences among the five link conditions on factual, inferential, or overall comprehension. Further, there were no significant differences on the total number of links clicked, the number of embedded links clicked, or the number of navigation bar links clicked. Other dependent measures could have been analyzed, but we decided first to examine the power of the study for key dependent variables. Table 1. Estimates of power and effect size (etasquared) for nine dependent variables Dependent Variable Total number of links clicked Number of embedded links clicked Number of navigation bar links clicked Proportion of correct questions Proportion of correct factual questions Proportion of correct inferential questions Proportion of correct questions (visited pages only) Proportion of correct factual questions (visited pages only) Proportion of correct inferential questions (visited pages only) 5. A power analysis of the pilot study data Eta-squared 0.039 0.041 0.011 0.024 0.037 0.015 Power 0.42 0.44 0.13 0.26 0.39 0.17 0.025 0.27 0.055 0.57 0.010 0.12 We inferred that including too few participants given the relatively small effect sizes of interest probably had resulted in inadequate statistical power. Hence, we used the observed group differences and standard deviations estimated from the pilot study dataset to estimate the number of participants that would have been needed to detect statistically significant differences between the five link conditions, for the dependent variables in this experiment. We used an online sample size estimation tool [7] to make these estimates, which are shown in This section explains how we used basic concepts of power analysis and effect size to determine whether our pilot study had adequate statistical power to detect significant effects. First, for all comprehension scores (for visited pages and for all pages), total number of links clicked, total number of embedded links clicked, and total number of navigation bar links clicked, we obtained estimates of power and effect size from SPSS 11.5 (by 16 Table 2 (this tool automates a standard method of sample size estimation described by Sokal and Rohlf [8, p. 263]). According to these estimates, and given the final number of participants used in data analysis, our pilot study had about half the number of participants needed to detect a significant difference between the most widely divergent means for total number of links clicked, number of embedded links clicked, and proportion of correctly answered factual questions about visited pages. The number of participants was much too small to detect significant differences between the condition means for other comprehension measures and for the number of navigation bar links clicked. instructions alerting participants that they would be asked questions about the information in the study Web site. Additionally, since pilot study participants had tended to follow the order of links in the navigation bar when browsing the study Web site, we redesigned the Web site so that the order of links in the navigation bar would be re-randomized each time a new participant began the study. Aggregate analyses of paths through the Web site would thus reveal a more balanced perspective on pages visited, and would make it possible to tell whether pages were visited because links to those pages were placed at the top of the navigation bar or because of the link wording. Table 2. Estimated sample sizes to obtain power of .80. 7. Preliminary power analysis results from revised study Dependent Variable Estimated sample size needed for adequate power Total number of links clicked 62 Number of embedded links clicked 62 Number of nav bar links clicked 266 Proportion of correct questions 158 Proportion of correct questions (visited pages only) Proportion of correct factual questions (visited pages only) Proportion of correct inferential questions (visited pages only) Data from the revised study are being analyzed, and final results of a power analysis of the revised study will be presented at the conference. Preliminary results indicate a much lower rate of participant disqualification, resulting in a much higher percentage of participants retained in the new study dataset (78 percent of the total number of participants, compared with 43 percent in the initial study), and generally higher power as a result. 224 62 190 8. Conclusions 6. Revising the study Completing a power analysis and estimating effect sizes for dependent variables can prove to be useful in several respects. First, it becomes possible to judge whether statistically non-significant results are more likely due to a study’s small size than to the absence of true effects. It is possible to make this judgment by (1) estimating effect sizes, and then (2) estimating the sample sizes required to obtain adequate power, given these effect sizes. Second, power analysis makes it possible to judge whether redesigning and rerunning a study could increase its power and the likelihood of obtaining significant results if true between-group differences exist. It is possible to make this judgment using (1) estimates of required sample sizes, to estimate the number of additional participants needed to attain adequate power for each dependent variable, and (2) knowledge or informed guesswork about the degree to which redesign measures could boost participation. Because the sample size estimates indicated that a relatively small number of additional participants would be required to avoid a Type II error, at least for the number of links clicked and for factual comprehension scores, we revised some aspects of the study for the final study run in order to increase the number of participants in the study dataset. Specifically, we focused on methods that would reduce the number of participants that would need to be disqualified from the study dataset. We made three key changes to achieve this goal. First, we revised the instructions displayed in the test Web site to more explicitly guide participants and reduce noncompliance. For example, we explicitly asked participants to open only one browser window. We also placed instructions to avoid using the Back button not only on the initial page of instructions, but elsewhere in the Web site as well. Second, we revised the study scenario to make it more likely that participants would read page content rather than assess the design of the site. To do this, we asked participants to imagine themselves as new park rangers needing to become informed about the natural history of American Samoa in order to answer visitors’ questions. Third, to further encourage reading of page content, we added a sentence on the initial page of 17 9. References About the Authors [1] Krull, R. “What practitioners need to know to evaluate research.” IEEE Transactions on Professional Communication. 40(3):168-181, 1997. Mary B. Evans studies usability engineering and international technical communication. For her M.S. in technical communication from the University of Washington, she developed a set of empiricallysupported Web design guidelines. She formerly served as a research analyst and Web developer in the National Oceanic and Atmospheric Administration, Seattle. [2] Aron, A., and E. N. Aron, Statistics for Psychology, 3rd ed,. Prentice-Hall, Upper Saddle River, New Jersey, 2003. [3] Cohen, J. “A power primer.” Psychological Bulletin. 112(1): 155-159, 1992. Carolyn Wei is a PhD student in Technical Communication at the University of Washington. Her research focuses on adoption of information and communication technologies such as Internet and mobile devices in diverse settings. Her research interests also include interaction within virtual communities such as distributed work groups and blog networks. [4] Cohen, J., Statistical Power Analysis for the Behavioral Sciences, Erlbaum, Hillsdale, New Jersey, 1988. [5] Murray, L. W., and D. A. Dosser, Jr., “How significant is a significant difference? Problems with the measurement of magnitude of effect.” Journal of Counseling Psychology 34(1): 68-72, 1987. Jan H. Spyridakis is a Professor in the Department of Technical Communication at the University of Washington. Her research focuses on document and screen design variables that affect comprehension and usability, cross-cultural audiences, and the refinement of research methods. She has won numerous teaching and publication awards, and she often teaches seminars in industry. [6] Levene, T. R., and C. R.. Hullett, “Eta squared, partial eta squared, and misreporting of effect size in communication research,” Human Communication Research. 28(4):612-625, 2000. [7] Chinese University of Hong Kong. Statistics toolbox. [Online document] [Cited April 10, 2004] Available http://department.obg.cuhk.edu.hk/ResearchSupport/Sample_siz e_CompMean.asp. [8] Sokal, R. R. and F. J. Rohlf, Biometry: the Principles and Practice of Statistics in Biological Research, W. H. Freeman, San Francisco, 1981. 18
© Copyright 2026 Paperzz