Keppel, G. Design and Analysis: Chapter 4: The Sensitivity of an Experiment: Effect Size and Power • “A researcher also attempts to design a sensitive experiment — one that is sufficiently powerful to detect any differences that might be present in the population. In general, we can create sensitive or powerful experiments by using large sample sizes, by choosing treatment conditions that are expected to produce sizable effects, and by reducing the uncontrolled variability in any study.” 4.1 Estimating relative treatment magnitude • “...the importance of an experimental manipulation is demonstrated by the degree to which we can account for the total variability among subjects by isolating the treatment effects.” • “All to frequently, researchers compare an F test that is significant at p < .00001 with one that is significant at p < .05 and conclude that the first experiment represents an impressive degree of prediction...” Because the size of the F ratio is affected by both treatment effects and sample size, we can’t compare F ratios directly. • “Consider two experiments, one with a sample size of 5 and the other with a sample size of 20, in which both experiments produce an F that is significant at p = .05.” Which result is more impressive? This is the problem posed by Rosenthal and Gaito (1963). Surprisingly, many of the “sophisticated” researchers that they asked reported that the experiment with the larger sample size had a larger or stronger effect. Can you explain why it’s more impressive to reject H0 with a small sample size? If you can’t do so now, you should be able to do so after you’ve learned more about treatment effect size and power. • One popular measure of the size of treatment effect is omega squared (w2). It is a quite useful measure because it is unaffected by sample size. For the single-factor independent groups design, w2 is defined as: s 2A 2 w = 2 s A + s 2S/ A The numerator represents the treatment effects in the population and the denominator represents the total variability in the population (variability due to treatment effects plus other variability). When no treatment effects are present, w2 goes to 0, so treatment effects yield values of w2 that go between 0 and 1. (You can get negative values of w2 when F < 1.) w2 is often referred to as the proportion of variance “explained” or “accounted for” by the manipulation. • Because all the terms are Greek (It’s all Greek to me!), you should recognize that the above formula is not particularly useful (except in theoretical terms). However, we can estimate w2 in a couple of ways: (4-1) wˆ 2 = (a - 1)(F - 1) (a - 1)(F - 1) + (a)(n) (4-2) Both of these formulas should give you identical answers, but they probably obscure the theoretical relationship illustrated in the first formula for w2. So, keep in mind that the formulas are both getting at the proportion of the total variance attributable to the treatment effects. 4-1 • If we apply the formulas to the data in Keppel p.57, we get 3314.25 - (4 - 1)(150.46) 2862.87 wˆ 2 = = =.543 5119.75 + 150.46 5270.21 wˆ 2 = (4 - 1)(7.34 - 1) 19.02 = =.543 (4 - 1)(7.34 - 1) + (4)(4) 19.02 + 16 (Using 4-1) (Using 4-2) So, we could say that about 54% of the variability in error scores is due to the levels of sleep deprivation in the experiment. • According to Cohen, we can consider three rough levels of effect size: “Small” -> wˆ 2 ª.01 “Medium” -> wˆ 2 ª.06 “Large” -> wˆ 2 ≥.15 • Instead of w2, we could use other measures of effect size. One possible measure is R2. R2 is always greater than w2, but the difference decreases with increases in sample size. When you look at the formula, R2 should make sense to you as a measure of treatment effect. R2 = SSA SST R2 = 3314.25 =.647 5119.75 (4-3) When applied to the K57 data, you’d get: • We could also use epsilon squared (also referred to as the “shrunken” R2). eˆ 2A = SSA - (a - 1)(MSS/ A ) SST For K57, we’d get 3314.25 - (3)(150.46) eˆ 2A = =.559 5119.75 • For this course, we’ll only use w2 for estimating effect size. As Keppel notes, the typical effect size found in psychological research is about .06...a “medium” treatment effect. • We can’t allow the tail to wag the dog, however, so we must keep in mind that a “meaningful result depends on its implications for theory, not specifically on its estimated strength.” In fact, Keppel argues that as programmatic research continues, it should increasingly address smaller and smaller treatment effects. • We must also keep in mind that in experimental research, the designer of the experiment can have an impact on w2 by selecting appropriate levels of the independent variable (using extreme manipulations, etc.). • OK, now we can directly assess the problem posed by Rosenthal & Gaito. Assume the following: You are comparing two experiments, one with n = 5 and one with n = 20. Your experiments are single factor independent groups experiments with a = 4 (like Keppel 57). Your results are exactly 4-2 significant at a = .05 (i.e., p = .05 and FObt = FCrit). Compare the treatment effects and power for the two experiments. n=5 n = 20 FObt Treatment Effect Power Can you now see why the results are more impressive with n = 5? 4.2 Controlling Type I and Type II Errors • Type I error rate (a) is set to .05 (fairly rigidly) • Type II error rate (b) is not known, nor is power (1-b) • “Why should we be concerned with controlling power? The answer is that power reflects the degree to which we can detect the treatment differences we expect and the chances that others will be able to duplicate our findings when they attempt to repeat our experiments.” • “...most researchers appear to pay surprisingly little attention to power...” • Cohen (1962), Brewer (1972), Sedlmeier & Gigerenzer (1989), and others have found that power of typical psychology experiments reported in journals is about .50. What is the implication? Half of psychological research undertaken will not yield significant results when H0 is false. • So, here’s a thought question. What are the implications of low power for psychology as a discipline? Think it through...what’s the cost to the discipline of conducting research that doesn’t find a significant result when it should? What’s the cost of finding significant results that are erroneous (Type I error)? 4.3 Reducing Error Variance • Don’t get hung up on increasing power by increasing sample size. Another approach to increasing power is to reduce error variance. (You could also increase power by increasing the treatment effect, right?) • “There are three major sources of error variance: random variation in the actual treatments, unanalyzed control factors, and individual differences.” • You can control random variation in lots of ways, including: carefully crafted instructions, carefully controlled research environments, well-trained experimenters, well-maintained equipment, and automation of the experiment where practical. • Control factors are best handled by careful experimental design coupled with appropriate statistical analyses. • In an independent groups design, you’re stuck with individual differences. You can reduce individual differences by selecting particular populations, but that will reduce the generalizability of your results. 4.4 Using Sample Size to Control Power 4-3 • Power is determined by a (which we cannot really vary from .05), by w2 (which we can’t really know in advance in most research), and by n (which is the only variable readily manipulated in research). • In Table 4-1 (for an single-factor experiment with 4 levels), we need only look at a small portion of the table. To achieve power of .80 (a reasonable target), and with a = .05, we need to have 17 participants per condition (68 total) when doing an experiment with a large treatment effect, 44 participants per condition (176 total) when doing an experiment with a medium treatment effect (which is fairly common), and 271 participants per condition (1084 total) when doing an experiment with a small treatment effect. If you really want to shudder, look at the number of participants need ifsufficient you werepower to set(.80), a to .01! • So, in orderyou’d to achieve you need lots of participants when you’re dealing with small effect sizes or when you adopt more stringent a levels (.01). • “Low power is poor science — we waste time energy, and resources whenever we conduct an experiment that has a low probability of producing a significant result.” • Although you cannot really know the level of power in advance, you can make some assumptions that enable you to make more accurate estimations. One approach is to look at similar research or to actually conduct pilot studies to provide a rough estimate of the needed parameters. As Keppel notes, “we do not have to estimate the absolute values of the population means, only the expected differences among them.” • “...we should be a bit cautious and underestimate the effect size so that our choice of sample size will be sure to afford reasonable power for our proposed study.” • “...methodologists are beginning to agree that a power of about .80 represents a reasonable and realistic value for research in the behavioral sciences...a 4:1 ratio of type I to type II errors is probably appropriate.” • How to use the Pearson-Hartley Power Charts (509-518) to determine sample size • First, you need to obtain an estimate of f (which first requires obtaining f2). Keppel suggests three approaches to getting a value for f. Approach 1, pp. 76-80 (Estimating means and population variance): f 2A = n¢ Â (m i - mT) s 2S/ A 2 a (4-4) Of course, with all the Greek letters, you know that you can’t really use this formula. All that you can do is attempt to estimate the parameters from sample statistics (or guesses). OK, what are all the terms of the formula? The potential sample size (n’) and the number of treatment levels (a) should be fairly obvious. Note that you don’t really have to know the population grand mean (mT) or the particular treatment means (mi) — all that you really need to know is the difference between them. If you can make a reasonable guess as to the likely differences among your condition means, then you are well on your way to estimating the power of your study. The last piece of the puzzle is the population variance ( s S2 / A ). Once again, although you cannot know the actual value, you might be able to make a reasonable estimate. Approach 2, pp. 82-84 (Cohen’s f, looser estimation of means): f2 = n’f2 (4-5) To get f, you first have to get d (the standardized range of the population means). (OK, I never claimed this was easy! J) The formula for d is: 4-4 d= mmax - m min sS / A Note that to compute d, you only need to estimate how far apart the two most distant group means might be, as well as the population variance/standard deviation. So, this approach doesn’t require the specificity of the first approach. However, you also need to estimate how much variability you might have among your group means. Again, this seems like it might be easier to do than estimating actual differences (or values) among your treatments. Cohen suggests three options: Minimum variation (like M Intermediate variation (like M MM M M): M Maximum variation (like M M M): M M): f =d 1 (2)(a) f =d a +1 (12)(a - 1) f =d a2 - 1 2(a) Approach 3, p 84 (using w2 ): f 2 = n¢ w 2A 1 - w 2A (4-7) OK, I’ll admit it. This is my personal fave. It’s simple if you are willing to assume that your research is likely to be typical of psychological research. That is, you can presume that your effect size will be in the neighborhood of .06. In doing so, you can now get into the power charts with f2, which will allow you to predict the sample size you need to achieve a particular level of power (like .80). (You can also get to w2 if you’re willing to estimate the population variance and the variance due to treatment.) • Using the power charts to estimate the necessary sample size (n) is something of a trialand-error process. The x-axis is ruled in units of f, so the first step is to take the square root of f2 (regardless of which of the three approaches you used to get f2). • To conserve space, each power chart shows two sets of curves (for a = .01 and a = .05). The x-axis is also ruled separately for the two different alpha levels. You need a different set of curves for each different number of levels of the IV (dfnum). Thus, if you’re dealing with an experiment with 4 levels of the IV (as in Keppel 57), you’d be on the charts with dfnum = 3 (p. 511). • If you assume that w2 = .06 and you think that you’d like to use a sample size of n = 20, you’d come to the rude awakening that power (1-b) was only about .40. (f2 = 1.28, f = 1.13, dfdenom = 76) OK, so let’s try a larger n, maybe 60. Now you’d find that power was higher than you’d like (why waste time and energy running too many people through your study?). (f2 = 3.83, f = 1.96, dfdenom = 236, so 1-b = .91) In fact, as Table 4-1 suggests, you’d need to use n = 44 to get power of .80. (f2 = 2.81, f = 1.68, dfdenom = 172, so 1-b = .80) 4-5 • If the power charts send you the signal that you’d need to run more participants than you can reasonably expect to do, you have a couple of options. One is to raise your alpha level, but this is frowned upon by most journal editors. Another approach might be to reduce the number of conditions that you’re planning to run. That way, you’d need fewer participants. Another approach would be to make your experiment a repeated measures experiment, if that’s a reasonable alternative (but we’ll have more to say about this design shortly). 4.5 Estimating the Power of an Experiment • If you’re taking Keppel’s (and Cohen’s) advice, you’ll be determining the sample size of your study prior to collecting the first piece of data. However, you can also estimate the power that your study has achieved after the fact. That is, you can compute an estimate of w2 and then power from the actual data of your study. The procedure is essentially the same as we’ve just gone through, but you’d typically compute this after the fact when you’ve completed an experiment and the results were not significant. • Imagine, for instance, that you’ve completed an experiment with a = 3 and n = 5 and you’ve obtained an F(2,12) = 3.20. With FCrit = 3.89, you’d fail to reject H0. Why? Did you have too little power? Let’s check. wˆ 2 = (a - 1)(F - 1) (2)(2.20) = =.227 (a - 1)(F - 1) + (a)(n) (2)(2.20) + (3)(5) So, even though we would retain H0, the effect size is large. Using our estimate of effect size, we would find that wˆ 2A Ê .227 ˆ fˆ 2A = n = 1.47 2 = (5)Ë 1- wˆ A .773 ¯ Thus, we would get an estimate of f = 1.21, which means that power was about .36 — way too low. In fact, we’d need to increase n to about 12 to achieve power of .80. • For the Keppel 57 example, you should get an estimate of power of about .88, based on f2 = 4.75. • Cohen also has tables for computing power, but these tables require that you compute f to get into them. The formula for f2 is similar to the formula for f2 using w2, except that n is missing. That is: wˆ 2A (4-9) 1 - wˆ 2A • Various computer programs exist for computing power. You might give one of them a try, although different programs often give different estimates of power (sometimes quite different) depending on the underlying algorithm. One program to check out is Gpower, which is available on the Macs in TLC206. You’ll note that this program works in terms of f, rather than w2. f2 = 4.6 “Proving” the Null Hypothesis • As we’ve discussed, you can presume that the null hypothesis is always false, which implies that with a sufficiently powerful experiment, you could always reject the null. Suppose, however, that you’re in the uncomfortable position of wanting to show that the null hypothesis is reasonable. What should you do? First of all, you must do a very powerful experiment (see the Greenwald, 1975 article that Keppel cites). Keppel suggests setting the probability of both types of errors to .05. (Thus, your power would be .95 instead of .80.) As you can readily imagine, the implications of setting power to .95 on sample size will be extreme...which is why researchers don’t ever really set out to “prove” the null hypothesis. 4-6 • Another approach might be to do the research and then show that the implications of your results on necessary sample size to achieve reasonable power would be to run a huge sample size. You might be able to convince a journal editor that a more reasonable approach is to presume that the null is “true” or “true enough.” 4-7
© Copyright 2025 Paperzz