1 Running title: Using Bayes to get the most out of non-significant results 2 3 Using Bayes to get the most out of non-significant results 4 5 6 7 Zoltan Dienes School of Psychology and Sackler Centre for Consciousness Science, University of Sussex 8 9 10 Keywords: Bayes factor, confidence interval, highest density region, null hypothesis, power, statistical inference, significance testing 11 12 13 Correspondence: 14 Zoltan Dienes 15 School of Psychology, University of Sussex, Brighton, BN1 9QH UK 16 Phone: +44 1273 877335, Fax: 01273 678058, Email: [email protected] 1 17 Abstract 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 No scientific conclusion follows automatically from a statistically non-significant result, yet people routinely use non-significant results to guide conclusions about the status of theories (or the effectiveness of practices). To know whether a non-significant result counts against a theory, or if it just indicates data insensitivity, researchers must use one of: power, intervals (such as confidence or credibility intervals), or else an indicator of the relative evidence for one theory over another, such as a Bayes factor. I argue Bayes factors allow theory to be linked to data in a way that overcomes the weaknesses of the other approaches. Specifically, Bayes factors use the data themselves to determine their sensitivity in distinguishing theories (unlike power), and they make use of those aspects of a theory’s predictions that are often easiest to specify (unlike power and intervals, which require specifying the minimal interesting value in order to address theory). Bayes factors provide a coherent approach to determining whether non-significant results support a null hypothesis over a theory, or whether the data are just insensitive. They allow accepting and rejecting the null hypothesis to be put on an equal footing. Concrete examples are provided to indicate the range of application of a simple online Bayes calculator, which reveal both the strengths and weaknesses of Bayes factors. 34 35 36 2 37 Using Bayes to get the most out of non-significant results 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Users of statistics, in disciplines from economics to sociology to biology to psychology, have had a problem. The problem is how to interpret a non-significant result. A non-significant result can mean one of two things: either that there is evidence for the null hypothesis and against a theory that predicted a difference (or relationship); or else that the data are insensitive in distinguishing the theories and nothing follows from the data at all. (Indeed, in the latter case, a non-significant result might even somewhat favour the theory, e.g. Dienes, 2008, p 128, as will be seen in some of the examples that follow.) That is, the data might count in favour of the null and against a theory; or they might count for nothing much. The problem is that people have been choosing one of those two interpretations without a coherent reason for that choice. Thus, non-significant results have been used to count against theories when they did not (e.g. Cohen, 1990; Rosenthal, 1993; Rouder, Morey, Speckman, & Pratte, 2007); or else have been ignored when they were in fact informative (e.g. believing that an apparent failure to replicate with a non-significant result is more likely to indicate noise produced by sloppy experimenters than a true null hypothesis; cf. Greenwald, 1993; Kruschke, 2013a; Pashler & Harris, 2012). One can only wonder what harm has been done to fields by not systematically determining which interpretation of a non-significant result actually holds. There are three solutions on the table for evaluating a non-significant result for a single study: 1) power; 2) interval estimates; and 3) Bayes factors (and related approaches). In this article, I will discuss the first two briefly (because readers are likely to be most familiar with them) indicating their uses and limitations; then describe how Bayes factors overcome those limitations (and what weaknesses they in turn have). The bulk of the paper will then provide detailed examples of how to interpret non-significant results using Bayes factors, while otherwise making minimal changes to current statistical practice. My aim is to clarify how to interpret non-significant results coherently, using whichever method is most suitable for the situation, in order to effectively link data to theory. I will be concentrating on a method of using Bayes that involves minor changes in adapting current practice. The changes therefore can be understood by reviewers and editors even as they operate under orthodox norms (see, e.g., Allen, Sumner, & Chambers, 2014) while still solving the fundamental problem of distinguishing insensitive data from evidence for a null hypothesis. 68 69 1. Illustration of the problem. 70 71 72 73 74 75 76 77 78 79 80 81 82 83 Imagine there really is an effect in the population, and the power of an experimental procedure is 0.5 (i.e. only 50 out of 100 tests would be significant when there is an actual, real effect; not that in reality we know what the real effect is, nor, therefore, what the power is for the actual population effect). The experiment is repeated exactly many times. Cumming (2011; see associated website) provides software for simulating such a situation; the use of simulation of course ensures that each simulation is identical to the last bar the vagaries of random sampling. A single run (of Cumming’s “ESCI dance p”) generated the sequence of p-values shown in Figure 1. Cumming (2011) calls such sequences the “dance of the p-values”. Notice how p-values can be very high or very low. For example, one could obtain a p-value of .028 (experiment 20) and in the very next attempted replication get a p of 0.817, when nothing had changed in terms of the population effect. There may be a temptation to think the p of .817 represents very strong evidence for the null; it is “very nonsignificant”. In fact, a p-value per se does not provide evidence for the null, no matter “how non-significant” it is (Fisher, 1935; Royall, 1997). A non-significant p-value does not 3 84 85 86 87 88 distinguish evidence for the null from no evidence at all (as we shall see). That is, one cannot use a high p-value in itself to count against a theory that predicted a difference. A high pvalue may simply reflect data insensitivity, a failure to distinguish the null hypothesis from the alternative because, for example, the standard error is high. Given this, how can we tell the meaning of a non-significant result? 89 90 Figure 1. Dance of the p-values 91 92 93 A sequence of p-values (and confidence intervals) for successive simulations of an experiment where there is an actual effect (mean population difference = 10 units) and power is 0.5. Created from software provided by Cumming (2011; associated website). 94 95 96 4 97 2. Solutions on the table. 98 99 100 101 102 103 104 105 2.1 Power. A typical orthodox solution to determining whether data are sensitive is power (Cohen, 1988). Power is the property of a decision rule defined by the long run proportion of times it leads to rejection of the null hypothesis given the null is actually false. That is, by fixing power one controls long run Type II error rates (the rate at which one accepts the null given it is false, i.e. the converse of power) . Power can be easily calculated with the free software Gpower (Faul et al., 2009): To calculate apriori power (i.e. in advance of collecting data, which when it should be calculated), one enters the effect size predicted in advance, a number of participants, the significance level to be used, and a power is returned. 106 107 108 109 110 111 112 113 To calculate power, in entering the effect size, one needs to know the minimal difference (relationship) below which the theory would be rendered false, irrelevant or uninteresting. If one did not use the minimal interesting value to calculate power one would not be controlling Type II error rates. If one used an arbitrary effect size (e.g. Cohen’s d of 0.5), one would be controlling error rates with respect to an arbitrary theory, and thus in no principled way controlling error rates for the theories one was evaluating. If one used a typical effect size, but not the minimal interesting effect size, type II error rates would not be controlled. 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 Power is an extremely useful concept. For example, imagine a review of 100 studies that looked at whether meditation improved depression. Fifty studies found statistically significant results in the right direction and 50 were non-significant. What can be concluded about the effectiveness of meditation in treating depression? One intuition is that each statistically significant result trades off against each statistically non-significant result, and nothing much follows from these data: More research is needed. This intuition is wrong because it ignores power. How many experiments out of 100 would be statistically significant at the 5% level if the null were true? Only five - that’s the meaning of a 5% significance level. (There may be a “file drawer problem” in that not all null results get published; but if all significant results were reported, and the null were true, one would expect half to be statistically significant in one direction and half in the other.) On the other hand, if an effect really did exist, and the power was 50%, how many would be statistically significant out of 100? Fifty - that’s the meaning of a power of 50%1. In fact, the obtained pattern of results categorically licenses the assertion that meditation improves depression (for this fictional data set). 129 130 131 132 133 134 135 136 137 138 139 140 Despite power being a very useful concept, there are two problems with power in practice. First, calculating power requires specifying the minimal interesting value, or at least the minimal interesting value that is plausible. This may be one of the hardest aspects of a theory’s predictions to specify. Second, power cannot use the data themselves in order to determine how sensitively those very data distinguish the null from the alternative hypothesis. It might seem strange that properties of the data themselves cannot be used to indicate how sensitive those data are. But post hoc or observed power, based on the effect size obtained in the data, gives no information about Type II error rate (Baguley, 2004): Such observed power is determined by the p-value and thus gives no information in addition to that provided by the p-value, and thus gives no information about Type II error rate. (To see this, consider that a sensitive non-significant result would have a mean well below the minimally interesting mean; power calculated on the observed mean, i.e. observed power, would indicate the data 1 Typical for psychology is no more than 50%: Cohen (1962); Sedlmeier and Gigerenzer (1989); see Button et al. (2013), for a considerably lower recent estimate in some disciplines. 5 141 142 143 144 insensitive, but it is the minimally interesting mean that should be used to determine power.) Intervals, such as confidence intervals solve the second problem; that is, they indicate how sensitive the data are, based on the very data themselves. But intervals do not solve the first problem, that of specifying minimally interesting values, as we shall see. 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 2.2 Interval estimates. Interval estimates include confidence intervals, and their likelihood and Bayesian equivalents, namely, likelihood intervals and credibility intervals (also called highest density regions) (see Dienes, 2008, for comparison). A confidence interval is the set of possible population values consistent with the data; all other population values may be rejected (see e.g. Cumming, 2011; Smithson, 2003). The APA recommends reporting confidence intervals whenever possible (APA, 2010). However, why should one report a confidence interval, unless it changes what conclusions might be drawn? Many statistics textbooks teach how to calculate confidence intervals but few teach how to use them to draw inferences about theory (for an exception see Lockhart, 1998). Figure 2 shows the four principles of inference by intervals (adapted from Berry, Carlin, Lee, & Muller, 2011; Freedman & Spiegelhalter, 1983; Kruschke, 2010a; Rogers, Howard, & Vessey, 1993; Serlin & Lapsey, 1985). A non-significant result means that the confidence interval (of the difference, correlation, etc) contains the null value (here assumed to be zero). But if the confidence interval includes zero, it also includes some values either side of zero; so the data are always consistent with some population effect. Thus, in order to accept a null hypothesis, the null hypothesis must specify a null region, not a point value. Specifying the null region means specifying a minimally interesting value (just as for power), which forms either limit of the region. Then if the interval is contained within the null region, one can accept the null hypothesis: The only allowable population values are null values (rule i) (cf equivalency testing, Rogers et al, 1993). If the interval does not cover the null region at all, the null hypothesis can be rejected. The only allowable population values are non-null (rule ii). Conversely if the interval covers both null and non-null regions, the data are insensitive: the data allow both null and non-null values (rule iv; see e.g. Kiyokawa, Dienes, Tanaka, Yamada, & Crowe, 2012 for an application of this rule). 169 6 170 Figure 2 The Four Principles of Inference by Intervals. 171 172 173 174 175 i) If the interval is completely contained in the null region, decide that the population value lies in the null region (accept the null region hypothesis); 176 177 ii) If the interval is completely outside the null region, decide that the population value lies outside the null region (reject the null region hypothesis); 178 179 iii) If the upper limit of the interval is below the minimal interesting value, decide against a theory postulating a positive difference (reject a directional theory); 180 181 iv) If the interval includes both null region and theoretically interesting values, the data are insensitive (suspend judgment). 182 7 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 In Figure 1, 95% confidence intervals are plotted. In all cases where the outcome was non-significant, the interval included not just zero but also the true population effect (10). Given that 10 is agreed to be an interesting effect size, the confidence intervals show all the non-significant outcomes to be insensitive. In all those cases the data should not be used to count against the theory of a difference (rule iv). The conclusion follows without determining a scientifically-relevant minimally interesting value. But a confidence interval can only be used to assert the null hypothesis when a null region can be specified (rule i) (cf Cohen, 1988; Greenwald, 1975; Hodges & Lehman, 1954; Serlin & Lapsley, 1985). For that assertion to be relevant to a given scientific context, the minimal value must be relevant to that scientific context (i.e. it cannot be determined by properties of the data alone nor can it be a generic default). For example, in clinical studies of depression it has been conventionally decided that a change of 3 units on the Hamilton scale is the minimal meaningful amount (e.g. Kirsch, 2010)2. Rule ii follows logically from rule i. Yet rule ii is stricter than current practice, because it requires the whole null region lie outside the interval, not just zero3. 198 199 200 201 202 203 204 205 While intervals can provide a systematic basis of inference which would advance current typical practice, they have a major problem in providing guidance in interpreting nonsignificant results. The problem is that asserting the null (and thus rejecting the null with a procedure that could have asserted it) requires in general specifying the minimally interesting value: What facts in the scientific domain indicate a certain value as a reasonable minimum? That may be the hardest part of a theory’s predictions to specify4. If you find it hard to specify the minimal value, you have just run out of options for interpreting a non-significant result as far as orthodoxy is concerned. 206 207 2.3 Bayes factors. Bayes factors (B) indicate the relative strength of evidence for two theories (e.g. Berger & Delampady, 1987; Dienes, 2011; Gallistel, 2009; Goodman, 1999; 2 Rule (i) is especially relevant for testing the assumptions of a statistical model. Take the assumption of normality in statistical tests. The crucial question is not whether the departure from normality is nonsignificant; but rather, whether whatever departure as exists is within such bounds that it does not threaten the integrity of the inferential test used. For example, is the skewness of the data within such bounds such that when the actual alpha rate (significance) is 5% the estimated rate is between 4.5% and 5.5%? Or is the skewness of the data within such bounds such that a Bayes factor calculated assuming normality lies between 2.9 and 3.1 if the Bayes factor calculated on the true model is 3? See Simonsohn, Nelson, and Simmons (forthcoming) for an example of interval logic in model testing. Using intervals inferentially would be a simple yet considerable advance on current practice in checking assumptions and model fit. 3 If inferences are drawn using Bayesian intervals (e.g. Krushcke, 2010a), then it is rules i and ii that must be followed. The rule of rejecting the null when a point null lies outside the credibility interval has no basis in Bayesian inference (a point value always has zero probability on a standard credibility interval); correspondingly, such a rule is sensitive to factors that are inferentially irrelevant such as the stopping rule and intentions of the experimenter (Dienes, 2008, 2011; Mayo, 1996). Rules (i) and (ii) are not sensitive to stopping rule (given interval width is not much more than that of the null region) (cf Kruschke, 2013b). Indeed, they can be used by orthodox statisticians, who can thereby gain many of the benefits of Bayes (though see Lindley, 1972, and Kruschke, 2010b, for the advantages of going fully Bayesian when it comes to using intervals, and Dienes, 2008, for a discussion of the conceptual commitments made by using the different sorts of intervals). 4 Figure 1 represents the decisions as black and white, in order to be consistent with orthodox statistics. For Bayesians, inferences can be graded. Kruschke (2013c) recommends specifying the degree to which the Bayesian credibility interval is contained within null regions of different widths so people with different null regions can make their own decisions. Nonetheless, pragmatically a conclusion needs to be reached and so a good reason still need to be specified for accepting any one of those regions. 8 208 209 210 211 212 213 214 215 216 217 Kass & Wasserman, 1996; Kruschke, 2011; Lee & Wagenmakers, 2005; Rouder et al., 2009)5. The Bayes factor B comparing an alternative hypothesis to the null hypothesis means that the data are B times more likely under the alternative than under the null. B varies between 0 and ∞, where 1 indicates the data do not favour either theory more than the other; values greater than 1 indicate increasing evidence for one theory over the other (e.g. the alternative over a null hypothesis) and values less than 1 the converse (e.g. increasing evidence for the null over the alternative hypothesis). Thus, Bayes factors allow three different types of conclusions: There is strong evidence for the alternative (B much greater than 1); there is strong evidence for the null (B close to 0); and the evidence is insensitive (B close to 1). This is already much more than p-values could give us. 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 In order to draw these conclusions we need to know how far from 1 counts as strong or substantial evidence and how close to 1 counts as the data being insensitive. Jeffreys (1939/1961) suggested conventional cut-offs: A Bayes factor greater than 3 or else less than 1/3 represents substantial evidence; conversely, anything between 1/3 and 3 is only weak or “anecdotal” evidence6. Are these just numbers pulled out of the air? In the examples considered below, when the obtained effect size is roughly that expected, a p of .05 roughly corresponds to a B of 3. This is not a necessary conclusion, nothing guarantees it; but it may be no coincidence that Jeffreys developed his ideas in Cambridge just as Fisher (1935) was developing his methods. That is, Jeffreys choice of 3 is fortunate in that it roughly corresponds to the standards of evidence that scientists have been used to using in rejecting the null. By symmetry, we automatically get a standard for assessing substantial evidence for the null: If 3 is substantial evidence for the alternative, 1/3 is substantial evidence for the null. (Bear in mind that there is no one-to-one correspondence of B with p-values, it depends on both the obtained effect size and also how precise or vague the predictions of the alternative are: With a sufficiently vague alternative a B of 3 may match a p-value of, for example, .01, or lower; Wetzels et al., 2011). 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 The Bayes factor is based on the principle that evidence supports the theory that most strongly predicted it. That is, in order to know how much evidence supports a theory, we need to know what the theory predicts. For the null hypothesis this is easy. Traditionally, the null hypothesis predicts that a single population value (typically 0) is plausible and all other population values are ruled out. But what does our theory, providing the alternative hypothesis, predict? Specifying the predictions of the theory is the difficult part of calculating a Bayes factor. The examples below, drawn partly from published studies, will show some straightforward ways theoretical predictions can be specified in a way suitable for Bayes factors. The goal is to determine a plot of the plausibility of different population values given the theory in its scientific context. Indeed, whatever one’s views on Bayes, thinking about justifications for minimum, typical or maximum values of an effect size in a scientific context is a useful exercise. We have to consider at least some aspect of such a plot to evaluate nonsignificant results; the minimum plausible value has to be specified for using power or confidence (credibility or likelihood) intervals in order to evaluate theories. That is, we should have been specifying theoretically-relevant effect sizes anyway. We could get away without doing so, because specified effect sizes are not needed for calculating p-values; but 5 Bayes factors were independently discovered at about the same time by Harold Jeffreys (1939/1961) and Alan Turing, the latter in order to help decode German messages during World War II (McGrayne, 2012). 6 Jeffreys (1961) also recommended labelling Bayes factors greater than 10 or less than 1/10 as ‘strong’ evidence. However, the terms ‘substantial’ and ‘strong’ have little to choose between them. Thus Lee and Wagenmakers (2014) recommend calling Bayes factors greater than 3 or less than a 1/3 as ‘moderate’ and those greater than 10 or less than 1/10 as ‘strong’. 9 250 251 252 253 254 the price has been incoherence in interpreting non-significant (as well as significant) results. It might be objected that having to specify the minimum was one of the disadvantages of inference by intervals; but as we will see in concrete examples, Bayes is more flexible about what is sufficient to be specified (e.g. rough expected value, or maximum), and different scientific problems end up naturally specifying predictions in different ways. 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 For the free online Bayes factor calculator associated with Dienes (2008; http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_factor.swf), there are three distributions that can be used to represent the predictions of the theory: Uniform, normal or half-normal (see Figure 3). The properties of each distribution will be briefly described in turn, before considering concrete examples of Bayes factors in practice in the next section. Note these distributions plot the plausibility of different values of the population parameter being tested, for example the plausibility of population means or mean differences. The distributions in Figure 3 in no way reflect the distribution of individual values in the population. The Dienes (2008) Bayes calculator assumes that the sampling distribution of the parameter estimate is normally distributed (just as a t-test does), and this obtains when, for example, the individual observations in the population are normally distributed. Thus, there may be a substantial proportion of people with a score below 0 in the population, and we could correctly believe this to be true while also assigning no plausibility to the population mean being below 0 (e.g. using a uniform from 0 to 10 in Figure 3). That is, using a uniform or half-normal distribution to represent the alternative hypothesis is entirely consistent with the data themselves being normally distributed. The distributions in Figure 3 are about the plausibility of different population parameter values (e.g. population means of a condition or population mean differences or associations etc.). 273 10 274 Figure 3 Representing the Alternative Hypothesis 275 276 277 278 279 (a) A uniform distribution with all population parameter values from the lower to the upper limit equally plausible. Here the lower limit is zero, a typical but not required value. 280 281 (b) A normal distribution, with population parameter values close to the mean being more plausible than others. The SD also needs to be specified; a default of mean/2 is often useful. 282 283 (c) A half-normal distribution. Values close to 0 are most plausible; a useful default for the SD is a typical estimated effect size. Population values less than 0 are ruled out. 284 285 11 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 First, we consider the uniform. A researcher tests whether imagining a sporting move (e.g. a golf swing) for 15 minutes a day for a week improves measured performance on the move. She performs a t-test and obtains a non-significant result. Does that indicate that imagination in this context is not useful for improving the skill? That depends on what size effect is predicted. Just knowing that the effect is non-significant is in itself meaningless. Presumably, whatever size the population effect is, the imagination manipulation is unlikely to produce an effect larger than that produced by actually practicing the move for 15 minutes a day for a week. If the researcher trained people with real performance to estimate that effect, it could be used as the upper limit of a uniform in predicting performance for imagination. If a clear minimum level of performance can be specified in a way simple to justify, that could be used as the lower limit of the uniform; otherwise, the lower limit can be set to 0 (in practice, the results are often indistinguishable whether 0 or a realistic minimum is used, though in every particular case this needs to be checked; cf Shang, Fu, Dienes, Shao, & Fu, 2013) . In general, when another condition or constraint within the measurement itself determines an upper limit, the uniform is useful, with a default of 0 as the lower limit. An example of a constraint produced by the measurement itself is a scale with an upper limit; for a 0-7 liking scale, the difference between conditions cannot be more than 7. (The latter constraint is quite vague and we can usually do better; for example, by constraints provided by other data. See Dienes, forthcoming-a, for examples of useful constraints created by types of scales.) 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 Next, we consider the normal distribution. Sometimes what is easily available is not another condition defining an upper limit, but data indicating a roughly likely value for an effect, should it exist. For example, it may be known for an implicit learning paradigm that asking people to search for rules rather than passively memorizing training material reduces classification accuracy in a later test phase by about 5%. A researcher may wish to change the manipulation by having people search for rules or not, not during the training phase as before, but between training and testing (cf. Mealor & Dienes, 2012). In the new experiment we may think 5% a reasonable estimate of the effect size that would be obtained (especially as we think similar mechanisms would be involved in producing the two effects). Further, if there is an effect in the post-training manipulation, we may have no strong reason to think it would be more than or less than the effect in the training manipulation. A natural way of representing this prediction of the theory is a normal distribution with a mean of 5%. What should the standard deviation of the normal distribution be? The theory predicts an effect in a certain direction; namely searching for rules will be harmful for performance. Thus, we require the plausibility of negative differences to be negligible. The height of a normal distribution comes pretty close to zero by about two standard deviations out. Thus, we could set two standard deviations to be equal to the mean; that is, SD = mean/2. This representation uses another data set to provide a most likely value - but otherwise keeps predictions as vague as possible (any larger SD would involve non-negligible predictions in the wrong direction; any smaller SD would indicate more precision in our prediction). Using SD = mean/2 is a useful default to consider in the absence of other considerations; we will consider exceptions below. With this default, the effect is predicted to lie between 0 and twice the estimated likely effect size. Thus, we are in no way committed to the likely effect size; it just sets a ball park estimate, with the true effect being possibly close to the null value or indeed twice the likely value. 330 331 332 333 334 When there is a likely predicted effect size, as in the preceding example, there is another way we can proceed. We can use the half-normal distribution, with a mode at 0, and scale the half-normal distribution’s rate of drop by setting the SD equal to the likely predicted value. So, to continue with the preceding example, the SD would be set to 5%. The halfnormal distribution indicates that smaller effect sizes are more likely than larger ones, and 12 335 336 337 338 339 340 341 342 343 whatever the true population effect is, it plausibly lies somewhere between 0 and 10%. Why should one use the half-normal rather than the normal distribution? In fact, Bayes factors calculated either way are typically very similar; but the half-normal distribution considers values close to the null most plausible, and this often makes it hard to distinguish the alternative from the null. Thus, if the Bayes factor does clearly distinguish the theories with a half-normal distribution, we have achieved clear conclusions despite our assumptions. The conclusion is thereby strengthened. For this reason, a useful default for when one has a likely expected value is to use the half-normal distribution. (We will consider examples below to illustrate when a normal distribution is the most natural representation of a particular theory.) 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 To summarize, to know how much evidence supports a theory we need to know what the theory predicts. We have considered three possible ways of representing the predictions of the alternative, namely, the uniform, normal and half-normal distributions. Representing the predictions of the alternative hypothesis is the part of performing Bayes that requires most thought. This thinking about the predictions of scientific theory can apparently be avoided by conventionally deciding on a default representation of the predictions of the alternative to use in all or most cases (e.g. Box & Tiao, 1973; Jeffreys, 1961; Liang, Paulo, Molina et al, 2008; Rouder et al., 2009; Rouder & Morey, 2011; Rouder, Morey, Speckman, & Province, 2012; Wetzels, Grasman, & Wagenmakers, 2012; Wetzels & Wagenmakers, 2012). But as there is no default theory in science, a researcher still has the responsibility of determining whether the default representation used in any Bayes factor reasonably matches the predictions of their theory - and if it does not, that particular Bayes factor is not relevant for assessing the theory; one or more others will be. (See Vanpaemel & Lee, 2012, for the corresponding argument in computational modelling, that distributions over parameters are also best informed by scientific knowledge rather than general defaults.) Thus, Rouder et al. (2009, p 232) recommend using content-specific constraints on specifying the alternative in the Bayes factor where this is possible. Vanpaemel (2010) also urges representing the alternative in a way as sensitive to theory as possible. Fortunately, more than a dozen papers have been published over the last couple of years illustrating the use of simple objectively-specified content-specific (non-default) constraints for Bayes factors in their particular scientific domains (e.g. Armstrong & Dienes, 2013; Buhle, Stevens, Friedman, & Wager, 2012; Dienes, Baddeley, & Jansari, 2012; Dienes & Hutton, 2013; Gallistel; 2009; Fu, Bin, Dienes, Fu, & Gao, 2013; Guo et al., 2013; Guo, Li, Yang, & Dienes, 2013; Mealor & Dienes, 2013a,b; Morey & Rouder, 2011; Newell & Rakow, 2011; Parris & Dienes, 2013; SemmensWheeler, Dienes, & Duka, 2013; Shang et al., 2013; Terhune, Wudarczyk, Kochuparampil, & Kadosh, 2013). Some principles are illustrated below. The three possible ways of representing the alternative hypothesis considered here turn out to be sufficient for representing theoretical intuitions in many scientific contexts. While the alternative requires some thought, a default is used for the null in the Dienes (2008) calculator; it assumes the null hypothesis is that the true population value is exactly zero. We will consider in Appendix 1 how to change this according to scientific context as well (cf. Morey & Rouder, 2011, who also allow for flexible non-default nulls). 376 377 378 379 380 381 382 383 To know how much evidence supports a theory, we need to know what the evidence is. For a situation where a t-test or a z-test can be performed, the relevant summary of the data is the parameter estimate and its standard error: For example, the sample mean difference and the standard error of the difference. This is the same as for a t-test (or z-test); a t value just is the parameter estimate divided by its standard error. So if one knows the sample mean difference, the relevant standard error can be obtained by: mean difference/t, where t can be obtained through SPSS, R etc. This formula works for a between-participants, within-participants, mixed or one sample t-test. Thus, any researcher reading a paper 13 384 385 386 involving a t-test (or any F with one degree of freedom; the corresponding t is the square root of the F), or performing their own t-test (or F with one degree of freedom) can readily obtain the two numbers summarizing the data needed for the Bayes factor. 387 388 389 390 391 392 393 The Dienes (2008) calculator asks for the mean (i.e. parameter estimate in general, including mean difference) and the standard error of this estimate. It also asks if the p(population value|theory) (i.e. the plausibility of different population values given the theory) is uniform: If the answer is ‘yes’, boxes appear to enter the lower and upper bounds; if the answer is ‘no’, boxes appear to set the parameters of the normal distribution. In the latter case, to answer the question “Is the distribution one-tailed or two-tailed?” enter “1” for a halfnormal and “2” for a normal7. 394 395 396 397 398 399 400 We can now use the simulations in Figure 1 to illustrate some features of Bayes factors. Table 1 shows the sequence of p-values obtained in the successive replications of the same experiment with a true population raw effect size of 10 units. Let us assume, as we did for interpreting the confidence intervals, that a value of 10 units is regarded as the sort of effect size that would hold if the theory were true. Thus, we can represent the alternative as a half-normal distribution with an SD of 10. Table 1 shows both the ‘dance of the p-values’ and the more graceful ‘tai chi of the Bayes factors’. 401 402 7 Note this is not the same as specifying a 1-tailed or 2-tailed test in orthodox statistics. In orthodox statistics, conducting a 1- or 2-tailed test is a matter of whether one would treat a result as significant if the difference had been extreme in just one direction (1-tailed test) or rather in either direction (2-tailed test). Such counterfactuals do not apply to Bayes, and the criticisms of 1-tailed tests in an orthodox sense (criticizing a researcher using a 1-tailed test because he would have rejected the null if the results had been extreme in the other direction, even though they were not) hence do not apply to Bayes factors (cf. Dienes, 2011; Royall, 1997). Further, as shown in Figure 3, a two-tailed normal distribution can be used to make a directional prediction. So 1- versus 2-tailed for the Bayes factor just refers to the shape of the distribution used to represent the predictions of the alternative. 14 403 Tables 1 (a) and (b). Bayes Factors Corresponding to the p-values Shown in Figure 1. 404 405 406 407 The B’s are sorted into three categories: The top row is for B’s less than 1/3 (support for null), the bottom row for B’s more than 3 (support for alternative), and the middle row is for B’s in the middle, indicating data insensitivity (support for neither hypothesis). Significant p-value’s are asterisked. 408 409 Table 1(a) .081 .034* .74 .034* .09 .817 .028* .001* .056 .031* .279 .024* .083 p B, giving support for: Null Neither Alternative 2.96 0.52 4.88 2.70 0.46 4.88 1.73 4.40 1024.6 3.33 4.88 2.96 4.28 410 411 412 Table 1(b) p .002* .167 .172 .387 .614 .476 .006* .028* .002* .024* .144 .23 B, giving support for: Null Neither Alternative 2.16 2.12 1.01 0.65 0.75 2.36 1.73 28.00 4.28 49.86 5.60 49.86 413 414 415 416 417 418 419 420 In interpreting Table 1, remember that a B above 3 indicates substantial support for the alternative and a B less than 0.33 indicates substantial support for the null. In Table 1, significant results are associated with B’s above 3 (because, as it happens, the effect sizes in those cases are around the values we regarded as typical, and not close to 0; in fact for a fixed p = .05, B will indicate increasing support for the null as N increases, and thus the sample mean shrinks, Lindley, 1957). Note also that Bayes factors can be quite large (e.g. 1024); they do not scale according to our intuitions trained on t-values8. 421 422 423 424 425 426 427 Crucially, in Table 1, non-significant results correspond to Bayes factors indicating insensitivity, that is, between 3 and 1/3. (It is in no way guaranteed of course that Bayes factors won’t sometimes indicate evidence for the null when the null is false. But this example illustrates how such misleading Bayes factors would be fewer than non-significant p-values when the null is false. For analytically-determined general properties of error rates with Bayes factors contrasting simple hypotheses see Royall, 1997.) A researcher obtaining any of the non-significant results in Table 1, would, following a Bayesian analysis, conclude 8 To have them scale in a similar way as a t-value, one could take a log Bayes factor to base 3. A log3 B above 1 indicates substantial evidence for the alternative, below -1 substantial evidence for the null, and 0 indicates no evidence either way. The value of 1024 in Table 1 would become 6.31, which maybe seems more ‘reasonable’. Using log to base square root 3 would mean 2 and -2 are the cut offs for substantial evidence, similar to a ttest. Compare Edwards, 1992, who recommended natural logs for Bayes factors comparing simple hypotheses. Nonetheless, I do not use log B in this paper. Unlike t, B has an intuitive meaning: The data are B times more likely under H1 than under H0. 15 428 429 430 431 432 433 that the data were simply insensitive and more participants should be run. The same conclusions follow from using confidence intervals, as shown in Figure 1, using the rules of inference in Figure 2. So, if Bayes factors often produce answers consistent with inference by intervals, what does a Bayes factor buy us? It will allow us to assert the null without knowing a minimal interesting effect size, as we now explore. We will see that a non-significant result sometimes just indicates insensitivity, but it is sometimes support for the null. 434 3.0 Examples using the different ways of representing the predictions of the theory 435 436 437 We will consider the uniform, normal and half-normal in turn. The sequence of examples serve as a tutorial in various aspects of calculating Bayes factors and so are best read in sequence. 438 3.1 Bayes with a Uniform Distribution. 439 440 441 442 443 444 445 446 447 448 1. A manipulation is predicted to decrease an effect. Dienes, Baddeley, and Jansari (2012) predicted that in a certain context a negative mood would reduce a certain type of learning. To simplify so as to draw out certain key points, the example will depart from the actual paradigm and results of Dienes et al. Subjects perform a two-alternative forced choice measure of learning, where 50% is chance. If performance in the neutral condition is 70%, then, if the theory is correct, performance in the negative mood condition will be somewhere between 50% and 70%. That is, the effect of mood would be somewhere between 0 and 20%. Thus, the predictions of a theory of a mood effect could be represented as a uniform from 0 to 20 (in fact, there is uncertainty in this estimate which could be represented, but we will discuss that issue in the subsequent example). 449 450 451 452 453 454 455 Performance in the negative mood condition is (say) 65% and non-significantly different from that in the neutral condition, t(50) = 0.5, p = .62. So the mean difference is 70% – 65% = 5%. The standard error = (mean difference)/t = 10%. Entering these values in the calculator (mean = 5, standard error = 10, uniform from 0 to 20) gives B = 0.89. That is, the data are insensitive and do not count against the theory that predicted negative mood would impair performance (indeed, Dienes et al. obtained a non-significant result which a Bayes factor indicated was insensitive). 456 457 458 459 460 461 462 463 464 465 466 467 468 If the performance in the negative mood condition had been 70%, the mean difference between mood conditions would be 0. The Bayes factor is then 0.60, still indicating data insensitivity. That is, obtaining identical sample means in two conditions in no way guarantees that there is compelling evidence for the null hypothesis. If the performance in the negative mood condition had been 75%, the mean difference between mood conditions would be 5% in the wrong direction. This is entered into the calculator as -5 (i.e. as a negative number: The calculator assumes positive means are in the direction predicted by theory if the theory is directional). B is then 0.43, still insensitive. That is, having the means go in the wrong direction does not in itself indicate compelling evidence against the theory (even coupled with a “very non-significant” p of .62). If the standard error were smaller, say 5 instead of 10, then a difference of -5 gives a B of 0.16, which is strong evidence for the null and against the theory. That is, as the standard error shrinks, the data become more sensitive, as would be expected. 469 470 471 472 The relation between standard error and sensitivity allows an interesting contrast between B and p-values. P-values do not monotonically vary with evidence for the null: A larger p-value may correspond to less evidence for the null. Consider a difference of 1 and a standard error of 10. This gives a t of 0.01, p =0.99. If the difference were the same but the 16 473 474 475 476 477 478 479 480 standard error were much smaller, e.g. 1, t would be larger, 1 and p = 0.32. The p-value is smaller because the standard error is smaller. To calculate B, assume a uniform from 0 to 20, as before. In the first case, where p = .99, B is 0.64 and in the second case, where p = .32, B = 0.17. B is more sensitive (and more strongly supports the null) in the second case precisely because the standard error is lower in the second case. Thus, a high p-value may just indicate the standard error is large and the data insensitive. It is a fallacy to say the data are convincing evidence of the null just because the p-value is very high. The high p-value may be indicating just the opposite. 481 482 483 484 485 486 487 488 489 490 491 492 2. Testing additivity versus interaction. Buhle et al. (2012) wished to test whether placebo pain relief works through the same mechanism as distraction based pain relief. They argued that if it were separate mechanisms, the effect of placebo should be identical in a distraction versus control condition; if the same mechanism, then placebo would have less effect in a distraction condition. Estimating from their Figure 2, the effect of placebo versus no placebo in the no distraction condition was 0.4 pain units (i.e. placebo reduced pain by 0.4 units). The effect of placebo in the distraction condition was 0.44 units (estimated). The placebo × distraction interaction raw effect (the difference between the two placebo effects) is therefore 0.4 – 0.44 = - .04 units (note that it is in the wrong direction to the theory that distraction would reduce the placebo effect). The interaction was non-significant, F(1, 31) = 0.109, p = .746. But in itself the non-significant result does not indicate evidence for additivity; perhaps the data were just insensitive. How can a Bayes factor be calculated? 493 494 495 496 497 498 499 500 501 502 The predicted size of the interaction needs to be specified. Following Gallistel (2009), Buhler et al. reasoned that the maximum size of the interaction effect would occur if distraction completely removed the placebo effect (i.e. placebo effect with distraction = 0). Then, the interaction would be the same size as the placebo effect in the no distraction condition (that is, interaction = placebo effect with no distraction – placebo effect with distraction (i.e. 0.4 – 0) =placebo effect with no distraction = 0.4). The smallest the population interaction could be (on the theory that distraction reduces the placebo effect) is if the placebo effect was very nearly the same in both conditions, that is an interaction effect of as close as we like to 0. So we can represent the plausible sizes of the interaction as a uniform from 0 to the placebo effect in the no distraction condition, that is from 0 to 0.4. 503 504 505 506 507 508 509 510 511 512 513 The F of 0.109 corresponds to a t-value of √0.109 = 0.33. Thus the standard error of the interaction is (raw interaction effect)/t = 0.04/0.33 = 0.12. With these values (mean = 0.04, standard error = 0.12, uniform from 0 to 0.4), B is 0.29, that is, substantial evidence for additivity. This conclusion depends on assuming that the maximum the interaction could be was the study’s estimate of the placebo effect with no distraction. But this is just an estimate. We could represent our uncertainty in that estimate – and that would always push the upper limit upwards. The higher the upper limit of a uniform, the vaguer the theory is. The vaguer the theory is, the more Bayes punishes the theory, and thus the easier it is to get evidence for the null. As we already have evidence for the null, representing uncertainty will not change the qualitative conclusion, only make it stronger. The next example will consider the rhetorical status of this issue in the converse case, when the Bayes factor is insensitive. 514 515 516 In sum, no more data need to be collected; the data are already sensitive enough to make the point. The non-significant result was rightly published, given it provided as substantial evidence for the null as a significant result might against it. 517 518 3. Interactions that can go in both directions. In the above example, the raw interaction effect was predicted on theory to go in only one direction, that is, to vary from 0 17 519 520 521 up to some positive maximum. We now consider a similar example, and then generalize to a theory that allows positive and negative interactions. The latter sort of interaction may be more difficult to specify limits in a simple way, but in this example we can. 522 523 524 525 526 527 528 529 530 531 Watson, Gordon, Stermac, Kalogerakos, and Steckley (2003) compared the effectiveness of Process Experiential Therapy (PET; i.e. Emotion Focussed Therapy) with Cognitive Behavioural Therapy (CBT) in treating depression. There were a variety of measures, but for illustration we will look at just the Beck Depression Inventory (BDI) on the entire intent-to-treat sample (n = 93). CBT reduced the BDI from pre- to post-test (from 25.09 to 12.56), a raw simple effect of time of 12.53. Similarly, PET reduced the BDI from 24.50 to 13.05, a raw simple effect of 11.45. The F for the group (CBT vs PET) × time (pre vs post) interaction was 0.18, non-significant. In itself, the non-significant interaction does not mean the treatments are equivalent in effectiveness. To know what it does mean, we need to get a handle on what size interaction effect we would expect. 532 533 534 535 536 537 538 539 540 One theory is based on assuming we know that CBT does work and is bound to be better than any other therapy for depression. Then, the population interaction effect (effect of time for CBT - effect of time for PET) will vary between near 0 (when the effects are near equal) and the effect of time for CBT (when the other therapy has no effect). Thus assuming 12.53 to be a sensitive estimate of the population effect of CBT, we can use a uniform for the alternative from 0 to 13 (rounding up). The sample raw interaction effect is 12.53 – 11.45 = 1.08. The F for the interaction (0.18) corresponds to a t of √0.18 = 0.42. Thus the standard error of the raw interaction effect is 1.08/0.42 = 2.55. With these values for the calculator (mean = 1.08, SE = 2.55, uniform from 0 to 13), B = 0.36. 541 542 543 544 545 546 547 548 549 The B just falls short of a conventional criterion for substantial evidence for equivalence. The evidence is still reasonable, in that in Bayes sharp conventional cut offs have no absolute meaning (unlike orthodoxy); evidence is continuous9. Further, we assumed we had estimated the population effect of CBT precisely. We could represent some uncertainty in that estimate, e.g. use an upper limit of a credibility or confidence interval of the effect of CBT as the upper limit of our uniform. But we might decide as a defeasible default not to take into account uncertainty in estimates of upper limits of uniforms when we have non-significant results. In this way, more non-significant studies come out as inconclusive; that is, we have a tough standard of evidence for the null. 550 551 552 553 554 555 556 557 558 It is also worth considering another alternative hypothesis, namely one which asserts that CBT might be better than PET – or vice versa. We can represent the predictions of this hypothesis as a uniform from -11 to + 13. (The same logic is used for the lower limit; i.e. if PET is superior, the most negative the interaction could be is when the effect of CBT is negligible compared to PET, so interaction = - effect of time for PET.) Then B = 0.29. That is, without a pre-existing bias for CBT, there is substantial evidence for equivalence in the context of this particular study (and it would get even stronger if we adjusted the limits of the uniform to take into account uncertainty in their estimation). Even with a bias for thinking CBT is at least the best, there is modest evidence for equivalence. 559 560 561 4. Interaction with degrees of freedom more than 1 and simple effects. Raz and colleagues have demonstrated a remarkable effect of hypnotic suggestion in a series of studies: Suggesting that the subject cannot read words, that the stimuli will appear as a 9 Indeed, the absence of sharp cut-offs applies to Bayesian intervals as much to Bayes factors: If a 90% Bayesian interval were within the null region, there would be grounds for asserting the null region hypothesis, albeit there would be stronger grounds if a 95% interval were in the null region. 18 562 563 meaningless script, substantially reduces the Stroop effect (e.g. Raz, Fan, & Posner, 2005). Table 2 presents imaginary but representative data. 564 565 566 567 Table 2. (Imaginary) Means for the Effectiveness of a Hypnotic Suggestion to Reduce the Stroop Effect. No Suggestion Suggestion Incongruent 850 785 Neutral 750 745 Congruent 720 715 568 569 570 571 572 573 574 575 576 577 578 579 580 581 The crucial test of the hypothesis that the suggestion will reduce the Stroop effect is the Suggestion (present vs absent) × Word Type (incongruent vs neutral vs congruent) interaction. This is a 2-degree of freedom effect. However, it naturally decomposes into two 1-degree of freedom effects. The Stroop effect can be thought of as having two components with possibly different underlying mechanisms: An interference effect (incongruent – neutral) and a facilitation effect (neutral – congruent). Thus the interaction decomposes into the effect of suggestion on interference and the effect of suggestion on facilitation. In general, multi-degree of action effects often decompose into one degree of freedom contrasts addressing specific theoretic questions. Let’s consider the effect of suggestion on interference (the same principles of course apply to the effect of suggestion on facilitation). First, let us say the test for the interaction suggestion (present vs absent) × word type (incongruent vs neutral) is F(1, 40) = 2.25, p = 0.14. Does this mean the effect of suggestion is ineffective in the context of this study for reducing interference? 582 583 584 585 586 587 588 589 590 591 The F of 2.25 corresponds to a t of √2.25 = 1.5. The raw interaction effect is (interference effect for no suggestion) – (interference effect for suggestion) = (850 – 750) – (785 – 745) = 100 – 40 = 60. Thus, the standard error for the interaction is 60/1.5 = 40. What range of effects could the population effect plausibly be? If the suggestion had removed the interference effect completely, the interaction would be the interference effect for no suggestion (100); conversely if the suggestion had been completely ineffective, it would leave the interference effect as it is, and thus the interaction would be 0. Thus, we can represent the alternative as a uniform from 0 to 100. This gives a B of 2.39. That is, the data are insensitive but if anything should increase one’s confidence in the effectiveness of the suggestion. 592 593 594 595 596 597 598 Now, let us say the test of the interaction gave us a significant result, e.g. F(1,40) = 4.41, p = .04, but the means remain the same as in Table 2. Now the F corresponds to a t of √4.41 = 2.1. The raw size of interaction is still 60; thus the standard error is 60/2.1 = 29. B is now 5.54, substantial evidence for the effectiveness of suggestion. In fact, the test of the interference effect with no suggestion is significant, t(40) = 2.80, p = .008; and the test for the interference effect for just the suggestion condition gives t(40) = 1.30, p = .20. Has the suggestion eradicated the Stroop interference effect? 599 600 601 602 We do not know if the Stroop effect has been eradicated or just reduced on the basis of the last non-significant result. A Bayes factor can be used to find out. The interference effect is 40. So the standard error is 40/1.3 = 31. On the alternative that the interference effect was not eradicated, the effect will vary between 0 and the effect found in the no 19 603 604 605 606 suggestion condition; that is, we can use a uniform from 0 to 100. This gives a B of 1.56. That is, the data are insensitive, and while we can conclude that the interference effect was reduced (the interaction indicates that), we cannot yet conclude whether or not it was abolished. 607 608 609 610 611 612 613 614 615 5. Paired comparisons. Tan et al (2009; in press) tested people on their ability to use a Brain Computer Interface (BCI). People were randomly assigned to three groups: no treatment, 12 weeks of mindfulness meditation, and an active control (12 weeks of learning to play the guitar). The active control was designed (and shown) to create the same level of expectations as mindfulness meditation in improving BCI performance. Initially, 8 subjects were run in each group. Pre-post differences on BCI performance were -8, 15, and 9 for the no-treatment, mindfulness and active control, respectively. The difference between mindfulness and active control was not significant, t(14) = 0.61, p = 0.55. So, is mindfulness any better than an active control? 616 617 618 619 620 621 622 623 624 Assume the active control and mindfulness were equalized on non-specific processes like expectations and beliefs about change, and mindfulness may or may not contain an additional active component. Then, the largest the difference between mindfulness and active control could be is the difference between mindfulness and the no-treatment control (i.e. a difference of 23, which was significant). Thus, the alternative can be represented a uniform from 0 to 23. The actual difference was 15 – 9 = 6, with a standard error of 6/0.61 = 9.8. This gives a B of 0.89. The data were insensitive. (In fact, with degrees of freedom less than 30, the standard error should be increased slightly by a correction factor given below. Increasing the standard error reduces sensitivity.) 625 626 627 628 629 630 631 632 633 634 635 636 Thus, the following year another group of participants were run. One cannot simply top up participants with orthodox statistics, unless pre-specified as possible by one’s stopping rule (Armitage, McPherson, & Rowe, 1969); by contrast, with Bayes, one can always collect more participants until the data are sensitive enough, that is, B < 1/3 or B > 3; see e.g. Berger and Wolpert (1988), Dienes (2008, 2011). Of course, B is susceptible to random fluctuations up and down; why cannot one capitalize on these and stop when the fluctuations favour a desired result? For example, Sanborn and Hills (2014) and Yu, Sprenger, Thomas, and Dougherty (2014) show that if the null is true, stopping when B > 3 (if that ever occurs) increases the proportion of cases that B > 3 when the null is true. However, as Rouder (2014) shows, it also increases the proportion of cases that B> 3 when Ho is false, and to exactly the same extent: B retains its meaning as relative strength of evidence, regardless of stopping rule (for more discussion, see Dienes, forthcoming). 637 638 639 640 641 642 643 644 645 646 647 648 With all data together (N = 63), the means for the three groups were 2, 14, and 6, for notreatment, mindfulness and active control, respectively. By conventional statistics, the difference between the mindfulness group and either of the others was significant; for mindfulness vs meditation, t(41) = 2.45, p = .019. The interpretation of the latter result is compromised by the topping up procedure (one could argue the p is less than .05/2, so legitimately significant at the 5% level; however, let us say the result had still been insensitive, would the researchers have collected more data the following year? If so, the correction should be more like .05/3; see Sagarin, Ambler, & Lee, 2014; Strube, 2006). A Bayes factor indicates strength of evidence independent of stopping rule, however. Assume as before that the maximum the plausible difference between mindfulness and active control could be is the difference between mindfulness and no-treatment, that is, 12. We represent the alternative as a uniform from 0 to 12. The mean difference between mindfulness and active 20 649 650 control is 14- 6 = 8, with a standard error of 8/2.45 = 3.3. This gives a B of 11.45, strong evidence for the advantage of mindfulness over an active control. 651 3.2. Bayes with a half-normal distribution. 652 653 654 In the previous examples we considered cases where a rough plausible maximum could be determined. Here, we consider an illustrative case where a rough expected value can be determined (and the theory makes a clear directional prediction). 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 Norming materials. Guo, Li, Yang, and Dienes (2013) constructed lists of nouns that differed in terms of the size of the object that the words represented, as rated on a 1-7 bipolar scale (1 = very small; 7 = very big). Big objects were perceived as bigger than small objects (5.42 vs 2.73), t(38) = 20.05, as was desired. Other dimensions were also rated including height. Large objects were rated non-significantly taller than small objects on an equivalent scale (5.13 vs 4.5), t(38) = 1.47, p = .15. Are the materials controlled for height? The difference in size was 5.42 – 2.73 = 2.69. This was taken to be a representative amount by which the two lists may differ on another dimension, like height. Thus, a B was calculated assuming a half-normal with SD = 2.69. (The test is directional because it is implausible that big objects would be shorter, unless specially selected to be so.) The “mean” was 5.13 – 4.5 = 0.63 height units, the standard error = 0.63/1.47 = 0.43 height units. This yields B = 0.83. The data are insensitive; no conclusions follow as to whether height was controlled. (Using the same half-normal, the lists were found to be equivalent on familiarity and valence, Bs < 1/3. These dimensions would be best analysed with full normal distributions (i.e. mean of 0, SD = 2.69), but if the null is supported with half-normal distributions it is certainly supported with full normal distributions, because the latter entail vaguer alternatives.) 671 3.3. Bayes with a Normal Distribution. 672 673 674 675 676 A half-normal distribution is intrinsically associated with a theory that makes directional predictions. A normal distribution may be used for theories that do not predict direction, as well as those that do. In the second case, the normal rather than half-normal is useful where it is clear that a certain effect is predicted and smaller effects are increasingly unlikely. One example is considered. 677 678 679 680 681 682 683 684 685 686 687 688 Theory roughly predicts a certain value and smaller values are not more likely. Fu, Dienes, Shang, and Fu (2013) tested Chinese and UK people on ability to implicitly learn global or local structures. A culture (Chinese vs British) by level (global vs local) interaction was obtained. Chinese were superior than British people at the global level, RT difference in learning effects = 50 ms, with standard error = 14ms. The difference at the local level was 15 ms, SE = 13 ms, t(47) = 1.31, p = .20. Fu et al wished to consider the theory that Chinese outperform British people simply because of greater motivation. If this were true, then Chinese people should outperform British people at the local level to the same extent as they outperform British at the global level. We can represent our knowledge of how well this is by a normal distribution with a mean of 50 and a standard deviation of 14. This gives B = 0.25. That is, relative to this theory, the null is supported. The motivation theory can be rejected based both on the significant interaction and on the Bayes factor. 689 690 691 692 693 My typical default for a normal distribution is to use SD = mean/2. This would suggest an SD of 25 rather than 14, the latter being the standard deviation of the sampling distribution of the estimated effect in the global condition. In this case, where the same participants are being run on the same paradigm with a minor change to which motivation should be blind, the sampling distribution of the effect in one condition stands as a good 21 694 695 696 representation of our knowledge of its effect in the other condition, assuming a process blind to the differences between conditions10. Defaults are good to consider, but they are not binding. 697 4. Some general considerations in using Bayes factors 698 699 700 701 702 703 Appendix 1 gives further examples, illustrating trend analysis, the use of standardised effect sizes, and why sometimes raw effect sizes are required to address scientific problems; it also indicates how to easily extend the use of the Dienes (2008) Bayes calculator to simple meta-analysis, correlations, and contingency tables. Appendix 2 considers other methods for comparing relative evidence (e.g. likelihood inference and BIC, the Bayesian Information Criterion). 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 4.1 Assumptions of the Dienes (2008) calculator. The calculator assumes the parameter estimate is normally distributed with known variance. However, in a t-test situation we have only estimated the variance. If the population distribution of observations is normal, and degrees of freedom are above 30, then the assumption of known variance is good enough (see Wilcox, 2010, for problems when the underlying distribution is not normal). For degrees of freedom less than 30, the following correction should be applied to the standard error: Increase the standard error by a factor (1 + 20/df*df) (adapted from Berry, 1996; it produces a good approximation to t, over-correcting by a small amount). For example if the df = 13 and the standard error of the difference is 5.1, we need to correct by (1 + 20/(13*13)) = 1.12. That is, the standard error we will enter is 5.1*1.12 = 5.7. If degrees of freedom have been adjusted in a t-test (on the same comparison as a Bayes factor is to be calculated for) to account for unequal variances, the adjusted degrees of freedom should be used in the correction. No adjustment is made when the dependent variable is (Fisher z transformed) correlations or for the contingency table analyses illustrated in the Appendix, as in these cases the standard error is known. 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 Note that the assumptions of the Dienes (2008) calculator are specific to that calculator and not to Bayes generally. Bayes is a general all-purpose method that can be applied to any specified distribution or to a bootstrapped distribution (e.g. Jackman, 2009; Krushcke, 2010a; Lee & Wagenmakers, 2014; see Kruschke, 2013b, for a Bayesian analysis that allows heavy-tailed distributions). Bayes is also not limited to one degree of freedom contrasts, as the Dienes (2008) calculator is (see Hoijtink, Klugkist & Boelin, 2008, for Bayes factors on complex hypotheses involving a set of inequality constraints). However, pin point tests of theoretical predictions are generally one degree of freedom contrasts (e.g. Lewis, 1993; Rosenthal, Rosnow, & Rubin, 2000). Multiple degree of freedom tests usually serve as precursors to one degree of freedom tests simply to control familywise error rates (though see Hoijtink et al). But for a Bayesian analysis one should not correct for what other tests are conducted (only data relevant to a hypothesis should be considered to evaluate that hypothesis), so one can go directly to the theoretically relevant specific contrasts of interest (Dienes, 2011; for more extended discussion see Dienes, forthcoming; and Kruschke, 2010a for the use of hierarchical modelling for dealing with multiple comparisons). 734 735 736 737 4.2 Power, intervals, and Bayes. No matter how much power one has, a sensitive result is never guaranteed. Sensitivity can be guaranteed with intervals and Bayes factors: One can collect data until the interval is smaller than the null region and is either in or out of the null region, or until the Bayes factor is either greater than 3 or less than a third (see Dienes, 2008 10 Using SD = mean/2 is like using the sampling distribution of the mean when it is only just significantly different from zero. 22 738 739 740 741 742 743 744 745 746 747 748 and forthcoming-b for related discussion; in fact, error probabilities are controlled remarkably well with such methods, see Sanborn & Hills, 2013, for limits and exceptions, and Rouder, 2014, for the demonstration that the Bayes factor always retains its meaning as strength of evidence – how much more likely the data are on H1 than H0 – regardless of stopping rule). Because power does not make use of the actual data to assess its sensitivity, one should assess the actual sensitivity of obtained data with either intervals or Bayes factors. Thus, one can run until sensitivity has been reached. Even if one cannot collect more data, there is still no point referring to power once the data are in. The study may be low powered, but the data actually sensitive; or the study may be high powered, and the data actually insensitive. Power is helpful in finding out a rough number of observations needed; intervals and Bayes factors can be used thereafter to assess data (cf. Royall, 1997). 749 750 751 752 753 754 Consider a study investigating whether imagining kicking a football for 20 minutes every day for a month increased the number of shots into goal out of 100 kicks. Real practice in kicking 20 minutes a day increases performance by 12 goals. We set this as the maximum of a uniform. What should the minimum be? The minimum is needed to interpret a confidence or other interval. It is hard to say for sure what the minimum should be. A score of .01 goals may not be worth it, but maybe 1 goal would be? Let us set 0.5 as the minimum. 755 756 757 758 759 760 761 762 763 764 765 766 767 768 The imagination condition led to an improvement of 0.4 goals with a standard error of 1. Using a uniform from 0.5 to 12, B = 0.11, indicating substantial evidence for the null hypothesis. If we used a uniform from 0 to 12, B = 0.15, hardly changed. However, a confidence (or other) interval would declare the results insensitive, as the interval extends outside the null region (even a 68% interval, provided by 0.4 ± 1 goals). The Bayes factor makes use of the full range of predictions of the alternative hypothesis, and can use this information to most sensitively draw conclusions (conditional on the information it assumes) (Jaynes, 2003). Further, the Bayes factor is most sensitive to the maximum, which could be specified reasonably objectively. Inference by intervals is completely dependent on specification of the minimum, which is often hard to specify objectively. However, if the minimum were easy to objectively specify and the maximum hard in a particular case, it is inference by intervals that would be most sensitive to the known constraints, and would be the preferred solution. In that sense, Bayes factors and intervals complement each other in their strengths and weaknesses (see Table 3)11. 769 770 771 772 In sum, in many cases, Bayes factors may be useful not only when there is little chance of collecting more data, so as to extract the most out of the data; but also in setting stopping rules for grant applications, or schemes like Registered Reports in the journal Cortex, so as to collect data in the most efficient way (for an example see Wagenmakers et al, 2012). 11 When a null region can be specified, Bayesian intervals and Bayes factors can be related (cf Krushke, 2013b Appendix D; Wetzels et al, 2009, for a somewhat different way of making the relationship). Let the alternative hypothesis be the complement of the null region. Define a prior distribution. The prior odds is the area representing the alternative divided by the area in the null region. Update the prior to obtain the posterior distribution (see e.g. Kruschke, 2013b, or tools on the website for Dienes, 2008). The strength of evidence that the data provide for the alternative hypothesis over the null is the degree to which the evidence changes the prior odds: The Bayes factor is the posterior odds divided by the prior odds. Example: A prior of N(0, 1) and data of N(0.5, 0.2) (i.e. mean = 0.5, SE = 0.2), gives a posterior (using the software on the website of Dienes, 2008, to obtain posteriors: http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_normalposterior.swf) of N(0.48, 0.2). For a null region of [-0.1, 0.1], using areas under the normal curve, we get B = 36.17/11.55 = 3.13. (This is not using the Dienes, 2008, Bayes factor calculator, just areas under the normal, with e.g. http://davidmlane.com/hyperstat/z_table.html). Conversely, a prior of N(0, 1) and data of (0, 0.1) give a posterior of N(0, 0.1) and B = 0.19. (Check these numbers for homework.) 23 773 774 775 776 777 778 (By estimating the relevant standard deviation and likely population effect in advance of running an experiment , you can enter into the calculator different standard errors based on different N, and find the N for which B > 3 or < 1/3. This provides an estimate of the required N. But, unlike with orthodox statistics, it is not an estimate you have committed to; cf. Kruschke, 2010a; Royall, 1997). Table 3. Comparing Intervals and Bayes for Interpreting a Non-significant Result. 779 780 What does it tell you? What do you need to link data to theory? Amount of data needed to obtain evidence for the null? What would be a useful stopping rule to guarantee sensitivity? Intervals How precisely a parameter has been estimated; a reflection of data rather than theory. A minimal value below which the theory is refuted. Enough to make sure the width of the interval is less than that of the null region; considerable participant numbers will typically be needed in contrast to Bayes factors. Interval width no more than null region width and interval either completely in or completely out of the null region. Bayes factors The strength of evidence the data provide for one theory over another; specific to the two theories contrasted. A rough expected value or maximum value consistent with theory. Bayes factors ensure maximum efficiency in use of participants, given a Bayes factor measures strength of evidence. Bayes factor either greater than three or less than a third. 781 782 783 784 785 786 787 788 4.3 Why choose one distribution rather than another? On a two-alternative recognition test we have determined that a rough likely level of performance, should knowledge exist, is 60%, and we are happy that it is not very plausible for performance to exceed 70%, nor to go below chance baseline (50%). Performance is 55% with a standard error of 2.6%. To determine a Bayes factor, we need first to rescale so that the null hypothesis is 0. So we subtract 50% from all scores. Thus, the “mean” is 5% and the standard error 2.6%. we can 24 789 790 use a uniform from 0 to 20 to represent the constraint that the score lies between chance and 20% above chance. This gives B = 2.01. 791 792 793 794 795 796 797 798 799 800 801 802 803 But we might have argued that as 60% (i.e. 10% above baseline) was rather likely on the theory (and background knowledge) we could use a normal distribution with a mean of 10% and a standard deviation of mean/2 = 5%. (Note this representation still entails that the true population value is somewhere between roughly 0 and 20.) This gives B = 1.98, barely changed. Or why not, you ask, use a half-normal distribution with the expected typical value, 10%, as the standard deviation? (Note this representation still entails that the true population value is somewhere between 0 and roughly 20.) This gives B = 2.75, somewhat different but qualitatively the same conclusion. In all cases the conclusion is that more evidence is needed. (Compare also the different ways of specifying the alternative provided by Rouder et al, 2009.) In this case, there may be no clear theoretical reason for preferring one of the distributions over the other for representing the alternative. But as radically shifting the distribution around (from flat to humped right up against one extreme to humped in the middle) made no qualitative difference to conclusions we can trust the conclusions. 804 805 806 807 808 809 Always perform a robustness check: Could theoretical constraints be just as readily represented in a different way, where the real theory is neutral to the different representations? Check that you get the same conclusion with the different representations. If not, run more participants until you do. If you cannot collect more data, acknowledge ambiguity in the conclusion (as was done by e.g. Shang et al, 2013). (See Kruschke, 2013c, for the equivalent robustness check for inference by intervals.) Appendix 3 considers robustness in more detail. 810 811 812 Different ways of representing the alternative may clearly correspond to different theories, as was illustrated in example 2 in Appendix 1. Indeed, one of the virtues of Bayes is that it asks one to consider theory, and the relation of theory to data, carefully. 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 4.4 But how can any constraints be set on what my alternative predicts? The examples in this paper (including the appendix) have illustrated finding constraints by use of data from different conditions, constraints intrinsic to the design and logic of the theory, and those provided by standardized effect sizes and regression slopes to convert different raw units of measurement into each other; and Dienes (in press) provides examples where constraints intrinsic to the measurement scale are very useful. However, there is one thing one cannot do, and that is to use the very effect itself, or its confidence interval, as the basis for predicting that same effect. For example, if the obtained mean difference was 15 ms, one cannot for that reason set the alternative to a half-normal distribution with a standard deviation of 15. One can use other aspects of the same data to provide constraints, but not the very aspect that one is testing (Jaynes, 2003). Double counting the mean difference in both the data summary and as specifying the predictions of the alternative violates the axioms of probability in updating theory probabilities. This problem creates pressure to demand default alternatives for Bayes factors, on the grounds that letting people construct their own alternative is open to abuse (Kievit, 2011). 828 829 830 831 832 833 834 The Bayes factor stands as a good evaluation of a theory to the extent that its assumptions can be justified. And fortunately, all assumptions that go into a Bayes factor are public. So, the type of abuse peculiar to Bayes (that is, representing what a theory predicts in ways favourable with respect to how the data actually turned out) is entirely transparent. It is open to public scrutiny and debate (unlike many of the factors that affect significance testing; see Dienes, 2011). Specifying what predictions a theory makes is also part of the substance of science. It is precisely what we should be trying to engage with, in the context of public 25 835 836 837 838 839 840 debate. Trying to convince people of one’s theory with cheap rhetorical tricks is the danger that comes with allowing argument over theories at all. Science already contains the solution to that problem, namely transparency, the right of everyone in principle to voice an argument, and commitment to norms of accountability to evidence and logic. When the theory itself does work in determining the representation of the alternative, the Bayes factor genuinely informs us about the bearing of evidence on theory. 841 842 843 844 845 846 847 848 849 850 851 852 4.5 Subjective versus objective Bayes. Bayesian approaches are often divided into subjective Bayes and objective Bayes (Stone, 2013). Being Bayesian at all in statistics is defined by allowing probabilities for population parameter values (and for theories more generally), and using such probability distributions for inference. Whenever one of the examples above asked for a specification what the theory predicts, it meant an assignment of a probability density distribution to different parameter values. How are these probabilities to be interpreted? The subjective Bayesian says that these probabilities can only be obtained by looking deep in one’s soul; probabilities are intrinsically subjective (e.g. Howson & Urbach, 2006). That is, it is all a matter of judgment. The objective Bayesian says that rather than relying on personal or subjective judgment, one should describe a problem situation such that the conclusions objectively follow from the stated constraints; every rational person should draw the same conclusions (e.g. Jaynes, 2003). 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 The approach in this paper respects both camps. The approach is objective in that the examples illustrate rules of thumb that can act as (contextually relevant) defaults, where the probability distributions are specified in simple objective ways by reference to data or logical or mathematical relations inherent in the design. No example relied on anyone saying, “according to my intuition the mean should be 2 because that’s how I feel” (Cf. Howard, Maxwell, & Fleming, 2000, for such priors). But the approach is subjective in that the examples illustrate that only scientific judgment can determine the right representation of the theory’s predictions given the theory and existing background knowledge; and that scientific judgment entails that all defaults are defeasible - because science is subject to the frame problem, and doing Bayes is doing science. Being a subjectivist means a given Bayes factor can be treated not as THE objective representation of the relation of the theory to evidence, but as the best given the current level of discussion, and it could be improved if e.g. another condition was run which informed what the theory predicted in this context (e.g. defined a maximum value more clearly). A representation of the theory can be provisionally accepted as reflecting current judgment and ignorance about the theory. 868 869 5 Conclusions. 870 871 872 873 874 875 This paper has explored the interpretation of non-significant results. A key aspect of the argument is that non-significant results cannot be interpreted in a theoretical vacuum. What is needed is a way of conducting statistics that more intimately links theory to data, either via inference by intervals or via Bayes factors. Using canned statistical solutions to force theorizing is backwards; statistics are to be the handmaiden of science, not science the butler boy of statistics. 876 877 878 879 880 Bayes has its own coherent logic that makes it radically different from significance testing. To be clear, a two-step decision process is not being advocated in this paper whereby Bayes is only performed after a non-significant result. Bayes can – and should – be performed any time a theory is being tested. At this stage of statistical practice, however, orthodox statistics typically need to be performed for papers to be accepted in journals. In 26 881 882 883 884 885 886 887 888 889 890 891 cases where orthodoxy fails to provide an answer, Bayes can be very useful. The most common case where orthodoxy fails is where non-significant results are obtained. But there are other cases. For example, when a reviewer asks for more subjects to be run, and the original results were already tested at the 5% level, Bayesian statistics are needed (as in example 3.1.5 above; orthodoxy is ruled out by its own logic in this case). Running a Bayes factor when non-significant results are obtained is simply a way that we as a community can come to know Bayes, and to obtain answers where we need answers, and none are forthcoming from orthodoxy. Once there is a communal competence in interpreting Bayes, frequentist orthodox statistics may cease to be conducted at all (or maybe just in cases where no substantial theory exists, and no prior constraints exists – that is, in impoverished prescientific environments). 892 893 894 895 896 897 898 899 One complaint about the approach presented here could be the following: “I am used to producing statistics that are independent of my theory. Traditionally, we could agree on the statistics, whatever our theories. Now the statistical result depends on specifying what my theory predicts. That means I might have to think about mechanisms by which the theory works, relate those mechanisms to previous data, argue with peers about how theory relates to different experiments, connect theory to predictions in ways that peers may argue about. All of this may produce considerable discussion before we can determine which if any theory has been supported!” The solution to this problem is the problem itself. 900 901 902 903 904 905 906 907 908 Bayes can be criticized because it relies on “priors”, and it may be difficult to specify what those priors are. Priors figure in Bayesian reasoning in two ways. The basic Bayesian schema can be represented as: Posterior odds in two theories = Bayes factor×prior odds in two theories. The prior odds in two theories are a sort of prior. The approach illustrated in this paper has lifted the Bayes factor out of that context and treated it alone as a measure of strength of evidence (cf Rouder et al., 2009; Royall, 1996). So there is no need to specify that sort of prior. But the Bayes factor itself requires specifying what the theories predict, and this is also called a prior. Hopefully it is obvious that if one wants to know how much evidence supports a theory, one has to know what the theory predicts. 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 The approach in this paper, though Bayesian, involves approaching analysis in a different way in detail than one would if one followed, say, Kruschke (2010a) or Lee and Wagenmakers (2014), who also define themselves as teaching the Bayesian approach. There are many ways of being a Bayesian and they are not exclusive. The philosophy of the approach here, as also illustrated in my papers published to date using Bayes, is to make the minimal changes to current practice in the simplest way that would still bring many of the advantages of Bayesian inference. In many cases, orthodox and Bayesian analyses will actually agree (e.g. Simonsohn, submitted, which is reassuring, even to Bayesians). A key place where they disagree is in the interpretation of non-significant results. In practice, orthodox statistics have not been used in a way that justifies any conclusion. So one strategy, at least initially, is to continue to use orthodox statistics, but also introduce Bayesian statistics simultaneously. Wherever non-significant results arise from which one wants to draw any conclusion, a Bayes factor can be employed to disambiguate, as illustrated in this paper (or a credibility or other interval, if minima can be simply established). In that way reviewers and readers get to see the analyses they know how to interpret, and the Bayes provides extra information where orthodoxy does not provide any answer. Thus, people can get used to Bayes and its proper use gradually debated and established. There need be no sudden wholesale replacement of conventional analyses. Thus, you do not need to wait for the revolution; your very next paper can use Bayes. 27 928 929 930 931 932 933 934 935 936 937 The approach in this paper also misses out on many of the advantages of Bayes. For example, Bayesian hierarchical modelling can be invaluable (Kruschke, 2010a; Lee & Wagenmaakers, 2014). Indeed, it can be readily combined with the approach in this paper. That is, the use of priors in the sense not used here can aid data analysis in important ways. Further, the advantages of Bayes go well beyond the interpretation of non-significant results (e.g. Dienes, 2011). This paper is just a small part of a larger research programme. But the problem it addresses has been an Achilles tendon of conventional practice (Oakes, 1986; Wagenmakers, 2007). From now on, all editors, reviewers and authors can decide: whenever a non-significant result is used to draw any conclusion, it must be justified by either inference by intervals or measures of relative evidence. Or else you remain silent. 28 938 References 939 940 941 Allen, C. P. G., Sumner, P., & Chambers, C. D. (2014). The Timing and Neuroanatomy of Conscious Vision as Revealed by TMS-induced Blindsight. Journal of Cognitive Neuroscience, DOI: 10.1162/jocn_a_00557 942 943 American Psychological Association (2010). Publication Manual of the American Psychological Association (5th ed.). Washington, DC: Author. 944 945 Armitage, P., McPherson, C. K., & Rowe, B. C. (1969). Repeated significance tests on accumulating data. Journal of the Royal Statistical Society, Series A, 132, 235-244. 946 947 948 Armstrong, A. M., & Dienes, Z. (2013). Subliminal Understanding of Negation: Unconscious Control by Subliminal Processing of Word Pairs. Consciousness & Cognition, 22, 1022-1040. 949 950 Baguley, T. (2004). Understanding statistical power in the context of applied research. Applied Ergonomics, 35, 73–80 951 952 Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100, 603–617 953 954 Baguley, T. (2012). Serious Stats: A guide to advanced statistics for the behavioral sciences. Palgrave Macmillan. 955 956 Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 2, 317-335. 957 958 Berger, J. O., & Wolpert, R. L. (1988). The Likelihood Principle, 2nd edition. Institute of Mathematical Statistic. 959 Berry, D. A. (1996). Statistics: A Bayesian perspective. Duxbury Press: London. 960 961 Berry, S. M., Carlin, B. P., Lee, J. J., & Müller, P. (2011). Bayesian adaptive methods for clinical trials. Chapman & Hall: London. 962 963 964 Box, G. E. P., & Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis (Addison-Wesley series in behavioral science, quantitative methods). Longman Higher Education. 965 966 Buhle, J. T., Stevens, B. L., Friedman, J. J., & Wager, T. D. (2012). Distraction and Placebo: Two Separate Routes to Pain Control. Psychological Science, 23, 246-253. 967 968 Burnham, K. P., & Anderson, D.R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). New York: Springer. 969 970 971 Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365-376. 972 973 Cohen, J. (1962). The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social Psychology, 65, 145-53. 974 975 Cohen, J. (1988). Statistical power analysis for the behavioural sciences (2nd edition).Erlbaum: Hillsdale, NJ. 29 976 977 1312. 978 979 Cumming, G. (2011). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge. 980 981 Dienes, Z. (2008). Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan. 982 983 Dienes, Z. (2011). Bayesian versus Orthodox statistics: Which side are you on? Perspectives on Psychological Sciences, 6, 274-290. 984 985 Dienes, Z. (in preparation). Bayesian mediation analysis: How to assert no mediation or full mediation. 986 987 988 Dienes, Z. (in press). How Bayesian statistics are needed to determine whether mental states are unconscious. In M. Overgaard (Ed.), Behavioural Methods in Consciousness Research. Oxford: Oxford University Press. 989 990 Dienes, Z. (forthcoming). How Bayes factors change our science. Journal of Mathematical Psychology (special issue on Bayes factors). 991 992 993 Dienes, Z., Baddeley, R. J., & Jansari, A. (2012). Rapidly Measuring The Speed Of Unconscious Learning: Amnesics Learn Quickly And Happy People Slowly. PLoS ONE 7(3): e33400. doi:10.1371/journal.pone.0033400 994 995 Dienes, Z, & Hutton, S. (2013). Understanding hypnosis metacognitively: rTMS applied to left DLPFC increases hypnotic suggestibility. Cortex, 49, 386-392. 996 997 Edwards, A. W. F. (1992). Likelihood (Expanded Edition). John Hopkins University Press: London. 998 999 1000 Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149-1160. 1001 Cohen, J. (1990). Things I have learnt so far. American Psychologist, 45 (12), 1304- Fisher, R. A. (1935). The design of experiments. Oliver and Boyd. 1002 1003 Freedman, L. S., & Spiegelhalter, D. J. (1983). The assessment of subjective opinion and its use in relation to stopping rules for clinical trials. The Statistician, 32, 153-160. 1004 1005 1006 Fu, Q., Bin, G., Dienes, Z., Fu, X., & Gao, X. (2013). Learning without consciously knowing: Evidence from event-related potentials in sequence learning. Consciousness and Cognition, 22, 22-34. 1007 1008 Fu, Q., Dienes, Z., Shang, J., & Fu, X. (2013). Who learns more? Cultural differences in implicit sequence learning. PLoS ONE 8(8): e71625. doi:10.1371/journal.pone.0071625 1009 1010 Gallistel, C. R. (2009). The Importance of Proving the Null. Psychological Review, 116, 439–453. 1011 1012 Glover, S., & Dixon, P. (2004). Likelihood ratios: A simple and flexible statistic for empirical psychologists. Psychonomic Bulletin & Review, 11, 791-806. 30 1013 1014 Goodman, S. N. (1999). Toward Evidence-Based Medical Statistics. 2: The Bayes Factor. Annals of Internal Medicine, 130, 1005- 1013. 1015 1016 1017 Guo, X., Jiang, S., Wang, H., Zhu, L., Tang, J., Dienes, Z., & Yang, Z. (2013). Unconsciously Learning Task-irrelevant Perceptual Sequences. Consciousness and Cognition, 22, 203–211. 1018 1019 1020 Guo, X., Li, F., Yang, Z., & Dienes, Z. (2013). Bidirectional transfer between metaphorical related domains in Implicit learning of form-meaning connections. PLoS ONE, 8(7): e68100. 1021 1022 Greenwald, A. G. (1975). Consequences of Prejudice against the Null Hypothesis. Psychological Bulletin, 82, 1 -20. 1023 1024 1025 Greenwald, A. G. (1993). Consequences of prejudice against the null hypothesis. In G. Keren & C. Lewis (Eds), A Handbook for data analysis in the behavioural sciences: Methodological issues (pp 419 – 449). Lawrence Erlbaum : Hove. 1026 1027 Hodges, J., & Lehmann, E. (1954). Testing the Approximate Validity of Statistical Hypotheses. Journal of the Royal Statistical Society. Series B (Methodological), 261-268. 1028 1029 Hoijtink, H., Klugkist, I., & Boelen, P.A. (Eds.). (2008).Bayesian evaluation of informative hypotheses. New York: Springer. 1030 1031 1032 Howard, G. S., Maxwell, S. E., & Fleming, K. J. (2000). The Proof of the Pudding: An Illustration of the Relative Strengths of Null Hypothesis, Meta-Analysis, and Bayesian Analysis. Psychological Methods, 5, 315-332. 1033 1034 Howson, C., & Urbach, P. (2006). Scientific reasoning: The Bayesian approach (3rd ed.). Chicago: Open Court. 1035 Jackman, S. (2009). Bayesian Analyses for the Social Sciences. Wiley: Eastbourne. 1036 1037 Jaynes, E.T. (2003). Probability theory: The logic of science. Cambridge, England: Cambridge University Press. 1038 1039 Jeffreys, H. (1939/1961). The theory of probability: First/Third edition. Oxford, England: Oxford University Press. 1040 1041 Kirsch, I. (2010). The emperor’s new drugs: Exploding the antidepressant myth. New York: Basic Books. 1042 1043 Kirsch, I. (2011). Suggestibility and suggestive modulation of the Stroop effect. Consciousness and Cognition, 20, 335–336. 1044 1045 Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343-1370. 1046 1047 Kievit, R. A. (2011). Bayesians Caught Smuggling Priors Into Rotterdam Harbor. Perspectives on Psychological Science, 6, 313. 1048 1049 Kiyokawa, S., Dienes, Z., Tanaka, D., Yamada, A., & Crowe, L. (2012). Cross Cultural Differences in Unconscious Knowledge. Cognition, 124, 16-24. 31 1050 1051 Kruschke, J. K. (2010a). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press. 1052 1053 Kruschke, J.K. (2010b). Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1, 658–676. 1054 1055 Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299-312. 1056 1057 1058 Kruschke, J. K. (2013a). Posterior predictive checks can and should be Bayesian: Comment on Gelman and Shalizi, ‘Philosophy and the practice of Bayesian statistics’. British Journal of Mathematical and Statistical Psychology, 66, 45-56. 1059 1060 Kruschke, J. K. (2013b). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142, 573–603. 1061 1062 1063 1064 Kruschke, J. K. (2013c). How much of a Bayesian posterior distribution falls inside a region of practical equivalence (ROPE). http://doingbayesiandataanalysis.blogspot.co.uk/2013/08/how-much-of-bayesianposterior.html. Retrieved 9 August 2013, 12:30pm. 1065 1066 Lee, M. D., & Wagenmakkers, E. J. (2005). Bayesian Statistical Inference in Psychology: Comment on Trafimow (2003). Psychological Review, 112, 662– 668. 1067 1068 Lee, M. D., & Wagenmakers, E. J. (2014). Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press: Cambridge. 1069 1070 1071 Lewis, C. (1993). Analyzing means from repeated measures data. In G. Keren & C. Lewis (Eds), A Handbook for Data Analysis for the Behavioural Sciences: Statistical Issues (pp 73 – 94). Erlbaum: Hove. 1072 1073 1074 Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (2008). Mixtures of gpriors for Bayesian variable selection. Journal of the American Statistical Association, 103, 410–423 1075 Lindley, D.V. (1957). A Statistical Paradox. Biometrika, 44, 187–192. 1076 1077 Lindley, D. V. (1972). Bayesian Statistics: A Review. Society for Industrial and Applied Mathematics 1078 1079 Lockhart, R. S. (1998). Introduction to data analysis for the behavioural sciences. W. W. Freeman and Company: New York. 1080 1081 Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago: University of Chicago Press. 1082 1083 1084 McGrayne, S. B. (2012). The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy . Yale University Press. 1085 1086 1087 Mclatchie, N. (2013). Feeling Impulsive, Thinking Prosocial: The Importance of Distinguishing Guilty Feelings From Guilty Thoughts. Unpublished PhD thesis, University of Kent. 32 1088 1089 Mealor, A. D., & Dienes, Z. (2012). Conscious and unconscious thought in artificial grammar learning. Consciousness & Cognition, 21, 865-874. 1090 1091 Mealor, A. D., & Dienes, Z. (2013a). The speed of metacognition: Taking time to get to know one’s structural knowledge. Consciousness & Cognition, 22, 123-136. 1092 1093 Mealor, A. D., & Dienes, Z. (2013b). Explicit feedback maintains implicit knowledge. Consciousness & Cognition, 22, 822-832. 1094 1095 Morey, R. D., & Rouder J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16, 406-419. 1096 1097 Newell, B. R., & Rakow, T. (2011). Revising beliefs about the merit Of Unconscious Thought: Evidence In favor of the null hypothesis. Social Cognition, 29, 711–726. 1098 1099 Nuijten, M. B., Wetzels, R., Matzke, D., Dolan, C. V., & Wagenmakers, E.-J. (in press). A default Bayesian hypothesis test for mediation. Behavior Research Methods. 1100 1101 Oakes, M. (1986). Statistical Inference: Commentary for the Social and Behavioural Sciences. Wiley-Blackwell. 1102 1103 1104 Parris, B. A., & Dienes, Z. (2013). Hypnotic suggestibility predicts the magnitude of the imaginative word blindness suggestion effect in a non-hypnotic context. Consciousness & Cognition, 22, 868-874. 1105 1106 Pashler, H., & Harris, C. R. (2012). Is the Replicability Crisis Overblown? Three Arguments Examined. Perspectives on Psychological Science, 7, 531-536. 1107 1108 Pitt, M. A., Kim, W., & Myung, I. J. (2003). Flexibility vs generalizability in model selection. Psychonomic Bulletin & Review, 10, 29-44. 1109 1110 Raz, A., Fan, J., & Posner, M. I. (2005). Hypnotic suggestion reduces conflict in the human brain. Proceedings of the National Academy of Sciences, 102, 9978–9983. 1111 1112 Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin, 113, 553-565. 1113 1114 1115 Rosenthal, R. (1993). Cumulating evidence. . In G. Keren & C. Lewis (Eds), A Handbook for data analysis in the behavioural sciences: Methodological issues (pp 519 – 559). Lawrence Erlbaum : Hove. 1116 1117 Rosenthal, R., Rosnow, R. L., & Rubin, D. R. (2000). Contrasts and effect sizes in behavioural research: A correlational approach. Cambridge University Press: Cambridge. 1118 1119 Rouder, J. N. (2014). Optional Stopping: No Problem For Bayesians. Psychonomic Bulletin & Review, 21,301-308. 1120 1121 Rouder, J. N., & Morey R. D. (2011). A Bayes-Factor Meta Analysis of Bem’s ESP Claim. Psychonomic Bulletin & Review. 18, 682-689. 1122 1123 1124 Rouder, J.N., Morey, R.D., Speckman, P.L., & Pratte, M.S. (2007). Detecting chance: A solution to the null sensitivity problem in subliminal priming. Psychonomic Bulletin & Review, 14, 597–605. 33 1125 1126 Rouder, J. N., Morey R. D., Speckman P. L., & Province J. M. (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology. 56, 356-374. 1127 1128 1129 Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., & Iverson, G.(2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237. 1130 1131 & Hall. 1132 1133 Sagarin B. J., Ambler J. K., Lee E. M. (2014). An ethical approach to peeking at data. Perspectives on Psychological Science, 9, 293–304. 1134 1135 Sanborn, A. N. & Hills, T. T. (2014). The frequentist implications of optional stopping on Bayesian hypothesis tests. Psychonomic Bulletin & Review, 21, 283-300 1136 1137 Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on thepower of studies? Psychological Bulletin, 105, 309-316. 1138 1139 Semmens-Wheeler, R., Dienes, Z., & Duka, T. (2013). Alcohol Increases Hypnotic Susceptibility. Consciousness & Cognition, 22, 1082–1091. 1140 1141 Serlin, R., & Lapsley, D. (1985). Rationality in Psychological Research: The GoodEnough Principle. The American Psychologist, 40, 73-83. 1142 1143 1144 Shang, J., Fu, Q., Dienes, Z., Shao, C. &, Fu, X. (2013). Negative affect reduces performance in implicit sequence learning. PLoS ONE 8(1): e54693. doi:10.1371/journal.pone.0054693. 1145 1146 Simonsohn, U (submitted). Posterior-Hacking: Selective Reporting Invalidates Bayesian Results Also. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040 1147 1148 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A Key To The File Drawer. Journal of Experimental Psychology: General, 143, 534-547 1149 1150 1151 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (forthcoming). P-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2377290 1152 1153 Smith, A. F. M., & Spiegelhalter, D. J. (1980). Bayes factors and choice criteria for linear models. Journal of the Royal Statistical Society, Sries B, 42, 213-220. Royall, R.M. (1997). Statistical evidence: A likelihood paradigm. London: Chapman 1154 Smithson, M. (2003). Confidence intervals. Sage: London. 1155 1156 1157 1158 Song, M., Maniscalco, B., Koizumi, A., & Lau, H. (2013). A new method for manipulating metacognitive awareness while keeping performance constant. Oral presentation at the 17th Meeting of the Association for the Scientific Study of Consciousness, San Diego, 12-15 July, 2013. 1159 1160 Press. Stone, J. V. (2013). Bayes’ Rule: A tutorial introduction to Bayesian analysis. Sebtel 34 1161 1162 1163 Storm, L., Tressoldi, P. E., & Utts, J. (2013). Testing the Storm et al. (2010) MetaAnalysis Using Bayesian and Frequentist Approaches: Reply to Rouder et al. (2013). Psychological Bulletin, 139, 248 –254. 1164 1165 Strube, M. J. (2006). SNOOP: a program for demonstrating the consequences of premature and repeated null hypothesis testing. Behavioural Research Methods, 38, 24-7. 1166 1167 1168 Tan, L. F., Dienes, Z., Jansari, A., & Goh, S. Y. (2014). Effect of Mindfulness Meditation on Brain-Computer Interface Performance. Consciousness and Cognition, 23, 1221. 1169 1170 1171 Tan, L.F., Jansari, A., Keng, S.L., & Goh, S.Y. (2009). Effect of mental training on BCI performance. In J.A. Jacko (Ed.), Human-Computer Interaction, Part II, HCI 2009. LNCS (Vol. 5611, pp. 632–635). Heidelberg: Springer. 1172 1173 1174 Terhune, D. B., Wudarczyk, O. A., Kochuparampil, P., & Kadosh, R. C. (2013). Enhanced dimension-specific visual working memory in grapheme-color synaesthesia. Cognition, 129 , 123-137 1175 1176 Vanpaemel, W. (2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54, 491–498. 1177 1178 1179 Vanpaemel, W., & Lee, M. D. (2012). Using priors to formalize theory: Optimal attention and the generalized context model. Psychonomic and Bulletin and Review, 19, 1047-56. 1180 1181 Verhagen, A. J., & Wagenmakers, E.-J. (in press). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General. 1182 1183 Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779-804. 1184 1185 1186 Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An Agenda for Purely Confirmatory Research. Perspectives on Psychological Science, 7, 632–638. 1187 1188 1189 1190 Watson, J. C., Gordon, L. B., Stermac, L., Kalogerakos, F., & Steckley, P. (2003). Comparing the Effectiveness of Process–Experiential With Cognitive–Behavioral Psychotherapy in the Treatment of Depression. Journal of Consulting and Clinical Psychology, 71, 773–781. 1191 1192 Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E. J. (2012). A default Bayesian hypothesis test for ANOVA designs. The American Statistician, 66, 104-111. 1193 1194 1195 Wetzels, R., Matzke, D., Lee, M.D., Rouder, J.N., Iverson, G.J., & Wagenmakers, E.J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6, 291–298. 1196 1197 1198 Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E.-J.(2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t test. Psychonomic Bulletin & Review, 16, 752–760. 1199 1200 Wetzels, R., & Wagenmakers, E. J. (2012). A default Bayesian hypothesis test for correlations and partial correlations. Psychonomic Bulletin & Review, 19, 1057–1064. 35 1201 1202 Wilcox, R. R. (2010). Fundamentals of modern statistical methods: Substantially improving power and accuracy, 2nd edition. Springer: London. 1203 1204 Yu, E. C., & Sprenger, A. M., Thomas, R. P., & Dougherty, M. R. (2014). When decision heuristics and science collide. Psychonomic Bulletin & Review, 21, 268–282 1205 1206 Yuan, Y., & MacKinnon, D. P. (2009). Bayesian Mediation Analysis. Psychological Methods, 14, 301–322. 1207 36 1208 Appendix 1 Further examples 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1. Trend analysis and other contrasts. Parris and Dienes (2013) investigated whether the strength of a perceptual response (measured in milliseconds) to an imaginative suggestion in a non-hypnotic context varied with hypnotisability (measured on a 0 to 12 scale). Hypnosis researchers often call people scoring 0-3 as lows, 4-7 as mediums, and 8-12 as highs. Parris and Dienes recruited an equal number of lows, mediums and highs. A key question is the shape of the relation between hypnotisability and performance on the non-hypnotic task. On the one hand, it could be simply linear. On the other hand, it may be that highs are an unusual group, with mediums performing the same as lows; or that lows are an unusual group with mediums performing the same as highs (a question posed by Kirsch, 2011). The question can be answered by whether, in addition to a linear effect of hypnotisability on the task, there is a quadratic effect. Parris and Dienes regressed performance (in milliseconds) on hypnotisability (as a 0-12 scale) and obtained a raw slope of b = 6 milliseconds/unit hypnotisability, t(33) = 2.261, p = .030. To obtain the quadratic effect, the hypnotisability score was squared and centred on zero. Performance was regressed on both hypnotisability and hypnotisability squared. The slope for the quadratic effect was b = .530 milliseconds/unit hypnotisability2, t(33) = .589, p = 0.56. Does the non-significant quadratic effect indicate that performance varies only linearly with hypnotisability? 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 The maximum plausible score of mediums is that of highs; thus, the minimum plausible quadratic effect would occur when mediums scored the same as highs, with lows scoring differently as a special group. Conversely, the maximum plausible quadratic effect would occur when mediums scored the same as lows, with highs scoring differently as a special group. To determine values for these extremes, first the mean performance for highs was found, and this mean set as the same performance for all mediums and highs; similarly, the mean for lows was set as the performance for all lows. This defined an idealized set of population values that were extreme in degree of quadratic effect. With this set of values treated as real data, the mean quadratic regression slope was - 0.673. Similarly, another hypothetical set of population vales was obtained by setting the mean of the lows as the same score for all lows and mediums; the highs all had the mean of the highs. With these values treated as data, the mean quadratic regression slope was 0.908. Thus, a Bayes factor can be determined assuming a uniform from -0.673 to +0.908. The mean for the data (i.e. in this case, the quadratic slope) is .53. It has a standard error of .53/.589 = 0.9. This gives a Bayes factor of 0.97. That is, the data were insensitive and provide no evidence either for or against the quadratic hypothesis. It would be wrong to use the non-significant quadratic effect to infer that mediums lay mid-way between highs and lows. 1243 1244 1245 1246 1247 The principle for constructing limits for a uniform can be generalized. For a contrast, use values of the dependent variable in each condition that would represent extremes which are plausible given the scientific context. Coefficients found using the hypothetical scores as real data determine the plausible extremes of the contrast and can be used in a uniform to represent the alternative. 1248 1249 1250 1251 1252 1253 1254 2. Equating one variable to test the effect of another. Song, Maniscalco, Koizumi, and Lau (2013) wished to investigate the function of confidence when performance was controlled. Thus, they found a way to create stimuli for which performance (d’) was calibrated to be similar and yet confidence differed. We will call this performance single task d’ (for reasons that will become clear) . (In what follows the numbers have been made up, so as not to jeopardise the publication of the authors’ paper.) On a 1-4 confidence scale, the confidence for the two conditions was 2.21 and 2.41, t(35) = 4.04, p = .0001. The 37 1255 1256 corresponding single task d’s for the two conditions were 1.83 and 1.93, t(35) = 0.96, p = 0.34. The question is, can equivalence be claimed for single task d’ across the two conditions? 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 To allow comparability between confidence and single task d’, standardized effect sizes might seem useful. Then, it could be argued, the sort of effect size we might expect to see in single task d’, if there were an effect, would be in the same ball park as the standardized effect size found for confidence. We will use Pearson’s r as a standardized effect size, where r2 = t2/( t2 + df). Thus, for confidence r = √ (4.042/(35 + 4.042)) = .56; and for d’, r = √ (0.962/(35 + 0.962)) = .16. Pearson’s r can be transformed to a normal variable with Fishers z = 0.5*loge((1 + r)/(1 - r)), which has standard error= 1/√ (df -1). Thus, Fisher’s z for confidence = 0.64, and for d’ Fisher’s z = 0.16. The standard error for the Fisher’s z for d’ is 0.17. Thus, a Bayes factor can be determined for a mean of 0.16, SE = 0.17, and with the alternative represented as a half-normal with standard deviation equal to 0.64. In this case, B = 0.64. That is, the data appear insensitive for establishing equivalence as B is between 3 and 1/3. 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 However, the use of standardized effect sizes in this context doesn’t address the question of interest. A standardized effect size reflects the consistency with which an effect is shown, and thus how noisily it is measured (Baguley, 2009). One could make the measurement of performance noisy simply by using few trials. (Theoretically irrelevant differences in noise between measures occur for reasons other than just number of trials. Some measures are simply noisier than others: For example, confidence has a restricted range, d’ an unrestricted range.) By using a sufficiently small number of trials for performance, the population r for the difference between conditions in single task d’ could be made very small. If confidence was measured on a different large number of trials, its population r could be made large. That is, a researcher making the measure of single task d’ as insensitive as possible would help establish equivalence in single task d’ between conditions. This is also true for testing equivalence merely by showing a non-significant result. That is, both significance testing and Bayes using standardized effect sizes, do not address the real issue, in this case, because they are sensitive to factors that are theoretically irrelevant (and could therefore be cynically manipulated by the experimenter). To address the actual theoretical concerns of the researchers, raw effect sizes should be used. 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 But, how can one use a difference in confidence units to set a difference in d’ units? The researchers were interested in the extent to which confidence predicted a third outcome variable, specifically, discrimination on a task that involved integrating over many stimuli. Call this multi-task d’. That is, the most important issue was given that the difference in confidence was associated with a change in multi-task d’, could the difference in single task d’s be large enough to explain the change in multi-task d’? Given this theoretical agenda, we can calibrate confidence and single task d’ in the following way. The difference in confidence was associated with a significant difference in multi-task d’ in their experiment (in fact, this was the main result of the study). Let us say multi-task d’ differed by 0.6 units. So, let us say the researchers run another study in which single task d’ was allowed to vary over a range from say 1.5 to 2.5, and multi-task d’ was measured. Regress single task d’ on multi-task d’. Let us say the raw regression slope is 1.5 single d’ units/multi d’ units. Thus, the change in single task d’ that corresponds to 0.6 units change in multi-task d’ is 1.5*0.6 = 0.9. Now, we can use that estimated relevant change in single task d’ as the SD of a halfnormal to represent a meaningful alternative to contrast with the null hypothesis of no difference in single task d’. For the Bayes factor calculator, the “mean” is 1.93-1.83 = 0.10 single task d’ units, the standard error is 0.10/0.96 = 0.10 single task d’ units, and we can use 38 1302 1303 a half-normal with an SD of 0.90 single task d’ units. In this case, B = 0.31, indicating we can sensitively accept the claim of equivalence. 1304 1305 1306 1307 1308 1309 1310 1311 This method finds the raw difference in d’ in a non-arbitrary way relevant to the scientific context. Further, there is no clear incentive to measure single task d’ insensitively. Making the measurement of single task d’ noisier (in an unbiased way) does not affect the expected raw regression slope of single task d’ against multi task d’. It just makes it harder to make sensitive claims when the Bayes factor on single task d’ is performed. (And making the measurement of multi-task d’ insensitive will reduce the slope, making equivalence claims harder.) A correct analysis and inference procedure will never be harmed by making data more sensitive. 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 Now let’s consider another possible method . Without knowledge of the slope of single task d’ against multi task d’, one could calibrate raw effect sizes by regressing single task d’ onto confidence and finding the raw slope. The difference in confidence was 2.41 2.21 = 0.2 confidence units. The raw slope was 0.45 d’ units/confidence unit, so the equivalent of 0.2 confidence units is 0.2*0.45 = 0.09 single task d’ units. Now, unlike with standardised effect sizes, making the measurement of d’ insensitive also does not help researchers trying to establish equivalence. As for the second method above, making the measurement of single task d’ more noisy (in an unbiased way) does not affect the raw regression slope. It just makes it harder to make sensitive claims. For the Bayes factor calculator, the “mean” is 1.93-1.83 = 0.10 single task d’ units, the standard error is 0.10/0.96 = 0.10, and we can use a half-normal with an SD of 0.09. In this case, B = 1.39, indicating insensitivity. We come to a different conclusion. So which method should be preferred, the second or the third? In fact, the second is most relevant in this scientific context. The third method assumes that the researchers were interested in differences in d’ only to the extent they were predicted by confidence - but the real point is the function of confidence itself in predicting multi-task d’. Note something peculiar about the third method: The smaller the relation between single task d’ and confidence, the smaller the estimate of single task d’ we are trying to pick up, and the harder it is to make claims of equivalence. But the extent of equivalence in single task d’ should be the same however confidence and single task d’ were correlated. Thus, the second method gets to the heart of the matter, not the third. The right statistical analysis depends on understanding the scientific problem. Canned statistical solutions are not the solution for statistics. 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 Up to now, we have talked about whether the claim of equivalence in single task d’ can be asserted. Whether or not the data are sensitive enough to support that claim, the claim that confidence predicts multi task d’ without being fully mediated by single task d’, can also be independently tested by mediation analyses. And in general the question of whether mediation is complete or whether there is no meditation at all are both claims that involve potentially accepting null hypotheses and thus (without specifying minimally interesting values) can only be answered by use of Bayes factors (see Semmens-Wheeler et al., 2013, for how to use the Dienes, 2008, Bayes calculator to give a quick and dirty answer (i.e. one that tests the product of regression slopes as if it satisfied the assumption of normality); Dienes, in preparation, for code that tests the null hypotheses in mediation analyses assuming only that estimates of individual raw regression slopes are normally distributed; and Nuijten, Wetzels, Matzke, Dolan, & Wagenmakers, submitted, for a different solution for obtaining Bayes factors for mediation analyses using standardized slopes.) By contrast, significance testing, including significance testing using Bayesian machinery to get the p-values (e.g. Yuan & MacKinnon, 2009), can obtain evidence for partial mediation but cannot in itself (without specification of minimally interesting values) allow assertion of either complete or no 39 1350 1351 mediation. Only a truly Bayesian mediation analysis can achieve the latter (presented in Dienes, in preparation, and Nuitjen et al, submitted). 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 3. Meta-analysis. Bayes factors can be used in meta-analysis to gain all the advantages of their use in single studies. Their use is most straightforward when the same paradigm has been repeatedly used. In this case, find the overall meta-analytic raw mean and overall meta-analytic standard error12, and use these in the Bayes factor calculator. If standardized effect sizes are used to compare across studies with different paradigms and dependent variables, Pearson’s r can be used as the standardized effect size, as r can be normalized with Fisher’s z, with known standard error, and thus the Dienes (2008) calculator used. Bayes factors can in principle be multiplied together to accumulate evidence. However, in practice Bayes factors from single studies should not in general be multiplied together to get an overall Bayes factor, because this procedure does not take into account that initial studies should be used to update the alternative hypotheses of subsequent studies as they are added to the calculation in order to respect the axioms of probability (Jaynes, 2003). Thus, it is easier to first summarize the data and then determine the single Bayes factor on the summary, unless special precautions are taken (cf Rouder & Morey, 2011; Storm, Tressoldi, & Utts, 2013). (For point alternative hypotheses as used in the likelihood school of inference, e.g. Royall, 1997, the problem does not arise because if the alternative says only one point value is possible (e.g. the hypothesis that the population difference in means is exactly 10 seconds), the distribution representing the alternative – which is just a spike at the point value postulated - remains the same whatever the previous studies. Thus, in that case, Bayes factors can be un-problematically multiplied across single studies.) 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 4. Correlations and regression slopes. While correlations themselves are not normal, they can be made normal by a transformation, Fishers z = 0.5*loge((1 + r)/(1 - r)), which has standard error= 1/√ (df -1). Thus, designs using correlations can make use of the Dienes (2008) Bayes calculator (see also Wetzels & Wagenmakers, 2012, for a default Bayesian analyses of correlations). However, in many cases, it is the raw regression slopes that really address theoretical concerns. Correlations may become reduced by a restriction of range or increased error in measurement of the dependent variable, but these factors do not affect the raw regression slope (which is nonetheless affected by error in measurement of the independent variable, as is a correlation). Raw linear regression slopes can be used with the Dienes (2008) calculator, just bearing in mind they have a standard error given by: raw slope/t. 1383 5. Contingency tables. The following example illustrates the points: 1384 1. One can interpret complex contingency tables by reducing to 1-df contrasts; 1385 1386 1387 1388 2. Researchers should be weaned off chi square tests of independence, which serve a purpose only for generating p-values and hence are essentially uninformative for nonsignificant results. Instead, one wants a measure of strength of association with known distribution and standard error; 12 This can be achieved using software on the website for Dienes (2008) for finding a posterior from a prior and a likelihood. Entering the first study summary as the prior, the second study summary as the likelihood, and obtain the posterior mean and standard error; use the posterior as the prior for the next study, which is entered as the likelihood. Keep feeding in the posterior as the prior for the next study until all studies have been entered. 40 1389 1390 3. Ln OR (the natural log of the odds ratio) expresses strength of association and is normally distributed with known standard error; 1391 1392 1393 4. Thus, having obtained Ln OR, one can play. One can compare strengths of association in different studies or conditions, put confidence intervals on the strength of association, and calculate Bayes factors to illuminate non-significant results. 1394 1395 1396 1397 1398 1399 1400 McLatchie (2013) primed people in ways meant to cultivate guilty thoughts or guilty feelings. People subsequently given a sum of money then chose to either use it immediately as is, or to use it after a delay, when it would be increased; further, the use could be either to keep the money oneself, or else donate it to a good cause. One hypothesis was that guilty thoughts should increase pro-social behaviour (donating rather than keeping); another was that guilty feelings would increase impulsivity (using now rather than later). In Study 1 the following data shown in Table 4(a) were obtained. 1401 1402 Table 4 1403 Data from Study 1 on Guilt Priming from McLatchie (2013) 1404 1405 (a) Full Data Set Guilty Feelings Guilty Thoughts Control Keep Now 6 3 3 Keep Delay 2 2 12 Donate Now 7 1 1 Donate Delay 5 14 4 1406 1407 (b) A One-degree of Freedom Contrast Guilty Thoughts Control Keep 5 15 Donate 15 5 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 The data can be collapsed in a number of ways to answer specific questions. For example, to test if guilty thoughts increase pro-social action, a collapsed version of the data is shown in Table 4 (b). This yields a 2×2 Chi square test of independence of χ2(1) = 7.69, p = .0056. χ2 does not directly express a magnitude of effect. One way of expressing the effect size is as an odds ratio: OR = (15*15)/(5*5) = 9.0. The OR is the product of cell counts along one diagonal, divided by the product along the other. If there is no association population OR = 1. We will put the diagonal on top which should be larger according to the theory, if the theory makes a directional prediction (as it does here). The natural log of the odds ratio, Ln (OR), is normally distributed with a squared standard error given by 1/A + 1/B + 1/C + 1/D, where A, B, C, D are each of the cell entries. So in this case, standard error = √ 41 1420 1421 (1/5 + 1/15 + 1/15 + 1/5) = 0.73. Ln OR = Ln 9.0 = 2.20 in this case. We can test association with a z-test = 2.20/0.73 = 3.01, p = .002613. 1422 1423 In Study 2 using the same design but different methods of priming, Mclatchie obtained the data in Table 5(a), with the question of current interest shown in Table 5(b). 1424 1425 Table 5 1426 Data from Study 2 on Guilt Priming from McLatchie (2013) 1427 1428 (a) Full Data Set Keep Now Guilty Feelings 6 Guilty Thoughts 1 Control 2 Keep Delay 4 7 14 Donate Now 6 3 7 Donate Delay 4 9 17 1429 1430 (b) A One-degree of Freedom Contrast Guilty Thoughts Control Keep 8 16 Donate 12 24 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 Now the data look rather different. χ2 = 0 p = 1.00. More importantly, Ln OR = 0 with a standard error of 0.56. (So z = 0, p =1.00.) Is the pattern significantly different in the second study? The Ln ORs differ by 2.20 – 0 = 2.20. The difference has a standard error given by the variance sum rule, namely squared standard error of difference is the sum of the squared standard errors. Thus standard error of difference = √ (0.732 + 0.562) = 0.92. Hence, to test the difference, z = 2.20/0.92 = 2.39, p = .017. Thus, the second study is a failure to replicate, the association is significantly less in the second rather than first study. Has the association been reduced to zero in the second study? The sort of effect size one might expect in the second study is provided by that in the first. Thus we can represent the alternative as a half-normal with SD = 2.20. The “mean” is 0, and the standard error is 0.56. This yields a B of 0.24, indicating substantial evidence for the null hypothesis of no association in the second study. 1445 1446 1447 1448 1449 1450 6. Comparing different theories. A Bayes factor compares theory 1 with theory 2. Call a Bayes factor comparing theory 1 with the null hypothesis B1/0. Call a Bayes factor comparing theory 2 with the null hypothesis B2/0 . Then, the Bayes factor comparing theory 1 with theory 2, B1/2 = B1/0 / B2/0 (Dienes, 2008). Thus, the Dienes (2008) Bayes factor calculator can be used to compare two substantial theories, or to compare a substantial theory to a null region hypothesis rather than a point null. For example, a uniform or a normal 13 To obtain two-tail p-values for z, apart from using R, one can use online calculators, such as http://davidmlane.com/hyperstat/z_table.html: click on “outside” and enter – z and +z in the boxes. 42 1451 1452 defining a null region could be used (cf Morey & Rouder, 2011, who provide interesting discussion of the properties of such Bayes factors). 1453 1454 1455 1456 1457 1458 1459 1460 1461 It is sometimes argued that a point null could never be really true (Cohen, 1994). But, it can be true to an astonishing degree of accuracy, given the scale of resolution of the data (Rouder et al, 2009). In this context, Baguley (2012, p. 369) cites a study on ESP with over 27,000 participants; the confidence interval for the proportion correct was [.496, .502], where 0.5 is the chance baseline. Thus, with well controlled and counter-balanced experiments using a point null hypothesis is not absurd, especially if a null region hypothesis is hard to specify. Nonetheless, null region hypotheses can also be entirely valid, for example where null regions are already decided in a literature (e.g. regarding changes of less than 3 units on the Hamilton depression scale as clinically uninteresting). 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 A Bayes factor could be used to compare the hypothesis that the change increased versus decreased by e.g. having a half normal with SD = expected effect size going in either direction. For example, for a mean difference of 5 seconds and an expected effect size, should it exist of 10 seconds in either direction, first run the calculator with a half normal with SD = 10, and data mean = 5 (and obtain B1/0) ; then, run again by setting the data mean as -5 (and obtain B2/0) . Dividing the two Bayes factors gives a Bayes factors comparing the two theories of change in each direction (B1/2). Bear in mind that if the null hypothesis of zero change is true, B1/2 is not driven in any particular direction as data is collected but performs a random walk. Thus sooner or later it will provide strong evidence for one of the two directional theories. 1472 43 1473 Appendix 2 Alternatives approaches for indicating relative evidence. 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 Likelihood inference: An alternative method to specifying a range of values predicted by the theory is to specify a single value. The likelihood school of inference recommends Bayes factors contrasting hypotheses of point values (Dienes, 2008; Royall, 1997). For example, one could compare the null hypothesis that the difference is 0 to an alternative that the difference is 10. In this way, the need to specify a distribution over population values is obviated. The Bayes factor will, in the limit as increasing observations are collected, come to favour the hypothesis that is closer to the truth. Thus, if the true population value is less than 5, eventually the null hypothesis will be supported over the alternative. If the population value is more than 5, the alternative will come to be supported. But, this means that 5 comes to function as a minimally interesting value. Thus, if one were to use point hypotheses in one’s Bayes factor, one should decide on the minimally interesting value, m, and use an alternative point value of 2m. 1486 1487 1488 1489 1490 1491 1492 In likelihood inference, there is no need to restrict oneself to a single Bayes factor. One could consider the relative strength of evidence for the full range of possible population values and thus construct a likelihood interval (Royall, 1997; see the Dienes, 2008, associated website for likelihood interval calculators). In this case, the rules of inference by interval apply, as shown in Figure 2. Thus, in general, likelihood inference contrasts with full Bayesian inference in requiring a specification of the minimally interesting effect size in order to connect data to theory. 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 BIC,AIC, DIC, etc. A range of model comparison techniques have been developed, often based on rather different principles, but all comparing the evidence for two models in a way that can be decomposed in terms of the model’s degree of fit to the data, corrected by the number of free parameters in the model if this differs between models (see Baguley, 2012, chapter 11; Burnham & Anderson, 2002; Glover & Dixon, 2004; Wagenmakers, 2007). The main contrast with the approach illustrated in this paper is that such methods, when used for null hypothesis testing (which is not their only use), amount to assuming a vague default alternative hypothesis to compare against the null. This interpretation is most clear for BIC. (AIC is a measure of relative strength of evidence with in effect a default alternative for null hypothesis testing, one somewhat less vague than for BIC; Smith & Spiegelhalter, 1980.) 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 Wagenmakers (2007) showed that BIC approximates a Bayes factor that uses a “unit information prior” for the alternative (see Rouder et al., 2009, for further development of this specification to define vague defaults for Bayes factors). That is, the alternative is specified as a normal with the same mean as the mean of the data and a standard deviation equal to the standard deviation of the data. This can be justified on the grounds that if one had any accurate prior information at all, it would tend to specify the same mean as the data do, even if vaguely. Further, assuming the prior information was worth only one observation of data (hence “unit information”), the standard error of that information would be the standard deviation of the population data divided by square root of N, the number of observations – and N is just one. (Technically, one cannot use that very aspect of the data one is trying to predict – say its mean – to specify the alternative predicting the same data, as this involves “double counting”, Jaynes 2003; however, the vagueness of the unit information prior renders the technical point harmless. One can also just set the mean of the alternative to zero to strictly deal with this point.) The BIC – or, correspondingly, the unit information prior used to specify the alternative with the Dienes (2008) Bayes calculator – may be worth considering if one finds it difficult to specify the alternative, or wishes to test a vast number of alternatives in a data mining operation. But, the responsibility still rests with the researcher 44 1520 1521 to consider seriously whether the unit information prior matches theoretical expectations. A vague prior makes finding evidence for the null easy (Kruschke, 2013b). 1522 1523 1524 1525 1526 1527 1528 Consider trying to find evidence for unconscious processing by showing null performance on a test of awareness. It would seem prudent not to use a vague default, if only on the grounds that one wants to make the demonstration of unconscious knowledge convincing, and hence not especially easy to obtain. In fact, more importantly, Dienes (in press) shows in this situation that there are informative specifications defining the extent to which one expects conscious knowledge, in raw units, rendering defaults irrelevant in that scientific context. 1529 1530 1531 1532 1533 1534 1535 Every case has to be considered on its own merits as to whether the specification of the alternative is relevant (Kruschke, 2013b; Vanpaemel, 2010). In that sense, true defaults (i.e. those that could be used at any time to avoid consideration of scientific context) do not actually exist in real scientific contexts. A Bayes factor is a means for comparing two theories. In the absence of theory, placing credibility or likelihood intervals around parameter estimates (without accepting or rejecting any null hypothesis) lets the data speak for themselves (Kruschke, 2013b). 1536 45 1537 Appendix 3 Robustness checking 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 The Bayes factor is sensitive to how vague H1 is specified as being; the vaguer H1, the more the data will support H0. The vagueness of H1 depends mainly on the rough maximum value allowed; however, in detail B can depend on the exact shape of the specified distribution as well. This raises a question: There may be a number of ways of specifying essentially the same scientific judgments. For example, the scientific judgment for H1 might simply be that the maximum population difference cannot be more than about 8. The exact shape for the distribution is not further specified; assigning a specific shape then adds more precision than the science in the situation actually provides. If B were sensitive to the shape beyond the specification of the maximum, the extent to which that B was relevant to the scientific theory could be questioned. On the other hand, if the decision based on B were robust to major shape changes in simple distributions given the same maximum, then B shows its relevance to evaluating the theory. 1550 1551 1552 1553 1554 1555 1556 Here we consider several Bayes factors, subscripted to indicate the distributions used to represent H1 (thanks to Wolf Vanpaemal for this notational suggestion). BN(m,sd) means a normal was used with mean m and standard deviation sd; BH(0, sd) indicates a half-normal was used with standard deviation sd; BU[min,max] indicates a uniform was used between a minimum of min and a maximum of max. Letting a rough maximum for a normal correspond to two standard deviations out, BN, BH and BU can then be set to respecting the same scientific intuition concerning the maximum population difference. 1557 1558 1559 1560 The Rouder Bayes factors are also shown below. BJZS(r) is the JZS B with scaling factor r, and BUI(r) uses the unit information prior with scaling factor r. BJZS and BUI can be taken to represent theories about Cohen’s d rather than about raw mean differences so may produce somewhat different answers than BN, BH or BU. 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 Consider a study where N = 30, the mean is 5.5 units above a null value of 0, SD = 13.7, so SE = 2.5. A predicted effect of about 5 units can be argued for; or a maximum of 10 units. The Rouder B’s were scaled with a predicted Cohen’s d of 5/13.7 = 0.36. The first two rows of Table 6 illustrate a significant effect, and the BN and BU just tip over 3, as expected. Notice the distributions for the predictions of H1 are centred on 0, so the scientific theory in question predicts effects in either direction. In these cases it doesn’t matter if the distribution is peaked in the middle (a normal) or flat (a uniform); the same conclusions follow. The Rouder B’s, scaled according to the Cohen’s d of the expected effect size, produce somewhat less evidence for H1 than the corresponding BN and BU: BJZS and BUI represent theories about Cohen’s d rather than about raw mean differences. BH is rather higher than the others in the first row; this is because it represents a different scientific theory, namely a theory that predicts effects in only one direction. It is desirable that B may be different when different scientific theories are represented. The second row illustrates how a directional theory is penalized when the mean goes in the wrong direction. 1575 1576 1577 1578 1579 46 1580 Table 6 1581 Different Specifications of H1 1582 1583 Mean SE BH(0, 5) BN(0, 5) BU[-10,10] BJZS(0.36) BUI(0.36) 1584 5.5 2.5 6.05 3.10 3.40 1.99 2.77 1585 -5.5 2.5 0.15 3.10 3.40 1.99 2.77 1586 0 2.5 0.45 0.45 0.31 0.34 0.45 1587 1588 Table 7 1589 B’s for a Directional Theory 1590 Mean SE BH(0, 5) BN(5, 2.5) BU[0,10] 1591 5 2.5 4.27 5.22 4.42 1592 -5 2.5 0.16 0.10 0.11 1593 6 2.5 8.81 12.10 10.46 1594 -6 2.5 0.14 0.10 0.09 1595 0 2.5 0.45 0.26 0.31 1596 1597 1598 1599 1600 1601 1602 1603 Table 7 presents results for three different ways of representing a directional theory. As can be seen, the different distributions produce similar B’s, no matter whether the distribution is flat, peaked in the middle or pushed to one end. Thus, conclusions based on these Bs will often be robust. In any give case, one needs to check the robustness of the conclusions. When evidence is near a threshold, different representations of H1 might tip either side of it (though bear in mind that the threshold is itself not meaningful for Bayes, it is just a convention). 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 Verhagen and Wagenmakers (in press) provide a different Bayes factor calculator that takes as its theory that the current study is an exact replication of a previous one, so H1 can be set as predicting the standardized effect size previously obtained, with an uncertainty defined by the previous standard error in the estimate. Their B closes matches the results obtained with the defaults recommended in this paper. For example, Verhagen and Wagenmaker’s example 1 is based on Eliot et al (2010): The latter’s experiment 2 was designed as replication of their experiment 1. In experiment 1 the raw effect size was 6.79 5.67 = 1.12 units for females. To interpret experiment 2, the 1.12 can be used as the SD of a half-normal. In experiment 2 of Eliot et al, a difference of about 1 raw unit (from the graph in the original paper) can be estimated for females; with a t of 3, this gives a standard error of 0.33. BH(0, 1.12) = 38.57, very close to the 39.73 obtained by Verhagen and Wagenmakers. For 47 1615 1616 1617 1618 1619 1620 males an SE of 0.33 can be estimated from the graph; thus, mean difference = SE*t = - .02. BH(0, 1.12) = 0.27, reasonably close to the 0.13 obtained by Verhagen and Wagenmakers. (Similar Bs for the Dienes calculator and Verhagen and Wagenmakers calculator hold for their other examples.) In sum, the different Bs, based on somewhat different distributions, but the same scientific intuitions, produced reassuring close answers, indicating the robustness of the conclusions. 1621 1622 1623 1624 1625 1626 1627 In sum, the often close results obtained by different Bs when the same scientific theories are represented validates Bs as a method of evaluating such theories. The arbitrariness of the distributions chosen to represent H1 can be shown to be inconsequential when different simple distributions representing the same theories produce qualitatively the same answers. When different answers are obtained for representations of H1 that are equally plausible as representations of the same scientific judgments, more data need to be collected until the conclusion is robust. 48
© Copyright 2026 Paperzz