Running title: Using Bayes to get the most out of non

1
Running title: Using Bayes to get the most out of non-significant results
2
3
Using Bayes to get the most out of non-significant results
4
5
6
7
Zoltan Dienes
School of Psychology and Sackler Centre for Consciousness Science, University of Sussex
8
9
10
Keywords: Bayes factor, confidence interval, highest density region, null hypothesis, power,
statistical inference, significance testing
11
12
13
Correspondence:
14
Zoltan Dienes
15
School of Psychology, University of Sussex, Brighton, BN1 9QH UK
16
Phone: +44 1273 877335, Fax: 01273 678058, Email: [email protected]
1
17
Abstract
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
No scientific conclusion follows automatically from a statistically non-significant result, yet
people routinely use non-significant results to guide conclusions about the status of theories
(or the effectiveness of practices). To know whether a non-significant result counts against a
theory, or if it just indicates data insensitivity, researchers must use one of: power, intervals
(such as confidence or credibility intervals), or else an indicator of the relative evidence for
one theory over another, such as a Bayes factor. I argue Bayes factors allow theory to be
linked to data in a way that overcomes the weaknesses of the other approaches. Specifically,
Bayes factors use the data themselves to determine their sensitivity in distinguishing theories
(unlike power), and they make use of those aspects of a theory’s predictions that are often
easiest to specify (unlike power and intervals, which require specifying the minimal
interesting value in order to address theory). Bayes factors provide a coherent approach to
determining whether non-significant results support a null hypothesis over a theory, or
whether the data are just insensitive. They allow accepting and rejecting the null hypothesis
to be put on an equal footing. Concrete examples are provided to indicate the range of
application of a simple online Bayes calculator, which reveal both the strengths and
weaknesses of Bayes factors.
34
35
36
2
37
Using Bayes to get the most out of non-significant results
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Users of statistics, in disciplines from economics to sociology to biology to psychology, have
had a problem. The problem is how to interpret a non-significant result. A non-significant
result can mean one of two things: either that there is evidence for the null hypothesis and
against a theory that predicted a difference (or relationship); or else that the data are
insensitive in distinguishing the theories and nothing follows from the data at all. (Indeed, in
the latter case, a non-significant result might even somewhat favour the theory, e.g. Dienes,
2008, p 128, as will be seen in some of the examples that follow.) That is, the data might
count in favour of the null and against a theory; or they might count for nothing much. The
problem is that people have been choosing one of those two interpretations without a
coherent reason for that choice. Thus, non-significant results have been used to count against
theories when they did not (e.g. Cohen, 1990; Rosenthal, 1993; Rouder, Morey, Speckman, &
Pratte, 2007); or else have been ignored when they were in fact informative (e.g. believing
that an apparent failure to replicate with a non-significant result is more likely to indicate
noise produced by sloppy experimenters than a true null hypothesis; cf. Greenwald, 1993;
Kruschke, 2013a; Pashler & Harris, 2012). One can only wonder what harm has been done
to fields by not systematically determining which interpretation of a non-significant result
actually holds. There are three solutions on the table for evaluating a non-significant result
for a single study: 1) power; 2) interval estimates; and 3) Bayes factors (and related
approaches). In this article, I will discuss the first two briefly (because readers are likely to
be most familiar with them) indicating their uses and limitations; then describe how Bayes
factors overcome those limitations (and what weaknesses they in turn have). The bulk of the
paper will then provide detailed examples of how to interpret non-significant results using
Bayes factors, while otherwise making minimal changes to current statistical practice. My
aim is to clarify how to interpret non-significant results coherently, using whichever method
is most suitable for the situation, in order to effectively link data to theory. I will be
concentrating on a method of using Bayes that involves minor changes in adapting current
practice. The changes therefore can be understood by reviewers and editors even as they
operate under orthodox norms (see, e.g., Allen, Sumner, & Chambers, 2014) while still
solving the fundamental problem of distinguishing insensitive data from evidence for a null
hypothesis.
68
69
1. Illustration of the problem.
70
71
72
73
74
75
76
77
78
79
80
81
82
83
Imagine there really is an effect in the population, and the power of an experimental
procedure is 0.5 (i.e. only 50 out of 100 tests would be significant when there is an actual,
real effect; not that in reality we know what the real effect is, nor, therefore, what the power
is for the actual population effect). The experiment is repeated exactly many times.
Cumming (2011; see associated website) provides software for simulating such a situation;
the use of simulation of course ensures that each simulation is identical to the last bar the
vagaries of random sampling. A single run (of Cumming’s “ESCI dance p”) generated the
sequence of p-values shown in Figure 1. Cumming (2011) calls such sequences the “dance of
the p-values”. Notice how p-values can be very high or very low. For example, one could
obtain a p-value of .028 (experiment 20) and in the very next attempted replication get a p of
0.817, when nothing had changed in terms of the population effect. There may be a
temptation to think the p of .817 represents very strong evidence for the null; it is “very nonsignificant”. In fact, a p-value per se does not provide evidence for the null, no matter “how
non-significant” it is (Fisher, 1935; Royall, 1997). A non-significant p-value does not
3
84
85
86
87
88
distinguish evidence for the null from no evidence at all (as we shall see). That is, one cannot
use a high p-value in itself to count against a theory that predicted a difference. A high pvalue may simply reflect data insensitivity, a failure to distinguish the null hypothesis from
the alternative because, for example, the standard error is high. Given this, how can we tell
the meaning of a non-significant result?
89
90
Figure 1. Dance of the p-values
91
92
93
A sequence of p-values (and confidence intervals) for successive simulations of an
experiment where there is an actual effect (mean population difference = 10 units) and power
is 0.5. Created from software provided by Cumming (2011; associated website).
94
95
96
4
97
2. Solutions on the table.
98
99
100
101
102
103
104
105
2.1 Power. A typical orthodox solution to determining whether data are sensitive is power
(Cohen, 1988). Power is the property of a decision rule defined by the long run proportion of
times it leads to rejection of the null hypothesis given the null is actually false. That is, by
fixing power one controls long run Type II error rates (the rate at which one accepts the null
given it is false, i.e. the converse of power) . Power can be easily calculated with the free
software Gpower (Faul et al., 2009): To calculate apriori power (i.e. in advance of collecting
data, which when it should be calculated), one enters the effect size predicted in advance, a
number of participants, the significance level to be used, and a power is returned.
106
107
108
109
110
111
112
113
To calculate power, in entering the effect size, one needs to know the minimal
difference (relationship) below which the theory would be rendered false, irrelevant or
uninteresting. If one did not use the minimal interesting value to calculate power one would
not be controlling Type II error rates. If one used an arbitrary effect size (e.g. Cohen’s d of
0.5), one would be controlling error rates with respect to an arbitrary theory, and thus in no
principled way controlling error rates for the theories one was evaluating. If one used a
typical effect size, but not the minimal interesting effect size, type II error rates would not be
controlled.
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
Power is an extremely useful concept. For example, imagine a review of 100 studies
that looked at whether meditation improved depression. Fifty studies found statistically
significant results in the right direction and 50 were non-significant. What can be concluded
about the effectiveness of meditation in treating depression? One intuition is that each
statistically significant result trades off against each statistically non-significant result, and
nothing much follows from these data: More research is needed. This intuition is wrong
because it ignores power. How many experiments out of 100 would be statistically
significant at the 5% level if the null were true? Only five - that’s the meaning of a 5%
significance level. (There may be a “file drawer problem” in that not all null results get
published; but if all significant results were reported, and the null were true, one would
expect half to be statistically significant in one direction and half in the other.) On the other
hand, if an effect really did exist, and the power was 50%, how many would be statistically
significant out of 100? Fifty - that’s the meaning of a power of 50%1. In fact, the obtained
pattern of results categorically licenses the assertion that meditation improves depression (for
this fictional data set).
129
130
131
132
133
134
135
136
137
138
139
140
Despite power being a very useful concept, there are two problems with power in
practice. First, calculating power requires specifying the minimal interesting value, or at least
the minimal interesting value that is plausible. This may be one of the hardest aspects of a
theory’s predictions to specify. Second, power cannot use the data themselves in order to
determine how sensitively those very data distinguish the null from the alternative hypothesis.
It might seem strange that properties of the data themselves cannot be used to indicate how
sensitive those data are. But post hoc or observed power, based on the effect size obtained in
the data, gives no information about Type II error rate (Baguley, 2004): Such observed power
is determined by the p-value and thus gives no information in addition to that provided by the
p-value, and thus gives no information about Type II error rate. (To see this, consider that a
sensitive non-significant result would have a mean well below the minimally interesting
mean; power calculated on the observed mean, i.e. observed power, would indicate the data
1
Typical for psychology is no more than 50%: Cohen (1962); Sedlmeier and Gigerenzer (1989); see Button et al.
(2013), for a considerably lower recent estimate in some disciplines.
5
141
142
143
144
insensitive, but it is the minimally interesting mean that should be used to determine power.)
Intervals, such as confidence intervals solve the second problem; that is, they indicate how
sensitive the data are, based on the very data themselves. But intervals do not solve the first
problem, that of specifying minimally interesting values, as we shall see.
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
2.2 Interval estimates. Interval estimates include confidence intervals, and their likelihood
and Bayesian equivalents, namely, likelihood intervals and credibility intervals (also called
highest density regions) (see Dienes, 2008, for comparison). A confidence interval is the set
of possible population values consistent with the data; all other population values may be
rejected (see e.g. Cumming, 2011; Smithson, 2003). The APA recommends reporting
confidence intervals whenever possible (APA, 2010). However, why should one report a
confidence interval, unless it changes what conclusions might be drawn? Many statistics
textbooks teach how to calculate confidence intervals but few teach how to use them to draw
inferences about theory (for an exception see Lockhart, 1998). Figure 2 shows the four
principles of inference by intervals (adapted from Berry, Carlin, Lee, & Muller, 2011;
Freedman & Spiegelhalter, 1983; Kruschke, 2010a; Rogers, Howard, & Vessey, 1993; Serlin
& Lapsey, 1985). A non-significant result means that the confidence interval (of the
difference, correlation, etc) contains the null value (here assumed to be zero). But if the
confidence interval includes zero, it also includes some values either side of zero; so the data
are always consistent with some population effect. Thus, in order to accept a null hypothesis,
the null hypothesis must specify a null region, not a point value. Specifying the null region
means specifying a minimally interesting value (just as for power), which forms either limit
of the region. Then if the interval is contained within the null region, one can accept the null
hypothesis: The only allowable population values are null values (rule i) (cf equivalency
testing, Rogers et al, 1993). If the interval does not cover the null region at all, the null
hypothesis can be rejected. The only allowable population values are non-null (rule ii).
Conversely if the interval covers both null and non-null regions, the data are insensitive: the
data allow both null and non-null values (rule iv; see e.g. Kiyokawa, Dienes, Tanaka,
Yamada, & Crowe, 2012 for an application of this rule).
169
6
170
Figure 2 The Four Principles of Inference by Intervals.
171
172
173
174
175
i) If the interval is completely contained in the null region, decide that the population
value lies in the null region (accept the null region hypothesis);
176
177
ii) If the interval is completely outside the null region, decide that the population
value lies outside the null region (reject the null region hypothesis);
178
179
iii) If the upper limit of the interval is below the minimal interesting value, decide
against a theory postulating a positive difference (reject a directional theory);
180
181
iv) If the interval includes both null region and theoretically interesting values, the
data are insensitive (suspend judgment).
182
7
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
In Figure 1, 95% confidence intervals are plotted. In all cases where the outcome was
non-significant, the interval included not just zero but also the true population effect (10).
Given that 10 is agreed to be an interesting effect size, the confidence intervals show all the
non-significant outcomes to be insensitive. In all those cases the data should not be used to
count against the theory of a difference (rule iv). The conclusion follows without
determining a scientifically-relevant minimally interesting value. But a confidence interval
can only be used to assert the null hypothesis when a null region can be specified (rule i) (cf
Cohen, 1988; Greenwald, 1975; Hodges & Lehman, 1954; Serlin & Lapsley, 1985). For that
assertion to be relevant to a given scientific context, the minimal value must be relevant to
that scientific context (i.e. it cannot be determined by properties of the data alone nor can it
be a generic default). For example, in clinical studies of depression it has been conventionally
decided that a change of 3 units on the Hamilton scale is the minimal meaningful amount (e.g.
Kirsch, 2010)2. Rule ii follows logically from rule i. Yet rule ii is stricter than current
practice, because it requires the whole null region lie outside the interval, not just zero3.
198
199
200
201
202
203
204
205
While intervals can provide a systematic basis of inference which would advance
current typical practice, they have a major problem in providing guidance in interpreting nonsignificant results. The problem is that asserting the null (and thus rejecting the null with a
procedure that could have asserted it) requires in general specifying the minimally interesting
value: What facts in the scientific domain indicate a certain value as a reasonable minimum?
That may be the hardest part of a theory’s predictions to specify4. If you find it hard to
specify the minimal value, you have just run out of options for interpreting a non-significant
result as far as orthodoxy is concerned.
206
207
2.3 Bayes factors. Bayes factors (B) indicate the relative strength of evidence for two
theories (e.g. Berger & Delampady, 1987; Dienes, 2011; Gallistel, 2009; Goodman, 1999;
2
Rule (i) is especially relevant for testing the assumptions of a statistical model. Take the assumption of
normality in statistical tests. The crucial question is not whether the departure from normality is nonsignificant; but rather, whether whatever departure as exists is within such bounds that it does not threaten
the integrity of the inferential test used. For example, is the skewness of the data within such bounds such that
when the actual alpha rate (significance) is 5% the estimated rate is between 4.5% and 5.5%? Or is the
skewness of the data within such bounds such that a Bayes factor calculated assuming normality lies between
2.9 and 3.1 if the Bayes factor calculated on the true model is 3? See Simonsohn, Nelson, and Simmons
(forthcoming) for an example of interval logic in model testing. Using intervals inferentially would be a simple
yet considerable advance on current practice in checking assumptions and model fit.
3
If inferences are drawn using Bayesian intervals (e.g. Krushcke, 2010a), then it is rules i and ii that must be
followed. The rule of rejecting the null when a point null lies outside the credibility interval has no basis in
Bayesian inference (a point value always has zero probability on a standard credibility interval);
correspondingly, such a rule is sensitive to factors that are inferentially irrelevant such as the stopping rule and
intentions of the experimenter (Dienes, 2008, 2011; Mayo, 1996). Rules (i) and (ii) are not sensitive to stopping
rule (given interval width is not much more than that of the null region) (cf Kruschke, 2013b). Indeed, they can
be used by orthodox statisticians, who can thereby gain many of the benefits of Bayes (though see Lindley,
1972, and Kruschke, 2010b, for the advantages of going fully Bayesian when it comes to using intervals, and
Dienes, 2008, for a discussion of the conceptual commitments made by using the different sorts of intervals).
4
Figure 1 represents the decisions as black and white, in order to be consistent with orthodox statistics. For
Bayesians, inferences can be graded. Kruschke (2013c) recommends specifying the degree to which the
Bayesian credibility interval is contained within null regions of different widths so people with different null
regions can make their own decisions. Nonetheless, pragmatically a conclusion needs to be reached and so a
good reason still need to be specified for accepting any one of those regions.
8
208
209
210
211
212
213
214
215
216
217
Kass & Wasserman, 1996; Kruschke, 2011; Lee & Wagenmakers, 2005; Rouder et al., 2009)5.
The Bayes factor B comparing an alternative hypothesis to the null hypothesis means that the
data are B times more likely under the alternative than under the null. B varies between 0
and ∞, where 1 indicates the data do not favour either theory more than the other; values
greater than 1 indicate increasing evidence for one theory over the other (e.g. the alternative
over a null hypothesis) and values less than 1 the converse (e.g. increasing evidence for the
null over the alternative hypothesis). Thus, Bayes factors allow three different types of
conclusions: There is strong evidence for the alternative (B much greater than 1); there is
strong evidence for the null (B close to 0); and the evidence is insensitive (B close to 1). This
is already much more than p-values could give us.
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
In order to draw these conclusions we need to know how far from 1 counts as strong
or substantial evidence and how close to 1 counts as the data being insensitive. Jeffreys
(1939/1961) suggested conventional cut-offs: A Bayes factor greater than 3 or else less than
1/3 represents substantial evidence; conversely, anything between 1/3 and 3 is only weak or
“anecdotal” evidence6. Are these just numbers pulled out of the air? In the examples
considered below, when the obtained effect size is roughly that expected, a p of .05 roughly
corresponds to a B of 3. This is not a necessary conclusion, nothing guarantees it; but it may
be no coincidence that Jeffreys developed his ideas in Cambridge just as Fisher (1935) was
developing his methods. That is, Jeffreys choice of 3 is fortunate in that it roughly
corresponds to the standards of evidence that scientists have been used to using in rejecting
the null. By symmetry, we automatically get a standard for assessing substantial evidence for
the null: If 3 is substantial evidence for the alternative, 1/3 is substantial evidence for the null.
(Bear in mind that there is no one-to-one correspondence of B with p-values, it depends on
both the obtained effect size and also how precise or vague the predictions of the alternative
are: With a sufficiently vague alternative a B of 3 may match a p-value of, for example, .01,
or lower; Wetzels et al., 2011).
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
The Bayes factor is based on the principle that evidence supports the theory that most
strongly predicted it. That is, in order to know how much evidence supports a theory, we
need to know what the theory predicts. For the null hypothesis this is easy. Traditionally, the
null hypothesis predicts that a single population value (typically 0) is plausible and all other
population values are ruled out. But what does our theory, providing the alternative
hypothesis, predict? Specifying the predictions of the theory is the difficult part of calculating
a Bayes factor. The examples below, drawn partly from published studies, will show some
straightforward ways theoretical predictions can be specified in a way suitable for Bayes
factors. The goal is to determine a plot of the plausibility of different population values given
the theory in its scientific context. Indeed, whatever one’s views on Bayes, thinking about
justifications for minimum, typical or maximum values of an effect size in a scientific context
is a useful exercise. We have to consider at least some aspect of such a plot to evaluate nonsignificant results; the minimum plausible value has to be specified for using power or
confidence (credibility or likelihood) intervals in order to evaluate theories. That is, we
should have been specifying theoretically-relevant effect sizes anyway. We could get away
without doing so, because specified effect sizes are not needed for calculating p-values; but
5
Bayes factors were independently discovered at about the same time by Harold Jeffreys (1939/1961) and
Alan Turing, the latter in order to help decode German messages during World War II (McGrayne, 2012).
6
Jeffreys (1961) also recommended labelling Bayes factors greater than 10 or less than 1/10 as ‘strong’
evidence. However, the terms ‘substantial’ and ‘strong’ have little to choose between them. Thus Lee and
Wagenmakers (2014) recommend calling Bayes factors greater than 3 or less than a 1/3 as ‘moderate’ and
those greater than 10 or less than 1/10 as ‘strong’.
9
250
251
252
253
254
the price has been incoherence in interpreting non-significant (as well as significant) results.
It might be objected that having to specify the minimum was one of the disadvantages of
inference by intervals; but as we will see in concrete examples, Bayes is more flexible about
what is sufficient to be specified (e.g. rough expected value, or maximum), and different
scientific problems end up naturally specifying predictions in different ways.
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
For the free online Bayes factor calculator associated with Dienes (2008;
http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_factor.swf), there are
three distributions that can be used to represent the predictions of the theory: Uniform,
normal or half-normal (see Figure 3). The properties of each distribution will be briefly
described in turn, before considering concrete examples of Bayes factors in practice in the
next section. Note these distributions plot the plausibility of different values of the population
parameter being tested, for example the plausibility of population means or mean differences.
The distributions in Figure 3 in no way reflect the distribution of individual values in the
population. The Dienes (2008) Bayes calculator assumes that the sampling distribution of the
parameter estimate is normally distributed (just as a t-test does), and this obtains when, for
example, the individual observations in the population are normally distributed. Thus, there
may be a substantial proportion of people with a score below 0 in the population, and we
could correctly believe this to be true while also assigning no plausibility to the population
mean being below 0 (e.g. using a uniform from 0 to 10 in Figure 3). That is, using a uniform
or half-normal distribution to represent the alternative hypothesis is entirely consistent with
the data themselves being normally distributed. The distributions in Figure 3 are about the
plausibility of different population parameter values (e.g. population means of a condition or
population mean differences or associations etc.).
273
10
274
Figure 3 Representing the Alternative Hypothesis
275
276
277
278
279
(a) A uniform distribution with all population parameter values from the lower to the upper
limit equally plausible. Here the lower limit is zero, a typical but not required value.
280
281
(b) A normal distribution, with population parameter values close to the mean being more
plausible than others. The SD also needs to be specified; a default of mean/2 is often useful.
282
283
(c) A half-normal distribution. Values close to 0 are most plausible; a useful default for the
SD is a typical estimated effect size. Population values less than 0 are ruled out.
284
285
11
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
First, we consider the uniform. A researcher tests whether imagining a sporting move (e.g. a
golf swing) for 15 minutes a day for a week improves measured performance on the move.
She performs a t-test and obtains a non-significant result. Does that indicate that imagination
in this context is not useful for improving the skill? That depends on what size effect is
predicted. Just knowing that the effect is non-significant is in itself meaningless. Presumably,
whatever size the population effect is, the imagination manipulation is unlikely to produce an
effect larger than that produced by actually practicing the move for 15 minutes a day for a
week. If the researcher trained people with real performance to estimate that effect, it could
be used as the upper limit of a uniform in predicting performance for imagination. If a clear
minimum level of performance can be specified in a way simple to justify, that could be used
as the lower limit of the uniform; otherwise, the lower limit can be set to 0 (in practice, the
results are often indistinguishable whether 0 or a realistic minimum is used, though in every
particular case this needs to be checked; cf Shang, Fu, Dienes, Shao, & Fu, 2013) . In general,
when another condition or constraint within the measurement itself determines an upper limit,
the uniform is useful, with a default of 0 as the lower limit. An example of a constraint
produced by the measurement itself is a scale with an upper limit; for a 0-7 liking scale, the
difference between conditions cannot be more than 7. (The latter constraint is quite vague and
we can usually do better; for example, by constraints provided by other data. See Dienes,
forthcoming-a, for examples of useful constraints created by types of scales.)
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
Next, we consider the normal distribution. Sometimes what is easily available is not
another condition defining an upper limit, but data indicating a roughly likely value for an
effect, should it exist. For example, it may be known for an implicit learning paradigm that
asking people to search for rules rather than passively memorizing training material reduces
classification accuracy in a later test phase by about 5%. A researcher may wish to change the
manipulation by having people search for rules or not, not during the training phase as before,
but between training and testing (cf. Mealor & Dienes, 2012). In the new experiment we may
think 5% a reasonable estimate of the effect size that would be obtained (especially as we
think similar mechanisms would be involved in producing the two effects). Further, if there is
an effect in the post-training manipulation, we may have no strong reason to think it would be
more than or less than the effect in the training manipulation. A natural way of representing
this prediction of the theory is a normal distribution with a mean of 5%. What should the
standard deviation of the normal distribution be? The theory predicts an effect in a certain
direction; namely searching for rules will be harmful for performance. Thus, we require the
plausibility of negative differences to be negligible. The height of a normal distribution
comes pretty close to zero by about two standard deviations out. Thus, we could set two
standard deviations to be equal to the mean; that is, SD = mean/2. This representation uses
another data set to provide a most likely value - but otherwise keeps predictions as vague as
possible (any larger SD would involve non-negligible predictions in the wrong direction; any
smaller SD would indicate more precision in our prediction). Using SD = mean/2 is a useful
default to consider in the absence of other considerations; we will consider exceptions below.
With this default, the effect is predicted to lie between 0 and twice the estimated likely effect
size. Thus, we are in no way committed to the likely effect size; it just sets a ball park
estimate, with the true effect being possibly close to the null value or indeed twice the likely
value.
330
331
332
333
334
When there is a likely predicted effect size, as in the preceding example, there is
another way we can proceed. We can use the half-normal distribution, with a mode at 0, and
scale the half-normal distribution’s rate of drop by setting the SD equal to the likely predicted
value. So, to continue with the preceding example, the SD would be set to 5%. The halfnormal distribution indicates that smaller effect sizes are more likely than larger ones, and
12
335
336
337
338
339
340
341
342
343
whatever the true population effect is, it plausibly lies somewhere between 0 and 10%. Why
should one use the half-normal rather than the normal distribution? In fact, Bayes factors
calculated either way are typically very similar; but the half-normal distribution considers
values close to the null most plausible, and this often makes it hard to distinguish the
alternative from the null. Thus, if the Bayes factor does clearly distinguish the theories with a
half-normal distribution, we have achieved clear conclusions despite our assumptions. The
conclusion is thereby strengthened. For this reason, a useful default for when one has a likely
expected value is to use the half-normal distribution. (We will consider examples below to
illustrate when a normal distribution is the most natural representation of a particular theory.)
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
To summarize, to know how much evidence supports a theory we need to know what
the theory predicts. We have considered three possible ways of representing the predictions
of the alternative, namely, the uniform, normal and half-normal distributions. Representing
the predictions of the alternative hypothesis is the part of performing Bayes that requires most
thought. This thinking about the predictions of scientific theory can apparently be avoided by
conventionally deciding on a default representation of the predictions of the alternative to use
in all or most cases (e.g. Box & Tiao, 1973; Jeffreys, 1961; Liang, Paulo, Molina et al, 2008;
Rouder et al., 2009; Rouder & Morey, 2011; Rouder, Morey, Speckman, & Province, 2012;
Wetzels, Grasman, & Wagenmakers, 2012; Wetzels & Wagenmakers, 2012). But as there is
no default theory in science, a researcher still has the responsibility of determining whether
the default representation used in any Bayes factor reasonably matches the predictions of
their theory - and if it does not, that particular Bayes factor is not relevant for assessing the
theory; one or more others will be. (See Vanpaemel & Lee, 2012, for the corresponding
argument in computational modelling, that distributions over parameters are also best
informed by scientific knowledge rather than general defaults.) Thus, Rouder et al. (2009, p
232) recommend using content-specific constraints on specifying the alternative in the Bayes
factor where this is possible. Vanpaemel (2010) also urges representing the alternative in a
way as sensitive to theory as possible. Fortunately, more than a dozen papers have been
published over the last couple of years illustrating the use of simple objectively-specified
content-specific (non-default) constraints for Bayes factors in their particular scientific
domains (e.g. Armstrong & Dienes, 2013; Buhle, Stevens, Friedman, & Wager, 2012;
Dienes, Baddeley, & Jansari, 2012; Dienes & Hutton, 2013; Gallistel; 2009; Fu, Bin, Dienes,
Fu, & Gao, 2013; Guo et al., 2013; Guo, Li, Yang, & Dienes, 2013; Mealor & Dienes,
2013a,b; Morey & Rouder, 2011; Newell & Rakow, 2011; Parris & Dienes, 2013; SemmensWheeler, Dienes, & Duka, 2013; Shang et al., 2013; Terhune, Wudarczyk, Kochuparampil,
& Kadosh, 2013). Some principles are illustrated below. The three possible ways of
representing the alternative hypothesis considered here turn out to be sufficient for
representing theoretical intuitions in many scientific contexts. While the alternative requires
some thought, a default is used for the null in the Dienes (2008) calculator; it assumes the
null hypothesis is that the true population value is exactly zero. We will consider in Appendix
1 how to change this according to scientific context as well (cf. Morey & Rouder, 2011, who
also allow for flexible non-default nulls).
376
377
378
379
380
381
382
383
To know how much evidence supports a theory, we need to know what the evidence
is. For a situation where a t-test or a z-test can be performed, the relevant summary of the
data is the parameter estimate and its standard error: For example, the sample mean
difference and the standard error of the difference. This is the same as for a t-test (or z-test); a
t value just is the parameter estimate divided by its standard error. So if one knows the
sample mean difference, the relevant standard error can be obtained by: mean difference/t,
where t can be obtained through SPSS, R etc. This formula works for a between-participants,
within-participants, mixed or one sample t-test. Thus, any researcher reading a paper
13
384
385
386
involving a t-test (or any F with one degree of freedom; the corresponding t is the square root
of the F), or performing their own t-test (or F with one degree of freedom) can readily obtain
the two numbers summarizing the data needed for the Bayes factor.
387
388
389
390
391
392
393
The Dienes (2008) calculator asks for the mean (i.e. parameter estimate in general,
including mean difference) and the standard error of this estimate. It also asks if the
p(population value|theory) (i.e. the plausibility of different population values given the theory)
is uniform: If the answer is ‘yes’, boxes appear to enter the lower and upper bounds; if the
answer is ‘no’, boxes appear to set the parameters of the normal distribution. In the latter case,
to answer the question “Is the distribution one-tailed or two-tailed?” enter “1” for a halfnormal and “2” for a normal7.
394
395
396
397
398
399
400
We can now use the simulations in Figure 1 to illustrate some features of Bayes
factors. Table 1 shows the sequence of p-values obtained in the successive replications of the
same experiment with a true population raw effect size of 10 units. Let us assume, as we did
for interpreting the confidence intervals, that a value of 10 units is regarded as the sort of
effect size that would hold if the theory were true. Thus, we can represent the alternative as a
half-normal distribution with an SD of 10. Table 1 shows both the ‘dance of the p-values’ and
the more graceful ‘tai chi of the Bayes factors’.
401
402
7
Note this is not the same as specifying a 1-tailed or 2-tailed test in orthodox statistics. In orthodox statistics,
conducting a 1- or 2-tailed test is a matter of whether one would treat a result as significant if the difference
had been extreme in just one direction (1-tailed test) or rather in either direction (2-tailed test). Such
counterfactuals do not apply to Bayes, and the criticisms of 1-tailed tests in an orthodox sense (criticizing a
researcher using a 1-tailed test because he would have rejected the null if the results had been extreme in the
other direction, even though they were not) hence do not apply to Bayes factors (cf. Dienes, 2011; Royall,
1997). Further, as shown in Figure 3, a two-tailed normal distribution can be used to make a directional
prediction. So 1- versus 2-tailed for the Bayes factor just refers to the shape of the distribution used to
represent the predictions of the alternative.
14
403
Tables 1 (a) and (b). Bayes Factors Corresponding to the p-values Shown in Figure 1.
404
405
406
407
The B’s are sorted into three categories: The top row is for B’s less than 1/3 (support
for null), the bottom row for B’s more than 3 (support for alternative), and the middle row is
for B’s in the middle, indicating data insensitivity (support for neither hypothesis).
Significant p-value’s are asterisked.
408
409
Table 1(a)
.081 .034* .74 .034* .09 .817 .028* .001* .056 .031* .279 .024* .083
p
B, giving support for:
Null
Neither
Alternative
2.96
0.52
4.88
2.70 0.46
4.88
1.73
4.40 1024.6 3.33 4.88
2.96
4.28
410
411
412
Table 1(b)
p
.002* .167
.172 .387 .614 .476 .006* .028* .002* .024* .144 .23
B, giving support for:
Null
Neither
Alternative
2.16
2.12 1.01 0.65 0.75
2.36 1.73
28.00 4.28 49.86 5.60
49.86
413
414
415
416
417
418
419
420
In interpreting Table 1, remember that a B above 3 indicates substantial support for
the alternative and a B less than 0.33 indicates substantial support for the null. In Table 1,
significant results are associated with B’s above 3 (because, as it happens, the effect sizes in
those cases are around the values we regarded as typical, and not close to 0; in fact for a fixed
p = .05, B will indicate increasing support for the null as N increases, and thus the sample
mean shrinks, Lindley, 1957). Note also that Bayes factors can be quite large (e.g. 1024);
they do not scale according to our intuitions trained on t-values8.
421
422
423
424
425
426
427
Crucially, in Table 1, non-significant results correspond to Bayes factors indicating
insensitivity, that is, between 3 and 1/3. (It is in no way guaranteed of course that Bayes
factors won’t sometimes indicate evidence for the null when the null is false. But this
example illustrates how such misleading Bayes factors would be fewer than non-significant
p-values when the null is false. For analytically-determined general properties of error rates
with Bayes factors contrasting simple hypotheses see Royall, 1997.) A researcher obtaining
any of the non-significant results in Table 1, would, following a Bayesian analysis, conclude
8
To have them scale in a similar way as a t-value, one could take a log Bayes factor to base 3. A log3 B above 1
indicates substantial evidence for the alternative, below -1 substantial evidence for the null, and 0 indicates no
evidence either way. The value of 1024 in Table 1 would become 6.31, which maybe seems more ‘reasonable’.
Using log to base square root 3 would mean 2 and -2 are the cut offs for substantial evidence, similar to a ttest. Compare Edwards, 1992, who recommended natural logs for Bayes factors comparing simple hypotheses.
Nonetheless, I do not use log B in this paper. Unlike t, B has an intuitive meaning: The data are B times more
likely under H1 than under H0.
15
428
429
430
431
432
433
that the data were simply insensitive and more participants should be run. The same
conclusions follow from using confidence intervals, as shown in Figure 1, using the rules of
inference in Figure 2. So, if Bayes factors often produce answers consistent with inference by
intervals, what does a Bayes factor buy us? It will allow us to assert the null without knowing
a minimal interesting effect size, as we now explore. We will see that a non-significant result
sometimes just indicates insensitivity, but it is sometimes support for the null.
434
3.0 Examples using the different ways of representing the predictions of the theory
435
436
437
We will consider the uniform, normal and half-normal in turn. The sequence of
examples serve as a tutorial in various aspects of calculating Bayes factors and so are best
read in sequence.
438
3.1 Bayes with a Uniform Distribution.
439
440
441
442
443
444
445
446
447
448
1. A manipulation is predicted to decrease an effect. Dienes, Baddeley, and Jansari
(2012) predicted that in a certain context a negative mood would reduce a certain type of
learning. To simplify so as to draw out certain key points, the example will depart from the
actual paradigm and results of Dienes et al. Subjects perform a two-alternative forced choice
measure of learning, where 50% is chance. If performance in the neutral condition is 70%,
then, if the theory is correct, performance in the negative mood condition will be somewhere
between 50% and 70%. That is, the effect of mood would be somewhere between 0 and 20%.
Thus, the predictions of a theory of a mood effect could be represented as a uniform from 0 to
20 (in fact, there is uncertainty in this estimate which could be represented, but we will
discuss that issue in the subsequent example).
449
450
451
452
453
454
455
Performance in the negative mood condition is (say) 65% and non-significantly
different from that in the neutral condition, t(50) = 0.5, p = .62. So the mean difference is 70%
– 65% = 5%. The standard error = (mean difference)/t = 10%. Entering these values in the
calculator (mean = 5, standard error = 10, uniform from 0 to 20) gives B = 0.89. That is, the
data are insensitive and do not count against the theory that predicted negative mood would
impair performance (indeed, Dienes et al. obtained a non-significant result which a Bayes
factor indicated was insensitive).
456
457
458
459
460
461
462
463
464
465
466
467
468
If the performance in the negative mood condition had been 70%, the mean difference
between mood conditions would be 0. The Bayes factor is then 0.60, still indicating data
insensitivity. That is, obtaining identical sample means in two conditions in no way
guarantees that there is compelling evidence for the null hypothesis. If the performance in the
negative mood condition had been 75%, the mean difference between mood conditions would
be 5% in the wrong direction. This is entered into the calculator as -5 (i.e. as a negative
number: The calculator assumes positive means are in the direction predicted by theory if the
theory is directional). B is then 0.43, still insensitive. That is, having the means go in the
wrong direction does not in itself indicate compelling evidence against the theory (even
coupled with a “very non-significant” p of .62). If the standard error were smaller, say 5
instead of 10, then a difference of -5 gives a B of 0.16, which is strong evidence for the null
and against the theory. That is, as the standard error shrinks, the data become more sensitive,
as would be expected.
469
470
471
472
The relation between standard error and sensitivity allows an interesting contrast
between B and p-values. P-values do not monotonically vary with evidence for the null: A
larger p-value may correspond to less evidence for the null. Consider a difference of 1 and a
standard error of 10. This gives a t of 0.01, p =0.99. If the difference were the same but the
16
473
474
475
476
477
478
479
480
standard error were much smaller, e.g. 1, t would be larger, 1 and p = 0.32. The p-value is
smaller because the standard error is smaller. To calculate B, assume a uniform from 0 to 20,
as before. In the first case, where p = .99, B is 0.64 and in the second case, where p = .32, B
= 0.17. B is more sensitive (and more strongly supports the null) in the second case precisely
because the standard error is lower in the second case. Thus, a high p-value may just indicate
the standard error is large and the data insensitive. It is a fallacy to say the data are
convincing evidence of the null just because the p-value is very high. The high p-value may
be indicating just the opposite.
481
482
483
484
485
486
487
488
489
490
491
492
2. Testing additivity versus interaction. Buhle et al. (2012) wished to test whether
placebo pain relief works through the same mechanism as distraction based pain relief. They
argued that if it were separate mechanisms, the effect of placebo should be identical in a
distraction versus control condition; if the same mechanism, then placebo would have less
effect in a distraction condition. Estimating from their Figure 2, the effect of placebo versus
no placebo in the no distraction condition was 0.4 pain units (i.e. placebo reduced pain by 0.4
units). The effect of placebo in the distraction condition was 0.44 units (estimated). The
placebo × distraction interaction raw effect (the difference between the two placebo effects)
is therefore 0.4 – 0.44 = - .04 units (note that it is in the wrong direction to the theory that
distraction would reduce the placebo effect). The interaction was non-significant, F(1, 31) =
0.109, p = .746. But in itself the non-significant result does not indicate evidence for
additivity; perhaps the data were just insensitive. How can a Bayes factor be calculated?
493
494
495
496
497
498
499
500
501
502
The predicted size of the interaction needs to be specified. Following Gallistel (2009),
Buhler et al. reasoned that the maximum size of the interaction effect would occur if
distraction completely removed the placebo effect (i.e. placebo effect with distraction = 0).
Then, the interaction would be the same size as the placebo effect in the no distraction
condition (that is, interaction = placebo effect with no distraction – placebo effect with
distraction (i.e. 0.4 – 0) =placebo effect with no distraction = 0.4). The smallest the
population interaction could be (on the theory that distraction reduces the placebo effect) is if
the placebo effect was very nearly the same in both conditions, that is an interaction effect of
as close as we like to 0. So we can represent the plausible sizes of the interaction as a uniform
from 0 to the placebo effect in the no distraction condition, that is from 0 to 0.4.
503
504
505
506
507
508
509
510
511
512
513
The F of 0.109 corresponds to a t-value of √0.109 = 0.33. Thus the standard error of
the interaction is (raw interaction effect)/t = 0.04/0.33 = 0.12. With these values (mean = 0.04,
standard error = 0.12, uniform from 0 to 0.4), B is 0.29, that is, substantial evidence for
additivity. This conclusion depends on assuming that the maximum the interaction could be
was the study’s estimate of the placebo effect with no distraction. But this is just an estimate.
We could represent our uncertainty in that estimate – and that would always push the upper
limit upwards. The higher the upper limit of a uniform, the vaguer the theory is. The vaguer
the theory is, the more Bayes punishes the theory, and thus the easier it is to get evidence for
the null. As we already have evidence for the null, representing uncertainty will not change
the qualitative conclusion, only make it stronger. The next example will consider the
rhetorical status of this issue in the converse case, when the Bayes factor is insensitive.
514
515
516
In sum, no more data need to be collected; the data are already sensitive enough to
make the point. The non-significant result was rightly published, given it provided as
substantial evidence for the null as a significant result might against it.
517
518
3. Interactions that can go in both directions. In the above example, the raw
interaction effect was predicted on theory to go in only one direction, that is, to vary from 0
17
519
520
521
up to some positive maximum. We now consider a similar example, and then generalize to a
theory that allows positive and negative interactions. The latter sort of interaction may be
more difficult to specify limits in a simple way, but in this example we can.
522
523
524
525
526
527
528
529
530
531
Watson, Gordon, Stermac, Kalogerakos, and Steckley (2003) compared the
effectiveness of Process Experiential Therapy (PET; i.e. Emotion Focussed Therapy) with
Cognitive Behavioural Therapy (CBT) in treating depression. There were a variety of
measures, but for illustration we will look at just the Beck Depression Inventory (BDI) on the
entire intent-to-treat sample (n = 93). CBT reduced the BDI from pre- to post-test (from
25.09 to 12.56), a raw simple effect of time of 12.53. Similarly, PET reduced the BDI from
24.50 to 13.05, a raw simple effect of 11.45. The F for the group (CBT vs PET) × time (pre
vs post) interaction was 0.18, non-significant. In itself, the non-significant interaction does
not mean the treatments are equivalent in effectiveness. To know what it does mean, we need
to get a handle on what size interaction effect we would expect.
532
533
534
535
536
537
538
539
540
One theory is based on assuming we know that CBT does work and is bound to be
better than any other therapy for depression. Then, the population interaction effect (effect of
time for CBT - effect of time for PET) will vary between near 0 (when the effects are near
equal) and the effect of time for CBT (when the other therapy has no effect). Thus assuming
12.53 to be a sensitive estimate of the population effect of CBT, we can use a uniform for the
alternative from 0 to 13 (rounding up). The sample raw interaction effect is 12.53 – 11.45 =
1.08. The F for the interaction (0.18) corresponds to a t of √0.18 = 0.42. Thus the standard
error of the raw interaction effect is 1.08/0.42 = 2.55. With these values for the calculator
(mean = 1.08, SE = 2.55, uniform from 0 to 13), B = 0.36.
541
542
543
544
545
546
547
548
549
The B just falls short of a conventional criterion for substantial evidence for
equivalence. The evidence is still reasonable, in that in Bayes sharp conventional cut offs
have no absolute meaning (unlike orthodoxy); evidence is continuous9. Further, we assumed
we had estimated the population effect of CBT precisely. We could represent some
uncertainty in that estimate, e.g. use an upper limit of a credibility or confidence interval of
the effect of CBT as the upper limit of our uniform. But we might decide as a defeasible
default not to take into account uncertainty in estimates of upper limits of uniforms when we
have non-significant results. In this way, more non-significant studies come out as
inconclusive; that is, we have a tough standard of evidence for the null.
550
551
552
553
554
555
556
557
558
It is also worth considering another alternative hypothesis, namely one which asserts
that CBT might be better than PET – or vice versa. We can represent the predictions of this
hypothesis as a uniform from -11 to + 13. (The same logic is used for the lower limit; i.e. if
PET is superior, the most negative the interaction could be is when the effect of CBT is
negligible compared to PET, so interaction = - effect of time for PET.) Then B = 0.29. That
is, without a pre-existing bias for CBT, there is substantial evidence for equivalence in the
context of this particular study (and it would get even stronger if we adjusted the limits of the
uniform to take into account uncertainty in their estimation). Even with a bias for thinking
CBT is at least the best, there is modest evidence for equivalence.
559
560
561
4. Interaction with degrees of freedom more than 1 and simple effects. Raz and
colleagues have demonstrated a remarkable effect of hypnotic suggestion in a series of
studies: Suggesting that the subject cannot read words, that the stimuli will appear as a
9
Indeed, the absence of sharp cut-offs applies to Bayesian intervals as much to Bayes factors: If a 90% Bayesian
interval were within the null region, there would be grounds for asserting the null region hypothesis, albeit
there would be stronger grounds if a 95% interval were in the null region.
18
562
563
meaningless script, substantially reduces the Stroop effect (e.g. Raz, Fan, & Posner, 2005).
Table 2 presents imaginary but representative data.
564
565
566
567
Table 2. (Imaginary) Means for the Effectiveness of a Hypnotic Suggestion to Reduce the
Stroop Effect.
No Suggestion
Suggestion
Incongruent
850
785
Neutral
750
745
Congruent
720
715
568
569
570
571
572
573
574
575
576
577
578
579
580
581
The crucial test of the hypothesis that the suggestion will reduce the Stroop effect is
the Suggestion (present vs absent) × Word Type (incongruent vs neutral vs congruent)
interaction. This is a 2-degree of freedom effect. However, it naturally decomposes into two
1-degree of freedom effects. The Stroop effect can be thought of as having two components
with possibly different underlying mechanisms: An interference effect (incongruent – neutral)
and a facilitation effect (neutral – congruent). Thus the interaction decomposes into the
effect of suggestion on interference and the effect of suggestion on facilitation. In general,
multi-degree of action effects often decompose into one degree of freedom contrasts
addressing specific theoretic questions. Let’s consider the effect of suggestion on interference
(the same principles of course apply to the effect of suggestion on facilitation). First, let us
say the test for the interaction suggestion (present vs absent) × word type (incongruent vs
neutral) is F(1, 40) = 2.25, p = 0.14. Does this mean the effect of suggestion is ineffective in
the context of this study for reducing interference?
582
583
584
585
586
587
588
589
590
591
The F of 2.25 corresponds to a t of √2.25 = 1.5. The raw interaction effect is
(interference effect for no suggestion) – (interference effect for suggestion) = (850 – 750) –
(785 – 745) = 100 – 40 = 60. Thus, the standard error for the interaction is 60/1.5 = 40.
What range of effects could the population effect plausibly be? If the suggestion had
removed the interference effect completely, the interaction would be the interference effect
for no suggestion (100); conversely if the suggestion had been completely ineffective, it
would leave the interference effect as it is, and thus the interaction would be 0. Thus, we can
represent the alternative as a uniform from 0 to 100. This gives a B of 2.39. That is, the data
are insensitive but if anything should increase one’s confidence in the effectiveness of the
suggestion.
592
593
594
595
596
597
598
Now, let us say the test of the interaction gave us a significant result, e.g. F(1,40) =
4.41, p = .04, but the means remain the same as in Table 2. Now the F corresponds to a t of
√4.41 = 2.1. The raw size of interaction is still 60; thus the standard error is 60/2.1 = 29. B is
now 5.54, substantial evidence for the effectiveness of suggestion. In fact, the test of the
interference effect with no suggestion is significant, t(40) = 2.80, p = .008; and the test for the
interference effect for just the suggestion condition gives t(40) = 1.30, p = .20. Has the
suggestion eradicated the Stroop interference effect?
599
600
601
602
We do not know if the Stroop effect has been eradicated or just reduced on the basis
of the last non-significant result. A Bayes factor can be used to find out. The interference
effect is 40. So the standard error is 40/1.3 = 31. On the alternative that the interference
effect was not eradicated, the effect will vary between 0 and the effect found in the no
19
603
604
605
606
suggestion condition; that is, we can use a uniform from 0 to 100. This gives a B of 1.56.
That is, the data are insensitive, and while we can conclude that the interference effect was
reduced (the interaction indicates that), we cannot yet conclude whether or not it was
abolished.
607
608
609
610
611
612
613
614
615
5. Paired comparisons. Tan et al (2009; in press) tested people on their ability to use
a Brain Computer Interface (BCI). People were randomly assigned to three groups: no
treatment, 12 weeks of mindfulness meditation, and an active control (12 weeks of learning to
play the guitar). The active control was designed (and shown) to create the same level of
expectations as mindfulness meditation in improving BCI performance. Initially, 8 subjects
were run in each group. Pre-post differences on BCI performance were -8, 15, and 9 for the
no-treatment, mindfulness and active control, respectively. The difference between
mindfulness and active control was not significant, t(14) = 0.61, p = 0.55. So, is mindfulness
any better than an active control?
616
617
618
619
620
621
622
623
624
Assume the active control and mindfulness were equalized on non-specific processes
like expectations and beliefs about change, and mindfulness may or may not contain an
additional active component. Then, the largest the difference between mindfulness and active
control could be is the difference between mindfulness and the no-treatment control (i.e. a
difference of 23, which was significant). Thus, the alternative can be represented a uniform
from 0 to 23. The actual difference was 15 – 9 = 6, with a standard error of 6/0.61 = 9.8. This
gives a B of 0.89. The data were insensitive. (In fact, with degrees of freedom less than 30,
the standard error should be increased slightly by a correction factor given below. Increasing
the standard error reduces sensitivity.)
625
626
627
628
629
630
631
632
633
634
635
636
Thus, the following year another group of participants were run. One cannot simply
top up participants with orthodox statistics, unless pre-specified as possible by one’s stopping
rule (Armitage, McPherson, & Rowe, 1969); by contrast, with Bayes, one can always collect
more participants until the data are sensitive enough, that is, B < 1/3 or B > 3; see e.g. Berger
and Wolpert (1988), Dienes (2008, 2011). Of course, B is susceptible to random fluctuations
up and down; why cannot one capitalize on these and stop when the fluctuations favour a
desired result? For example, Sanborn and Hills (2014) and Yu, Sprenger, Thomas, and
Dougherty (2014) show that if the null is true, stopping when B > 3 (if that ever occurs)
increases the proportion of cases that B > 3 when the null is true. However, as Rouder (2014)
shows, it also increases the proportion of cases that B> 3 when Ho is false, and to exactly the
same extent: B retains its meaning as relative strength of evidence, regardless of stopping rule
(for more discussion, see Dienes, forthcoming).
637
638
639
640
641
642
643
644
645
646
647
648
With all data together (N = 63), the means for the three groups were 2, 14, and 6, for notreatment, mindfulness and active control, respectively. By conventional statistics, the
difference between the mindfulness group and either of the others was significant; for
mindfulness vs meditation, t(41) = 2.45, p = .019. The interpretation of the latter result is
compromised by the topping up procedure (one could argue the p is less than .05/2, so
legitimately significant at the 5% level; however, let us say the result had still been
insensitive, would the researchers have collected more data the following year? If so, the
correction should be more like .05/3; see Sagarin, Ambler, & Lee, 2014; Strube, 2006). A
Bayes factor indicates strength of evidence independent of stopping rule, however. Assume
as before that the maximum the plausible difference between mindfulness and active control
could be is the difference between mindfulness and no-treatment, that is, 12. We represent the
alternative as a uniform from 0 to 12. The mean difference between mindfulness and active
20
649
650
control is 14- 6 = 8, with a standard error of 8/2.45 = 3.3. This gives a B of 11.45, strong
evidence for the advantage of mindfulness over an active control.
651
3.2. Bayes with a half-normal distribution.
652
653
654
In the previous examples we considered cases where a rough plausible maximum
could be determined. Here, we consider an illustrative case where a rough expected value can
be determined (and the theory makes a clear directional prediction).
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
Norming materials. Guo, Li, Yang, and Dienes (2013) constructed lists of nouns that
differed in terms of the size of the object that the words represented, as rated on a 1-7 bipolar
scale (1 = very small; 7 = very big). Big objects were perceived as bigger than small objects
(5.42 vs 2.73), t(38) = 20.05, as was desired. Other dimensions were also rated including
height. Large objects were rated non-significantly taller than small objects on an equivalent
scale (5.13 vs 4.5), t(38) = 1.47, p = .15. Are the materials controlled for height? The
difference in size was 5.42 – 2.73 = 2.69. This was taken to be a representative amount by
which the two lists may differ on another dimension, like height. Thus, a B was calculated
assuming a half-normal with SD = 2.69. (The test is directional because it is implausible that
big objects would be shorter, unless specially selected to be so.) The “mean” was 5.13 – 4.5
= 0.63 height units, the standard error = 0.63/1.47 = 0.43 height units. This yields B = 0.83.
The data are insensitive; no conclusions follow as to whether height was controlled. (Using
the same half-normal, the lists were found to be equivalent on familiarity and valence, Bs <
1/3. These dimensions would be best analysed with full normal distributions (i.e. mean of 0,
SD = 2.69), but if the null is supported with half-normal distributions it is certainly supported
with full normal distributions, because the latter entail vaguer alternatives.)
671
3.3. Bayes with a Normal Distribution.
672
673
674
675
676
A half-normal distribution is intrinsically associated with a theory that makes
directional predictions. A normal distribution may be used for theories that do not predict
direction, as well as those that do. In the second case, the normal rather than half-normal is
useful where it is clear that a certain effect is predicted and smaller effects are increasingly
unlikely. One example is considered.
677
678
679
680
681
682
683
684
685
686
687
688
Theory roughly predicts a certain value and smaller values are not more likely. Fu,
Dienes, Shang, and Fu (2013) tested Chinese and UK people on ability to implicitly learn
global or local structures. A culture (Chinese vs British) by level (global vs local) interaction
was obtained. Chinese were superior than British people at the global level, RT difference in
learning effects = 50 ms, with standard error = 14ms. The difference at the local level was 15
ms, SE = 13 ms, t(47) = 1.31, p = .20. Fu et al wished to consider the theory that Chinese
outperform British people simply because of greater motivation. If this were true, then
Chinese people should outperform British people at the local level to the same extent as they
outperform British at the global level. We can represent our knowledge of how well this is by
a normal distribution with a mean of 50 and a standard deviation of 14. This gives B = 0.25.
That is, relative to this theory, the null is supported. The motivation theory can be rejected
based both on the significant interaction and on the Bayes factor.
689
690
691
692
693
My typical default for a normal distribution is to use SD = mean/2. This would
suggest an SD of 25 rather than 14, the latter being the standard deviation of the sampling
distribution of the estimated effect in the global condition. In this case, where the same
participants are being run on the same paradigm with a minor change to which motivation
should be blind, the sampling distribution of the effect in one condition stands as a good
21
694
695
696
representation of our knowledge of its effect in the other condition, assuming a process blind
to the differences between conditions10. Defaults are good to consider, but they are not
binding.
697
4. Some general considerations in using Bayes factors
698
699
700
701
702
703
Appendix 1 gives further examples, illustrating trend analysis, the use of standardised
effect sizes, and why sometimes raw effect sizes are required to address scientific problems;
it also indicates how to easily extend the use of the Dienes (2008) Bayes calculator to simple
meta-analysis, correlations, and contingency tables. Appendix 2 considers other methods for
comparing relative evidence (e.g. likelihood inference and BIC, the Bayesian Information
Criterion).
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
4.1 Assumptions of the Dienes (2008) calculator. The calculator assumes the parameter
estimate is normally distributed with known variance. However, in a t-test situation we have
only estimated the variance. If the population distribution of observations is normal, and
degrees of freedom are above 30, then the assumption of known variance is good enough (see
Wilcox, 2010, for problems when the underlying distribution is not normal). For degrees of
freedom less than 30, the following correction should be applied to the standard error:
Increase the standard error by a factor (1 + 20/df*df) (adapted from Berry, 1996; it produces
a good approximation to t, over-correcting by a small amount). For example if the df = 13
and the standard error of the difference is 5.1, we need to correct by (1 + 20/(13*13)) = 1.12.
That is, the standard error we will enter is 5.1*1.12 = 5.7. If degrees of freedom have been
adjusted in a t-test (on the same comparison as a Bayes factor is to be calculated for) to
account for unequal variances, the adjusted degrees of freedom should be used in the
correction. No adjustment is made when the dependent variable is (Fisher z transformed)
correlations or for the contingency table analyses illustrated in the Appendix, as in these
cases the standard error is known.
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
Note that the assumptions of the Dienes (2008) calculator are specific to that
calculator and not to Bayes generally. Bayes is a general all-purpose method that can be
applied to any specified distribution or to a bootstrapped distribution (e.g. Jackman, 2009;
Krushcke, 2010a; Lee & Wagenmakers, 2014; see Kruschke, 2013b, for a Bayesian analysis
that allows heavy-tailed distributions). Bayes is also not limited to one degree of freedom
contrasts, as the Dienes (2008) calculator is (see Hoijtink, Klugkist & Boelin, 2008, for Bayes
factors on complex hypotheses involving a set of inequality constraints). However, pin point
tests of theoretical predictions are generally one degree of freedom contrasts (e.g. Lewis,
1993; Rosenthal, Rosnow, & Rubin, 2000). Multiple degree of freedom tests usually serve as
precursors to one degree of freedom tests simply to control familywise error rates (though see
Hoijtink et al). But for a Bayesian analysis one should not correct for what other tests are
conducted (only data relevant to a hypothesis should be considered to evaluate that
hypothesis), so one can go directly to the theoretically relevant specific contrasts of interest
(Dienes, 2011; for more extended discussion see Dienes, forthcoming; and Kruschke, 2010a
for the use of hierarchical modelling for dealing with multiple comparisons).
734
735
736
737
4.2 Power, intervals, and Bayes. No matter how much power one has, a sensitive result is
never guaranteed. Sensitivity can be guaranteed with intervals and Bayes factors: One can
collect data until the interval is smaller than the null region and is either in or out of the null
region, or until the Bayes factor is either greater than 3 or less than a third (see Dienes, 2008
10
Using SD = mean/2 is like using the sampling distribution of the mean when it is only just significantly
different from zero.
22
738
739
740
741
742
743
744
745
746
747
748
and forthcoming-b for related discussion; in fact, error probabilities are controlled remarkably
well with such methods, see Sanborn & Hills, 2013, for limits and exceptions, and Rouder,
2014, for the demonstration that the Bayes factor always retains its meaning as strength of
evidence – how much more likely the data are on H1 than H0 – regardless of stopping rule).
Because power does not make use of the actual data to assess its sensitivity, one should
assess the actual sensitivity of obtained data with either intervals or Bayes factors. Thus, one
can run until sensitivity has been reached. Even if one cannot collect more data, there is still
no point referring to power once the data are in. The study may be low powered, but the data
actually sensitive; or the study may be high powered, and the data actually insensitive. Power
is helpful in finding out a rough number of observations needed; intervals and Bayes factors
can be used thereafter to assess data (cf. Royall, 1997).
749
750
751
752
753
754
Consider a study investigating whether imagining kicking a football for 20 minutes
every day for a month increased the number of shots into goal out of 100 kicks. Real practice
in kicking 20 minutes a day increases performance by 12 goals. We set this as the maximum
of a uniform. What should the minimum be? The minimum is needed to interpret a
confidence or other interval. It is hard to say for sure what the minimum should be. A score
of .01 goals may not be worth it, but maybe 1 goal would be? Let us set 0.5 as the minimum.
755
756
757
758
759
760
761
762
763
764
765
766
767
768
The imagination condition led to an improvement of 0.4 goals with a standard error
of 1. Using a uniform from 0.5 to 12, B = 0.11, indicating substantial evidence for the null
hypothesis. If we used a uniform from 0 to 12, B = 0.15, hardly changed. However, a
confidence (or other) interval would declare the results insensitive, as the interval extends
outside the null region (even a 68% interval, provided by 0.4 ± 1 goals). The Bayes factor
makes use of the full range of predictions of the alternative hypothesis, and can use this
information to most sensitively draw conclusions (conditional on the information it assumes)
(Jaynes, 2003). Further, the Bayes factor is most sensitive to the maximum, which could be
specified reasonably objectively. Inference by intervals is completely dependent on
specification of the minimum, which is often hard to specify objectively. However, if the
minimum were easy to objectively specify and the maximum hard in a particular case, it is
inference by intervals that would be most sensitive to the known constraints, and would be
the preferred solution. In that sense, Bayes factors and intervals complement each other in
their strengths and weaknesses (see Table 3)11.
769
770
771
772
In sum, in many cases, Bayes factors may be useful not only when there is little
chance of collecting more data, so as to extract the most out of the data; but also in setting
stopping rules for grant applications, or schemes like Registered Reports in the journal Cortex,
so as to collect data in the most efficient way (for an example see Wagenmakers et al, 2012).
11
When a null region can be specified, Bayesian intervals and Bayes factors can be related (cf Krushke, 2013b
Appendix D; Wetzels et al, 2009, for a somewhat different way of making the relationship). Let the alternative
hypothesis be the complement of the null region. Define a prior distribution. The prior odds is the area
representing the alternative divided by the area in the null region. Update the prior to obtain the posterior
distribution (see e.g. Kruschke, 2013b, or tools on the website for Dienes, 2008). The strength of evidence
that the data provide for the alternative hypothesis over the null is the degree to which the evidence changes
the prior odds: The Bayes factor is the posterior odds divided by the prior odds. Example: A prior of N(0, 1) and
data of N(0.5, 0.2) (i.e. mean = 0.5, SE = 0.2), gives a posterior (using the software on the website of Dienes,
2008, to obtain posteriors:
http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_normalposterior.swf) of N(0.48, 0.2).
For a null region of [-0.1, 0.1], using areas under the normal curve, we get B = 36.17/11.55 = 3.13. (This is not
using the Dienes, 2008, Bayes factor calculator, just areas under the normal, with e.g.
http://davidmlane.com/hyperstat/z_table.html). Conversely, a prior of N(0, 1) and data of (0, 0.1) give a
posterior of N(0, 0.1) and B = 0.19. (Check these numbers for homework.)
23
773
774
775
776
777
778
(By estimating the relevant standard deviation and likely population effect in advance of
running an experiment , you can enter into the calculator different standard errors based on
different N, and find the N for which B > 3 or < 1/3. This provides an estimate of the required
N. But, unlike with orthodox statistics, it is not an estimate you have committed to; cf.
Kruschke, 2010a; Royall, 1997).
Table 3. Comparing Intervals and Bayes for Interpreting a Non-significant Result.
779
780
What does it tell
you?
What do you
need to link
data to theory?
Amount of data
needed to
obtain evidence
for the null?
What would be
a useful
stopping rule to
guarantee
sensitivity?
Intervals
How precisely a
parameter has
been estimated; a
reflection of data
rather than
theory.
A minimal
value below
which the
theory is
refuted.
Enough to
make sure the
width of the
interval is less
than that of the
null region;
considerable
participant
numbers will
typically be
needed in
contrast to
Bayes factors.
Interval width
no more than
null region
width and
interval either
completely in
or completely
out of the null
region.
Bayes factors
The strength of
evidence the data
provide for one
theory over
another; specific
to the two
theories
contrasted.
A rough
expected value
or maximum
value consistent
with theory.
Bayes factors
ensure
maximum
efficiency in
use of
participants,
given a Bayes
factor measures
strength of
evidence.
Bayes factor
either greater
than three or
less than a
third.
781
782
783
784
785
786
787
788
4.3 Why choose one distribution rather than another? On a two-alternative recognition
test we have determined that a rough likely level of performance, should knowledge exist, is
60%, and we are happy that it is not very plausible for performance to exceed 70%, nor to go
below chance baseline (50%). Performance is 55% with a standard error of 2.6%. To
determine a Bayes factor, we need first to rescale so that the null hypothesis is 0. So we
subtract 50% from all scores. Thus, the “mean” is 5% and the standard error 2.6%. we can
24
789
790
use a uniform from 0 to 20 to represent the constraint that the score lies between chance and
20% above chance. This gives B = 2.01.
791
792
793
794
795
796
797
798
799
800
801
802
803
But we might have argued that as 60% (i.e. 10% above baseline) was rather likely on
the theory (and background knowledge) we could use a normal distribution with a mean of 10%
and a standard deviation of mean/2 = 5%. (Note this representation still entails that the true
population value is somewhere between roughly 0 and 20.) This gives B = 1.98, barely
changed. Or why not, you ask, use a half-normal distribution with the expected typical value,
10%, as the standard deviation? (Note this representation still entails that the true population
value is somewhere between 0 and roughly 20.) This gives B = 2.75, somewhat different but
qualitatively the same conclusion. In all cases the conclusion is that more evidence is needed.
(Compare also the different ways of specifying the alternative provided by Rouder et al,
2009.) In this case, there may be no clear theoretical reason for preferring one of the
distributions over the other for representing the alternative. But as radically shifting the
distribution around (from flat to humped right up against one extreme to humped in the
middle) made no qualitative difference to conclusions we can trust the conclusions.
804
805
806
807
808
809
Always perform a robustness check: Could theoretical constraints be just as readily
represented in a different way, where the real theory is neutral to the different representations?
Check that you get the same conclusion with the different representations. If not, run more
participants until you do. If you cannot collect more data, acknowledge ambiguity in the
conclusion (as was done by e.g. Shang et al, 2013). (See Kruschke, 2013c, for the equivalent
robustness check for inference by intervals.) Appendix 3 considers robustness in more detail.
810
811
812
Different ways of representing the alternative may clearly correspond to different
theories, as was illustrated in example 2 in Appendix 1. Indeed, one of the virtues of Bayes is
that it asks one to consider theory, and the relation of theory to data, carefully.
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
4.4 But how can any constraints be set on what my alternative predicts? The examples
in this paper (including the appendix) have illustrated finding constraints by use of data from
different conditions, constraints intrinsic to the design and logic of the theory, and those
provided by standardized effect sizes and regression slopes to convert different raw units of
measurement into each other; and Dienes (in press) provides examples where constraints
intrinsic to the measurement scale are very useful. However, there is one thing one cannot do,
and that is to use the very effect itself, or its confidence interval, as the basis for predicting
that same effect. For example, if the obtained mean difference was 15 ms, one cannot for that
reason set the alternative to a half-normal distribution with a standard deviation of 15. One
can use other aspects of the same data to provide constraints, but not the very aspect that one
is testing (Jaynes, 2003). Double counting the mean difference in both the data summary and
as specifying the predictions of the alternative violates the axioms of probability in updating
theory probabilities. This problem creates pressure to demand default alternatives for Bayes
factors, on the grounds that letting people construct their own alternative is open to abuse
(Kievit, 2011).
828
829
830
831
832
833
834
The Bayes factor stands as a good evaluation of a theory to the extent that its
assumptions can be justified. And fortunately, all assumptions that go into a Bayes factor are
public. So, the type of abuse peculiar to Bayes (that is, representing what a theory predicts in
ways favourable with respect to how the data actually turned out) is entirely transparent. It is
open to public scrutiny and debate (unlike many of the factors that affect significance testing;
see Dienes, 2011). Specifying what predictions a theory makes is also part of the substance of
science. It is precisely what we should be trying to engage with, in the context of public
25
835
836
837
838
839
840
debate. Trying to convince people of one’s theory with cheap rhetorical tricks is the danger
that comes with allowing argument over theories at all. Science already contains the solution
to that problem, namely transparency, the right of everyone in principle to voice an argument,
and commitment to norms of accountability to evidence and logic. When the theory itself
does work in determining the representation of the alternative, the Bayes factor genuinely
informs us about the bearing of evidence on theory.
841
842
843
844
845
846
847
848
849
850
851
852
4.5 Subjective versus objective Bayes. Bayesian approaches are often divided into
subjective Bayes and objective Bayes (Stone, 2013). Being Bayesian at all in statistics is
defined by allowing probabilities for population parameter values (and for theories more
generally), and using such probability distributions for inference. Whenever one of the
examples above asked for a specification what the theory predicts, it meant an assignment of
a probability density distribution to different parameter values. How are these probabilities to
be interpreted? The subjective Bayesian says that these probabilities can only be obtained by
looking deep in one’s soul; probabilities are intrinsically subjective (e.g. Howson & Urbach,
2006). That is, it is all a matter of judgment. The objective Bayesian says that rather than
relying on personal or subjective judgment, one should describe a problem situation such that
the conclusions objectively follow from the stated constraints; every rational person should
draw the same conclusions (e.g. Jaynes, 2003).
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
The approach in this paper respects both camps. The approach is objective in that the
examples illustrate rules of thumb that can act as (contextually relevant) defaults, where the
probability distributions are specified in simple objective ways by reference to data or logical
or mathematical relations inherent in the design. No example relied on anyone saying,
“according to my intuition the mean should be 2 because that’s how I feel” (Cf. Howard,
Maxwell, & Fleming, 2000, for such priors). But the approach is subjective in that the
examples illustrate that only scientific judgment can determine the right representation of the
theory’s predictions given the theory and existing background knowledge; and that scientific
judgment entails that all defaults are defeasible - because science is subject to the frame
problem, and doing Bayes is doing science. Being a subjectivist means a given Bayes factor
can be treated not as THE objective representation of the relation of the theory to evidence,
but as the best given the current level of discussion, and it could be improved if e.g. another
condition was run which informed what the theory predicted in this context (e.g. defined a
maximum value more clearly). A representation of the theory can be provisionally accepted
as reflecting current judgment and ignorance about the theory.
868
869
5 Conclusions.
870
871
872
873
874
875
This paper has explored the interpretation of non-significant results. A key aspect of
the argument is that non-significant results cannot be interpreted in a theoretical vacuum.
What is needed is a way of conducting statistics that more intimately links theory to data,
either via inference by intervals or via Bayes factors. Using canned statistical solutions to
force theorizing is backwards; statistics are to be the handmaiden of science, not science the
butler boy of statistics.
876
877
878
879
880
Bayes has its own coherent logic that makes it radically different from significance
testing. To be clear, a two-step decision process is not being advocated in this paper whereby
Bayes is only performed after a non-significant result. Bayes can – and should – be
performed any time a theory is being tested. At this stage of statistical practice, however,
orthodox statistics typically need to be performed for papers to be accepted in journals. In
26
881
882
883
884
885
886
887
888
889
890
891
cases where orthodoxy fails to provide an answer, Bayes can be very useful. The most
common case where orthodoxy fails is where non-significant results are obtained. But there
are other cases. For example, when a reviewer asks for more subjects to be run, and the
original results were already tested at the 5% level, Bayesian statistics are needed (as in
example 3.1.5 above; orthodoxy is ruled out by its own logic in this case). Running a Bayes
factor when non-significant results are obtained is simply a way that we as a community can
come to know Bayes, and to obtain answers where we need answers, and none are
forthcoming from orthodoxy. Once there is a communal competence in interpreting Bayes,
frequentist orthodox statistics may cease to be conducted at all (or maybe just in cases where
no substantial theory exists, and no prior constraints exists – that is, in impoverished prescientific environments).
892
893
894
895
896
897
898
899
One complaint about the approach presented here could be the following: “I am used
to producing statistics that are independent of my theory. Traditionally, we could agree on the
statistics, whatever our theories. Now the statistical result depends on specifying what my
theory predicts. That means I might have to think about mechanisms by which the theory
works, relate those mechanisms to previous data, argue with peers about how theory relates to
different experiments, connect theory to predictions in ways that peers may argue about. All
of this may produce considerable discussion before we can determine which if any theory has
been supported!” The solution to this problem is the problem itself.
900
901
902
903
904
905
906
907
908
Bayes can be criticized because it relies on “priors”, and it may be difficult to specify
what those priors are. Priors figure in Bayesian reasoning in two ways. The basic Bayesian
schema can be represented as: Posterior odds in two theories = Bayes factor×prior odds in
two theories. The prior odds in two theories are a sort of prior. The approach illustrated in
this paper has lifted the Bayes factor out of that context and treated it alone as a measure of
strength of evidence (cf Rouder et al., 2009; Royall, 1996). So there is no need to specify that
sort of prior. But the Bayes factor itself requires specifying what the theories predict, and this
is also called a prior. Hopefully it is obvious that if one wants to know how much evidence
supports a theory, one has to know what the theory predicts.
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
The approach in this paper, though Bayesian, involves approaching analysis in a
different way in detail than one would if one followed, say, Kruschke (2010a) or Lee and
Wagenmakers (2014), who also define themselves as teaching the Bayesian approach. There
are many ways of being a Bayesian and they are not exclusive. The philosophy of the
approach here, as also illustrated in my papers published to date using Bayes, is to make the
minimal changes to current practice in the simplest way that would still bring many of the
advantages of Bayesian inference. In many cases, orthodox and Bayesian analyses will
actually agree (e.g. Simonsohn, submitted, which is reassuring, even to Bayesians). A key
place where they disagree is in the interpretation of non-significant results. In practice,
orthodox statistics have not been used in a way that justifies any conclusion. So one strategy,
at least initially, is to continue to use orthodox statistics, but also introduce Bayesian statistics
simultaneously. Wherever non-significant results arise from which one wants to draw any
conclusion, a Bayes factor can be employed to disambiguate, as illustrated in this paper (or a
credibility or other interval, if minima can be simply established). In that way reviewers and
readers get to see the analyses they know how to interpret, and the Bayes provides extra
information where orthodoxy does not provide any answer. Thus, people can get used to
Bayes and its proper use gradually debated and established. There need be no sudden
wholesale replacement of conventional analyses. Thus, you do not need to wait for the
revolution; your very next paper can use Bayes.
27
928
929
930
931
932
933
934
935
936
937
The approach in this paper also misses out on many of the advantages of Bayes. For
example, Bayesian hierarchical modelling can be invaluable (Kruschke, 2010a; Lee &
Wagenmaakers, 2014). Indeed, it can be readily combined with the approach in this paper.
That is, the use of priors in the sense not used here can aid data analysis in important ways.
Further, the advantages of Bayes go well beyond the interpretation of non-significant results
(e.g. Dienes, 2011). This paper is just a small part of a larger research programme. But the
problem it addresses has been an Achilles tendon of conventional practice (Oakes, 1986;
Wagenmakers, 2007). From now on, all editors, reviewers and authors can decide: whenever
a non-significant result is used to draw any conclusion, it must be justified by either inference
by intervals or measures of relative evidence. Or else you remain silent.
28
938
References
939
940
941
Allen, C. P. G., Sumner, P., & Chambers, C. D. (2014). The Timing and
Neuroanatomy of Conscious Vision as Revealed by TMS-induced Blindsight. Journal of
Cognitive Neuroscience, DOI: 10.1162/jocn_a_00557
942
943
American Psychological Association (2010). Publication Manual of the American
Psychological Association (5th ed.). Washington, DC: Author.
944
945
Armitage, P., McPherson, C. K., & Rowe, B. C. (1969). Repeated significance tests
on accumulating data. Journal of the Royal Statistical Society, Series A, 132, 235-244.
946
947
948
Armstrong, A. M., & Dienes, Z. (2013). Subliminal Understanding of Negation:
Unconscious Control by Subliminal Processing of Word Pairs. Consciousness & Cognition,
22, 1022-1040.
949
950
Baguley, T. (2004). Understanding statistical power in the context of applied research.
Applied Ergonomics, 35, 73–80
951
952
Baguley, T. (2009). Standardized or simple effect size: What should be reported?
British Journal of Psychology, 100, 603–617
953
954
Baguley, T. (2012). Serious Stats: A guide to advanced statistics for the behavioral
sciences. Palgrave Macmillan.
955
956
Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical
Science, 2, 317-335.
957
958
Berger, J. O., & Wolpert, R. L. (1988). The Likelihood Principle, 2nd edition.
Institute of Mathematical Statistic.
959
Berry, D. A. (1996). Statistics: A Bayesian perspective. Duxbury Press: London.
960
961
Berry, S. M., Carlin, B. P., Lee, J. J., & Müller, P. (2011). Bayesian adaptive methods
for clinical trials. Chapman & Hall: London.
962
963
964
Box, G. E. P., & Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis
(Addison-Wesley series in behavioral science, quantitative methods). Longman Higher
Education.
965
966
Buhle, J. T., Stevens, B. L., Friedman, J. J., & Wager, T. D. (2012). Distraction and
Placebo: Two Separate Routes to Pain Control. Psychological Science, 23, 246-253.
967
968
Burnham, K. P., & Anderson, D.R. (2002). Model selection and multimodel inference:
A practical information-theoretic approach (2nd ed.). New York: Springer.
969
970
971
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S.
J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability
of neuroscience. Nature Reviews Neuroscience, 14, 365-376.
972
973
Cohen, J. (1962). The statistical power of abnormal-social psychological research: a
review. Journal of Abnormal and Social Psychology, 65, 145-53.
974
975
Cohen, J. (1988). Statistical power analysis for the behavioural sciences (2nd
edition).Erlbaum: Hillsdale, NJ.
29
976
977
1312.
978
979
Cumming, G. (2011). Understanding The New Statistics: Effect Sizes, Confidence
Intervals, and Meta-Analysis. Routledge.
980
981
Dienes, Z. (2008). Understanding Psychology as a Science: An Introduction to
Scientific and Statistical Inference. Palgrave Macmillan.
982
983
Dienes, Z. (2011). Bayesian versus Orthodox statistics: Which side are you on?
Perspectives on Psychological Sciences, 6, 274-290.
984
985
Dienes, Z. (in preparation). Bayesian mediation analysis: How to assert no mediation
or full mediation.
986
987
988
Dienes, Z. (in press). How Bayesian statistics are needed to determine whether mental
states are unconscious. In M. Overgaard (Ed.), Behavioural Methods in Consciousness
Research. Oxford: Oxford University Press.
989
990
Dienes, Z. (forthcoming). How Bayes factors change our science. Journal of
Mathematical Psychology (special issue on Bayes factors).
991
992
993
Dienes, Z., Baddeley, R. J., & Jansari, A. (2012). Rapidly Measuring The Speed Of
Unconscious Learning: Amnesics Learn Quickly And Happy People Slowly. PLoS ONE 7(3):
e33400. doi:10.1371/journal.pone.0033400
994
995
Dienes, Z, & Hutton, S. (2013). Understanding hypnosis metacognitively: rTMS
applied to left DLPFC increases hypnotic suggestibility. Cortex, 49, 386-392.
996
997
Edwards, A. W. F. (1992). Likelihood (Expanded Edition). John Hopkins University
Press: London.
998
999
1000
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses
using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research
Methods, 41, 1149-1160.
1001
Cohen, J. (1990). Things I have learnt so far. American Psychologist, 45 (12), 1304-
Fisher, R. A. (1935). The design of experiments. Oliver and Boyd.
1002
1003
Freedman, L. S., & Spiegelhalter, D. J. (1983). The assessment of subjective opinion
and its use in relation to stopping rules for clinical trials. The Statistician, 32, 153-160.
1004
1005
1006
Fu, Q., Bin, G., Dienes, Z., Fu, X., & Gao, X. (2013). Learning without consciously
knowing: Evidence from event-related potentials in sequence learning. Consciousness and
Cognition, 22, 22-34.
1007
1008
Fu, Q., Dienes, Z., Shang, J., & Fu, X. (2013). Who learns more? Cultural differences
in implicit sequence learning. PLoS ONE 8(8): e71625. doi:10.1371/journal.pone.0071625
1009
1010
Gallistel, C. R. (2009). The Importance of Proving the Null. Psychological Review,
116, 439–453.
1011
1012
Glover, S., & Dixon, P. (2004). Likelihood ratios: A simple and flexible statistic for
empirical psychologists. Psychonomic Bulletin & Review, 11, 791-806.
30
1013
1014
Goodman, S. N. (1999). Toward Evidence-Based Medical Statistics. 2: The Bayes
Factor. Annals of Internal Medicine, 130, 1005- 1013.
1015
1016
1017
Guo, X., Jiang, S., Wang, H., Zhu, L., Tang, J., Dienes, Z., & Yang, Z. (2013).
Unconsciously Learning Task-irrelevant Perceptual Sequences. Consciousness and Cognition,
22, 203–211.
1018
1019
1020
Guo, X., Li, F., Yang, Z., & Dienes, Z. (2013). Bidirectional transfer between
metaphorical related domains in Implicit learning of form-meaning connections. PLoS ONE,
8(7): e68100.
1021
1022
Greenwald, A. G. (1975). Consequences of Prejudice against the Null Hypothesis.
Psychological Bulletin, 82, 1 -20.
1023
1024
1025
Greenwald, A. G. (1993). Consequences of prejudice against the null hypothesis. In
G. Keren & C. Lewis (Eds), A Handbook for data analysis in the behavioural sciences:
Methodological issues (pp 419 – 449). Lawrence Erlbaum : Hove.
1026
1027
Hodges, J., & Lehmann, E. (1954). Testing the Approximate Validity of Statistical
Hypotheses. Journal of the Royal Statistical Society. Series B (Methodological), 261-268.
1028
1029
Hoijtink, H., Klugkist, I., & Boelen, P.A. (Eds.). (2008).Bayesian evaluation of
informative hypotheses. New York: Springer.
1030
1031
1032
Howard, G. S., Maxwell, S. E., & Fleming, K. J. (2000). The Proof of the Pudding:
An Illustration of the Relative Strengths of Null Hypothesis, Meta-Analysis, and Bayesian
Analysis. Psychological Methods, 5, 315-332.
1033
1034
Howson, C., & Urbach, P. (2006). Scientific reasoning: The Bayesian approach (3rd
ed.). Chicago: Open Court.
1035
Jackman, S. (2009). Bayesian Analyses for the Social Sciences. Wiley: Eastbourne.
1036
1037
Jaynes, E.T. (2003). Probability theory: The logic of science. Cambridge, England:
Cambridge University Press.
1038
1039
Jeffreys, H. (1939/1961). The theory of probability: First/Third edition. Oxford,
England: Oxford University Press.
1040
1041
Kirsch, I. (2010). The emperor’s new drugs: Exploding the antidepressant myth. New
York: Basic Books.
1042
1043
Kirsch, I. (2011). Suggestibility and suggestive modulation of the Stroop effect.
Consciousness and Cognition, 20, 335–336.
1044
1045
Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal
rules. Journal of the American Statistical Association, 91, 1343-1370.
1046
1047
Kievit, R. A. (2011). Bayesians Caught Smuggling Priors Into Rotterdam Harbor.
Perspectives on Psychological Science, 6, 313.
1048
1049
Kiyokawa, S., Dienes, Z., Tanaka, D., Yamada, A., & Crowe, L. (2012). Cross
Cultural Differences in Unconscious Knowledge. Cognition, 124, 16-24.
31
1050
1051
Kruschke, J. K. (2010a). Doing Bayesian Data Analysis: A Tutorial with R and BUGS.
Academic Press.
1052
1053
Kruschke, J.K. (2010b). Bayesian data analysis. Wiley Interdisciplinary Reviews:
Cognitive Science, 1, 658–676.
1054
1055
Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation
and model comparison. Perspectives on Psychological Science, 6, 299-312.
1056
1057
1058
Kruschke, J. K. (2013a). Posterior predictive checks can and should be Bayesian:
Comment on Gelman and Shalizi, ‘Philosophy and the practice of Bayesian statistics’. British
Journal of Mathematical and Statistical Psychology, 66, 45-56.
1059
1060
Kruschke, J. K. (2013b). Bayesian estimation supersedes the t test. Journal of
Experimental Psychology: General, 142, 573–603.
1061
1062
1063
1064
Kruschke, J. K. (2013c). How much of a Bayesian posterior distribution falls inside a
region of practical equivalence (ROPE).
http://doingbayesiandataanalysis.blogspot.co.uk/2013/08/how-much-of-bayesianposterior.html. Retrieved 9 August 2013, 12:30pm.
1065
1066
Lee, M. D., & Wagenmakkers, E. J. (2005). Bayesian Statistical Inference in
Psychology: Comment on Trafimow (2003). Psychological Review, 112, 662– 668.
1067
1068
Lee, M. D., & Wagenmakers, E. J. (2014). Bayesian Cognitive Modeling: A Practical
Course. Cambridge University Press: Cambridge.
1069
1070
1071
Lewis, C. (1993). Analyzing means from repeated measures data. In G. Keren & C.
Lewis (Eds), A Handbook for Data Analysis for the Behavioural Sciences: Statistical Issues
(pp 73 – 94). Erlbaum: Hove.
1072
1073
1074
Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (2008). Mixtures of gpriors for Bayesian variable selection. Journal of the American Statistical Association, 103,
410–423
1075
Lindley, D.V. (1957). A Statistical Paradox. Biometrika, 44, 187–192.
1076
1077
Lindley, D. V. (1972). Bayesian Statistics: A Review. Society for Industrial and
Applied Mathematics
1078
1079
Lockhart, R. S. (1998). Introduction to data analysis for the behavioural sciences. W.
W. Freeman and Company: New York.
1080
1081
Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago:
University of Chicago Press.
1082
1083
1084
McGrayne, S. B. (2012). The Theory That Would Not Die: How Bayes' Rule Cracked
the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two
Centuries of Controversy . Yale University Press.
1085
1086
1087
Mclatchie, N. (2013). Feeling Impulsive, Thinking Prosocial: The Importance of
Distinguishing Guilty Feelings From Guilty Thoughts. Unpublished PhD thesis, University of
Kent.
32
1088
1089
Mealor, A. D., & Dienes, Z. (2012). Conscious and unconscious thought in artificial
grammar learning. Consciousness & Cognition, 21, 865-874.
1090
1091
Mealor, A. D., & Dienes, Z. (2013a). The speed of metacognition: Taking time to get
to know one’s structural knowledge. Consciousness & Cognition, 22, 123-136.
1092
1093
Mealor, A. D., & Dienes, Z. (2013b). Explicit feedback maintains implicit knowledge.
Consciousness & Cognition, 22, 822-832.
1094
1095
Morey, R. D., & Rouder J. N. (2011). Bayes factor approaches for testing interval
null hypotheses. Psychological Methods, 16, 406-419.
1096
1097
Newell, B. R., & Rakow, T. (2011). Revising beliefs about the merit Of Unconscious
Thought: Evidence In favor of the null hypothesis. Social Cognition, 29, 711–726.
1098
1099
Nuijten, M. B., Wetzels, R., Matzke, D., Dolan, C. V., & Wagenmakers, E.-J. (in
press). A default Bayesian hypothesis test for mediation. Behavior Research Methods.
1100
1101
Oakes, M. (1986). Statistical Inference: Commentary for the Social and Behavioural
Sciences. Wiley-Blackwell.
1102
1103
1104
Parris, B. A., & Dienes, Z. (2013). Hypnotic suggestibility predicts the magnitude of
the imaginative word blindness suggestion effect in a non-hypnotic context. Consciousness &
Cognition, 22, 868-874.
1105
1106
Pashler, H., & Harris, C. R. (2012). Is the Replicability Crisis Overblown? Three
Arguments Examined. Perspectives on Psychological Science, 7, 531-536.
1107
1108
Pitt, M. A., Kim, W., & Myung, I. J. (2003). Flexibility vs generalizability in model
selection. Psychonomic Bulletin & Review, 10, 29-44.
1109
1110
Raz, A., Fan, J., & Posner, M. I. (2005). Hypnotic suggestion reduces conflict in the
human brain. Proceedings of the National Academy of Sciences, 102, 9978–9983.
1111
1112
Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using significance tests to
evaluate equivalence between two experimental groups. Psychological Bulletin, 113, 553-565.
1113
1114
1115
Rosenthal, R. (1993). Cumulating evidence. . In G. Keren & C. Lewis (Eds), A
Handbook for data analysis in the behavioural sciences: Methodological issues (pp 519 –
559). Lawrence Erlbaum : Hove.
1116
1117
Rosenthal, R., Rosnow, R. L., & Rubin, D. R. (2000). Contrasts and effect sizes in
behavioural research: A correlational approach. Cambridge University Press: Cambridge.
1118
1119
Rouder, J. N. (2014). Optional Stopping: No Problem For Bayesians. Psychonomic
Bulletin & Review, 21,301-308.
1120
1121
Rouder, J. N., & Morey R. D. (2011). A Bayes-Factor Meta Analysis of Bem’s ESP
Claim. Psychonomic Bulletin & Review. 18, 682-689.
1122
1123
1124
Rouder, J.N., Morey, R.D., Speckman, P.L., & Pratte, M.S. (2007). Detecting chance:
A solution to the null sensitivity problem in subliminal priming. Psychonomic Bulletin &
Review, 14, 597–605.
33
1125
1126
Rouder, J. N., Morey R. D., Speckman P. L., & Province J. M. (2012). Default Bayes
factors for ANOVA designs. Journal of Mathematical Psychology. 56, 356-374.
1127
1128
1129
Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., & Iverson, G.(2009). Bayesian t
tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16,
225–237.
1130
1131
& Hall.
1132
1133
Sagarin B. J., Ambler J. K., Lee E. M. (2014). An ethical approach to peeking at data.
Perspectives on Psychological Science, 9, 293–304.
1134
1135
Sanborn, A. N. & Hills, T. T. (2014). The frequentist implications of optional
stopping on Bayesian hypothesis tests. Psychonomic Bulletin & Review, 21, 283-300
1136
1137
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect
on thepower of studies? Psychological Bulletin, 105, 309-316.
1138
1139
Semmens-Wheeler, R., Dienes, Z., & Duka, T. (2013). Alcohol Increases Hypnotic
Susceptibility. Consciousness & Cognition, 22, 1082–1091.
1140
1141
Serlin, R., & Lapsley, D. (1985). Rationality in Psychological Research: The GoodEnough Principle. The American Psychologist, 40, 73-83.
1142
1143
1144
Shang, J., Fu, Q., Dienes, Z., Shao, C. &, Fu, X. (2013). Negative affect reduces
performance in implicit sequence learning. PLoS ONE 8(1): e54693.
doi:10.1371/journal.pone.0054693.
1145
1146
Simonsohn, U (submitted). Posterior-Hacking: Selective Reporting Invalidates
Bayesian Results Also. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2374040
1147
1148
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A Key To The File
Drawer. Journal of Experimental Psychology: General, 143, 534-547
1149
1150
1151
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (forthcoming). P-Curve and Effect
Size: Correcting for Publication Bias Using Only Significant Results.
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2377290
1152
1153
Smith, A. F. M., & Spiegelhalter, D. J. (1980). Bayes factors and choice criteria for
linear models. Journal of the Royal Statistical Society, Sries B, 42, 213-220.
Royall, R.M. (1997). Statistical evidence: A likelihood paradigm. London: Chapman
1154
Smithson, M. (2003). Confidence intervals. Sage: London.
1155
1156
1157
1158
Song, M., Maniscalco, B., Koizumi, A., & Lau, H. (2013). A new method for
manipulating metacognitive awareness while keeping performance constant. Oral
presentation at the 17th Meeting of the Association for the Scientific Study of Consciousness,
San Diego, 12-15 July, 2013.
1159
1160
Press.
Stone, J. V. (2013). Bayes’ Rule: A tutorial introduction to Bayesian analysis. Sebtel
34
1161
1162
1163
Storm, L., Tressoldi, P. E., & Utts, J. (2013). Testing the Storm et al. (2010) MetaAnalysis Using Bayesian and Frequentist Approaches: Reply to Rouder et al. (2013).
Psychological Bulletin, 139, 248 –254.
1164
1165
Strube, M. J. (2006). SNOOP: a program for demonstrating the consequences of
premature and repeated null hypothesis testing. Behavioural Research Methods, 38, 24-7.
1166
1167
1168
Tan, L. F., Dienes, Z., Jansari, A., & Goh, S. Y. (2014). Effect of Mindfulness
Meditation on Brain-Computer Interface Performance. Consciousness and Cognition, 23, 1221.
1169
1170
1171
Tan, L.F., Jansari, A., Keng, S.L., & Goh, S.Y. (2009). Effect of mental training on
BCI performance. In J.A. Jacko (Ed.), Human-Computer Interaction, Part II, HCI 2009.
LNCS (Vol. 5611, pp. 632–635). Heidelberg: Springer.
1172
1173
1174
Terhune, D. B., Wudarczyk, O. A., Kochuparampil, P., & Kadosh, R. C. (2013).
Enhanced dimension-specific visual working memory in grapheme-color synaesthesia.
Cognition, 129 , 123-137
1175
1176
Vanpaemel, W. (2010). Prior sensitivity in theory testing: An apologia for the Bayes
factor. Journal of Mathematical Psychology, 54, 491–498.
1177
1178
1179
Vanpaemel, W., & Lee, M. D. (2012). Using priors to formalize theory: Optimal
attention and the generalized context model. Psychonomic and Bulletin and Review, 19,
1047-56.
1180
1181
Verhagen, A. J., & Wagenmakers, E.-J. (in press). Bayesian tests to quantify the result
of a replication attempt. Journal of Experimental Psychology: General.
1182
1183
Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p
values. Psychonomic Bulletin & Review, 14, 779-804.
1184
1185
1186
Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R.
A. (2012). An Agenda for Purely Confirmatory Research. Perspectives on Psychological
Science, 7, 632–638.
1187
1188
1189
1190
Watson, J. C., Gordon, L. B., Stermac, L., Kalogerakos, F., & Steckley, P. (2003).
Comparing the Effectiveness of Process–Experiential With Cognitive–Behavioral
Psychotherapy in the Treatment of Depression. Journal of Consulting and Clinical
Psychology, 71, 773–781.
1191
1192
Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E. J. (2012). A default Bayesian
hypothesis test for ANOVA designs. The American Statistician, 66, 104-111.
1193
1194
1195
Wetzels, R., Matzke, D., Lee, M.D., Rouder, J.N., Iverson, G.J., & Wagenmakers, E.J.
(2011). Statistical evidence in experimental psychology: An empirical comparison using 855
t tests. Perspectives on Psychological Science, 6, 291–298.
1196
1197
1198
Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E.-J.(2009). How to
quantify support for and against the null hypothesis: A flexible WinBUGS implementation of
a default Bayesian t test. Psychonomic Bulletin & Review, 16, 752–760.
1199
1200
Wetzels, R., & Wagenmakers, E. J. (2012). A default Bayesian hypothesis test for
correlations and partial correlations. Psychonomic Bulletin & Review, 19, 1057–1064.
35
1201
1202
Wilcox, R. R. (2010). Fundamentals of modern statistical methods: Substantially
improving power and accuracy, 2nd edition. Springer: London.
1203
1204
Yu, E. C., & Sprenger, A. M., Thomas, R. P., & Dougherty, M. R. (2014). When
decision heuristics and science collide. Psychonomic Bulletin & Review, 21, 268–282
1205
1206
Yuan, Y., & MacKinnon, D. P. (2009). Bayesian Mediation Analysis. Psychological Methods,
14, 301–322.
1207
36
1208
Appendix 1 Further examples
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1. Trend analysis and other contrasts. Parris and Dienes (2013) investigated whether
the strength of a perceptual response (measured in milliseconds) to an imaginative suggestion
in a non-hypnotic context varied with hypnotisability (measured on a 0 to 12 scale). Hypnosis
researchers often call people scoring 0-3 as lows, 4-7 as mediums, and 8-12 as highs. Parris
and Dienes recruited an equal number of lows, mediums and highs. A key question is the
shape of the relation between hypnotisability and performance on the non-hypnotic task. On
the one hand, it could be simply linear. On the other hand, it may be that highs are an unusual
group, with mediums performing the same as lows; or that lows are an unusual group with
mediums performing the same as highs (a question posed by Kirsch, 2011). The question can
be answered by whether, in addition to a linear effect of hypnotisability on the task, there is a
quadratic effect. Parris and Dienes regressed performance (in milliseconds) on
hypnotisability (as a 0-12 scale) and obtained a raw slope of b = 6 milliseconds/unit
hypnotisability, t(33) = 2.261, p = .030. To obtain the quadratic effect, the hypnotisability
score was squared and centred on zero. Performance was regressed on both hypnotisability
and hypnotisability squared. The slope for the quadratic effect was b = .530 milliseconds/unit
hypnotisability2, t(33) = .589, p = 0.56. Does the non-significant quadratic effect indicate
that performance varies only linearly with hypnotisability?
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
The maximum plausible score of mediums is that of highs; thus, the minimum
plausible quadratic effect would occur when mediums scored the same as highs, with lows
scoring differently as a special group. Conversely, the maximum plausible quadratic effect
would occur when mediums scored the same as lows, with highs scoring differently as a
special group. To determine values for these extremes, first the mean performance for highs
was found, and this mean set as the same performance for all mediums and highs; similarly,
the mean for lows was set as the performance for all lows. This defined an idealized set of
population values that were extreme in degree of quadratic effect. With this set of values
treated as real data, the mean quadratic regression slope was - 0.673. Similarly, another
hypothetical set of population vales was obtained by setting the mean of the lows as the same
score for all lows and mediums; the highs all had the mean of the highs. With these values
treated as data, the mean quadratic regression slope was 0.908. Thus, a Bayes factor can be
determined assuming a uniform from -0.673 to +0.908. The mean for the data (i.e. in this case,
the quadratic slope) is .53. It has a standard error of .53/.589 = 0.9. This gives a Bayes factor
of 0.97. That is, the data were insensitive and provide no evidence either for or against the
quadratic hypothesis. It would be wrong to use the non-significant quadratic effect to infer
that mediums lay mid-way between highs and lows.
1243
1244
1245
1246
1247
The principle for constructing limits for a uniform can be generalized. For a contrast,
use values of the dependent variable in each condition that would represent extremes which
are plausible given the scientific context. Coefficients found using the hypothetical scores as
real data determine the plausible extremes of the contrast and can be used in a uniform to
represent the alternative.
1248
1249
1250
1251
1252
1253
1254
2. Equating one variable to test the effect of another. Song, Maniscalco, Koizumi,
and Lau (2013) wished to investigate the function of confidence when performance was
controlled. Thus, they found a way to create stimuli for which performance (d’) was
calibrated to be similar and yet confidence differed. We will call this performance single task
d’ (for reasons that will become clear) . (In what follows the numbers have been made up, so
as not to jeopardise the publication of the authors’ paper.) On a 1-4 confidence scale, the
confidence for the two conditions was 2.21 and 2.41, t(35) = 4.04, p = .0001. The
37
1255
1256
corresponding single task d’s for the two conditions were 1.83 and 1.93, t(35) = 0.96, p =
0.34. The question is, can equivalence be claimed for single task d’ across the two conditions?
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
To allow comparability between confidence and single task d’, standardized effect
sizes might seem useful. Then, it could be argued, the sort of effect size we might expect to
see in single task d’, if there were an effect, would be in the same ball park as the
standardized effect size found for confidence. We will use Pearson’s r as a standardized
effect size, where r2 = t2/( t2 + df). Thus, for confidence r = √ (4.042/(35 + 4.042)) = .56; and
for d’, r = √ (0.962/(35 + 0.962)) = .16. Pearson’s r can be transformed to a normal variable
with Fishers z = 0.5*loge((1 + r)/(1 - r)), which has standard error= 1/√ (df -1). Thus,
Fisher’s z for confidence = 0.64, and for d’ Fisher’s z = 0.16. The standard error for the
Fisher’s z for d’ is 0.17. Thus, a Bayes factor can be determined for a mean of 0.16, SE =
0.17, and with the alternative represented as a half-normal with standard deviation equal to
0.64. In this case, B = 0.64. That is, the data appear insensitive for establishing equivalence
as B is between 3 and 1/3.
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
However, the use of standardized effect sizes in this context doesn’t address the
question of interest. A standardized effect size reflects the consistency with which an effect is
shown, and thus how noisily it is measured (Baguley, 2009). One could make the
measurement of performance noisy simply by using few trials. (Theoretically irrelevant
differences in noise between measures occur for reasons other than just number of trials.
Some measures are simply noisier than others: For example, confidence has a restricted range,
d’ an unrestricted range.) By using a sufficiently small number of trials for performance, the
population r for the difference between conditions in single task d’ could be made very small.
If confidence was measured on a different large number of trials, its population r could be
made large. That is, a researcher making the measure of single task d’ as insensitive as
possible would help establish equivalence in single task d’ between conditions. This is also
true for testing equivalence merely by showing a non-significant result. That is, both
significance testing and Bayes using standardized effect sizes, do not address the real issue, in
this case, because they are sensitive to factors that are theoretically irrelevant (and could
therefore be cynically manipulated by the experimenter). To address the actual theoretical
concerns of the researchers, raw effect sizes should be used.
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
But, how can one use a difference in confidence units to set a difference in d’ units?
The researchers were interested in the extent to which confidence predicted a third outcome
variable, specifically, discrimination on a task that involved integrating over many stimuli.
Call this multi-task d’. That is, the most important issue was given that the difference in
confidence was associated with a change in multi-task d’, could the difference in single task
d’s be large enough to explain the change in multi-task d’? Given this theoretical agenda, we
can calibrate confidence and single task d’ in the following way. The difference in
confidence was associated with a significant difference in multi-task d’ in their experiment
(in fact, this was the main result of the study). Let us say multi-task d’ differed by 0.6 units.
So, let us say the researchers run another study in which single task d’ was allowed to vary
over a range from say 1.5 to 2.5, and multi-task d’ was measured. Regress single task d’ on
multi-task d’. Let us say the raw regression slope is 1.5 single d’ units/multi d’ units. Thus,
the change in single task d’ that corresponds to 0.6 units change in multi-task d’ is 1.5*0.6 =
0.9. Now, we can use that estimated relevant change in single task d’ as the SD of a halfnormal to represent a meaningful alternative to contrast with the null hypothesis of no
difference in single task d’. For the Bayes factor calculator, the “mean” is 1.93-1.83 = 0.10
single task d’ units, the standard error is 0.10/0.96 = 0.10 single task d’ units, and we can use
38
1302
1303
a half-normal with an SD of 0.90 single task d’ units. In this case, B = 0.31, indicating we
can sensitively accept the claim of equivalence.
1304
1305
1306
1307
1308
1309
1310
1311
This method finds the raw difference in d’ in a non-arbitrary way relevant to the
scientific context. Further, there is no clear incentive to measure single task d’ insensitively.
Making the measurement of single task d’ noisier (in an unbiased way) does not affect the
expected raw regression slope of single task d’ against multi task d’. It just makes it harder to
make sensitive claims when the Bayes factor on single task d’ is performed. (And making the
measurement of multi-task d’ insensitive will reduce the slope, making equivalence claims
harder.) A correct analysis and inference procedure will never be harmed by making data
more sensitive.
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
Now let’s consider another possible method . Without knowledge of the slope of
single task d’ against multi task d’, one could calibrate raw effect sizes by regressing single
task d’ onto confidence and finding the raw slope. The difference in confidence was 2.41 2.21 = 0.2 confidence units. The raw slope was 0.45 d’ units/confidence unit, so the
equivalent of 0.2 confidence units is 0.2*0.45 = 0.09 single task d’ units. Now, unlike with
standardised effect sizes, making the measurement of d’ insensitive also does not help
researchers trying to establish equivalence. As for the second method above, making the
measurement of single task d’ more noisy (in an unbiased way) does not affect the raw
regression slope. It just makes it harder to make sensitive claims. For the Bayes factor
calculator, the “mean” is 1.93-1.83 = 0.10 single task d’ units, the standard error is 0.10/0.96
= 0.10, and we can use a half-normal with an SD of 0.09. In this case, B = 1.39, indicating
insensitivity. We come to a different conclusion. So which method should be preferred, the
second or the third? In fact, the second is most relevant in this scientific context. The third
method assumes that the researchers were interested in differences in d’ only to the extent
they were predicted by confidence - but the real point is the function of confidence itself in
predicting multi-task d’. Note something peculiar about the third method: The smaller the
relation between single task d’ and confidence, the smaller the estimate of single task d’ we
are trying to pick up, and the harder it is to make claims of equivalence. But the extent of
equivalence in single task d’ should be the same however confidence and single task d’ were
correlated. Thus, the second method gets to the heart of the matter, not the third. The right
statistical analysis depends on understanding the scientific problem. Canned statistical
solutions are not the solution for statistics.
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
Up to now, we have talked about whether the claim of equivalence in single task d’
can be asserted. Whether or not the data are sensitive enough to support that claim, the claim
that confidence predicts multi task d’ without being fully mediated by single task d’, can also
be independently tested by mediation analyses. And in general the question of whether
mediation is complete or whether there is no meditation at all are both claims that involve
potentially accepting null hypotheses and thus (without specifying minimally interesting
values) can only be answered by use of Bayes factors (see Semmens-Wheeler et al., 2013, for
how to use the Dienes, 2008, Bayes calculator to give a quick and dirty answer (i.e. one that
tests the product of regression slopes as if it satisfied the assumption of normality); Dienes, in
preparation, for code that tests the null hypotheses in mediation analyses assuming only that
estimates of individual raw regression slopes are normally distributed; and Nuijten, Wetzels,
Matzke, Dolan, & Wagenmakers, submitted, for a different solution for obtaining Bayes
factors for mediation analyses using standardized slopes.) By contrast, significance testing,
including significance testing using Bayesian machinery to get the p-values (e.g. Yuan &
MacKinnon, 2009), can obtain evidence for partial mediation but cannot in itself (without
specification of minimally interesting values) allow assertion of either complete or no
39
1350
1351
mediation. Only a truly Bayesian mediation analysis can achieve the latter (presented in
Dienes, in preparation, and Nuitjen et al, submitted).
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
3. Meta-analysis. Bayes factors can be used in meta-analysis to gain all the
advantages of their use in single studies. Their use is most straightforward when the same
paradigm has been repeatedly used. In this case, find the overall meta-analytic raw mean and
overall meta-analytic standard error12, and use these in the Bayes factor calculator. If
standardized effect sizes are used to compare across studies with different paradigms and
dependent variables, Pearson’s r can be used as the standardized effect size, as r can be
normalized with Fisher’s z, with known standard error, and thus the Dienes (2008) calculator
used. Bayes factors can in principle be multiplied together to accumulate evidence. However,
in practice Bayes factors from single studies should not in general be multiplied together to
get an overall Bayes factor, because this procedure does not take into account that initial
studies should be used to update the alternative hypotheses of subsequent studies as they are
added to the calculation in order to respect the axioms of probability (Jaynes, 2003). Thus, it
is easier to first summarize the data and then determine the single Bayes factor on the
summary, unless special precautions are taken (cf Rouder & Morey, 2011; Storm, Tressoldi,
& Utts, 2013). (For point alternative hypotheses as used in the likelihood school of inference,
e.g. Royall, 1997, the problem does not arise because if the alternative says only one point
value is possible (e.g. the hypothesis that the population difference in means is exactly 10
seconds), the distribution representing the alternative – which is just a spike at the point value
postulated - remains the same whatever the previous studies. Thus, in that case, Bayes
factors can be un-problematically multiplied across single studies.)
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
4. Correlations and regression slopes. While correlations themselves are not normal,
they can be made normal by a transformation, Fishers z = 0.5*loge((1 + r)/(1 - r)), which has
standard error= 1/√ (df -1). Thus, designs using correlations can make use of the Dienes
(2008) Bayes calculator (see also Wetzels & Wagenmakers, 2012, for a default Bayesian
analyses of correlations). However, in many cases, it is the raw regression slopes that really
address theoretical concerns. Correlations may become reduced by a restriction of range or
increased error in measurement of the dependent variable, but these factors do not affect the
raw regression slope (which is nonetheless affected by error in measurement of the
independent variable, as is a correlation). Raw linear regression slopes can be used with the
Dienes (2008) calculator, just bearing in mind they have a standard error given by: raw
slope/t.
1383
5. Contingency tables. The following example illustrates the points:
1384
1. One can interpret complex contingency tables by reducing to 1-df contrasts;
1385
1386
1387
1388
2. Researchers should be weaned off chi square tests of independence, which serve a
purpose only for generating p-values and hence are essentially uninformative for nonsignificant results. Instead, one wants a measure of strength of association with known
distribution and standard error;
12
This can be achieved using software on the website for Dienes (2008) for finding a posterior from a prior and
a likelihood. Entering the first study summary as the prior, the second study summary as the likelihood, and
obtain the posterior mean and standard error; use the posterior as the prior for the next study, which is
entered as the likelihood. Keep feeding in the posterior as the prior for the next study until all studies have
been entered.
40
1389
1390
3. Ln OR (the natural log of the odds ratio) expresses strength of association and is
normally distributed with known standard error;
1391
1392
1393
4. Thus, having obtained Ln OR, one can play. One can compare strengths of
association in different studies or conditions, put confidence intervals on the strength of
association, and calculate Bayes factors to illuminate non-significant results.
1394
1395
1396
1397
1398
1399
1400
McLatchie (2013) primed people in ways meant to cultivate guilty thoughts or guilty
feelings. People subsequently given a sum of money then chose to either use it immediately
as is, or to use it after a delay, when it would be increased; further, the use could be either to
keep the money oneself, or else donate it to a good cause. One hypothesis was that guilty
thoughts should increase pro-social behaviour (donating rather than keeping); another was
that guilty feelings would increase impulsivity (using now rather than later). In Study 1 the
following data shown in Table 4(a) were obtained.
1401
1402
Table 4
1403
Data from Study 1 on Guilt Priming from McLatchie (2013)
1404
1405
(a) Full Data Set
Guilty Feelings
Guilty Thoughts
Control
Keep Now
6
3
3
Keep Delay
2
2
12
Donate Now
7
1
1
Donate Delay
5
14
4
1406
1407
(b) A One-degree of Freedom Contrast
Guilty Thoughts
Control
Keep
5
15
Donate
15
5
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
The data can be collapsed in a number of ways to answer specific questions. For
example, to test if guilty thoughts increase pro-social action, a collapsed version of the data is
shown in Table 4 (b). This yields a 2×2 Chi square test of independence of χ2(1) = 7.69, p
= .0056. χ2 does not directly express a magnitude of effect. One way of expressing the effect
size is as an odds ratio: OR = (15*15)/(5*5) = 9.0. The OR is the product of cell counts
along one diagonal, divided by the product along the other. If there is no association
population OR = 1. We will put the diagonal on top which should be larger according to the
theory, if the theory makes a directional prediction (as it does here). The natural log of the
odds ratio, Ln (OR), is normally distributed with a squared standard error given by 1/A + 1/B
+ 1/C + 1/D, where A, B, C, D are each of the cell entries. So in this case, standard error = √
41
1420
1421
(1/5 + 1/15 + 1/15 + 1/5) = 0.73. Ln OR = Ln 9.0 = 2.20 in this case. We can test association
with a z-test = 2.20/0.73 = 3.01, p = .002613.
1422
1423
In Study 2 using the same design but different methods of priming, Mclatchie
obtained the data in Table 5(a), with the question of current interest shown in Table 5(b).
1424
1425
Table 5
1426
Data from Study 2 on Guilt Priming from McLatchie (2013)
1427
1428
(a) Full Data Set
Keep Now
Guilty Feelings
6
Guilty Thoughts 1
Control
2
Keep Delay
4
7
14
Donate Now
6
3
7
Donate Delay
4
9
17
1429
1430
(b) A One-degree of Freedom Contrast
Guilty Thoughts
Control
Keep
8
16
Donate
12
24
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
Now the data look rather different. χ2 = 0 p = 1.00. More importantly, Ln OR = 0
with a standard error of 0.56. (So z = 0, p =1.00.) Is the pattern significantly different in the
second study? The Ln ORs differ by 2.20 – 0 = 2.20. The difference has a standard error
given by the variance sum rule, namely squared standard error of difference is the sum of the
squared standard errors. Thus standard error of difference = √ (0.732 + 0.562) = 0.92. Hence,
to test the difference, z = 2.20/0.92 = 2.39, p = .017. Thus, the second study is a failure to
replicate, the association is significantly less in the second rather than first study. Has the
association been reduced to zero in the second study? The sort of effect size one might expect
in the second study is provided by that in the first. Thus we can represent the alternative as a
half-normal with SD = 2.20. The “mean” is 0, and the standard error is 0.56. This yields a B
of 0.24, indicating substantial evidence for the null hypothesis of no association in the second
study.
1445
1446
1447
1448
1449
1450
6. Comparing different theories. A Bayes factor compares theory 1 with theory 2.
Call a Bayes factor comparing theory 1 with the null hypothesis B1/0. Call a Bayes factor
comparing theory 2 with the null hypothesis B2/0 . Then, the Bayes factor comparing theory 1
with theory 2, B1/2 = B1/0 / B2/0 (Dienes, 2008). Thus, the Dienes (2008) Bayes factor
calculator can be used to compare two substantial theories, or to compare a substantial theory
to a null region hypothesis rather than a point null. For example, a uniform or a normal
13
To obtain two-tail p-values for z, apart from using R, one can use online calculators, such as
http://davidmlane.com/hyperstat/z_table.html: click on “outside” and enter – z and +z in the boxes.
42
1451
1452
defining a null region could be used (cf Morey & Rouder, 2011, who provide interesting
discussion of the properties of such Bayes factors).
1453
1454
1455
1456
1457
1458
1459
1460
1461
It is sometimes argued that a point null could never be really true (Cohen, 1994). But,
it can be true to an astonishing degree of accuracy, given the scale of resolution of the data
(Rouder et al, 2009). In this context, Baguley (2012, p. 369) cites a study on ESP with over
27,000 participants; the confidence interval for the proportion correct was [.496, .502], where
0.5 is the chance baseline. Thus, with well controlled and counter-balanced experiments
using a point null hypothesis is not absurd, especially if a null region hypothesis is hard to
specify. Nonetheless, null region hypotheses can also be entirely valid, for example where
null regions are already decided in a literature (e.g. regarding changes of less than 3 units on
the Hamilton depression scale as clinically uninteresting).
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
A Bayes factor could be used to compare the hypothesis that the change increased
versus decreased by e.g. having a half normal with SD = expected effect size going in either
direction. For example, for a mean difference of 5 seconds and an expected effect size, should
it exist of 10 seconds in either direction, first run the calculator with a half normal with SD =
10, and data mean = 5 (and obtain B1/0) ; then, run again by setting the data mean as -5 (and
obtain B2/0) . Dividing the two Bayes factors gives a Bayes factors comparing the two
theories of change in each direction (B1/2). Bear in mind that if the null hypothesis of zero
change is true, B1/2 is not driven in any particular direction as data is collected but performs a
random walk. Thus sooner or later it will provide strong evidence for one of the two
directional theories.
1472
43
1473
Appendix 2 Alternatives approaches for indicating relative evidence.
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
Likelihood inference: An alternative method to specifying a range of values predicted
by the theory is to specify a single value. The likelihood school of inference recommends
Bayes factors contrasting hypotheses of point values (Dienes, 2008; Royall, 1997). For
example, one could compare the null hypothesis that the difference is 0 to an alternative that
the difference is 10. In this way, the need to specify a distribution over population values is
obviated. The Bayes factor will, in the limit as increasing observations are collected, come to
favour the hypothesis that is closer to the truth. Thus, if the true population value is less than
5, eventually the null hypothesis will be supported over the alternative. If the population
value is more than 5, the alternative will come to be supported. But, this means that 5 comes
to function as a minimally interesting value. Thus, if one were to use point hypotheses in
one’s Bayes factor, one should decide on the minimally interesting value, m, and use an
alternative point value of 2m.
1486
1487
1488
1489
1490
1491
1492
In likelihood inference, there is no need to restrict oneself to a single Bayes factor.
One could consider the relative strength of evidence for the full range of possible population
values and thus construct a likelihood interval (Royall, 1997; see the Dienes, 2008, associated
website for likelihood interval calculators). In this case, the rules of inference by interval
apply, as shown in Figure 2. Thus, in general, likelihood inference contrasts with full
Bayesian inference in requiring a specification of the minimally interesting effect size in
order to connect data to theory.
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
BIC,AIC, DIC, etc. A range of model comparison techniques have been developed,
often based on rather different principles, but all comparing the evidence for two models in a
way that can be decomposed in terms of the model’s degree of fit to the data, corrected by the
number of free parameters in the model if this differs between models (see Baguley, 2012,
chapter 11; Burnham & Anderson, 2002; Glover & Dixon, 2004; Wagenmakers, 2007). The
main contrast with the approach illustrated in this paper is that such methods, when used for
null hypothesis testing (which is not their only use), amount to assuming a vague default
alternative hypothesis to compare against the null. This interpretation is most clear for BIC.
(AIC is a measure of relative strength of evidence with in effect a default alternative for null
hypothesis testing, one somewhat less vague than for BIC; Smith & Spiegelhalter, 1980.)
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
Wagenmakers (2007) showed that BIC approximates a Bayes factor that uses a “unit
information prior” for the alternative (see Rouder et al., 2009, for further development of this
specification to define vague defaults for Bayes factors). That is, the alternative is specified
as a normal with the same mean as the mean of the data and a standard deviation equal to the
standard deviation of the data. This can be justified on the grounds that if one had any
accurate prior information at all, it would tend to specify the same mean as the data do, even
if vaguely. Further, assuming the prior information was worth only one observation of data
(hence “unit information”), the standard error of that information would be the standard
deviation of the population data divided by square root of N, the number of observations –
and N is just one. (Technically, one cannot use that very aspect of the data one is trying to
predict – say its mean – to specify the alternative predicting the same data, as this involves
“double counting”, Jaynes 2003; however, the vagueness of the unit information prior renders
the technical point harmless. One can also just set the mean of the alternative to zero to
strictly deal with this point.) The BIC – or, correspondingly, the unit information prior used
to specify the alternative with the Dienes (2008) Bayes calculator – may be worth considering
if one finds it difficult to specify the alternative, or wishes to test a vast number of
alternatives in a data mining operation. But, the responsibility still rests with the researcher
44
1520
1521
to consider seriously whether the unit information prior matches theoretical expectations. A
vague prior makes finding evidence for the null easy (Kruschke, 2013b).
1522
1523
1524
1525
1526
1527
1528
Consider trying to find evidence for unconscious processing by showing null
performance on a test of awareness. It would seem prudent not to use a vague default, if only
on the grounds that one wants to make the demonstration of unconscious knowledge
convincing, and hence not especially easy to obtain. In fact, more importantly, Dienes (in
press) shows in this situation that there are informative specifications defining the extent to
which one expects conscious knowledge, in raw units, rendering defaults irrelevant in that
scientific context.
1529
1530
1531
1532
1533
1534
1535
Every case has to be considered on its own merits as to whether the specification of
the alternative is relevant (Kruschke, 2013b; Vanpaemel, 2010). In that sense, true defaults
(i.e. those that could be used at any time to avoid consideration of scientific context) do not
actually exist in real scientific contexts. A Bayes factor is a means for comparing two theories.
In the absence of theory, placing credibility or likelihood intervals around parameter
estimates (without accepting or rejecting any null hypothesis) lets the data speak for
themselves (Kruschke, 2013b).
1536
45
1537
Appendix 3 Robustness checking
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
The Bayes factor is sensitive to how vague H1 is specified as being; the vaguer H1,
the more the data will support H0. The vagueness of H1 depends mainly on the rough
maximum value allowed; however, in detail B can depend on the exact shape of the specified
distribution as well. This raises a question: There may be a number of ways of specifying
essentially the same scientific judgments. For example, the scientific judgment for H1 might
simply be that the maximum population difference cannot be more than about 8. The exact
shape for the distribution is not further specified; assigning a specific shape then adds more
precision than the science in the situation actually provides. If B were sensitive to the shape
beyond the specification of the maximum, the extent to which that B was relevant to the
scientific theory could be questioned. On the other hand, if the decision based on B were
robust to major shape changes in simple distributions given the same maximum, then B
shows its relevance to evaluating the theory.
1550
1551
1552
1553
1554
1555
1556
Here we consider several Bayes factors, subscripted to indicate the distributions used
to represent H1 (thanks to Wolf Vanpaemal for this notational suggestion). BN(m,sd) means a
normal was used with mean m and standard deviation sd; BH(0, sd) indicates a half-normal was
used with standard deviation sd; BU[min,max] indicates a uniform was used between a minimum
of min and a maximum of max. Letting a rough maximum for a normal correspond to two
standard deviations out, BN, BH and BU can then be set to respecting the same scientific
intuition concerning the maximum population difference.
1557
1558
1559
1560
The Rouder Bayes factors are also shown below. BJZS(r) is the JZS B with scaling
factor r, and BUI(r) uses the unit information prior with scaling factor r. BJZS and BUI can be
taken to represent theories about Cohen’s d rather than about raw mean differences so may
produce somewhat different answers than BN, BH or BU.
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
Consider a study where N = 30, the mean is 5.5 units above a null value of 0, SD =
13.7, so SE = 2.5. A predicted effect of about 5 units can be argued for; or a maximum of 10
units. The Rouder B’s were scaled with a predicted Cohen’s d of 5/13.7 = 0.36. The first two
rows of Table 6 illustrate a significant effect, and the BN and BU just tip over 3, as expected.
Notice the distributions for the predictions of H1 are centred on 0, so the scientific theory in
question predicts effects in either direction. In these cases it doesn’t matter if the distribution
is peaked in the middle (a normal) or flat (a uniform); the same conclusions follow. The
Rouder B’s, scaled according to the Cohen’s d of the expected effect size, produce somewhat
less evidence for H1 than the corresponding BN and BU: BJZS and BUI represent theories about
Cohen’s d rather than about raw mean differences. BH is rather higher than the others in the
first row; this is because it represents a different scientific theory, namely a theory that
predicts effects in only one direction. It is desirable that B may be different when different
scientific theories are represented. The second row illustrates how a directional theory is
penalized when the mean goes in the wrong direction.
1575
1576
1577
1578
1579
46
1580
Table 6
1581
Different Specifications of H1
1582
1583
Mean SE
BH(0, 5)
BN(0, 5)
BU[-10,10]
BJZS(0.36)
BUI(0.36)
1584
5.5
2.5
6.05
3.10
3.40
1.99
2.77
1585
-5.5
2.5
0.15
3.10
3.40
1.99
2.77
1586
0
2.5
0.45
0.45
0.31
0.34
0.45
1587
1588
Table 7
1589
B’s for a Directional Theory
1590
Mean SE
BH(0, 5)
BN(5, 2.5)
BU[0,10]
1591
5
2.5
4.27
5.22
4.42
1592
-5
2.5
0.16
0.10
0.11
1593
6
2.5
8.81
12.10
10.46
1594
-6
2.5
0.14
0.10
0.09
1595
0
2.5
0.45
0.26
0.31
1596
1597
1598
1599
1600
1601
1602
1603
Table 7 presents results for three different ways of representing a directional theory.
As can be seen, the different distributions produce similar B’s, no matter whether the
distribution is flat, peaked in the middle or pushed to one end. Thus, conclusions based on
these Bs will often be robust. In any give case, one needs to check the robustness of the
conclusions. When evidence is near a threshold, different representations of H1 might tip
either side of it (though bear in mind that the threshold is itself not meaningful for Bayes, it is
just a convention).
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
Verhagen and Wagenmakers (in press) provide a different Bayes factor calculator that
takes as its theory that the current study is an exact replication of a previous one, so H1 can
be set as predicting the standardized effect size previously obtained, with an uncertainty
defined by the previous standard error in the estimate. Their B closes matches the results
obtained with the defaults recommended in this paper. For example, Verhagen and
Wagenmaker’s example 1 is based on Eliot et al (2010): The latter’s experiment 2 was
designed as replication of their experiment 1. In experiment 1 the raw effect size was 6.79 5.67 = 1.12 units for females. To interpret experiment 2, the 1.12 can be used as the SD of a
half-normal. In experiment 2 of Eliot et al, a difference of about 1 raw unit (from the graph in
the original paper) can be estimated for females; with a t of 3, this gives a standard error of
0.33. BH(0, 1.12) = 38.57, very close to the 39.73 obtained by Verhagen and Wagenmakers. For
47
1615
1616
1617
1618
1619
1620
males an SE of 0.33 can be estimated from the graph; thus, mean difference = SE*t = - .02.
BH(0, 1.12) = 0.27, reasonably close to the 0.13 obtained by Verhagen and Wagenmakers.
(Similar Bs for the Dienes calculator and Verhagen and Wagenmakers calculator hold for
their other examples.) In sum, the different Bs, based on somewhat different distributions,
but the same scientific intuitions, produced reassuring close answers, indicating the
robustness of the conclusions.
1621
1622
1623
1624
1625
1626
1627
In sum, the often close results obtained by different Bs when the same scientific
theories are represented validates Bs as a method of evaluating such theories. The
arbitrariness of the distributions chosen to represent H1 can be shown to be inconsequential
when different simple distributions representing the same theories produce qualitatively the
same answers. When different answers are obtained for representations of H1 that are equally
plausible as representations of the same scientific judgments, more data need to be collected
until the conclusion is robust.
48