The case for conceptual replication and strong inference

Recent findings from a massive collaborative project (OSF, 2015), attempting to replicate many of
the findings published in top psychology journals, have suggested that roughly half of these fail to
replicate. These findings deservedly made headlines, and have generated a great deal of reaction
amongst psychologists. While some have argued that the estimated rate of non-replicability may
have been overestimated (Gilbert, King, Pettigrew, & Wilson, 2016), in water cooler conversations a
far more skeptical viewpoint about the usefulness of studying replicability at all has begun to
emerge.
To summarize, the skeptics hold that we shouldn’t care so much about whether any individual study
replicates; what we should instead care about are whether the theories they were meant to
investigate are true. Non-replications don’t tell us much if anything about the theories we ultimately
care about.
This view goes overboard because at the end of the day, experiments are data points that help paint
a theoretical picture. If there are 100 published studies pointing to the existence of automatic
priming effects (e.g. reading a cross-word about ageing makes people walk slower), and there is no
systematic bias in the published articles, then this is a pretty good reason to believe in automatic
priming effects. If however we randomly pick 10 of those to replicate, and 9 fail, this will justifiably
undermine our confidence in the underlying theory.
With that being said, the critics of the obsession with replicability also have a point. As scientists, we
should care more about theories than about individual findings. There is far too much emphasis on
how individual findings might have been produced by various forms of bias or experimental
malpractice, and not nearly enough clear thinking about how experimental evidence should be
translated into beliefs about how the mind works.
The computer scientist Chris Drummond (2009) has made a distinction between “replicability” and
“reproducibility” that I think is very useful. A “replication” in Drummond’s terminology is the
production of a nearly identical set of results in a nearly identical experimental setting to a
previously run experiment. So this would involve running the replication attempt with the same
stimulus set, from the same pool of participants, with the same instructions, same experimental
protocol, etc. On the other hand, a reproduction would be the production of a set of results which is
qualitatively similar to the original set, but in an experimental context where the precise
experimental parameters may have changed slightly or even dramatically. “Qualitatively similar”
here can be fairly precisely defined to refer to parameter settings likely to produce effects in the
reproduction which are predicted by the theory thought to explain the original results.
One can think about differences in parameter settings between two experiments as being plotted on
a similarity scale (see Figure 1A below). Imagine that all experiments which can be plotted on the
scale are capable of testing all theories that could potentially explain the originally observed effect.
If a new experiment is very close to the point of reference on the scale, then we should be more
likely to call that a replication attempt. The further away it is, the more likely we should be to call it
a reproduction attempt. Following popular terminology from social psychology, we could also call
this an attempt at a “conceptual replication” (Stroebe & Strack, 2014).
Figure 1: (A) “Replications” in Drummond’s sense are the production of the same set of results to
those observed in an original experiment. The replication attempt involves nearly identical
experimental settings to those employed in the original experiment. We say “nearly” here because
no two sequentially run experiments run can ever be perfectly identical given that they are run at
different times. “Reproductions” (also known as “conceptual replications”) provide evidence for the
mechanism postulated to explain the originally observed effect but in a very different experimental
setting. (B) The method of “strong inference” involves targeted reproductions in specifically those
experimental parameter settings that can help us discriminate between competing theories.
Talking about things in this way can help us think more clearly about various forms of theoretical
biases and malpractice that may occur in psychological science. To illustrate this, consider the
following hypothetical scenario. Researchers are interested in testing whether adults reading text
written in difficult fonts (the disfluent condition) will learn more than adults reading text in easy to
read fonts (the fluent condition), or vice-versa. They start out with the counterintuitive hypothesis
that disfluency will improve learning.
Now (as is more and more common), imagine also that this experiment can be run on Amazon’s
Mechanical Turk, which is an on-line platform allowing for quick and cheap data collection. In
advance, the researcher identifies 10 difficult to read text fonts, and 10 easy to read text fonts. The
researcher is technologically savvy, and figures out a way to automate internet-based data
collection. They end up running 100 different experiments, one experiment for each possible
combination of easy and difficult fonts. Each experiment thus compares one easy to read text font
with one hard to read text font. Of all these experiments, only one yields the desired significant
difference whereby the hard to read font produces higher levels of learning. The experimenter,
having read all about the replicability problem, wants to make sure their experiment is replicable. So
they replicate the one experiment that worked, not once but twice. Then they go on to publish these
findings, and conclude in the paper that font difficulty has a general influence on learnability.
Clearly something has gone wrong here, but the problem has nothing to do with replicability since
the original results fully replicate. The problem is reproducibility. What has actually happened is
that the researcher has oversampled the experimental parameter space, and has chosen to focus
only on the very specific parameters where the experiment replicates while neglecting those
parameters that yield theoretically inconvenient results. This produces a bias not at the level of the
individual published results, but at the level of the theory that is being drawn from the results. By
ignoring the theoretically inconvenient findings, the researcher is negligently leading us to draw a
general conclusion about the influence of fonts on learnability that is, in this case, highly unlikely to
be true.
My experience is that something approximating this is not entirely uncommon in psychology, though
experimenters might run, say, 5 versions of an experiment before finding one that works instead of
the 100 illustrated here. Experimental settings that produce theoretically inconvenient results are
sometimes dismissed, and this can be justified with post-hoc reasoning. “These fonts weren’t fair
tests of the hypothesis after all” because one was too easy or too hard, or too familiar, or too large,
etc… “My-side” bias, which is defined as the tendency to seek out information which supports one’s
pre-existing beliefs or desires while ignoring or discounting valid contradictory evidence, has been
argued to influence the quality, and likely replicability, of individual findings (e.g. Papp, 2014;
Nuzzo, 2015). What the above case illustrates is that the same type of bias can operate at the level of
theoretical inference.
Of course, not all cases of non-reproducibility are due to scientific malpractice or bias, just as not all
cases of non-replicability are. In some instances, a researcher draws a perfectly reasonable
conclusion about a pattern of data that just happens to be wrong. There are also cases in which it
really is the case that the experimental parameter settings do not offer the possibility of a fair test of
the hypothesis. For example, imagine that there actually is a general effect of font difficulty on
learning. But also imagine that the material the participant is being asked to learn is extraordinarily
difficult, say a series of 20 digit numbers. In a case like this, any true effect that exists will be
masked because the task actually is too difficult. In both the fluent and disfluent condition,
participants would be at 0% accuracy because the task is just too hard. However if thhe experiment
had involved a series of 4 or 5 digit numbers, then the experiment might have worked.
What is needed are theories of (1) underlying mental processes of interest and (2) task performance
that are capable, at least in principle, of making clear predictions about the range of parameter
settings over which a previously observed effect should reproduce. Thinking about things in this way
helps highlight the impact of “strong inference,” (Platt, 1964) which is a de-biasing technique that I
think is under appreciated in everyday practice. The idea of strong inference is that researchers
should not, as is too often the case, simply be content with having a theory and finding a few
significant results that were predicted by that theory. Instead, the researchers should force
themselves to articulate one or multiple alternative hypotheses. In conjunction with a theory of task
performance, one can then identify experimental parameters where the various theories make
competing predictions, and focus in on those (schematically depicted in Figure 1B).
This method, which has been widely adopted in biology since the 60’s with great success (Fudge,
2014), forces scientists to explore the experimental parameter space by selectively focusing in on
series of targeted sets of reproductions that can advance theory. In doing so, one would likely avoid
the type of bias described in the hypothetical disfluency situation above. For example, if a
researcher (by chance or unconscious malpractice) found a significant boost in learning for disfluent
fonts, they might then articulate two competing theories which could explain the effect. Theory 1
may be that disfluent fonts improve learning because specific types of letter shapes require more
attention to process. Increased attention “spills over” onto the learning task. Theory 2 may be that
disfluent fonts improve learning not because of the shapes of individual letters, but because
disfluent fonts involve hard to process holistic word forms (which in turn increase attention and
improve learning). The researchers would then devise a minimally contrasting set of reproduction
attempts which could differentiate between the two alternative hypotheses. In doing so, they would
quickly realize that the original effect fails to extend in any of the expected directions, and that there
may therefore be doubts about the original interpretation of the results.
The emphasis of this view is somewhat different from those who argue that concentration on direct
replications is the primary way forward for psychological science (e.g. Simmons, 2014; Chambers,
2012). Many who hold this view dismiss “conceptual replications” (i.e. “reproduction attempts” in
Drummond’s terminology) as being overly limited in the type of information they provide. But I think
there is a larger point here that is not being expressed. Yes, direct replication is important because
we want to know that individual data points are reliable. But if you are interested in theoretical
progress via experimentation, targeted conceptual replication is also required.
References
Chambers, C. (2012). You can’t replicate a concept.
http://neurochambers.blogspot.fr/2012/03/you-cant-replicate-concept.html
Drummond, C. (2009) .Replicability is not reproducibility: nor is it good science. Proceedings of the
Evaluation Methods for Machine Learning Workshop 26th International Conference for Machine
Learning, Montreal, Quebec, Canada.
Fudge, D. S. (2014). Fifty years of JR Platt’s strong inference. Journal of Experimental Biology,
217(8), 1202-1204.
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the
reproducibility of psychological science.” Science, 351, 1037-a-1038-a.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349(6251), aac4716.
Paap, K. R. (2014). The role of componential analysis, categorical hypothesising, replicability and
confirmation bias in testing for bilingual advantages in executive functioning. Journal of Cognitive
Psychology, 26, 242–255.
Platt, J. R. (1964). Strong Inference. Science, 146, 347–353.
Simons, D. J. (2014). The value of direct replication. Perspectives on Psychological Science, 9(1),
76-80.
Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives
on Psychological Science, 9, 59–71.