Error and the Growth of Experimental Knowledge.—Deborah G

1999
675
REVIEWS
Reviews
Syst. Biol. 48(3):675–679, 1999
Error and the Growth of Experimental Knowledge.—Deborah G. Mayo. 1996. The University of
Chicago Press, Chicago, Illinois. 494 pp. Science and
its Conceptual Foundations Series, David L. Hull
(ed.). xvi + 494 pp. $70.00 (hardbound). ISBN 0226-51197-9. $29.95 (paperback). ISBN 0-226-51198-7.
www.press.uchicago.edu.
We can say that, if we take e as a test of h, then the
severity of this test, interpreted as supporting evidence, will be greater the less probable is e given b
alone (without h); that is to say, the smaller is p(e,b),
the probability of e given b. (Popper, 1963:390 )
. . . hypothesis H passes a severe test with e if e Žts
H, and the test procedure had a high probability of
producing a result that accords less well with H than
e does, if H were false or incorrect. (Mayo, 1996:445)
The two quotes above express, in much the same way,
the intuition that we should not be very impressed with
some purported evidence, e, for hypothesis, h, if in fact
we don’t really need h at all to have had a good chance
of observing e.
For Popper, severity formed part of a counterargument to the inductivist pursuit of high probabilities
of hypotheses—indeed, seeking an “improbability” instead could not be any more contrary. When evidence
e is improbable given only background knowledge (his
“b” in the quote above), then one has a severe test and,
equivalently, a degree of corroboration of h, indicated
by just how improbable e was given only b. Popper ’s
formulation of severity and corroboration is the natural
companion to his more familiar statement of relationship between evidence and hypothesis—that observations can refute or falsify h but cannot prove h to be true.
Failure of e to falsify h clearly does not verify h—but it
does provide more or less support for h, depending on
the severity of the test. Unfortunately, common-usage
of Popper (as in introductions to research papers that
begin along the lines, “Popper said we should try to
falsify hypotheses, and here we go”), calls upon falsiŽcation without reference to corroboration/severity,
or simply equates corroboration logically with nonfalsiŽcation. The improbability criterion itself has been
misinterpreted as simply implying an epistemological
preference for hypotheses that are likely to be proven
wrong—and so improbability is subsumed into a role
of providing grist for the falsiŽcation mill.
For Mayo, a professor in the Philosophy Department
at the Virginia Polytechnic Institute and State University, severity is important also. She describes it as the
“centerpiece” of her book, Error and the Growth of Ex-
perimental Knowledge, which earned the Lakatos Prize
in philosophy of science in 1999. The starting point of
Mayo’s account of the growth of experimental knowledge is an agreement with Popper that seeking high
probabilities for hypotheses yields “an unsatisfactory
account of scientiŽc inference.” Mayo states at the outset that it is the pinning down of the probabilities of
particular outcomes from experiments that can be the
basis for gaining knowledge. Again, it is an improbability of our observations or evidence that is of interest,
suggesting strong links with Popper. Putting aside for
the moment Mayo’s presentation of the philosophical
issues that provide context and justiŽcation, it is worth
reviewing Žrst her presentation of the mechanics of her
severe tests.
Mechanics
Mayo’s book is all about reconciling our philosophical notions about the relationship between evidence and hypothesis with actual statistical practice.
In Chapter 1, Mayo makes it clear that this is to be a
largely a statistical account, acknowledging “as the
norm the need to deal with the rejection of statistical hypotheses.” Here, standard statistical tests are to
be the basis for talking about experiments, errors and
probabilities . Early on, Mayo distinguishes between
“evidential-relationship ” (E-R) and testing approaches
(Chapter 3). The quantities in E-R approaches are described as probabilities “or other measures (of support
or credibility)” assigned to hypotheses, while the quantities in testing are quantities related to “errors” that
describe properties of the test itself. Mayo describes
a simple scenario, when a null hypothesis is rejected
based on a small tail probability and the alternative
hypothesis, H, therefore has passed what she calls a
“severe test.” That small tail probability means that obtaining our evidence e is improbable, given only the
null model, without H. The link between severity and
an indication of the correctness of h is therefore clear:
“[T]he experimental inference that is licensed, in other
words, is what has passed a severe test” (p. 11). Mayo
uses “reliability” in this context to reect, not an automatic inference that H is true, but rather that H has
gained some grounds for our supposing it to be correct.
Achieving severity is described as the pathway to making inferences, although there is to be no strictly logical
relationship here between evidence and hypothesis.
How do “errors” enter into this? Mayo describes a
“severe test” also as “a testing procedure with a good
chance of revealing the presence of a speciŽc error if it
exists—but not otherwise” (p. 7; see also p. 460). A potential “error” in this context corresponds to some alternative conceivable way to account for evidence e, with
H absent. This deŽnition can be equated with that in the
quote above—a “good chance of revealing the presence
of a speciŽc error” is equivalent to a “high probability
676
VOL. 48
SYSTEMATIC BIOLOGY
of producing a result that accords less well with H than
e does.” This error-based description, at least up to the
phrase “but not otherwise,” therefore corresponds to
the version of severity in the quote above, taken from
later in the book. Finding an error when an error exists,
and Žnding H false when it is false, are equivalent bases
for severity. I will return to the “but not otherwise” condition later in this review.
In Chapter 6, Mayo presents a simple example. In
testing hypothesis H, a null hypothesis (H0 ) is rejected
with an observed P value of 0.03. Mayo emphasizes that
this value is the probability that the test would pass H
when in fact H0 , the null hypothesis, was true—in other
words, the familiar type 1 error equal to the probability of wrongly rejecting the null hypothesis. Severity
is then calculated as 1 minus this probability, or 0.97—
equal to the probability of not passing H when indeed
H is false (H0 is true). Mayo notes (p. 193):
. . . by rejecting the null hypothesis H0 only when the
signiŽcance level is low, we automaticall y ensure that
any such rejection constitutes a case where the nonchance hypothesis H passes a severe test.
All that would seem to amount to nothing more
than another word (“severe”) for the low tail probabilities leading to conventional rejection of hypotheses. But Mayo promotes a search (“hunting and snooping,” Chapter 9) for those hypotheses that, given the
experimental outcome, would be judged to have been
tested severely. Mayo cautions against treating “predesignated and postdesignated tests alike,” capturing the
necessary distinction between a priori and a posteriori
tests.
In a later example, Mayo describes a test of a hypothesis that an observed value of a variable is different from values associated with some assumed statistical distribution, represented by a null model. When
the null hypothesis is not rejected, the hypothesis that
there was a real difference has not been tested severely;
the observed value and corresponding difference are
not improbable given only the null model. However, an
hypothesis that the difference is not greater than some
quantity, x, has been tested severely as a result of the observed value for the initial test. Now, the same observed
value is improbable, because it lies in the tail of a distribution representing a difference of x units. Mayo’s
various well-presented examples clarify the link of tail
probabilities to severity that is apparent when observations are taken as evidence for some hypothesis H, and
that evidence is improbable given only our null model.
The Mechanics of Popperian Severity/Corroboration
Mayo’s severity criterion appears to be identical to
Popper’s, though her own improbability criterion is
based on tail probabilities. In considering Popperian
severity/corroboration, this amounts to substituting a
null or other statistical model as a special case of background knowledge, b, and as the basis for calculatin g
the probability of the evidence given only the background knowledge, p(e,b). In the introductory quote
from Mayo above, note that e refers to the data, whereas
Popper uses e for the evidence, corresponding to what
Mayo refers to as the “Žt” of data to hypothesis. Mayo is
calculatin g the probability of the evidence given background knowledge just as Popper is.
The special case represented by the substitution of a
statistical model for background knowledge is nothing
new. The statistical basis for Mayo’s severity criterion
may recall for readers of Systematic Biology the explicit
Popperian severity/corroboration framework underlying the permutation tail probability test (PTP) and related tests. The stated degree of corroboration/severity
for a PTP test of a phylogenetic hypothesis is equated
with the degree of improbability, equal to the P value
in evaluatin g a null model based on random character
covariation (Faith, 1990, 1992; Faith and Cranston, 1991,
1992). The hypothesis, h, is that the most-parsimonious
tree has revealed true phylogenetic pattern and the evidence, e, is taken to be the length (degree of parsimony)
of that cladogram. The most-parsimonious tree has the
highest corroboration among all competing tree hypotheses, but the degree of corroboration may be quite
small (for example, on some occasions when the parsimony value is based on a single character). To make the
above quote from Popper speciŽc to this context, “if we
take the degree of parsimony as a test of the hypothesis
that the most-parsimonious tree is the true tree, then
the severity of this test, interpreted as supporting evidence, will be greater the less probable is that degree of
parsimony given the null model of random covariation
alone (without h); that is to say, the smaller is p(e,b), the
probability of a tree that short given the null model.”
So, PTP as an assessment of severity/corroboration appropriately addresses the question as originally posed,
“[C]ould a cladogram this short have arisen by chance
alone?” (Faith and Cranston, 1991).
Mayo’s presentation of the mechanics of severity
therefore corresponds well not only with Popperian
severity/corroboration in general, but also, in the statistical context, with the use in systematics of null models
and tail probabilities as an explicit assessment of Popperian severity/corroboration. However, the linkages
that are actually made in this book are another matter. Mayo’s own characterization of the philosophical
context for severity is the most challenging part of the
book.
Mayo’s Caricature of Popper
Mayo’s characterizatio n of Popper stands in sharp
contrast to the one reviewed above. No equivalence is
noted between her severity criterion and Popper’s. This
is not to say that Popper isn’t discussed. At the outset,
Mayo claims novelty for her approach with reference
to Popper (p. 4):
The result, let me be clear, is not a Žlling-in of the Popperian (or the Lakotosian) framework, but a wholly
different picture of learning from error, and with it a
different program for explaining the growth of scientiŽc knowledge.
This apparent contrast arises because Mayo presents
Popperian philosophy, in the early chapters, purely as a
tale about falsiŽcation. Popper here is repeatedly characterized as a sloganeer (pp. 7, 8, 207), with nothing
positive to say regarding positive support for hypotheses. That portrayal is to stand in apparent contrast to
1999
REVIEWS
Mayo’s achievement in this book: “[E]ach of these slogans, however, is turned into a position where something positive is extracted from the severe criticism”
(p. 8). Mayo sums this up with a phrase that suggests
she doesn’t think slogans are too bad after all, stating
that it is time to “dispel the ghosts of Popper’s negativism” (p. 8).
Throughout Chapter 1, Popper’s apparent negativism is linked to limitation s that falsiŽcation imposes
on how much we can learn from error: “Popper says little about what positive information is acquired through
error other than just that we learn an error has been
made someplace” (p. 1). For Popper, “learning is a matter of deductive falsiŽcation—what is learned is that h
is false” (p. 2). “Popper says that passing a severe test
(i.e., corroboration) counts in favor of a hypothesis simply because it may be true, while those that failed the
tests are false” (p. 10).
Popper’s severity/corroboration is not ignored in
Chapter 1, but is vaguely characterized as a failure
(p. 8):
The most devastating criticism of Popper’s approach
is this. . . Popper seems to lack any meaningful way
of saying why passing severe tests counts in favor of
a hypothesis. And later (p. 10):
I agree with Popper’s critics that Popper fails to explain why corroboration counts in favor of a hypothesis.
If those are indeed accurate characterization s of Popper’s severity criterion, then it is plausible that Mayo’s
severity criterion, in searching for positive indications
that hypothesis h is correct, is really something totally
different from Popperian severity/corroboration. But
Mayo’s caricature is easily countered by citing Popper’s
own references to a positive interpretation of degree
of corroboration/severity. These and many other references make it clear that severity for Popper embraces
the same thing that Mayo is claiming as a novel result:
If e should be probable, in the presence of b alone
(“probable” in the sense of the probability calculus),
then its occurrence can hardly be considered as signiŽcant support of h. (Popper, 1982:237)
Now we can express our demand that the empirical
evidence, if it is to support h, should not be probable
(or expected) on the background knowledge b alone
by p(e,b) < 1/2. This leads us at once to realize that
the smaller p(e,b), the stronger will be the support
which e renders to h—provided our Žrst demand is
satisŽed, that is, provided e follows from h and b, or
from h in the presence of b. (Popper, 1982:238) .
In Realism and the Aim of Science, Popper (1982) reviews his rationale for corroboration as something that
must go beyond just a failure to falsify. He argues that
evidence e can support h in a way that goes beyond
simply not being a counter-instance—but to do so, it
must be improbable without h. That notion of a degree
of positive support, not as a probability assigned to the
hypothesis but as a characterization of the test, is the
basis for his severity criterion:
677
. . . we demand intuitively that only severe tests should
count, and that the more severe they are, the more
they should count. But this is the same as to demand that e should be improbable on our background
knowledge. (Popper, 1982:238)
In the midst of slogans, Mayo appears to have overlooked the key positive role played by Popperian
severity. Part of the difŽculty may be that the severity/corroboration deŽnitions are in Popper’s appendices and formulae. The reader who expects these to
be carefully examined in the context of Mayo’s own
discussion of severity will be disappointed, as Mayo
states that, “the particular mathematical formulas Popper offered for measuring the degree of severity are
even more problematic and they will not be speciŽcally
considered here” (p. 207).
I have some sympathy with Mayo’s portrayal. Popular Popper is basically the falsiŽcation bits—a caricature
that is not so much a product of Popper’s own slogans
as the sloganeering of subsequent interpreters. Mayo’s
book simply reinforces that falsiŽcation-centred perspective. One is left with the feeling that, if Popper has
been so consistently sloganized that his philosophy is to
be only falsiŽcation, then it is time to acknowledge another system, a Popper* : inferential analyses using falsiŽcation (and severity/corroboration, sometimes based
on statistical tail probabilities) .
Mayo is perhaps correct in noting that Popper did
not take the “error probability turn” (p. 207) in a statistical sense. However, given that Popper did deŽne
severity/corroboration in terms of that fundamental
low probability of evidence e in the absence of hypothesis h, it is frustrating that the links are not discussed.
Mayo does consider Popperian severity again later in
the book (Chapter 6), and it will be difŽcult for the
reader to reconcile this later section with the earlier part
of the book. In the early dismissive caricature of Popper,
Mayo argues (p. 41):
a major aw in Popper’s account (recall Chapter 1)
arises because he supplies no grounds for thinking
that a hypothesis h very probably would not have
been corroborated if it were false.
In Chapter 6, Mayo now puts forward a different
characterization (p. 208):
Popper plainly states that the reason he thinks that
hypothesis j can be expected to fail if false is that background and alternative hypotheses predict not-e . . .
It now appears that Popper’s severity is to be acknowledged as having some similarity to Mayo’s severity criterion. However, Mayo goes on to argue that a criterion based on having alternative hypotheses predict
not-e does not work. Here, a different kind of misrepresentation arises, and a reading of Popper once again
seems to directly counter Mayo’s characterization . According to Popper, the only term used to calculate an
improbability of our evidence is b, described as “background knowledge consisting of theories not under test,
and also of initial conditions” (Popper, 1982:252 ; emphasis mine). Also, Popper states:
678
SYSTEMATIC BIOLOGY
by our background knowledge b we mean any
knowledge (relevant to the situation) which we
accept—perhaps only tentatively —while we are testing h. Thus b may, for example, include initial conditions. It is important to realize that b must be consistent with h. . . . (Popper, 1982:236) .
Mayo’s characterizatio n of Popperian severity as requiring evaluation of the evidence in light of alternative hypotheses therefore appears misplaced. Further,
the idea that there is a “prediction” of not-e has no correspondence with Popper’s assertion that it is a low
probability of e given b that is the reason to expect that
h will fail when false. Why does Mayo saddle Popper
with a prediction and yet refer to low probability of
evidence in her own treatment of severity?
The reader therefore may judge Mayo’s “ghosts of
Popper” to be an apparition, with clearer links between
her own severity account and Popper* , particularly in
the calculatio n of statistical probabilities of obtaining
evidence, e, when h is absent or “false.” The mechanics for all that are exciting. Whether the book breaks
new ground philosophically is open to question. If it
is judged not-so-improbable to have obtained all that
severity-as-tail-probabilit y mechanics based only on
our standard epistemological recipes (lots of Popper,
hold the Mayo), then Mayo’s “wholly different picture
of learning from error, and with it a different program
for explaining the growth of scientiŽc knowledge,” certainly has not yet passed any severe test.
Errors and Systematics
Those are Mayo’s key confusions regarding Popperian severity, but confusions in describing her own
severity criterion will be equally frustrating for the
reader. In Chapter 12, Mayo provides the reader a quick
review of severity (p. 424; emphasis mine):
the lower the signiŽcance level required before rejecting H0 and accepting the non-null hypothesis—call it
H—the more improbable such an acceptance of H is,
when in fact H0 is true. And the more probable such an
erroneous acceptance of H is, the higher the severity
is of a result taken to pass H. This just rehearses what
we already know.
But it is of course the improbability of such an erroneous acceptance that is to yield higher severity. These
lapses are easy to catch when one keeps in mind Popperian severity, but must be difŽcult for the uninitiated.
Further on in Chapter 13, another possible confusion
for the reader arises in the description here of the link of
severity to an error-statistical framework: “It is learned
that an error is absent to the extent that a procedure
of inquiry with a high probability of detecting the error if and only if it is present nevertheless detects no
error” (p. 445, emphasis mine). Shouldn’t that be simply “if,” in order to be consistent with Mayo’s central
deŽnition of severity? The “if” condition corresponds
to a statement that there must be a high probability of
obtaining a Žt that is not as good as that observed, if h
is false. That is the notion of severity that Mayo thoroughly discusses, and is all that is needed to obtain a
requirement for a low p(e,b). The “only if” condition
VOL. 48
means that there must be a high probability of a Žt not
so good as that observed, only if h is false. That suggests
that if h is true, we can have only a low-ish probability
of a not-so-good Žt, or a high-ish probability of as good
a Žt. That can be expressed as a high p(e,hb)—the probability of e given h and b—a likelihood term also used
in Popper’s equations for severity/corroboration (but
usually set equal to 1). That inconsistency in Mayo’s
deŽnition of severity matches that in the earlier quote
taken from Chapter 1 (see also p. 460) that used the
additional condition “but not otherwise.” Does Mayo
intend something like p(e,hb) to be an essential part of
the calculatio n of severity?
The role of those two terms, p(e,hb) and p(e,b), in
corroboration/severity highlights the link between this
book and the ongoing debates in systematics about how
corroboration/severity may, or may not, justify cladistic parsimony and other methods of phylogenetic inference. In the PTP-inspired framework referred to above,
there is no automatic justiŽcation for any method; the
measure of “Žt,” to use Mayo’s term, could be that provided by any phylogenetic method and severity would
still be indicated by low p(e,b). In contrast, other subsequent recastings of Popperian corroboration, while
explicitly considering the terms, p(e,hb) and p(e,b), actually assign p(e,b) no role. For example,
the best supported hypotheses are those that assign
highest probability to the evidence. Only p(e,hb) can
perform this role; the other term p(e,b) does not involve h. (Carpenter et al., 1998:107) .
Similarly, the early link of severity to improbable
hypotheses (Kluge, 1984) has been freshly recast in a
recent “review” (Kluge, 1997) as an improbability of
evidence in light of the background knowledge. However, “evidence,” e, here reects only properties of the
data itself, implying that all tree hypotheses have the
same value for p(e,b). So this term again plays no role
in determining relative corroboration/severity for different tree hypotheses. Mayo’s book surely is a mustread for systematists just for the opportunity it offers
to explore the mismatch between these recent justiŽcations for cladistics and the severity-as-improbability of-evidence of Popper* .
But Mayo’s book also may have a broader appeal
to systematists. Currently, philosophical justiŽcations
and distinctions in systematics seem to be the domain
of cladistics but not other methods (see Siddall and
Kluge, 1997, and reference within to Huelsenbeck, 1996;
Wenzel, 1997, and reference within to Swofford et al.,
1996). Mayo’s book shows that severity is a powerful, general criterion with wide application. Her broad
framework includes “methodological rules” and “error
repertoires,” where experience with severity applications leads to lessons about how methods and models
perform in different contexts. At this level, the book
may point to a more inclusive philosophy for systematics. Various methods in systematics may produce severe tests, and the methods themselves may be severely
tested. Whereas a sloganized Popper has provided an
exclusive philosophy, twisting and turning to uniquely
justify cladistic parsimony, Popper* is relevant to phylogenetic analyses using parsimony and other methods.
1999
REVIEWS
REFERENCES
CARPENTER, J. M., P. A. GOLOBOFF , AND J. S.
FARRIS . 1998. PTP is meaningless, T-PTP is contradictory: A reply to Trueman. Cladistics 14:105–116.
FAITH , D. P. 1990. Chance marsupial relationships.
Nature 345:393–394.
FAITH , D. P. 1992. On corroboration: A reply to Carpenter. Cladistics 8:265–273.
FAITH , D. P., AND P. S. CRANSTON . 1991. Could a cladogram this short have arisen by chance alone? On permutation tests for cladistic structure. Cladistics 7:1–
28.
FAITH , D. P. AND P. S. CRANSTON . 1992. Probability, parsimony, and Popper. Syst. Biol. 41:252–257.
HUELSENBECK , J. P. 1996. Phylogenetic methods.
Pages 1–8. http://mw511.biol.berkeley.edu/john/lecture.html.
KLUGE , A. G. 1984. The relevance of parsimony to
phylogenetic inference. Pages 24–38 in Cladistics:
Perspectives on the reconstruction of evolutionary
history (T. Duncan and T. F. Stuessy, eds.). Columbia
679
University Press, New York.
KLUGE , A. G. 1997. Testability and the refutation
and corroboration of cladistic hypotheses. Cladistics
13:81–96.
POPPER, K. 1963. Conjectures and refutations: The
growth of scientiŽc knowledge. Harper and Row,
New York.
POPPER, K. 1982. Realism and the aim of science.
Hutchinson, London.
SIDDALL , M. E., AND A. G. KLUGE . 1997. Probabilism
and phylogenetic inference. Cladistics 13:313–336.
SWOFFORD, D. L., G. J. OLSEN , P. J. WADDELL , AND D.
M. HILLIS . 1996. Phylogenetic inference. Pages 407–
514 in Molecular systematics (D. M. Hillis, D. M., C.
Moritz, and B. K. Mable, eds.). Sinauer Associates,
Sunderland, Massachusetts.
WENZEL , J. W. 1997. When is a phylogenetic test good
enough? Mem. Mus. Natl. Hist. Nat. 173:31–45.
Daniel P. Faith, Australian Museum, Sydney, 2010 Australia