1999 675 REVIEWS Reviews Syst. Biol. 48(3):675–679, 1999 Error and the Growth of Experimental Knowledge.—Deborah G. Mayo. 1996. The University of Chicago Press, Chicago, Illinois. 494 pp. Science and its Conceptual Foundations Series, David L. Hull (ed.). xvi + 494 pp. $70.00 (hardbound). ISBN 0226-51197-9. $29.95 (paperback). ISBN 0-226-51198-7. www.press.uchicago.edu. We can say that, if we take e as a test of h, then the severity of this test, interpreted as supporting evidence, will be greater the less probable is e given b alone (without h); that is to say, the smaller is p(e,b), the probability of e given b. (Popper, 1963:390 ) . . . hypothesis H passes a severe test with e if e ts H, and the test procedure had a high probability of producing a result that accords less well with H than e does, if H were false or incorrect. (Mayo, 1996:445) The two quotes above express, in much the same way, the intuition that we should not be very impressed with some purported evidence, e, for hypothesis, h, if in fact we don’t really need h at all to have had a good chance of observing e. For Popper, severity formed part of a counterargument to the inductivist pursuit of high probabilities of hypotheses—indeed, seeking an “improbability” instead could not be any more contrary. When evidence e is improbable given only background knowledge (his “b” in the quote above), then one has a severe test and, equivalently, a degree of corroboration of h, indicated by just how improbable e was given only b. Popper ’s formulation of severity and corroboration is the natural companion to his more familiar statement of relationship between evidence and hypothesis—that observations can refute or falsify h but cannot prove h to be true. Failure of e to falsify h clearly does not verify h—but it does provide more or less support for h, depending on the severity of the test. Unfortunately, common-usage of Popper (as in introductions to research papers that begin along the lines, “Popper said we should try to falsify hypotheses, and here we go”), calls upon falsication without reference to corroboration/severity, or simply equates corroboration logically with nonfalsication. The improbability criterion itself has been misinterpreted as simply implying an epistemological preference for hypotheses that are likely to be proven wrong—and so improbability is subsumed into a role of providing grist for the falsication mill. For Mayo, a professor in the Philosophy Department at the Virginia Polytechnic Institute and State University, severity is important also. She describes it as the “centerpiece” of her book, Error and the Growth of Ex- perimental Knowledge, which earned the Lakatos Prize in philosophy of science in 1999. The starting point of Mayo’s account of the growth of experimental knowledge is an agreement with Popper that seeking high probabilities for hypotheses yields “an unsatisfactory account of scientic inference.” Mayo states at the outset that it is the pinning down of the probabilities of particular outcomes from experiments that can be the basis for gaining knowledge. Again, it is an improbability of our observations or evidence that is of interest, suggesting strong links with Popper. Putting aside for the moment Mayo’s presentation of the philosophical issues that provide context and justication, it is worth reviewing rst her presentation of the mechanics of her severe tests. Mechanics Mayo’s book is all about reconciling our philosophical notions about the relationship between evidence and hypothesis with actual statistical practice. In Chapter 1, Mayo makes it clear that this is to be a largely a statistical account, acknowledging “as the norm the need to deal with the rejection of statistical hypotheses.” Here, standard statistical tests are to be the basis for talking about experiments, errors and probabilities . Early on, Mayo distinguishes between “evidential-relationship ” (E-R) and testing approaches (Chapter 3). The quantities in E-R approaches are described as probabilities “or other measures (of support or credibility)” assigned to hypotheses, while the quantities in testing are quantities related to “errors” that describe properties of the test itself. Mayo describes a simple scenario, when a null hypothesis is rejected based on a small tail probability and the alternative hypothesis, H, therefore has passed what she calls a “severe test.” That small tail probability means that obtaining our evidence e is improbable, given only the null model, without H. The link between severity and an indication of the correctness of h is therefore clear: “[T]he experimental inference that is licensed, in other words, is what has passed a severe test” (p. 11). Mayo uses “reliability” in this context to reect, not an automatic inference that H is true, but rather that H has gained some grounds for our supposing it to be correct. Achieving severity is described as the pathway to making inferences, although there is to be no strictly logical relationship here between evidence and hypothesis. How do “errors” enter into this? Mayo describes a “severe test” also as “a testing procedure with a good chance of revealing the presence of a specic error if it exists—but not otherwise” (p. 7; see also p. 460). A potential “error” in this context corresponds to some alternative conceivable way to account for evidence e, with H absent. This denition can be equated with that in the quote above—a “good chance of revealing the presence of a specic error” is equivalent to a “high probability 676 VOL. 48 SYSTEMATIC BIOLOGY of producing a result that accords less well with H than e does.” This error-based description, at least up to the phrase “but not otherwise,” therefore corresponds to the version of severity in the quote above, taken from later in the book. Finding an error when an error exists, and nding H false when it is false, are equivalent bases for severity. I will return to the “but not otherwise” condition later in this review. In Chapter 6, Mayo presents a simple example. In testing hypothesis H, a null hypothesis (H0 ) is rejected with an observed P value of 0.03. Mayo emphasizes that this value is the probability that the test would pass H when in fact H0 , the null hypothesis, was true—in other words, the familiar type 1 error equal to the probability of wrongly rejecting the null hypothesis. Severity is then calculated as 1 minus this probability, or 0.97— equal to the probability of not passing H when indeed H is false (H0 is true). Mayo notes (p. 193): . . . by rejecting the null hypothesis H0 only when the signicance level is low, we automaticall y ensure that any such rejection constitutes a case where the nonchance hypothesis H passes a severe test. All that would seem to amount to nothing more than another word (“severe”) for the low tail probabilities leading to conventional rejection of hypotheses. But Mayo promotes a search (“hunting and snooping,” Chapter 9) for those hypotheses that, given the experimental outcome, would be judged to have been tested severely. Mayo cautions against treating “predesignated and postdesignated tests alike,” capturing the necessary distinction between a priori and a posteriori tests. In a later example, Mayo describes a test of a hypothesis that an observed value of a variable is different from values associated with some assumed statistical distribution, represented by a null model. When the null hypothesis is not rejected, the hypothesis that there was a real difference has not been tested severely; the observed value and corresponding difference are not improbable given only the null model. However, an hypothesis that the difference is not greater than some quantity, x, has been tested severely as a result of the observed value for the initial test. Now, the same observed value is improbable, because it lies in the tail of a distribution representing a difference of x units. Mayo’s various well-presented examples clarify the link of tail probabilities to severity that is apparent when observations are taken as evidence for some hypothesis H, and that evidence is improbable given only our null model. The Mechanics of Popperian Severity/Corroboration Mayo’s severity criterion appears to be identical to Popper’s, though her own improbability criterion is based on tail probabilities. In considering Popperian severity/corroboration, this amounts to substituting a null or other statistical model as a special case of background knowledge, b, and as the basis for calculatin g the probability of the evidence given only the background knowledge, p(e,b). In the introductory quote from Mayo above, note that e refers to the data, whereas Popper uses e for the evidence, corresponding to what Mayo refers to as the “t” of data to hypothesis. Mayo is calculatin g the probability of the evidence given background knowledge just as Popper is. The special case represented by the substitution of a statistical model for background knowledge is nothing new. The statistical basis for Mayo’s severity criterion may recall for readers of Systematic Biology the explicit Popperian severity/corroboration framework underlying the permutation tail probability test (PTP) and related tests. The stated degree of corroboration/severity for a PTP test of a phylogenetic hypothesis is equated with the degree of improbability, equal to the P value in evaluatin g a null model based on random character covariation (Faith, 1990, 1992; Faith and Cranston, 1991, 1992). The hypothesis, h, is that the most-parsimonious tree has revealed true phylogenetic pattern and the evidence, e, is taken to be the length (degree of parsimony) of that cladogram. The most-parsimonious tree has the highest corroboration among all competing tree hypotheses, but the degree of corroboration may be quite small (for example, on some occasions when the parsimony value is based on a single character). To make the above quote from Popper specic to this context, “if we take the degree of parsimony as a test of the hypothesis that the most-parsimonious tree is the true tree, then the severity of this test, interpreted as supporting evidence, will be greater the less probable is that degree of parsimony given the null model of random covariation alone (without h); that is to say, the smaller is p(e,b), the probability of a tree that short given the null model.” So, PTP as an assessment of severity/corroboration appropriately addresses the question as originally posed, “[C]ould a cladogram this short have arisen by chance alone?” (Faith and Cranston, 1991). Mayo’s presentation of the mechanics of severity therefore corresponds well not only with Popperian severity/corroboration in general, but also, in the statistical context, with the use in systematics of null models and tail probabilities as an explicit assessment of Popperian severity/corroboration. However, the linkages that are actually made in this book are another matter. Mayo’s own characterization of the philosophical context for severity is the most challenging part of the book. Mayo’s Caricature of Popper Mayo’s characterizatio n of Popper stands in sharp contrast to the one reviewed above. No equivalence is noted between her severity criterion and Popper’s. This is not to say that Popper isn’t discussed. At the outset, Mayo claims novelty for her approach with reference to Popper (p. 4): The result, let me be clear, is not a lling-in of the Popperian (or the Lakotosian) framework, but a wholly different picture of learning from error, and with it a different program for explaining the growth of scientic knowledge. This apparent contrast arises because Mayo presents Popperian philosophy, in the early chapters, purely as a tale about falsication. Popper here is repeatedly characterized as a sloganeer (pp. 7, 8, 207), with nothing positive to say regarding positive support for hypotheses. That portrayal is to stand in apparent contrast to 1999 REVIEWS Mayo’s achievement in this book: “[E]ach of these slogans, however, is turned into a position where something positive is extracted from the severe criticism” (p. 8). Mayo sums this up with a phrase that suggests she doesn’t think slogans are too bad after all, stating that it is time to “dispel the ghosts of Popper’s negativism” (p. 8). Throughout Chapter 1, Popper’s apparent negativism is linked to limitation s that falsication imposes on how much we can learn from error: “Popper says little about what positive information is acquired through error other than just that we learn an error has been made someplace” (p. 1). For Popper, “learning is a matter of deductive falsication—what is learned is that h is false” (p. 2). “Popper says that passing a severe test (i.e., corroboration) counts in favor of a hypothesis simply because it may be true, while those that failed the tests are false” (p. 10). Popper’s severity/corroboration is not ignored in Chapter 1, but is vaguely characterized as a failure (p. 8): The most devastating criticism of Popper’s approach is this. . . Popper seems to lack any meaningful way of saying why passing severe tests counts in favor of a hypothesis. And later (p. 10): I agree with Popper’s critics that Popper fails to explain why corroboration counts in favor of a hypothesis. If those are indeed accurate characterization s of Popper’s severity criterion, then it is plausible that Mayo’s severity criterion, in searching for positive indications that hypothesis h is correct, is really something totally different from Popperian severity/corroboration. But Mayo’s caricature is easily countered by citing Popper’s own references to a positive interpretation of degree of corroboration/severity. These and many other references make it clear that severity for Popper embraces the same thing that Mayo is claiming as a novel result: If e should be probable, in the presence of b alone (“probable” in the sense of the probability calculus), then its occurrence can hardly be considered as signicant support of h. (Popper, 1982:237) Now we can express our demand that the empirical evidence, if it is to support h, should not be probable (or expected) on the background knowledge b alone by p(e,b) < 1/2. This leads us at once to realize that the smaller p(e,b), the stronger will be the support which e renders to h—provided our rst demand is satised, that is, provided e follows from h and b, or from h in the presence of b. (Popper, 1982:238) . In Realism and the Aim of Science, Popper (1982) reviews his rationale for corroboration as something that must go beyond just a failure to falsify. He argues that evidence e can support h in a way that goes beyond simply not being a counter-instance—but to do so, it must be improbable without h. That notion of a degree of positive support, not as a probability assigned to the hypothesis but as a characterization of the test, is the basis for his severity criterion: 677 . . . we demand intuitively that only severe tests should count, and that the more severe they are, the more they should count. But this is the same as to demand that e should be improbable on our background knowledge. (Popper, 1982:238) In the midst of slogans, Mayo appears to have overlooked the key positive role played by Popperian severity. Part of the difculty may be that the severity/corroboration denitions are in Popper’s appendices and formulae. The reader who expects these to be carefully examined in the context of Mayo’s own discussion of severity will be disappointed, as Mayo states that, “the particular mathematical formulas Popper offered for measuring the degree of severity are even more problematic and they will not be specically considered here” (p. 207). I have some sympathy with Mayo’s portrayal. Popular Popper is basically the falsication bits—a caricature that is not so much a product of Popper’s own slogans as the sloganeering of subsequent interpreters. Mayo’s book simply reinforces that falsication-centred perspective. One is left with the feeling that, if Popper has been so consistently sloganized that his philosophy is to be only falsication, then it is time to acknowledge another system, a Popper* : inferential analyses using falsication (and severity/corroboration, sometimes based on statistical tail probabilities) . Mayo is perhaps correct in noting that Popper did not take the “error probability turn” (p. 207) in a statistical sense. However, given that Popper did dene severity/corroboration in terms of that fundamental low probability of evidence e in the absence of hypothesis h, it is frustrating that the links are not discussed. Mayo does consider Popperian severity again later in the book (Chapter 6), and it will be difcult for the reader to reconcile this later section with the earlier part of the book. In the early dismissive caricature of Popper, Mayo argues (p. 41): a major aw in Popper’s account (recall Chapter 1) arises because he supplies no grounds for thinking that a hypothesis h very probably would not have been corroborated if it were false. In Chapter 6, Mayo now puts forward a different characterization (p. 208): Popper plainly states that the reason he thinks that hypothesis j can be expected to fail if false is that background and alternative hypotheses predict not-e . . . It now appears that Popper’s severity is to be acknowledged as having some similarity to Mayo’s severity criterion. However, Mayo goes on to argue that a criterion based on having alternative hypotheses predict not-e does not work. Here, a different kind of misrepresentation arises, and a reading of Popper once again seems to directly counter Mayo’s characterization . According to Popper, the only term used to calculate an improbability of our evidence is b, described as “background knowledge consisting of theories not under test, and also of initial conditions” (Popper, 1982:252 ; emphasis mine). Also, Popper states: 678 SYSTEMATIC BIOLOGY by our background knowledge b we mean any knowledge (relevant to the situation) which we accept—perhaps only tentatively —while we are testing h. Thus b may, for example, include initial conditions. It is important to realize that b must be consistent with h. . . . (Popper, 1982:236) . Mayo’s characterizatio n of Popperian severity as requiring evaluation of the evidence in light of alternative hypotheses therefore appears misplaced. Further, the idea that there is a “prediction” of not-e has no correspondence with Popper’s assertion that it is a low probability of e given b that is the reason to expect that h will fail when false. Why does Mayo saddle Popper with a prediction and yet refer to low probability of evidence in her own treatment of severity? The reader therefore may judge Mayo’s “ghosts of Popper” to be an apparition, with clearer links between her own severity account and Popper* , particularly in the calculatio n of statistical probabilities of obtaining evidence, e, when h is absent or “false.” The mechanics for all that are exciting. Whether the book breaks new ground philosophically is open to question. If it is judged not-so-improbable to have obtained all that severity-as-tail-probabilit y mechanics based only on our standard epistemological recipes (lots of Popper, hold the Mayo), then Mayo’s “wholly different picture of learning from error, and with it a different program for explaining the growth of scientic knowledge,” certainly has not yet passed any severe test. Errors and Systematics Those are Mayo’s key confusions regarding Popperian severity, but confusions in describing her own severity criterion will be equally frustrating for the reader. In Chapter 12, Mayo provides the reader a quick review of severity (p. 424; emphasis mine): the lower the signicance level required before rejecting H0 and accepting the non-null hypothesis—call it H—the more improbable such an acceptance of H is, when in fact H0 is true. And the more probable such an erroneous acceptance of H is, the higher the severity is of a result taken to pass H. This just rehearses what we already know. But it is of course the improbability of such an erroneous acceptance that is to yield higher severity. These lapses are easy to catch when one keeps in mind Popperian severity, but must be difcult for the uninitiated. Further on in Chapter 13, another possible confusion for the reader arises in the description here of the link of severity to an error-statistical framework: “It is learned that an error is absent to the extent that a procedure of inquiry with a high probability of detecting the error if and only if it is present nevertheless detects no error” (p. 445, emphasis mine). Shouldn’t that be simply “if,” in order to be consistent with Mayo’s central denition of severity? The “if” condition corresponds to a statement that there must be a high probability of obtaining a t that is not as good as that observed, if h is false. That is the notion of severity that Mayo thoroughly discusses, and is all that is needed to obtain a requirement for a low p(e,b). The “only if” condition VOL. 48 means that there must be a high probability of a t not so good as that observed, only if h is false. That suggests that if h is true, we can have only a low-ish probability of a not-so-good t, or a high-ish probability of as good a t. That can be expressed as a high p(e,hb)—the probability of e given h and b—a likelihood term also used in Popper’s equations for severity/corroboration (but usually set equal to 1). That inconsistency in Mayo’s denition of severity matches that in the earlier quote taken from Chapter 1 (see also p. 460) that used the additional condition “but not otherwise.” Does Mayo intend something like p(e,hb) to be an essential part of the calculatio n of severity? The role of those two terms, p(e,hb) and p(e,b), in corroboration/severity highlights the link between this book and the ongoing debates in systematics about how corroboration/severity may, or may not, justify cladistic parsimony and other methods of phylogenetic inference. In the PTP-inspired framework referred to above, there is no automatic justication for any method; the measure of “t,” to use Mayo’s term, could be that provided by any phylogenetic method and severity would still be indicated by low p(e,b). In contrast, other subsequent recastings of Popperian corroboration, while explicitly considering the terms, p(e,hb) and p(e,b), actually assign p(e,b) no role. For example, the best supported hypotheses are those that assign highest probability to the evidence. Only p(e,hb) can perform this role; the other term p(e,b) does not involve h. (Carpenter et al., 1998:107) . Similarly, the early link of severity to improbable hypotheses (Kluge, 1984) has been freshly recast in a recent “review” (Kluge, 1997) as an improbability of evidence in light of the background knowledge. However, “evidence,” e, here reects only properties of the data itself, implying that all tree hypotheses have the same value for p(e,b). So this term again plays no role in determining relative corroboration/severity for different tree hypotheses. Mayo’s book surely is a mustread for systematists just for the opportunity it offers to explore the mismatch between these recent justications for cladistics and the severity-as-improbability of-evidence of Popper* . But Mayo’s book also may have a broader appeal to systematists. Currently, philosophical justications and distinctions in systematics seem to be the domain of cladistics but not other methods (see Siddall and Kluge, 1997, and reference within to Huelsenbeck, 1996; Wenzel, 1997, and reference within to Swofford et al., 1996). Mayo’s book shows that severity is a powerful, general criterion with wide application. Her broad framework includes “methodological rules” and “error repertoires,” where experience with severity applications leads to lessons about how methods and models perform in different contexts. At this level, the book may point to a more inclusive philosophy for systematics. Various methods in systematics may produce severe tests, and the methods themselves may be severely tested. Whereas a sloganized Popper has provided an exclusive philosophy, twisting and turning to uniquely justify cladistic parsimony, Popper* is relevant to phylogenetic analyses using parsimony and other methods. 1999 REVIEWS REFERENCES CARPENTER, J. M., P. A. GOLOBOFF , AND J. S. FARRIS . 1998. PTP is meaningless, T-PTP is contradictory: A reply to Trueman. Cladistics 14:105–116. FAITH , D. P. 1990. Chance marsupial relationships. Nature 345:393–394. FAITH , D. P. 1992. On corroboration: A reply to Carpenter. Cladistics 8:265–273. FAITH , D. P., AND P. S. CRANSTON . 1991. Could a cladogram this short have arisen by chance alone? On permutation tests for cladistic structure. Cladistics 7:1– 28. FAITH , D. P. AND P. S. CRANSTON . 1992. Probability, parsimony, and Popper. Syst. Biol. 41:252–257. HUELSENBECK , J. P. 1996. Phylogenetic methods. Pages 1–8. http://mw511.biol.berkeley.edu/john/lecture.html. KLUGE , A. G. 1984. The relevance of parsimony to phylogenetic inference. Pages 24–38 in Cladistics: Perspectives on the reconstruction of evolutionary history (T. Duncan and T. F. Stuessy, eds.). Columbia 679 University Press, New York. KLUGE , A. G. 1997. Testability and the refutation and corroboration of cladistic hypotheses. Cladistics 13:81–96. POPPER, K. 1963. Conjectures and refutations: The growth of scientic knowledge. Harper and Row, New York. POPPER, K. 1982. Realism and the aim of science. Hutchinson, London. SIDDALL , M. E., AND A. G. KLUGE . 1997. Probabilism and phylogenetic inference. Cladistics 13:313–336. SWOFFORD, D. L., G. J. OLSEN , P. J. WADDELL , AND D. M. HILLIS . 1996. Phylogenetic inference. Pages 407– 514 in Molecular systematics (D. M. Hillis, D. M., C. Moritz, and B. K. Mable, eds.). Sinauer Associates, Sunderland, Massachusetts. WENZEL , J. W. 1997. When is a phylogenetic test good enough? Mem. Mus. Natl. Hist. Nat. 173:31–45. Daniel P. Faith, Australian Museum, Sydney, 2010 Australia
© Copyright 2026 Paperzz