If one wants a post-data measure, one can write

LSE: PH500: Research Seminar:
Contemporary Problems in Philosophy of
Statistical Science
Deborah G Mayo
Summer 2012
Debates over the philosophical foundations of
statistical science have a long and fascinating
history marked by deep and passionate
controversies that intertwine with fundamental
notions of the nature of statistical inference and
the role of probabilistic concepts in inductive
learning. Progress in resolving decades-old
controversies, which still shake the foundations
of statistics, demands both philosophical and
technical acumen, but gaining entry into the
current state of play requires a road map that
zeroes in on core themes and current
standpoints. That is what this seminar will
provide.
1 June 6, 2012
PH 500. Contemporary Problems in
Philosophy of Statistical Science
1. Outline of 3 whirlwinds: Key themes,
concepts and problems
2 June 6, 2012
One tentative suggestion for the 5 seminars:
1. Overview: Contemporary controversies in
philosophy of statistics: Bayesian –
frequentist statistics, error statistics
2. Extension/continuation of #1
July –Oct/Nov/Dec: self-study of statistical
methods background
3. Error statistics and new reforms: confidence
intervals, significance tests, severity and
possible new tools (e.g., meta-analysis)
4. Contemporary issues in Bayesian
foundations
5. Testing assumptions of statistical models
and/or philosophy of statistics in evidencebased policy
Note: Not everything I say will be
uncontroversial, but I think the work will put
you in a position to take them in a different
direction and go further..
3 June 6, 2012
2. What is the Philosophy of Statistics?
Study of the epistemological, conceptual and
logical problems revolving around the use and
interpretation of statistical methodology
(statistical science: systematic methods for
collecting, modeling, and making inferences
about aspects of data generating mechanisms)
4 June 6, 2012
One year ago, I had a conversation with Sir
David Cox*
COX: Deborah, in some fields foundations do
not seem very important, but we both think
foundations of statistical inference are
important; why do you think that is?
MAYO: I think because they ask about
fundamental questions of evidence, inference, and probability. …in statistics we’re
so intimately connected to the scientific
interest in learning about the world, we
invariably cross into philosophical questions
about empirical knowledge and inductive
inference.
*A Statistical Scientist Meets a Philosopher of Science: A
Conversation between Sir David Cox and Deborah Mayo
(as recorded, June, 2011) RMM Vol. 2, 2011, 103–114;
Special Topic: Statistical Science and Philosophy of
Science
While statisticians and philosophers of science
go about their work in very different ways, at
one level, they ask many similar questions:
5 June 6, 2012
What should be observed, what inferences
are (are not) warranted from the data?
How well do data confirm or fit a model?
What is a good test? Does failure to reject
hypothesis H yield evidence “confirming” H?
How can we tell if an anomaly is genuine?
How can blame for a data-model misfit be
attributed correctly?
Is it relevant to inference if data have been
used to construct or select the hypothesis for
testing? (novelty, double-counting)
How can we learn about/test causal claims?
How can the gaps between available data
and theoretical claims be bridged reliably?
No wonder statistics tends to readily cross over
into philosophical territory.
6 June 6, 2012
Central job of philosophers of science
to help resolve the conceptual, logical, and
methodological discomforts of scientists
especially in fields that deal with issues of
scientific knowledge, evidence, inference,
learning despite uncertainties and errors?
The risk of error enters because we want to
move beyond the evidence or data
7 June 6, 2012
3. Induction as evidence-transcending inference
The conclusion is "evidence transcending":
The premises can be true while the conclusion
inferred may be false without a logical
contradiction.
This frees us to talk about “induction”
without presupposing certain forms of
ampliative inference
…in particular, without presupposing there’s
just one role for probability (however it’s
interpreted)
While probability naturally arises in capturing
induction there are 2 key traditions as to its role:
to measure:
8 June 6, 2012
4. Two roles for probability in inductive
inference: to measure:
1. How much confidence, belief, support to
assign a hypothesis (evidential-relation)
2. How reliably a hypothesis was tested, how
well-tested, probed, or corroborated it is
(severe testing, Popper, Peirce)
This contrast is at the heart of a
philosophical scrutiny of statistical accounts.
Philosophers have generally held the former.
9 June 6, 2012
Conceding that all attempts to solve the problem
of induction, fail, philosophers turned to
constructing logics of induction or confirmation
theories (e.g., Carnap 1962).
The thinking was/is:
Deductive logic: rules to compute whether a
conclusion is true, given the truth of a set of
premises (True)
Inductive logic or confirmation theory: would
provide rules to compute the probability of a
conclusion, given the truth of certain evidence
statements (?)
I suggest rejecting this parallel (Peirce)
10 June 6, 2012
But following the idea that an inductive logic
should supply a degree of evidential relationship
between given evidence statements, e, a
hypothesis H, it is common to look to conditional
probability or Bayes’s Theorem:
P(H|e) = P(e|H)P(H)/P(e)
where P(e) = P(e|H)P(H) + P(e|not-H) P(not-H).
How to obtain these probabilities is outside of
what the “logic” provides
Computing P(H|e), the posterior probability,
requires a probability assignment to all of the
members of “not-H” (Bayesian catchall).
Major source of difficulty: how to obtain and
interpret these prior probabilities.
a. If analytic and a priori, relevance for
predicting and learning about empirical
phenomena is problematic
b. If they measure subjective degrees of belief,
their relevance for giving objective
guarantees of reliable inference is unclear.
11 June 6, 2012
In statistics,
(a) is analogous to default, reference, or
“objective” Bayesianism (e.g.,
Jeffreys, Berger);
(b) to subjective Bayesianism
(other types may exist)
12 June 6, 2012
“Error probability” methods
 Designed to reach statistical conclusions
without invoking prior probabilities in
hypotheses,
 Probability is used to quantify how frequently
methods are capable of discriminating
between alternative hypotheses and how
reliably it facilitates the detection of error.
Error frequencies or error probabilities (e.g.,
significance levels, confidence levels).
In addition to a difference in form, it is part of the
account to provide and check the components
13 June 6, 2012
5. Statistical Test Ingredients
(A) Hypotheses. A statistical hypothesis H,
generally couched in terms of an unknown
parameter , is a claim about some aspect of the
process that generated the data, x0 = (x1, …,xn)
given in some model of the process
(approximate, partial).
Statistical hypotheses assign probabilities to
various outcomes x0 “computed under the
supposition that Hi is correct (about the
generating mechanism, as modeled)”.
That is how one should read f(x; Hi) or P(x; Hi).
Note that this is not a conditional
probability, since that would assume that there is
a prior probability for Hi.
14 June 6, 2012
(B) Distance function. A function of the data
d(X), the test statistic, reflects how well or
poorly the data x0 = (x1, …,xn) fit the hypothesis
H0—the larger the value of d(x0) the farther the
outcome is from what is expected under H0 in
the direction of alternative H1, with respect to
the particular question being asked.
Though no observed difference is strictly
inconsistent with the truth of H0 some are
deemed so improbable under H0—, and so much
more probable under alternatives-- that we take
the data as evidence against H0— (if not for
rejecting it).
(D) Test rule T. One type: infer that x are
evidence of a discrepancy d from a null
hypothesis H0 just in case {d(X) > c}.
15 June 6, 2012
By computing the probability of different values
of d(X) under a test or null hypothesis H0, we
can calculate P{d(X) > c; H0}
and thereby calculate the probability of inferring
evidence for H erroneously--an error
probability.
Such an error probability is given by the
probability distribution of d(X)—called its
sampling distribution—computed under one or
another hypothesis.
Example, in significance tests, we may
calculate the p-value is the probability of a
difference larger than d(x0), under the
assumption that H0 is true:
p(x0)=P(d(X) > d(x0); H0).
If the p-value is small, e.g., .01, .05, it may be
taken as evidence of an inconsistency with H0 at
a given level.
16 June 6, 2012
6. Unifying Fisherian and NeymanPearsonian Tests
It might seem that I’m blurring Fisherian and NP tests (the N-P test is usually formulated as
specifying in advance the cut-off point for
rejection, whereas the Fisherian test reports
observed p-values)
A close look finds this does not lead to the
“inconsistent hybrid” of tests (Gigerenzer) many
fear
The disputes are almost entirely the outgrowth
of personality and professional hostilities that
grew years after N-P tests were erected to place
Fisherian tests on firmer footing—and Fisher
agreed (until the break-up for other reasons)
17 June 6, 2012
In both approaches, probability enters in
scientific inference to provide error probabilities
of test (or inference rules), given by the
sampling distribution of statistics.
The sampling distribution may be used to
characterize the capability of the inferential rule
to unearth flaws and distinguish hypotheses.
At any rate, that is the thesis of the account I
aim to develop.
What makes an account “error statistical” is
its consideration of these error probabilities.
Fisherian “pure” significance tests and N-P
tests just refer to different questions or stages of
inquiry (in sync with Cox’s taxonomy of tests)
But that’s not enough to make it adequate:
these computations must be directed to assessing
how stringently or severely tests claims are (by
dint of the test and data)
18 June 6, 2012
7. Statistical Hypotheses and Occurrence of
Events
A confusion often lurking in the background
of many foundational discussions stems from
mistaking the goal of assigning probabilities to
the occurrence of events and that of assigning
probabilities to the hypotheses themselves.
To infer a statistical hypothesis H, is to infer
a claim that assigns probabilities to the various
experimental outcomes and to the events
described in terms of them.
This is very different from assigning a
probability to H itself (as in a Bayesian analysis)
19 June 6, 2012
Interestingly, all accounts seem to agree that
probability is assigned to events (or
corresponding statements)
How then could probabilities be assigned to
hypotheses?
The truth of a hypothesis would need to be
like the occurrence of an event (but it’s not clear
how this works in scientific contexts)
(Are universes “as plenty as raspberries”?
(Peirce; even if they were, it’s not clear this
gives the probability of a particular hypothesis)
20 June 6, 2012
8. Scrutinizing a statistical account at a
philosophical or foundational level…we ask:
Does it provide an adequate characterization of
scientific reasoning, evidence, inference,
testing?
o it should be ascertainable (must be able to
apply it)
o it should be relevant to the tasks required of
the inference tools
o it should not be in conflict with intuitions
about inference (science, evidence)
—it must have a principled way to avoid
these conflicts (not merely a reconstruction
game)
Right away we are confronted with issues that
appeal to contrasting “philosophical theories”
21 June 6, 2012
Moreover, these questions require asking:
what can various methods be used for? (how
else to know if they’re relevant?)
 This is distinct from what a method’s
founders may have had in mind, or textbook
accounts
 Demands standing “one level removed”
from common interpretations/applications of
methods
22 June 6, 2012
I do not want to rehash the “statistics wars” of
the 60’s, 70’s, 80’s, 90’s, — up to the present:
 even though the significance test
controversy is still hotly debated among
practitioners
(in psychology, epidemiology, ecology,
economics)
 even though it seems each generation fights
the ‘statistics wars’ anew, with journalistic
reforms, and task forces set up to stem
automatic, recipe-like uses of statistics that
have long been deplored (#5 seminar?).
I do want to set the stage for making progress in
these decades old controversies, as well as tackle
current ones:
(i)
in statistical practice, and
(ii) philosophical problems of evidence,
inference, induction
23 June 6, 2012
9. Relevance for Statistical Science Practice?
Only in certain moments is there a need for a
philosophical or self-reflective standpoint in
practice.
Increasingly, over the past decade some issues
seem to cry out for philosophical ministrations
Eclecticism:
Granted, many statisticians suggest that
throwing different and competing methods at a
problem is all to the good —It increases the
chance at least one will be right
This may be so, but one needs to understand
how to interpret and relate competing
answers….which goes back to philosophical
underpinnings…
Unificationism:
Others think that foundational conflicts are bad
for the profession and seek some kind of
[Bayesian-Frequentist] “unification” or
“reconciliations”
24 June 6, 2012
“We [statisticians ] are not blameless….we
have not made a concerted professional effort
to provide the scientific world with unified
testing methodology…and so are tacit
accomplices in the unfortunate statistical
situation (J. Berger, 2003)
 Not waiting for philosophers….“we focus
less on what is correct philosophically than on
‘what is correct methodologically
logically…professional agreement on statistical
philosophy is not on the immediate horizon, but
this should not stop us from agreeing on
methodology.”
But what is correct methodologically
depends on what is correct philosophically
If we look deeply at
(a) what’s put forward as a Bayesianfrequentist unification, that everyone
should love
(b) and reactions to them
we unearth several gems for understanding
and getting beyond current impasses
25 June 6, 2012
The unificationists (e.g.,“default”
Bayesians) think Bayesians and frequentists
should be content with their offerings, but
neither seem to, …why? (seminar #4)
Finding ways to equate posterior
probabilities with frequentist error
probabilities (however clever) masks
underlying conflicts: we get agreements on
numbers that fail both as degrees of belief
and as relevant error probabilities….
26 June 6, 2012
As a result, subjective Bayesians don’t seem
to like the unifications much:
Dennis Lindley: They focus too much on
technique at the expense of the “Bayesian
standpoint” (i.e., updating degrees of belief)
(1997)
But there is growing confusion as to what
the Bayesian or frequentist standpoints are (I
hope to clarify this in our seminar)
Jay Kadane, venerable subjective Bayesian:
“The growth in use and popularity of Bayesian
methods has stunned many of us who were
involved in exploring their implications
decades ago. The result, …is that there are
users of these methods who do not understand
the philosophical basis of the methods they are
using, and hence may misinterpret or badly
use the results. ….. No doubt helping people
to use Bayesian methods more appropriately is
an important task of our time”
27 June 6, 2012
But many question whether the classic
subjective philosophy (that Kadane, Lindley
and others have in mind) is the only or the
best way to be helped…
The predominant uses of Bayesianism these
days are not subjective
Andrew Gelman, a practicing Bayesian
statistician surprisingly, recently said:
“The main point where we disagree with
many Bayesians is that we do not think that
Bayesian methods are useful for giving the
posterior probability that a hypothesis is
true... Bayesian inference is good for
deductive inference within a model, .. for
evaluating a model, we prefer …what Cox
and Hinkley call ‘pure significance testing’)
In a recent paper (Gelman and Shalizi 2012),
they even call for error statistical
foundations for their brand of Bayesianism
I’d like to read this and or the Gelman piece
in the special topic in RMM
28 June 6, 2012
10. My Diagnosis: What do we want? What
can our methods do?
The current situation is a result of never having
been clear on contrasting views on
(a) the roles of probability in ampliative
inference and
(b) the nature and goals of
inductive/statistical inference in relation to
scientific inquiry
 What is correct methodologically turns on
philosophy
 Methodology without philosophy is shallow
(as is philosophy of statistics without
statistical methodology)
To begin we might probe a familiar philosophystatistics analogy:
29 June 6, 2012
11. Popper is to Frequentists as Carnap is
to Bayesians
“In opposition to [the] inductivist attitude, I
assert that C(H,x) must not be interpreted as
the degree of corroboration of H by x, unless
x reports the results of our sincere efforts to
overthrow H. The requirement of sincerity
cannot be formalized—no more than the
inductivist requirement that e must represent
our total observational knowledge. (Popper
1959, p. 418.)
Observations or experiments can be
accepted as supporting a theory (or a
hypothesis, or a scientific assertion) only if
these observations or experiments are
severe tests of the theory---or in other words,
only if they result from serious attempts to
refute the theory, ….” (Popper, 1994, p. 89)
30 June 6, 2012
Did BP representatives have good evidence that
H: the blowout preventer is satisfactory (April,
2010)?
Not if they kept decreasing the pressure until H
passed, rather than performing a more severe
test (as they were supposed to).
Or if they ignored the conflicting readings, fail
to check if the battery is out, etc. etc.
Passing this overall “test” made it too easy for
H for pass, even if false.
(Popper always used GTR as an example of a
good test, but the initial studies weren’t very
severe)
When we reason this way, we are insisting that
Weakest Requirement for a Genuine Test:
Agreement between data x and H fails to
count in support of a hypothesis or claim H, if
so good an agreement was (virtually) assured
even if H is false—no test at all!
31 June 6, 2012
12. The Severe Tester Standpoint: is
admittedly skeptical, demanding—it places the
burden of proof differently than accounts built
on positive instances or agreement or the like.
Perhaps it’s just a very different job it seeks
(even if you wanted to retain the idea of giving
overall degrees of confirmation, you might
admit this distinct job)
While my recommendation is in the Popperian
spirit, his computations never gave him a way to
characterize severity adequately.
Aside: Popper wrote to me “I regret never
having learned statistics”
32 June 6, 2012
13. Neymanian Power Analysis: Neyman to
Carnap
In a little-known article (one of 3) Neyman
responds to Carnap’s criticism of “Neyman’s
frequentism”:
“When Professor Carnap criticizes some attitudes
which he represents as consistent with my
(“frequentist”) point of view, I readily join him in
his criticism without, however, accepting the
responsibility for the criticized paragraphs”. (p. 13)
To Carnap, Neyman is an “inductive straight
ruler”:
If all you know is p% of H’s have been C’s then
the frequentist infers there is a high probability
that p% of H’s will be C in a long series of trials.
This overlooks, Neyman retorts, “that applying
any theory of inductive inference requires a
theoretical model of some phenomena, not merely
the phenomena themselves”.
33 June 6, 2012
“I am concerned with the term ‘degree of
confirmation’ introduced by Carnap. …We have
seen that the application of the locally best onesided test to the data…failed to reject the hypothesis
[that the 26 observations come from a source in
which the null hypothesis is true]. The question is:
does this result ‘confirm’ the hypothesis that H0 is
true of the particular data set]? ”.
Locally best one-sided Test T+
A sample X = (X1, …,Xn) each Xi is Normal,
N(,2), (NIID),  assumed known; M the
sample mean
H0:  ≤ 0 against H1:  > 0.
test statistic d(X) is the sample mean M
standardized d*(X) = (M - 0)/x,
x n.5
Test fails to reject the null, d*(x0) ≤ some cut-off
cor the p-value is not very small
34 June 6, 2012
Carnap says yes…
….the attitude described is dangerous.
…the chance of detecting the presence [of
discrepancy from the null], when only [this
number] of observations are available, is
extremely slim, even if [is present].
“One may be confident in the absence of that
discrepancy only if the power to detect it were
high”.
(1) P(d(X) > c;  = 0 + )
Power to detect 
 Just missing the cut-off c is the worst case
 It is more informative to look at the probability
of getting a worse fit than you did
(2) P(d(X) > d(x0);  = 0 + ) “attained power”
a measure of the severity (or degree of
corroboration) the inference  < 0 + 
Not the same as something called “retrospective
power” or “ad hoc” power! (Shpower)
35 June 6, 2012
36 June 6, 2012
My current blogpost:
Neymanian Power Analysis (Detectable
Discrepancy Size): If data x are not statistically
significantly different from H0, and the power to
detect discrepancy γ is high (low), then x
constitutes good (poor) evidence that the actual
effect is no greater than γ.
By taking into account the actual x0, a more
nuanced post-data reasoning may be obtained.
This may be captured in the Severity
Interpretation of Acceptance SIA
SIA: (a): If there is a very high probability that
[the observed difference] would have been
larger than it is, were μ > μ1, then μ < μ1 passes
the test with high severity, and not otherwise…
(Mayo and Spanos 2006); or in what we call a
frequentist principle of evidence FEV(ii) (Mayo
and Cox 2005, 2010)
37 June 6, 2012
While data-dependent, the reasoning using (2)
is still in sync with Neyman’s argument here...
Is it part of the N-P school?
I recommend moving away from the idea
that we are to “sign up” for N-P or Fisherian
“paradigms”
38 June 6, 2012
Quick glance at applying the Severity
Interpretation
With reference to the one-sided test T+:
H0:  < 0 vs H1:  > 0 where it is assumed
that  is known
(SIA): If d(x) is not statistically
significant, then test T+ passes
µ < µ 0+ 
We can generalize in this case, if the
results are statistically insignificant,
then we can pass
µ < M0 + kx
with severity ( 1 – ),
where P(d(X) > k) = 
x =  n.5
39 June 6, 2012
If one wants a post-data measure, one can write:
SEV( < M0 + x) to abbreviate:
The severity with which test T with data x
passes claim:
( < M 0 + x).
40 June 6, 2012
One can consider a series of upper discrepancy
bounds…
SEV( < M 0 + 0x) = .5
SEV( < M 0 + .5x) = .7
SEV( < M 0 + 1x) = .84
SEV( < M 0 + 1.5x) = .93
SEV( < M 0 + 1.98x) = .975
41 June 6, 2012
But aren’t I just using this as another way to say
how probable each claim is?
No. This would lead to inconsistencies (if we
mean mathematical probability)
By indicating which discrepancies are
and are not indicated at various levels,
we avoid
“fallacies of insignificant results”
the charge that tests do not supply
measures of effect or discrepancy
size.
Aside: relation to confidence interval
estimation
It will be noted these upper bounds correspond
to upper limits of confidence intervals at various
levels, however, the confidence interval
corresponding to our one sided test is the lower
.975 bound
( > M 0 - 1.96x)
42 June 6, 2012
Does the SEV construal go beyond standard
frequentist methods?
Maybe so: but a philosopher of statistics I am
supplying the tools with an interpretation and an
associated philosophy of inference….
Nor do my comments turn on whether one
replaces frequencies with “propensities”
(however they are to be defined).
(Aside from people already mentioned, this may
relate also to some work on “confidence
distributions”)
43 June 6, 2012
14. Error (Probability) Statistics: two keys
What is key on the statistics side: The
probabilities refer to the distribution of
statistic d(x) (sampling distribution)—
applied to events
What is key on the philosophical side: error
probabilities may* be used to quantify
probativeness or severity of tests (for a
given inference)
*they do not always or automatically give this
The standpoint of the severe prober, or the
severity principle, directs us to obtain error
probabilities that are relevant to determining
well-testedness
44 June 6, 2012
15. Menu of Criticisms of Frequentist (error
statistical) Tests
(Mayo and Spanos 2011)
While we list 13, they really boil down to:
Fallacies of acceptance and rejection,
behavioristic interpretation)
(#1) Error statistical tools forbid using any
background knowledge.
(#2) All statistically significant results are
treated the same.
(#3) The p-value does not tell us how large a
discrepancy is found.
(#4) With large enough sample size even a
trivially small discrepancy from the null can be
detected.
(#5) Whether there is a statistically significant
difference from the null depends on which is the
null and which is the alternative.
(#6) Statistically insignificant results are taken
as evidence that the null hypothesis is true.
45 June 6, 2012
(#7) Error probabilities are invariably
misinterpreted as posterior probabilities.
(#8) Error statistical tests are justified only in
cases where there is a very long (if not infinite)
series of repetitions of the same experiment.
(#9) Specifying statistical tests is too arbitrary.
(#10) We should be doing confidence interval
estimation rather than significance tests.
(#11) Error statistical methods take into account
the intentions of the scientists analyzing the
data.
(#12) All models are false anyway.
(#13) Testing assumptions involves illicit datamining.
46 June 6, 2012
16. Two Tenets Behind the Criticisms
What we need
1.Probabilism: the role of probability in
inference is to assign a degree of belief,
support, confirmation (given by mathematical
probability)
(an adequate statistical account requires a
posterior probability assignment to hypotheses)
What we get from “frequentist’ methods
2.(Radical) Behavioristic Construal: The
frequentist’s central aim is assuring low longrun error (of some sort):
It fails to give an evidential account, useless for
inference
47 June 6, 2012
17. Logic of What Would a “Logic” of Severe
Testing be?
If H has passed a severe test T with x, there is no
problem in saying x warrants H, or, if one likes,
x warrants believing in H….
If SEV(H ) is high, its denial is low, i.e.,
SEV(~H ) is low
But it does not follow that a severity assessment
should obey the probability calculus, or be a
posterior probability….
Both SEV(H ) and SEV(~H ) can be low, if
there fails to be good evidence for either
……in conflict with a probability assignment
(there may be a confusion of ordinary language
use of “probability”)
48 June 6, 2012
Consider
HGTR: General theory of relativity is an
adequate gravity account
We have severely passed individual hypotheses
from GTR, e.g.,
J: the deflection effect is approximately 1.75 + e,
but not the full theory.
Avoid evidence for irrelevant Conjunctions
Other points lead me to deny probability yields a
logic we want for well-testedness
Irrelevant conjunctions, tacking paradox for
Bayesians and hypothetico-deductivists:
Data x from test T may well warrant
J: deflection effect is L + e
But if I conjoin J with a hypothesis not probed in
the least by x, then the conjunction gets no warrant
If SEV for H is low, then SEV for (H & J) is low, even
if SEV for J is high.
49 June 6, 2012
18. Severity vs “Rubbing Off”
My recommendation differs from what I call the
Rubbing off construal: The procedure is rarely
wrong, therefore, the probability it is wrong in
this case is low (Birnbaum’s “Confidence
Concept?).
Still too behavioristic and too pre-data
The long-run reliability of the rule is a necessary
but not a sufficient condition to infer H (with
severity)
The reasoning instead is counterfactual:
H:  < M0 + 1.96x
passes severely because were this inference false,
and the true mean  > M0 + 1.96x
then, very probably, we would have observed a
larger sample mean:
50 June 6, 2012
It does not suffice that the inference was
generated by a rule with good performance
characteristics in the long run of applications….
Supposing it does, i.e., supposing a “radical
behavioristic” position, leads to well-known
howlers.
51 June 6, 2012
19. Criticism Based on Mixture Tests
Rather than give the technical statistical example
(Cox 1958) I jump to an informal example of the
kind of howler that the frequentist account is alleged
to permit:
Oil Exec: Our inference that H: the blowout
preventer is capable of doing its job is highly
reliable
Senator: But didn’t you just say no cement bond
log was performed, when it initially failed…?
Oil Exec: That’s what we did on April 20, but
usually we do—I’m giving the average.
We use a randomizer that most of the time
directs us to run the gold-standard check on
pressure, but with small probability tells us to
assume the pressure is fine, or keep decreasing
the pressure of the test til it passes….
52 June 6, 2012
20. Weak Conditonality Principle and Strong
Liklihood principle
A report of the average error might be relevant
for some contexts (insurance?) but the oil rep
gives a highly misleading of the stringency of
the actual test H managed to “pass” from the
data on April 20.
Violates the severity criterion: —it would be
easy to ensure evidence that H is true, even
when H is false
D. R. Cox’s Weak Conditionality Principle:
Cox is famous for this kind of example, and
gave a way to solve the problem years ago: “if it
is known which experiment produced the data,
inferences about  are appropriately drawn in
terms of the sampling distribution of the
experiment known to have been performed
An actual example is with two measuring
instruments with different precisions
53 June 6, 2012
Nevertheless, we see in numerous books and
papers that the frequentist cannot really avail
herself of ways out (of reporting the irrelevant
average error rate)….
There is a famous “breakthrough” argument
purporting to show that error statistical
principles entail the Strong Likelihood Principle
(LP) (Birnbaum, 1962), but I claim the argument
is flawed—invalid or unsound (Mayo 2010).
The SLP is fundamental to statistical
foundations
54 June 6, 2012
Error statisticians violate the SLP
One way to illustrate its violation in frequentist
statistics is via the “Optional Stopping Effect”.
We have a random sample from a Normal
distribution with mean µ and standard deviation ,
i.e.
Xi ~ N(µ,) and we test H0: µ=0, vs. H1: µ0.
stopping rule:
Keep sampling until H is rejected at the .05 level
(i.e., keep sampling until |M|  1.96 / n ).
With this stopping rule the actual significance level
differs from, and will be greater than .05.
Fifty years ago, almost to the day, (subjective)
Bayesian Savage announced to a group of
eminent statisticians that “optional stopping is
no sin”.
55 June 6, 2012
The likelihood principle emphasized in Bayesian
statistics implies, … that the rules governing when
data collection stops are irrelevant to data
interpretation. (Edwards, Lindman, Savage 1963, p.
239).
They are quite relevant to the error statistician: This
ensures high or maximal probability of error, —
violates weak severity
The SLP violates the weak repeated sampling rule
(Cox and Hinkley)
56 June 6, 2012
It has long been thought that the Bayesian
foundational superiority over error statisticians
stems from Bayesians upholding, while
frequentists violate, the likelihood principle (LP),
Frequentists, concede that they (the
Bayesians) are coherent, we are not…
What then to say about leading default
Bayesians, (Bernardo, 2005, Berger, 2004)?
Admitting that “violation of principles such as
the likelihood principle is the price that has to be
paid for objectivity.” (Berger, 2004).
Although they are using very different
conceptions of “objectivity” objectivity — there
seems to be an odd sort of agreement between
them and the error statistician.
57 June 6, 2012
21. Do the concessions of reference Bayesians
bring them into the error statistical fold?
I think not, (though I will leave this as an open
question). I base my (tentative) no answer on:
--While they may still have some ability to
ensure low error probabilities in a long-run—
some might therefore say they may be regarded
as “frequentists”*
This would not make them error statisticians….
--While relevant error probabilities (for
assessing well-testedness in the case at hand)
entails violating the LP, the converse is not true.
58 June 6, 2012
We would ask: What are its (Bayesian)
foundations?
Impersonal priors may not be construed as
measuring beliefs or even probabilities—they
are often improper.
If prior probabilities represent background
information, why do they differ according to
the experimental model?
They are mere “reference points” for getting
posteriors, but how shall they be interpreted?
59 June 6, 2012
22. Overview: Putting My Cards on the Table
Statistical methods, as I see them, provide
techniques for modeling, checking and avoiding,
and learning from these mistakes.
This conception of statistics is sufficiently
general to embrace any of the philosophies of
statistics now on offer, even though each
requires its own interpretation
It does not readily lend itself to a single
overarching “logic” of the sort to which it has
been traditional to look in seeking a
philosophical foundation for statistics.
We want error control, but it shouldn’t be so
extreme that little of informative significance is
learned: we want both reliability (of tests) and
(informativeness) of claims inferred.
60 June 6, 2012
We seek not an idealized approach but one
that captures how we actually set sail to find
things out, and how to do it better..
Although the overarching goal of inquiry is
to find out what is (truly) the case, the
hypotheses that serve this purpose generally
approximations and may even be deliberately
false
An inductive inference, in this conception,
takes the form of inferring hypotheses or claims
to the extent that they have been well tested.
It also requires reporting claims that have
not passed severely, or have passed with low
severity.
While there are “behavioristic” contexts, in
scientific contexts we use them to distinguish
correct and incorrect interpretations of this data
set in relation to hypotheses about this
phenomenon (in this world).
How?
Answer: by enabling the assessment of
how well probed or how severely tested H is
with data x (along with a background or a
“repertoire of errors”).
61 June 6, 2012
While the degree of severity (with which H
has passed a test T with data x) allows
determining if it is warranted to infer H, it is not
assigned to H itself: it is an attribute of the test
procedure as a whole (including the inference
under consideration).
In the most interesting cases of scientific
learning, the ingredients for an exhaustive set of
hypotheses required for a long of probability or
inductive logic is absent.
Moreover, the logic of probability fails as a
logic for well testedness; I propose to replace it
with a probabilistic concept that succeeds.
62 June 6, 2012
New Foundations?
The classic error statistical procedures, long
being the brunt of philosophical criticism,
should be given a greater run for their money.
While the error statistical paradigm builds on
these statistical methods, it will also supply a
general framework for a richer understanding of
the philosophical problems of evidence,
including strands of Bayesian epistemology and
Bayesian statistics.
63 June 6, 2012
The idea that non-Bayesian ideas might
afford a foundation for the many strands of
Bayesianism is not as preposterous as it first
seems.
But the idea of severe testing is sufficiently
general to apply to any other methods on offer:
Any inference, (even to a posterior
probability), can be said to be warranted just to
the extent that the inference has withstood
severe testing: one with a high probability of
having found flaws were they present.
(There is evidence that some Bayesians are
looking to error statistics to form a non-Bayesian
foundations for their methods.)
You needn’t concur with this statistical
philosophy; it suffices to recognize that the
current situation is in need of philosophical
illumination, and I hope to galvanize you to take
up the general project
64 June 6, 2012