https://​secure.​hosting.​vt.​edu/​www.​econ.​vt.​edu/​directory/​spanos/​spanos10.​pdf

Why the Decision-Theoretic Perspective
Misrepresents Frequentist Inference
Aris Spanos
Department of Economics,
Virginia Tech, USA
October 2014
Abstract
The primary objective of this paper is to revisit a widely held view that
decision theory provides a unifying framework for comparing the frequentist
and Bayesian approaches. The paper calls into question this viewpoint and argues that the decision theoretic perspective misrepresents both the underlying
reasoning and the primary objective of frequentist inference to learn from data
about the true parameter ∗  This is primarily because of its reliance on loss
functions in conjunction with the universal quantifier ‘for all values of ’. For
the same reasons, the paper calls into question the appropriateness and judiciousness of admissibility and the James-Stein risk ‘optimality’ for frequentist
estimation. These findings largely substantiate Fisher’s (1935; 1955) claims
concerning the impertinence of loss functions in scientific inference and their
appropriateness for ‘acceptance sampling’.
Key words: decision theoretic framework; Bayesian vs. Frequentist inference; James-Stein estimator; loss functions; admissibility; error probabilities;
risk functions; acceptance sampling
1
1
Introduction
A widely held view in statistics is that the decision-theoretic framework proposed by
Wald (1950) provides a broad enough perspective that can accommodate both the
frequentist and Bayesian approaches to inference, despite their well-known differences.
Indeed, it is often regarded as a unifying framework for comparing these approaches
by bringing into focus their common features and neutralizing their differences using
a common terminology based on decision rules, action spaces, loss and risk functions,
admissibility, etc.; see Berger (1985), Robert (2007), O’Hagan (1994).
Historically, Wald (1939) proposed the original decision-theoretic framework as a
way to unify frequentist estimation and testing as framed by Neyman (1937):
“The problem in this formulation is very general. It contains the problems of testing
hypotheses and of statistical estimation treated in the literature.” (p. 340)
It is important to emphasize that the original Wald (1939) framing was much narrower
in the sense that: (i) a decision (action) space was defined in terms of subsets of the
parameter space Θ, and (ii) the loss (weight) function was a zero-one loss function,
and thus much closer to the frequentist approach; see Ferguson (1976). As argued
below, the more general framing, introduced by Wald (1947; 1950) and extended by
Le Cam (1955), is considerably less pertinent for frequentist inference.
Among the frequentist pioneers, Jerzy Neyman accepted enthusiastically this
broader perspective in the early 1950s, primarily because it appeared to provide
a formalization for his behavioristic interpretation of Neyman-Pearson (N-P) testing
based on the accept/reject rules; see Neyman (1952). Neyman’s attitude towards
Wald’s (1950) framing was also adopted wholeheartedly by some of his most influential students/colleagues at Berkeley, including Lehmann (1959) and LeCam (1986).
In the forward to the collection of Neyman’s early papers published in 1967, Neyman’s
students involved in selecting his papers to be reprinted write:
“The concepts of confidence intervals and of the Neyman-Pearson theory have proved
immensely fruitful. A natural but far reaching extension of their scope can be found
in Abraham Wald’s theory of statistical decision functions.” (Neyman, 1967, p. vii)
In contrast, R. A. Fisher (1955) rejected the decision-theoretic perspective, claiming that it seriously distorts his rendering of frequentist statistics:
“The attempt to reinterpret the common tests of significance used in scientific research
as though they constituted some kind of acceptance procedure and led to ‘decisions’
in Wald’s sense, originated in several misapprehensions and has led, apparently,
to several more.” (p.69)
The primary aim of this paper is to revisit Fisher’s minority view by taking a
closer look at the decision-theoretic framing in order to reevaluate the extent to which
it provides an appropriate framework for comparing the frequentist and Bayesian
approaches. It is argued that Fisher’s minority viewpoint, with a few exceptions
(Cox, 1958; Tukey, 1960; Birnbaum, 1977), has been inadequately appreciated by the
statistics literature. The paper makes a case that the decision-theoretic perspective
misrepresents the frequentist viewpoint in at least two important respects:
2
(a) There are important differences between an inference pertaining to evidence
for or against a hypothesis and a decision to do something as a result of an inference:
“... to conclude that an hypothesis is best supported is, apparently, to decide that the
hypothesis in question is best supported. Hence it is a decision like any other. But
this inference is fallacious. Deciding that something is the case differs from deciding
to do something. ... Hence deciding to do something falls squarely in the province of
decision theory, but deciding that something is the case does not.” (Hacking, 1965,
p. 31)
(b) The decision-theoretic terminology glosses over the fundamental differences in
(i) the underlying reasoning, and (ii) the primary objectives of the frequentist and
Bayesian approaches. In particular, the notions of a risk function and admissibility
are incongruous with frequentist inference primarily because they rely on the universal
quantifier:
‘for all ∈Θ’, denoted by ‘∀∈Θ’.
In contrast, the relevant quantifier for frequentist notions such as ‘unbiasedness’ and
the ‘Mean Square Error’ (MSE) is the existential quantifier:
‘there exists a ∗ ∈Θ such that’, denoted by ∃∗ ∈Θ
since these notions are defined at a point =∗ , where ∗ denotes true value of  in
Θ, whatever that value happens to be, and not ‘for all ∈Θ’.
In addition, it is argued that Fisher (1955) was correct in claiming that the decision
theoretic framing is germane to "acceptance sampling" or "decision-making under
uncertainty" more generally. This is because in the latter case a loss function does
reflect the ‘cost’ associated with different decisions and it stems from information
‘other than the data’. In such a context the traditional error probabilities associated
with the inference procedures, including the type I, II and the p-value, play no role
in the decision making. Instead, the expected loss dominates the decision making by
providing a ranking of the different decisions.
2
The decision-theoretic set up
This section introduces the decision theoretic framework with a view to bring out its
close affinity to the reasoning and objectives of the Bayesian approach that revolves
around the quantifier ‘∀∈Θ’. In direct contrast to the underlying reasoning and
objectives of frequentist inference which relies on the quantifier ‘there exists a ∗ (true
value of ) in Θ’.
2.1
Basic elements
The decision-theoretic framework has three basic elements.
1. A prespecified (parametric) statistical model M (x), generically specified by:
M (x)={ (x; θ) θ∈Θ} x∈R  for θ∈Θ⊂R   ¿ 
3
(1)
where (x; θ) denotes the (joint) distribution of the sample X:=(1    ) R
denotes the sample space and Θ the parameter space.
2. A decision space  containing all mappings (): R →  where  denotes
the set of all actions available to the statistician.
3. A loss function ( ): [ × Θ] →  representing the numerical loss if the
statistician takes action ∈ when the state of nature is ∈Θ; see Berger (1985),
Bickel and Doksum (2001).
The frequentist, Bayesian and the decision-theoretic approaches share the notion
of a statistical model by viewing data x0 :=(1    ) as a realization of a sample
(X)) can
X:=(1    ). In a decision-theoretic framework the loss function ( b
take several functional forms (table 1); see Wasserman, 2004, p. 193.
The key differences between the three approaches is that:
(a) the frequentist approach relies exclusively on M (x)
(b) the Bayesian approach adds a prior distribution, (θ) ∀θ∈Θ and
(c) the decision-theoretic framing revolves around a loss (or utility) function:
((x) θ) ∀θ∈Θ ∀x∈R 
Table 1: Decision theoretic Loss Functions
Square loss: 2 (b
(X); )=(b
(X) − )2
Absolute loss: 1 (b
(X); )=|b
(X) − |
 loss:  (b
(X); )=|b
(X) − |
(
0 if b
(X) = 
Zero-one loss: 0−1 ( b
(X))=
b
1 if (X) 6= 
´
³
R
 (x; )x
(X); )= x∈R ln  (x;)
Kullback-Leibler:  (b
(x;
)

The claim that the decision-theoretic perspective provides neutral ground stems
largely from the fact that the loss function depends on both the sample and parameter
spaces via the two universal quantifiers:
‘∀x∈R ’, associated with the distribution of the sample:
frequentist: (x; θ) ∀x∈R 
‘∀θ∈Θ’ associated with the posterior distribution:
Bayesian: (θ|x0 ) ∝ (θ) ·  (x0 |θ) ∀θ∈Θ
As argued below, the source of many confusions and misunderstandings is that the
‘∀θ∈Θ’ component of the double quantifier:
(x; θ) ∀x∈R  ∀θ∈Θ
4
is at odds with the reasoning underlying frequentist inference. Instead, frequentist
inference relies on two different types of reasoning:
factual (estimation, prediction):
(x; ∗ ) ∀x∈R 
hypothetical (hypothesis testing): (x; 0 ) (x; 1 ) ∀x∈R 
where ∗ denotes true value of  in Θ, and   =0 1 denote prespecified values of 
associated with the hypotheses, 0 : 0 ∈Θ0 , 1 : 1 ∈Θ1  with Θ0 and Θ1 constituting
a partition of Θ
Frequentist reasoning, in general, has nothing to do with the quantifier ‘∀∈Θ’,
despite claims made by Bayesian textbooks (Robert, 2007, p. 61):
“The frequentist paradigm relies on this criterion [risk function] to compare estimators
and, if possible, to select the best estimator, the reasoning being that estimators are
evaluated on their long-run performance for all possible values of the parameter ”
2.2
The nature of Bayesian inference
According to O’Hagan (1994):
“Having obtained the posterior density (θ|x0 ), the final step of the Bayesian method
is to derive from it suitable inference statements. The most usual inference question is
this: After seeing the data x0 , what do we now know about the parameter θ The only
answer to this question is to present the entire posterior distribution.(p. 6)
O’Hagan, echoing earlier views by Lindley (1965) and Tiao and Box (1975) in
contrasting frequentist (classical) inferences with Bayesian inferences, argues that:
“Classical inference theory is very concerned with constructing good inference rules.
The primary concern of Bayesian inference, ..., is entirely different. The objective is
to extract information concerning θ from the posterior distribution, and to present it
helpfully via effective summaries. There are two criteria in this process. The first is
to identify interesting features of the posterior distribution. ... The second criterion is
good communication. Summaries should be chosen to convey clearly and succinctly all
the features of interest. ... In Bayesian terms, therefore, a good inference is one which
contributes effectively to appropriating the information about θ which is conveyed by the
posterior distribution.” (p. 14)
O’Hagan’s key argument is that criteria for ‘optimal’ inference are only parasitical
on Bayes’ theorem and enter the picture via the decision theoretic perspective:
“... a study of decision theory has two potential benefits. First, it provides a link to
classical inference. It thereby shows to what extent classical estimators, confidence intervals and hypotheses tests can be given a Bayesian interpretation or motivation. Second, it
helps identify suitable summaries to give Bayesian answers to stylized inference questions
which classical theory addresses.” (p. 14)
As a result of adopting this decision theoretic perspective, Bayesian inferences
often rely on additional information from sources other than the data x0 . For example,
in the case of point estimation, the additional information come in the form a loss
(or utility) function (b
(X) )
5
An appropriate Bayes estimate is often selected by minimizing the posterior risk:
R
 (b
 ) =
(b
(X) )(|x0 )
∈Θ
(b
(X) ) plays a crucial role is selecting an optimal estimate (Schervish, 1995):
(i) when 2 (b
 )=(b
 − )2 the Bayes estimate b
 is the mean of (|x0 )
 )=|e
 − | the Bayes estimate e
 is the median of (|x0 )
(ii) when 1 (e
(iii) when 0−1 ( )=( ) the Bayes estimate  is the mode of (|x0 ).
To render the notion of a loss function operational one needs to deal with the
two quantifiers ‘∀θ∈Θ’ and ‘∀x∈R ’. To eliminate the latter quantifier the decisiontheoretic approach takes expectations with respect to (x; ) ∀x∈R to define the
risk function:
h
i R
b
b
(x))(x; )x ∀∈Θ
(2)
( )=X ( (X)) = x∈R ( b

which is now a function of all ∈Θ. In practice, the most widely used loss function
is the square, whose risk function is known as the Mean Square Error (MSE):
(3)
( b
)=MSE(b
(X); )=(b
(X)−)2  ∀∈Θ
From a decision-theoretic perspective a minimal property (necessary but not sufficient) for an ‘optimal’ estimator is considered to be admissibility. An estimator
e
(X) is inadmissible if there exists another estimator b
(X) such that:
(4)
( b
) ≤ ( e
) ∀∈Θ
and the strict inequality () holds for at least one value of  Otherwise, e
(X) is said
to be admissible with respect to the loss function ( b
)
It is interesting to note that the original notion of admissibility introduced by
Wald (1939) was not relative to a loss (weight) function.
In practice, risk functions often intersect, rendering one estimator better than
another for certain values of Θ1 ⊂Θ, but worse for other values ∈Θ−Θ1 . Hence, one
needs to deal with the quantifier ‘∀∈Θ’ to reduce the class of possible estimators.
Two such reductions are:
Maximum risk: max (b
)=sup( b
)
R
∈Θ
)= ∈Θ ( b
)()
Bayes risk:
 (b
where () denotes the prior distribution of 
Having reduced the risk function from ‘all ∈Θ’ down to a scalar, the obvious
way to choose among different estimators is to find the ones that minimize this scalar
with respect to all possible estimators: e
(): R → Θ
Such a minimization gives rise to the two widely used decision rules:
)= inf [sup( b
)]
Minimax rule: inf max (b
Bayes rule:

(X)

(X) ∈Θ
inf  (b
)= inf

(X)
6
R
∈Θ

(X)
( b
)()
where the infimum is over all possible estimators e
(X)
Taking admissibility as the minimal criterion for selecting an estimator, the main
result concerns a Bayes rule b
 (X) based on a prior () The result is that under
certain following regularity conditions, the Bayes rule b
 (X) based on a prior () is
b
 (X) is also minimax;
admissible. Moreove, when (  )=  ∞ a Bayes rule b
see Wasserman (2004). Taken together the above results have led to:
The complete class theorem — initiated by Lehmann (1947), Wald (1947).
Under certain regularity conditions, any admissible decision rule or estimator must
be either Bayes or the limit of a sequence of Bayes decision rules, Berger (1985).
This theorem has led people to suggest that an effective way to generate optimal
statistical procedures is to find the Bayes solution using a reasonable prior and then
examine its frequentist properties to see whether it is satisfactory from the latter
viewpoint; see Rubin (1984), Gelman et al (2004).
As argued below, these claims depend crucially on a notion of admissibility which
is shown to be incompatible with frequentist inference. This is primarily because the
objective and the underlying reasoning of frequentist inference are at odds with loss
function minimization revolving around the quantifier ‘∀∈Θ’.
2.3
Stein’s paradox
The quintessential example that has bolstered the appeal of the above Bayesian claims
is the James-Stein estimator (Efron and Morris, 1973) and gave rise to a sizeable
literature on shrinkage estimators; see Saleh (2006).
Consider the case of an independent sample X:=(1  2    ) from a Normal
distribution:
 v NI(   2 ) =1 2  
where  2 is known. Using the notation θ:=(1  2    ) and I :=diag(1 1  1),
this can be denoted by:
X v N(θ 2 I )
b
The primary aim is to find a good estimator θ(X)
of θ where its ‘optimality’ is
assessed in terms of the square (Euclidean) loss function:
P
2
b
b
b
(5)
2 (θ θ(X))=(k
θ(X)
− θk2 ) = 
=1 (  (X) −   ) 
Stein (1955) astounded the statistical world by showing that for =2 the Leastb (X)=X is admissible, but for   2 is inadmissible. Indeed,
Squares estimator θ
James and Stein (1961) were able to come up with a nonlinear estimator:
´
³
2
b (X)= 1 − (−2)
θ
X
kXk2
b (X)= X in MSE terms
referred to as the James-Stein estimator that dominates θ
by demonstrating that:
b (X); θ)  MSE(θ
b (X); θ), ∀θ∈R 
MSE(θ
7
(6)
b (X) is also inadmissible and dominated by the modified JamesIt turns out that θ
Stein estimator that is admissible:
³
´
2 +
b+ (X)= 1 − (−2)
θ
X

kXk2
where ()+ = max(0 ); see Wasserman (2004).
The traditional interpretation of this result is that when the mean θ:=(1  2    )
for   2 from a Normal, Independent sample X are the unknown parameters of
interest, the James—Stein estimator reduces their overall MSE by using a combined
nonlinear estimator as opposed to the linear Least-Squares estimator, which is inadmissible with respect to this particular loss function. In contrast, when each parameter is estimated separately, the least squares (LS) estimator is admissible with
respect to square loss functions.
This result seems to imply that one will ‘do better’ (in overall MSE terms) by
using a combined nonlinear (shrinkage) estimator, instead of estimating these means
separately. What is surprising about this result is that there is no statistical reason
(due to independence) to connect the inferences pertaining to the different individual
means, and yet the obvious estimator (LS) is inadmissible.
As argued next, contrary to the conventional wisdom, this calls into question the
appropriateness of the notion of admissibility with respect to a particular loss function
as well as the trading certain degree of bias against the overall MSE, and not the
judiciousness of frequentist estimation.
3
Frequentist inference and learning from data
The objectives and underlying reasoning of frequentist inference are inadequately
discussed in the statistics literature. As a result some of its key differences with
Bayesian inference remain obscure.
All forms of parametric frequentist inference begin with a prespecified statistical model M (x) which is assumed to have given rise to data x0 :=(1    )
This model is chosen from the set of all possible models that could have given
rise to data x0  by selecting the a probabilistic structure for the stochastic process
{  ∈N:=(1 2   )} in such a way so as to render the observed data x0 a ‘typical’ realization thereof; see Spanos (2013). In light of the fact that each value of
θ∈Θ represents a different element of the models represented by M (x) the primary
objective of frequentist inference is to learn from data about the ‘true’ model:
M∗ (x)={ (x; θ ∗ )} x∈R 
(7)
where θ∗ denotes true value of θ in Θ whatever that happens to be. The expression
‘θ∗ denotes the true value of θ’ is a shorthand for saying that ‘data x0 constitute a
typical realization of the sample X with distribution (x; θ∗ )’. It is important to
emphasize that this ‘typicality’ is testable vis-a-vis the data x0 
8
3.1
Frequentist estimation
Frequentist reasoning for estimation is factual, in the sense the optimality of an
estimator is appraised in terms of its sampling distribution evaluated under:
=∗ , whatever value ∗ in Θ happens to be.
Point estimators contribute to this objective by effectively pin-pointing ∗ for all
possible sample realizations: the generic capacity of b
 (X) to zero-in on ∗ . Optimal
properties like consistency, unbiasedness, full efficiency, sufficiency, etc. evaluate this
generic capacity using its sampling distribution  (b
 (x); ∗ ) for x∈R  A key feature
of frequentist inference is that the sampling distribution of any statistic  =(X)
(estimator, test, predictor) is derived via:
Z Z
Z
···
 (x; )x
 (; ):=P( ≤ ; ) =
(8)
| {z }
{x: (x)≤; x∈R
}
For instance, strong consistency asserts that b
 (X) will zero-in on ∗ with problim b
 (X)=∗ )=1 Similarly, unbiasedness asserts that
ability one as  → ∞ : P(→∞
the sampling distribution of b
 (X) has a mean equal to ∗ :
(b
 (X))=∗ 
In this sense both of these optimal properties are defined at the point =∗ . In
contrast, the decision-theoretic definition of unbiasedness:
(b
 (X))= ∀∈Θ,
makes no sense in frequentist estimation because what is of interest is whether the
sampling distribution of b
 (X) has a mean equal to the true ∗ or not. Similarly, the
appropriate frequentist definition of the MSE for an estimator is defined at the point
=∗ :
(9)
 (X) − ∗ )2  for a particular ∗ in Θ
(b
 (X); ∗ )=(b
because the  (b
(X)) as well the bias:
(10)
 (X))−∗  for a particular ∗ in Θ
(b
 (X); ∗ )=(b
make sense only when defined at =∗ to yield:
(X))+[(b
 (X))−∗ ]2  for a particular ∗ in Θ
(b
(X); ∗ )= (b
(11)
In contrast, the decision-theretic definition (3) makes not sense in terms of the reasoning or the primary objective of frequentist inference. Hence, the apparent affinity
between a square loss function and the dispersion of an estimator from arbitrary
points is illusory because the only relevant dispersion from the frequentist perspective is around the true value ∗ . This is clearly exemplified by the two different
definitions of the MSE:
 (X)−)2
Decision-theoretic: ∀∈Θ: (b
 (X); )=(b
(12)
Frequentist:
∃∗ ∈Θ: (b
 (X); ∗ )=(b
 (X)−∗ )2
9
Unfortunately, statistics textbooks adopt one of the two definitions of unbiasedness and the MSE in (12) and ignore (or seem unaware) of the other. The discussion
that follows calls into question the pertinacy of admissibility, when defined in terms
of the universal quantifier, as a minimal property for a good estimator, as well as the
relevance of James-Stein estimators for frequentist inference.
3.2
Admissibility as a minimal property
The impertinence of the notion of admissibility as a mininal property (necessary
but not sufficient) for a good estimator, can be illustrated using the following example.
Example. In the context of the simple Normal model in (13), let us consider a
MSE comparison between two estimators of :
P
(i) the Maximum Likelihood Estimator (MLE):   = 1 =1  
(ii) the ‘crystalball’ estimator:  =7405926 ∀x∈R 
When compared on admissibility grounds, these two estimators are admissible and
thus equally acceptable. Common sense, however, suggests that if a particular criterion of optimality cannot distinguish between   [a strongly consistent, unbiased,
fully efficient and sufficient estimator] and   an arbitrarily chosen real number that
ignores the data altogether, it is not much of a minimal property. A moment’s reflection suggests that its impertinence stems from its reliance on the quantifier ‘∀∈Θ’.
The admissibility of  stems from the fact that for certain values of  close to  ,
say ∈( ± √ ) for 0    1 on MSE grounds  is ‘better’ than   :
(  ; )= 1  ( ; ) ≤
2

for ∈( ± √ )
This example brings out two major weaknesses of the notion of admissibility as a
minimal property for an estimator. The first is that admissibility is totally ineffective
as a minimal property because it does not filter out "bad" estimators such as  −
the worst possible estimator, but it does exclude potentially good estimators like the
sample median as inadmissible; see Cox and Hinkley (1974). This foregrounds the
second weakness, which is the extreme relativism of admissibility to the particular loss
function, i.e. 2 (b
(X); ). As mentioned above, the sample median is inadmissible
(X); ) but it is the optimal estimator in the case of the absolute
with respect to 2 (b
b
(X) − |. That is, whether an estimator is good or bad
loss function 1 ((X); )=|b
does not depend on the underlying statistical model, but on the loss function whose
selection stems from information other than the data. This raises another source of
potential conflict.
It is well-known that within each statistical model M (x) in (1) there exists
an inherent statistical distance function, often relating to the log-likelihood and the
score function, and which relates to information contained in the data; see Casella
and Berger (2002). For instance, when the distribution underlying M (x) is Normal,
the inherent distance function for comparing estimators of the mean () is the square:
 (X) − ∗ )2 
(b
 (X); ∗ ) = (b
10
evaluated at =∗  the ‘true’  in Θ On the other hand, when the distribution is
Laplace (see Shao, 2003) the relevant statistical distance function is the Absolute
Distance:
(b
 (X); ∗ ) = |b
 (X) − ∗ |
Similarly, when the distribution underlying M (x) is Uniform, the inherent distance
function is:
 (x) − ∗ |
 (b
 (X); ∗ ) = sup |b
x∈R

The question that naturally arises is when it might make sense to ignore these
inherent distance functions and compare estimators using an externally given loss
function. The key difference between the two is that the assumptions defining the
likelihood function are testable vis-a-vis the data, but those underlying the loss function are not. Moreover, the likelihood function gives rise to a ‘global’ notion of
optimality, like full efficiency, as opposed to the ‘local’ one stemming from a loss
function; e.g. minimum variance relative to this particular loss function.
In light of the above discussion of admissibility, a strong case can be made that
the real minimal property for frequentist estimation is consistency, which stems from
information contained in the statistical model. It is interesting to note that consistency would instantly eliminate  . Consistency is such a minimal property because
if an estimator b
(X) cannot pinpoint ∗ when an infinite amount of data information
is available, it should be considered impertinent simply because it conflicts with the
primary objective of frequentist inference. In comparison, there is nothing in the
notion of admissibility that advances learning from data about ∗ .
3.3
James-Stein estimator: a frequentist perspective
For a proper frequentist evaluation of the above James-Stein result, it is important
to bring out the conflict between the overall MSE (5) and the reasoning underlying
frequentist estimation. The James-Stein estimator, when viewed from this frequentist
perspective, raises several issues of concern.
b (X) and the James-Stein θ
b (X) estimators are inconsisFirst, both the OLS θ
tent estimators of θ since the underlying model suffers from the incidental parameter
problem: there is essentially one observation ( ) for each unknown parameter ( ),
and as  → ∞ the number of unknown parameters increases at the same rate. To
bring out the futility of comparing these two estimators more markedly, consider the
following simpler example.
Example. Let X:=(1  2    ) be a sample from the simple Normal model:
 v NIID( 1) =1 2   for   2
(13)
2 = 12 (1 +  ) and inferring that b
2 is relaComparing the two estimators b
1 =  b
b
tively more efficient than 1 relative to a square loss function, i.e.:
MSE(b
2 (X); )=1  MSE(b
1 (X); )= 1  ∀∈R
2
is totally uninteresting because both estimators are inconsistent! This foregrounds a
related second issue which has to do with the optimality of the James-Stein result on
11
admissibility grounds with respect to a particular loss function. This was called into
question above because in frequentist estimation the minimal property for estimators
is not admissibility but consistency, on the basis of which both of these estimators
will be filtered out.
A consistent James-Stein estimator. In light of that, a way to render the
above Stein paradox potentially interesting from the frequentist perspective is to use
panel (longitudinal) data where the sample takes the form:
X :=(1  2    ) =1 2  
In this case the Least-Squares and James-Stein estimators take the form:

¡
¢
b (X)=  1   2      where   = 1 P   =1 2  
θ

=1
³
´
+
¡
¢
+
(−2)2
b
X where X:=  1   2     
θ (X)= 1 − kXk2
Third, the notion of ‘better’ in the James-Stein result needs to be evaluated more
critically. It is clear that the loss function in (5) introduces a trade-off between the
accuracy of the estimators of individual parameters (1  2    ) and the overall
expected loss in the sense that the increase in the latter is at the expense of former.
Hence, the James-Stein result raises a key question: ‘in what sense the overall MSE
among a group of estimated means based on statistically independent processes provides a better measure of ‘error’ in learning about the true means?’ The short answer
is that it doesn’t. Indeed, the overall MSE will be irrelevant as a statistical error when
the primary objective of estimation is to learn from data about θ∗ , the true value of
θ. This is because it penalizes the estimator’s capacity to pin-point θ∗ by trading an
increase in bias for a decrease in the overall MSE.
In summary, the above example raises serious practical questions about how the
loss function machinery can be implemented in practice to render the expected loss
b 0 ) for different values of β meaningful. In particular:
associated with β(z
(i) where does the extraneous information concerning costs associated with parameter values come from?
(ii) why an inconsistent estimator that is efficient relative to a particular loss
function is an optimal estimator?
(iii) why is the overall MSE more important than learning from data about the
true value θ∗ ?
3.4
Confidence Interval Estimation
To bring out the frequentist reasoning underlying Confidence Interval (CI) estimation,
let us return to the simple Normal model in (13) and have a closer look at the sampling
distribution of a good estimator,    [consistent, unbiased, fully efficient, sufficient]
often stated as:
  v N( 1 )
12
What is not usually explicitly revealed is that the evaluation of that distribution is
factual, i.e. =∗ , and denoted by:
=∗
  v N(∗  1 )
What is remarkable about this result is that when   is standardized to define the
pivotal function:
¢ =∗
√ ¡
(14)
(X;):=    − ∗ v N(0 1)
∗
one is certain that (14) holds only for the true  and no other value. For any other
value of  say 1 6=∗  the same evaluation will yield:
√
=
(X;) v 1 N( 1  1)  1 =  (1 −∗ ) 
The factual reasoning result in (14) provides the basis for constructing the (1−)
Confidence Interval (CI):
´
³
(15)
P   −  2 ( √1 ) ≤  ≤   +  2 ( √1 ); =∗ =1−
which asserts that the random interval [  − 2 ( √ )   + 2 ( √ )] will cover (overlay) the true mean ∗ , whatever that happens to be, with probability (1−) or equivalently, the error of coverage is  Hence, frequentist estimation the coverage error
probability depends only on the sampling distribution of   and is attached to random interval for all values 6=∗ without knowing ∗ 
The factual reasoning underlying estimation renders the post-data coverage error
probability degenerate since the factual scenario has played out and the observed CI
[ − 2 ( √ )  + 2 ( √ )] either includes or excludes ∗  but there is no way to know.
The same factual reasoning undermines any attempt to use b
(x0 )=∗ as a legitimate
inference result.
3.5
Frequentist Hypothesis testing
Another frequentist inference procedure one can employ to learn from data about ∗
is hypothesis testing where the question posed is whether ∗ is ‘close enough’ to a
hypothesized value 0 .
3.5.1 Legitimate frequentist error probabilities
In contrast to estimation, the reasoning underlying frequentist testing is hypothetical
in nature. For testing the hypotheses:
0 :  = 0 vs. 1 :   0  where 0 is a prespecified value,
one returns to the same sampling distribution in (??), but transforms the pivotal
quantity in (14) √
into¡ the test¢statistic by replacing ∗ with the prespecified value 0 
yielding (X):=    − 0  However, instead of evaluating it under the factual
 = ∗ , it is now evaluated under various hypothetical scenarios associated with 0
and 1 to yield two types of (hypothetical) sampling distributions:
¢ =
√ ¡
(I) (X):=    − 0 v 0 N(0 1)
¢ =
√
√ ¡
(II) (X):=    − 0 v 1 N( 1  1)  1 =  (1 −0 ) for 1  0 
13
In both cases (I)-(II) the underlying reasoning is hypothetical in the sense that the
factual in (14) is replaced by hypothesized values of  and the test statistic (X)
provides a standardized distance between the hypothesized values (0 or 1 ) and ∗
the true  assumed to underlie the generation of the data x0 ; note that   is an
excellent estimator of ∗  Using the sampling distribution in (I) one can define the
following error probabilities:
significance level: P((X)   ; 0 ) = 
(16)
p-value: P((X)  (x0 ); 0 )=(x0 )
Using the sampling distribution in (II) one can define:
type II error prob.: P((X) ≤  ; =1 )=(1 ) for 1  0 
(17)
power: P((X)   ; =1 )=(1 ) for 1  0 
It can be shown that the test   defined by the test statistic (X) and the rejection
region 1 ()={x :(x)   } constitutes a Uniformly Most Powerful (UMP) test for
significance level ; see Lehmann (1959). The type I [II] error probability is associated
with test  erroneously rejecting [accepting] 0 . The type I and II error probabilities
evaluate the generic capacity [whatever the sample realization x∈R ] of a test to
reach correct inferences. Contrary to Bayesian claims, these error probabilities have
nothing to do with the temporal or the physical dimension of the long-run metaphor
associated with repeated samples; conflating the relevant error probabilities with the
relative empirical frequencies associated with the long-run metaphor constitutes a
category mistake. The relevant feature of the long-run metaphor is the repeatability
(in principle) of the DGM represented by M (x) A feature of the relevant sampling
distributions that can be trivially operationalized using computer simulation; see
Spanos (2013).
The key difference between the significance level  and the p-value is that the former is a pre-data and the latter a post-data error probability. Indeed, the p-value can
be viewed as the smallest significance level  at which 0 would have been rejected
with data x0 . The legitimacy of post-data error probabilities underlying the hypothetical reasoning can be used to go beyond the N-P accept/reject rules and provide
an evidential interpretation pertaining to the discrepancy  from the null warranted
by data x0 . Its key difference from the Bayesian and likelihoodist approaches is that
it takes into account the generic capacity of the test in establishing . The underlying intuition is that detecting a discrepancy  using a very sensitive (insensitive) test
provides less (more) strong evidence that  is present. This warranted discrepancy 
can then be used in conjunction with substantive subject matter information to address the issue of statistical vs. substantive significance; see Mayo and Spanos (2006;
2011).
Despite the fact that frequentist testing uses hypothetical reasoning, its main
∗
∗

objective is also to learn from data about the√true
¡ model ¢M (x)={ (x;  )} x∈R 
This is because a test statistic like (X):=    −0 constitutes nothing more
than a scaled distance between ∗ [the value behind the generation of   ] and a
hypothesized value 0  with ∗ being replaced by its ‘best’ estimator   
14
4
Risk functions and acceptance sampling
This section discusses the nature of loss and risk functions and their inherent conflict
with the aims and underlying reasoning of the frequentist approach. It is argued
that loss (utility) functions are appropriate for ‘acceptance sampling’ and decision
making under uncertainty, more generally, where non-data information about losses
and regrets constitutes an integral part of the problem. Moreover, in contrast to the
type I-II and coverage error, ‘expected loss’ is not a legitimate frequentist error primarily because the former error probabilities are attached to the inference procedure
itself and not to θ as the expected loss. Despite the apparent affinity between the
decision-theoretic set up and the Neyman-Pearson (N-P) ‘accept/reject’ rules, a closer
look reveals that it is actually at odds with the primary objective and the inductive
reasoning underlying frequentist inference in general and N-P testing in particular.
4.1
Where do loss functions come from?
A closer scrutiny of the decision-theoretic set up reveals that the loss function needs
to invoke ‘information from sources other than the data’, which is usually not readily
available. Indeed, such information is available in very restrictive situations, such
as acceptance sampling in quality control. In light of that, a proper understanding
of the intended scope of statistical inference calls for distinguishing the special cases
where the loss function is part and parcel of the available substantive information
from those that no such information is either relevant or available.
Tiao and Box (1975), p. 624, reiterated Fisher’s (1935) distinction:
“Now it is undoubtly true that on the one hand that situations exist where the loss
function is at least approximately known (for example certain problems in business) and
sampling inspection are of this sort. ... On the other hand, a vast number of inferential
problems occur, particularly in the analysis of scientific data, where there is no way of
knowing in advance to what use the results of research will subsequently be put.”
Cox (1978), p. 45, went further and questioned this framing even in cases where
the inference might involve a decision:
“The reasons that the detailed techniques [decision-theoretic] seem of fairly limited applicability, even when a fairly clearcut decision element is involved, may be
(i) that, except in such fields as control theory and acceptance sampling, a major
contribution of statistical technique is in presenting the evidence in incisive form for
discussion, rather than in providing mechanical presentation for the final decision. This is
especially the case when a single major decision is involved.
(ii) The central difficulty may be in formulating the elements required for the quantitative analysis, rather than in combining these elements via a decision rule.”
Lehmann (1984) warns us about the misleading implications of arbitrary loss
functions:
“It is argued that the choice of a loss function, while less crucial than that of the
model, experts an important influence on the nature of the solution of a statistical decision
15
problem, and that an arbitrary choice such as squared error may be baldly misleading as
to the relative desirability of the competing procedures.” (p. 425)
4.2
Acceptance sampling vs. learning from data
Let us bring out the key features of a situation where the above decision-theoretic set
up makes perfectly good sense. This is the situation Fisher (1955) called acceptance
sampling, such as an industrial production process where the objective is quality
control, i.e. to make a decision pertaining to shipping sub-standard products (e.g.
nuts and bolts) to a buyer using the expected loss/gain as the ultimate criterion.
This issue was highlighted by Birnbaum (1977):
“Two contrasting interpretations of the decision concept are formulated: behavioral,
applicable to ‘decisions’ in a concrete literal sense as in acceptance sampling; and
evidential, applicable to ‘decisions’ such as ’reject 0 ’ in a research context, where the
pattern and strength of statistical evidence concerning statistical hypotheses is of central
interest.” (p. 19)
In an acceptance sampling context, the MSE(b
(X); ) or some other risk function,
are relevant because they evaluate genuine losses associated with a decision related to
the choice of an estimate b
(x0 ), say the cost of the observed percentage of defective
products, but that has nothing to do with type I and II error probabilities.
Acceptance sampling differs from a scientific enquiry in two crucial respects:
[a] The primary aim is to use statistical rules to guide actions astutely, e.g. use
b
(x0 ) in order to minimize the expected loss associated with “a decision”, and
[b] The sagacity of all actions is determined by the respective ‘losses’ stemming
from “relevant information other than the data” (Cox and Hinkley, 1974, p. 251).
The key difference between acceptance sampling and a scientific inquiry is that
the primary objective of the latter is not to minimize expected loss (costs, utility)
associated with different values of θ∈Θ but to use data x0 to learn about the ‘true’
model (7). The two situations are drastically different mainly because the key notion
of a ‘true θ’ calls into question the above acceptance sampling set up. Indeed, the
loss function being defined ‘∀θ∈Θ’, will penalize θ∗  since there is no reason to believe
that the θ ranked lowest when minimizing the expected loss would coincide with θ ∗ ,
unless by accident.
Consider the case where acceptance sampling resembles hypothesis testing in so far
as final products are randomly selected for inspection during the production process.
In such a situation the main objective can be viewed as operationalizing the probabilities of false acceptance/rejection with a view to minimize the expected losses.
The conventional wisdom has been that this situation is similar enough to NeymanPearson (N-P) testing to render the latter as the appropriate framing for the decision
to ship this particular batch or not. However, a closer look at some of the examples
used to illustrate such a situation (Silvey, 1975), reveals that the decisions are driven
exclusively by the risk function and not by any quest to learn from data about the
true θ ∗ . For instance, N-P way of addressing the trade-off between the two types of
16
error probabilities, fixing  to a small value and seek a test that minimizes the type
II error probability, seems utterly irrelevant in such a context. One can easily think
of a loss function where the ‘optimal’ trade-off calls for a much larger type I than
type II error probability. Indeed, as argued by Tukey (1960):
“Wald’s decision theory ... has given up fixed probability of errors of the first kind,
and has focused on gains, losses or regrets.” (p. 433)
Hence, in acceptance sampling:
[c] The trade-off between the two types of error probabilities is determined by the
risk function itself, and not by any attempt to learn from data about θ∗  Indeed, the
learning is deliberately undermined by certain loss function such as the overall MSE
(5) that favor biased estimators of the James-Stein type.
In light of the crucial differences [a]-[c], one can make a strong case that the objectives and the underlying reasoning of acceptance sampling are drastically different
from those pertaining to a scientific context.
4.3
Is expected loss a legitimate frequentist error?
The question that naturally arises at this stage is whether expected loss is a legitimate
error probability like the type I-II, p-value and coverage probabilities. ‘What do the
latter frequentist error probabilities have in common?’
First, these error probabilities stem directly from the statistical model M (x)
since the underlying sampling distributions of estimators, test statistics and predictors
are derived exclusively from the distribution of the sample (x; θ) via (8). In this
sense, the relevant error probabilities are directly related to statistical information
pertaining to the data as summarized by the statistical model M (x) itself. Hence,
they have nothing to do with extraneous information ‘other than the data’.
Second, all these error probabilities are attached to a particular frequentist inference procedure as they related to a relevant inferential claim. These error probabilities
calibrate the effectiveness of inference procedures in learning from data about the true
statistical model M∗ (x)={(x; θ∗ )} x∈R 
In light of these features, the question is: ‘in what sense a risk function could
potentially represent relevant frequentist errors?’ According to some Bayesians, the
risk function does represent a legitimate frequentist error because it is derived by
taking expectations with respect to  (x; ) x∈R ; see Robert (2007). This argument
is misleading for several reasons.
(a) This claim stems from the confusion between the universal and existential
quantifiers. The relevant errors in estimation, including (b
 (X))=∗ and (b
 (X)−∗ )2 
∗
are with respect to  (x;  ), stemming from factual reasoning which is based on the
existential quantifier.
(b) The expected losses stemming from the risk function ( b
) are attached to
particular values of  in Θ. Such an assignment is in direct conflict with all the above
legitimate error probabilities that are attached to the inference procedure itself, and
never to the particular values of  in Θ The expected loss assigned to each value of
17
 in Θ has nothing to do with learning from data about ∗ . Indeed, the risk function
will often penalize a procedure for pin-pointing ∗  in sync with ‘acceptance sampling’,
where the objective of the inference has nothing to do with ∗ .
b
(c) Another crucial issue with loss functions, such as 2 (θ θ(X))
is that in a
decision-theoretic context are treated as a unitless numerical measures of how costly
b 0 ). The trouble
are the various consequences of potential decisions associated with θ(x
is the statistical parameters are rarely uniless quantities. To bring out the difficulties,
let us take an example from economics, a field where loss functions supposedly arise
naturally. Consider the simple linear regression model:
 =  0 +  1 1 +  2 2 +  3 3 +    v NIID(0 2 ) =1 2   
where the unknown parameters of interest are θ:=(β  2 ), β:=( 0   1   2   3 ) A
moment’s reflection suggests that serious practical difficulties are raised by the mathematical structure of a loss function such as that of Stein:
P
b
b
b (Z) −   )2 
(18)
2 (β β(Z))=(k
β(Z)
− βk2 ) = 3=0 (
where Z:=(y X), and the James-Stein estimator takes the form:
¶
µ
−1
2
>
b
b for   0 β=(X
b
β  (Z)= 1−  > >  β
X) X> y
 (X X)
It is well-known that the regression coefficients are not unitless since   depends
crucially on the units of measurement of both  and  , =1 2 3 Indeed, substantive
subject matter information often comes in the form of the sign and magnitude of
b1 (z0 )=18,
these coefficients, and in practice their magnitude varies greatly, say 
b3 (z0 )= −004 This, however, implies that the loss function renders the smaller
and 
coefficient estimates more or less irrelevant because their relative contribution to (18)
will be miniscule. Moreover, one can change the cost associated with any coefficient
by changing the units of measurement of any of the variables involved, which is often
trivial to do. Such changes in the units of measurement will change drastically the
ranking of different potential decisions.
Having said that, expected losses can be the relevant measure for ‘acceptance sampling’ because the objective of the inference is driven by the risk function. However,
the expected loss assigned to each value of  in Θ has nothing to do with learning
from data about ∗ .
4.4
Loss functions as panacea?
The extreme relativism of the optimality in Bayesian estimation with respect to the
particular loss function, renders the latter highly vulnerable to abuse. In practice,
one can justify any estimator, however lame by other criteria based on information
relating to the data, as optimal by selecting the "appropriate" loss function. Worse,
like the goddess of universal remedy, loss functions are often invoked by Bayesians
as a panacea to any serious foundational problem. The strategy is: invoke a loss
18
function to sidestep the real issue by transforming it into a question of selecting the
‘appropriate’ loss function.
A recent example of that concerns the appropriateness of Jeffreys’s spiked prior
as it relates to Bayesian testing; see Berger (1985). In an obvious attempt to sidestep
foundational issues raised by the traditional Bayes factor, Bernardo (2011) proposes
a new decision-theoretic set up for Bayesian testing that relies on particular loss
functions he calls intrinsic. The end result invokes a number of presuppositions
about framing the null and alternative hypotheses and certain arbitrary thresholds,
like a context dependent utility constant ∗ =5 in an obvious attempt to mimic the
likelihood ratio test. What is missing is the relevant sampling distribution associated
with the likelihood ratio statistic that would determine both the proper frequentist
threshold as well as the power of the resulting test to detect different discrepancies
from the null. Instead, the end result depends on averaging the intrinsic discrepancy
function over all values of  in Θ with a posterior providing the weights.
Similarly, in an obvious attempt to sidestep the vulnerability of the Bayes factor to
distinguish between statistically and substantively significant discrepancies, Robert
(2014) invoked the need for a vague loss function: “the “substantive sense” can only
be gathered from a loss function.” (p. 224). The only possible response to such an
off-the-wall claim is that ‘if all you have is a hammer, everything looks like a nail’.
Even before Wald introduced the decision-theoretic perspective, Fisher (1935)
perceptively argued:
“In the field of pure research no assessment of the cost of wrong conclusions, or of
delay in arriving at more correct conclusions can conceivably be more than a pretence,
and in any case such an assessment would be inadmissible and irrelevant in judging the
state of the scientific evidence.” (pp. 25-26)
Tukey (1960) echoed Fisher’s view by contrasting decisions vs. inferences:
“Like any other human endeavor, science involves many decisions, but it progresses
by the building up of a fairly well established body of knowledge. This body grows by
the reaching of conclusions — by acts whose essential characteristics differ widely from
the making of decisions. Conclusions are established with careful regard to evidence, but
without regard to consequences of specific actions in specific circumstances.” (p. 425)
5
Summary and conclusions
The above discussion called into question the claim that the decision-theoretic setup
provides a unifying framework for comparing the frequentist and Bayesian approaches
to inference. It is argued that a closer look reveals that this perspective enhances
the Bayesian but distorts frequentist inference. This validates Fisher’s (1935; 1955)
claims concerning the impertinence of loss functions in scientific inference and their
appropriateness for ‘acceptance sampling’. The impertinence of this perspective for
frequentist inference stems from the fact that its underlying reasoning and primary
objective are at odds with the quantifier ‘∀∈Θ’, that permeates both the decisiontheoretic and Bayesian inference.
19
The primary objective of frequentist inference is to learn from data about the true
value ∗ . What matters for an optimal frequentist procedure is not its behavior for
all possible values ∈Θ, but how well it does in enabling the modeler to learn about
the true value ∗ in Θ. The underlying reasoning of frequentist inference comes
in two modes. Factual, based on  (x; ∗ ) ∀x∈R , for estimation and prediction
and hypothetical, based on  (x; 0 ) (x; 1 ) ∀x∈R  where (0  1 ) are prespecified
values, for hypothesis testing. Legitimate frequentist error probabilities calibrate an
inference procedure’s capacity to learn from data about ∗ . They do that by assigning
the relevant error probabilities to the inference procedures themselves. In contrast
a loss function ignores the capacity of the procedures and assigns the losses to all
different values of  in Θ, rendering it appropriate for evaluating expected losses in
situations like acceptance sampling.
The inappropriateness of the universal quantifier ‘∀∈Θ’ in learning from data
about ∗ calls into question the relevance of admissibility as a minimal property.
Instead, the relevant minimal property for frequentist estimation is consistency. Similarly, the superiority of James-Stein type estimators is questionable because it runs
afoul the very objective of frequentist inference.
In conclusion, the inappropriateness of loss functions when the primary objective
of frequentist inference is learning from data x0 about ∗  questions:
(i) the appropriateness of the decision-theoretic set up as a neutral framework for
comparing the frequentist and Bayesian approaches,
(ii) the relevance of the complete class theorem for frequentist inference,
(iii) the relevance and judiciousness of admissibility for frequentist estimation,
(iii) the relevance of the James-Stein risk ‘optimality’ for frequentist estimation.
References
[1] Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, 2nd
edition, Springer, NY.
[2] Berger, J. O. and R.W. Wolpert (1988), The Likelihood Principle, Institute of
Mathematical Statistics, Lecture Notes - Monograph series, 2nd edition, vol. 6,
California, Hayward.
[3] Bernardo, J. M. (2011), “Integrated Objective Bayesian Estimation and Hypothesis Testing”, in J. Bernardo et al. (eds.): Bayesian Statistics 9: Proceedings of
the Ninth Valencia Meeting, 1-68, Oxford University Press, Oxford.
[4] Bickel, P. J. and K. A. Doksum (2001), Mathematical Statistics, vol. 1, 2nd ed.,
Prentice Hall, NJ.
[5] Birnbaum, A. (1977), “The Neyman-Pearson Theory as Decision Theory, and as
Inference Theory; with a Criticism of the Lindley-Savage argument for Bayesian
Theory,” Synthese, 36: 19-49.
[6] Casella, G. and R. L. Berger (2002), Statistical Inference, 2nd ed., Duxbury, CA.
20
[7] Cox, D. R. (1958), “Some Problems Connected with Statistical Inference,” Annals of Mathematical Statistics, 29: 357-372.
[8] Cox, D. R. (1978), “Foundations of Statistical Inference: the Case for Eclecticism,” Australian Journal of Statistics, 20: 43-59.
[9] Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, Chapman & Hall,
London.
[10] Efron, B. and C. N. Morris (1973), “Stein’s estimation rule and its competitors—
an empirical Bayes approach,” Journal of the American Statistical Association,
68: 117—130.
[11] Ferguson, T. S. (1976), “Development of the Decision Model,” ch. 16 in On the
History of Statistics and Probability, edited by D. B. Owen, Marcel Dekker, NY.
[12] Fisher, R. A. (1935), The Design of Experiments, Oliver and Boyd, Edinburgh.
[13] Fisher, R. A. (1955), “Statistical methods and scientific induction,” Journal of
the Royal Statistical Society, B, 17: 69-78.
[14] Gelman, A., J. B. Carlin and D. B. Rubin (2004), Bayesian Data Analysis, 2nd
edition, Chapman & Hall, London.
[15] Ghosh, J. K., M. Delampady and T. Samanta (2006), An Introduction to
Bayesian Analysis: Theory and Methods, Springer, NY.
[16] Hacking, I. (1965), Logic of Statistical Inference, Cambridge University Press,
Cambridge.
[17] James, W. and C. Stein (1961), “Estimation with quadratic loss”, Proceedings
of the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
1: 361—379.
[18] LeCam, L. (1955), “An extension of Wald’s theory of statistical decision functions,” Annals of Mathematical Statistics, 26: 69-81.
[19] LeCam, L. (1986), Asymptotic Methods in Statistical Decision Theory, Springer,
NY.
[20] Lehmann, E.L. (1947), “On families of admissible sets,” Annals of Mathematical
Statistics, 18: 97-104.
[21] Lehmann, E. L. (1959), Testing Statistical Hypotheses, Wiley, NY.
[22] Lehmann, E. L. (1984), “Specification Problems in the Neyman-Pearson-Wald
Theory,” pp. 425-436 in Statistics: An Appraisal, edited by H.A. David and H.T.
David, The Iowa State University Press, Ames, IA.
[23] Lindley, D. V. (1965), Introduction to Probability and Statistics from a Bayesian
Viewpoint, Part 2: Inference, Cambridge University Press, Cambridge.
[24] Mayo, D. G. and A. Spanos. (2006), “Severe Testing as a Basic Concept in a
Neyman-Pearson Philosophy of Induction,” The British Journal for the Philosophy of Science, 57: 323-357.
21
[25] Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 151-196 the Handbook
of Philosophy of Science, vol. 7: Philosophy of Statistics, D. Gabbay, P. Thagard,
and J. Woods (editors), Elsevier.
[26] Neyman, J. (1937), “Outline of a theory of statistical estimation based on the
classical theory of probability,” Philosophical Transactions of the Royal of London, Series A, 236: 333-380.
[27] Neyman, J. (1952), Lectures and Conferences on Mathematical Statistics and
Probability, 2nd ed. U.S. Department of Agriculture, Washington.
[28] Neyman, J. (1967), A Selection of Early Statistical Papers by J. Neyman, University of California Press, CA.
[29] O’Hagan, A. (1994), Bayesian Inference, Edward Arnold, London.
[30] Robert, C.P. (2007), The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, 2nd ed., Springer, NY.
[31] Robert, C.P. (2014), “On the Jeffreys-Lindley Paradox,” Philosophy of Science,
81: 216-232.
[32] Rubin, D. B. (1984), “Bayesianly justifiable and relevant frequency calculation
for the applied statistician,” Annals of Statistics, 12: 1151-1172.
[33] Saleh, A. K. Md. E. (2006), Theory of Preliminary Test and Stein-Type Estimation with Applications, Wiley-Interscience, NY.
[34] Schervish, M. J. (1995), Theory of Statistics, Springer-Verlag, NY.
[35] Silvey, S. D. (1975), Statistical Inference, Chapman & Hall, London.
[36] Shao, J. (2003), Mathematical Statistics, 2nd ed., Springer, NY.
[37] Spanos, A. (2013), “A Frequentist Interpretation of Probability for Model-based
Inductive Inference,” Synthese, 190: 1555-1585.
[38] Stein, C. (1956), “Inadmissibility of the usual estimator for the mean of a multivariate distribution”, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1: 197—206.
[39] Tiao, G. C. and G. E. P. Box (1975), “Some comments on "Bayes" estimators,” pp. 619-626 in Studies in Bayesian Econometrics and Statistics, In Honor
of Leonard J. Savage, edited by S. E Fienberg and A. Zellner, North-Holland,
Amsterdam.
[40] Tukey, J.W. (1960), “Conclusions vs Decisions,” Technometrics, 2: 423-433.
[41] Wald, A. (1939), “Contributions to the Theory of Statistical Estimation and
Testing Hypotheses”, Annals of Mathematical Statistics, 10: 299-326.
[42] Wald, A. (1947), “An essentially complete family class of admissible decision
functions,” Annals of Mathematical Statistics, 18: 549-555.
[43] Wald, A. (1950), Statistical Decision Functions, Wiley, NY.
[44] Wasserman, L. (2004), All of Statistics, Springer, NY.
22