Revisiting the `Mixed Experiment` Example: Fallacious Frequentist

Revisiting the ‘Mixed Experiment’ Example:
Fallacious Frequentist Reasoning
or an Improper Statistical Model? ∗
Aris Spanos†
Department of Economics,
Virginia Tech, Blacksburg, VA 24061
[email protected]
January 2013
Abstract
The primary aim of this paper is to revisit the ‘mixed experiment’ example
in order to elucidate some of its veiled features with a view to demonstrate
that the real source of the fallacious results associated with this example is
the problematic nature of the example itself, and not the nature of frequentist
reasoning. In particular, the mixed experiment constitutes an improper (misspecified) statistical model in the sense that, by assumption, it could not have
generated the data in question. What is often neglected in these discussions is
that in frequentist inference learning from data about the underlying mechanism can only occur when the relevant error probabilities relate directly to a
statistically adequate model, in the sense that its probabilistic assumptions are
valid for the data in question. Hence, using the mixed experiment as a basis
of frequentist inference will invariably lead to unreliable inferences because the
actual (true) error probabilities will be different from the nominal (assumed)
ones. This discrepancy offers a very different explanation for the fallacious
results discussed in the literature.
Key terms: mixed experiment, two measuring instruments example, conditionality principle, sufficiency principle, likelihood principle, statistical vs.
substantive information, statistical misspecification, specification uncertainty.
∗
Published in Advances and Applications in Statistical Science, 8: 29-47, 2013.
Thanks are due to Clark Glymour and Deborah Mayo for stimulating discussions on this topic
of this paper.
†
1
1
Introduction
The ‘two measuring instruments’ example was introduced by Cox (1958) to illustrate
the cogency of Fisher’s Conditionality Principle (CP) as it relates to the irrelevance
of component experiments that could, but were not actually performed. The CP is
implemented by conditioning on ancillary statistics to define the relevant distribution for the sample. Historically this example has generated a number of different
controversies that linger on to this day1 . In addition to raising issues about the role
of pre-data error probabilities and their long-run interpretation in the context of the
Neyman-Pearson (N-P) testing, it has been used by critics of the frequentist approach
to question its underlying reasoning more generally. In particular, this example gave
Birnbaum (1962) the idea for the notion of a mixed experiment that constitutes a
key premise in his derivation of the Likelihood Principle (LP); the latter results by
combining the CP with the Sufficiency Principle (SP). It also features prominently
among the examples used by Bayesians to highlight the fallacious nature of frequentist reasoning more generally due to its supposed insistence on taking into account
all possible outcomes, irrespective of relevance; Cornfield (1969), Berger and Wolpert
(1988), DeGroot and Schervish (2002), Young and Smith (2005).
The main aim of this paper is to revisit this particular example, nowadays called
the ‘mixed experiment’, in order to elucidate some of its veiled features with a view to
demonstrate that the real source of the fallacious results is the problematic nature of
the example itself, and not the fallaciousness of frequentist reasoning. In particular,
the mixed experiment constitutes an improper (misspecified) statistical model in the
sense that it could not have generated the data in question; the data were generated by
a process with a constant variance, but the model assumes a heterogeneous variance
— a convex combination of two different variances. What causes confusion in these
discussions is that in frequentist inference learning from data can only occur when the
relevant error probabilities relate directly to an adequate description of the underlying
mechanism; a statistically adequate model in the sense that it could have generated the
particular data, i.e. its probabilistic assumptions are valid for the data in question; see
Mayo and Spanos (2004). Hence, using the mixed experiment as a basis of frequentist
inference will invariably lead to unreliable inferences because, we know by assumption
that, one of the two mechanisms assumed did not contribute to the generation of the
data in question. As a result, the actual (true) error probabilities will be different
from the nominal (assumed) ones. This discrepancy between actual and nominal error
probabilities offers a very different explanation for the fallacious results discussed in
the literature.
It is argued that the Bayesian charges against frequentist reasoning based on this
particular example are misplaced and misleading. Moreover, the application of the
CP to address the seemingly fallacious results is both artificial and unnecessary when
statistical adequacy is viewed as a precondition for securing the reliability of inference.
1
It is interesting to note that Cox (2006) refers to this example as ‘notorious’.
2
In addition, the paper calls into question a key premise in Birnbaum’s derivation of
the LP.
In section 2, the probabilistic structure of the mixed experiment example is clarified as a prelude to the discussion in section 3 that focuses on the key issues and
problems it raises when viewed as a possible mechanism that could have given rise
to the data in question. The key problem is that the mixed experiment model is
misspecified and the associated distribution of the sample is erroneous, giving rise to
erroneous error probabilities and thus unreliable inferences. In section 4 the problematic nature of the mixed experiment is traced to the imposition of illegitimate (from
the frequentist perspective) substantive information in the form of a prior distribution. This form of a priori information is shown to be incompatible with frequentist
inference because it pertains to substantive adequacy, not statistical, which is handled
very differently in the frequentist approach, as opposed to the Bayesian approach.
2
The ‘mixed experiment’ example
Cox (1958) described the ‘two measuring instruments’ example as follows:
“Suppose that we are interested in the mean  of a normal population
and that, by an objective randomization device, we draw either (i) with
probability 12 , one observation, , from a normal population of mean 
and variance  21 ? or (ii) with probability 12 , one observation , from a
normal population of mean  and variance  22 , where  21   22 are known,
 22 À  21 , and where we know in any particular instance which population
has been sampled.” (p. 360)
To make the model more realistic consider a sample of size  i.e. data x0 :=(1  2    )
is assumed to have arisen from one of two simple Normal models:
M1 (x):  v N( 21 ),
M2 (x):  v N( 22 ) =1  
(1)
In addition, it is assumed that the following substantive information is available:
[i] the variances ( 21   22 ) are known a priori, with  22   21 
[ii] a priori each model could have given rise to data x0 with probability 12 .
The primary hypotheses of interest are:
0 :  = 0 vs. 1 :  6= 0 
(2)
and the key question is ‘what is the relevant distribution of the sample upon which
frequentist inferences should be based?’
The conventional wisdom pertaining to the mixed experiment example can be
summarized as follows.
In light of [ii] the joint density of ( ), where  is a Bernoulli random variable:
=0 for M1 (x)  (=0) = 12 
=1 for M2 (x)  (=1) =
3
1
2
takes the form:
( ; ) = (|; )·() = 12 (;   21 )1− (;   22 )  =0 1
¡ ¢
¡ ¢
= 12 I=0 · (;   21 ) + 12 I=1 · (;   22 )
(3)
where (;   2 ) denotes the Normal density N( 2 ) and I= the indicator function
for = Hence, the relevant model for Neyman-Pearson (N-P) testing is the ‘mixed
experiment’ model :
M (x) = 12 I=0 M1 (x) + 12 I=1 M2 (x)
whose distribution of the sample takes the form:
¡ ¢ Y
( ;  21 )1− ( ;  22 ) 
(x z; ) = 12
=1
(4)
(5)
where Z:=(1  2    ).
The statistics literature often uses a simplified version of the above distribution
of the sample that stems from the marginal density  (; ) derived by summing out
:
P
(; ) = ¡ ¢ ( ; )= (|=0;
¡ ¢ ) ·  (=0)+(|=1; ) · (=1)=
(6)
= 12 (;   21 ) + 12 (;  22 )
Under certain simplifying assumptions concerning the equal contributions of the two
sub-populations one can derive an approximate distribution of the sample (Lehmann,
1993):
Q
Q
(x; ) = 12  ( ;  21 ) + 12  ( ;  22 ) for all x∈R 
(7)
Hence, the N-P test for the hypotheses in (2) that suggests itself is:
(X) =
√
(−0 ) 0
v
∗
N(0 1)
1 ()={x : |(x)|   }
(8)
P
2 + 2
where  2∗ = 1 2 2  = 1 =1  and 1 () denotes the -level rejection region.
It turns out that when (7) is used as a basis of frequentist inference, the N-P t-test
(8) gives rise to several fallacious results. Berger and Wolpert (1988), p. 7, argued:
“... we can phrase this example as "mixed experiment" in which with
probability 12 (independent of ) either experiment 1 or experiment 2
(both pertaining to ) will be performed. Should the analysis depend
only on the experiment actually performed, or should the possibility of
having done the other experiment be taken into account. The obvious
intuitive answer to the questions in the above example is that only the
experiment actually preformed should matter. But this is counter to the
pre-experimental frequentist reasoning, which says that one should average over all possible outcomes (here, including the coin flip).”
4
Hence, the Bayesian explanation of why frequentist reasoning is fallacious in this
example is that it insists on taking into account experiments that could have happened
but didn’t, blaming the quantifier for all (x)∈R × {0 1}. This is usually contrasted
with the Bayesian approach which adheres to the Likelihood Principle insisting that
only the observed data x0 value is relevant for inference purposes; see Robert (2007).
In light of such criticisms several suggestions have been proposed in the literature
on different ways to avoid them; see Lehmann (1993), Cox and Hinkley (1974). In
particular, Cox (1958) used this example to argue in favour of using Fisher’s Conditionality Principle (CP) implemented by conditioning on =since  is an ancillary
statistic since  (=0)= (=1)= 12  is independent of  This yields the conditional
densities:
(|=1)=(;   22 )
 (|=0)=(;   21 )
suggesting that the relevant distribution of the sample is not (7) but:
Q
(x|; ) = =1  ( |=; ) for all x∈R 
(9)
Hence, the proper way to test (2) is based on two separate N-P tests, depending on
= :
√
0
0)
(10)
v N(0 1)
1 ()={x: | (x)|   } =1 2
 (X)= (−


It is clear that the relevant error probabilities associated with the two N-P tests, (8)
and (10) will be different!
3
Fallacious reasoning or an improper model?
The fact that – by assumption – the mixed experiment model M (x) in (4) could
not have generated data x0 :=(1  2    ) raises serious doubts about the above
Bayesian charges. It should be obvious that one could not learn from data x0  about
the generating mechanism, by using error probabilities from a model that could not
have generated x0  To shed further light on this issue let us return to the basics of
the frequentist approach.
It is well known that the cornerstone of frequentist inference is provided by the
distribution of the sample, say:
(1  2   ; θ):= (x; θ), for all x∈R 
defined as the joint distribution of the random variables making up the sample of the
postulated statistical model:
M (x) = { (x; θ) θ∈Θ} x∈R 
The quantifier for all x∈R refers to all possible outcomes associated with this particular statistical model. That is, the frequentist approach assumes that data x0
were generated by a single member of M (x) : the ‘true’ data-generating mechanism
M∗ (x)={ (x; θ ∗ )} x∈R , where θ ∗ denotes the ‘true’ value of θ The family of models denoted by M (x) for θ∈Θ aims to facilitate learning from data about M∗ (x)
Hence, the quantifier ‘for all possible outcomes’ (x∈R ) refers to all outcomes associated with M (x) and not those of other experiments that did not contribute to
generating x0 .
5
The crucial role of  (x; θ) stems from the fact that it defines the sampling distribution of all relevant statistics, such as estimators, tests or predictors. In particular,
the sampling distribution of any statistic  =(X) is derived via:
Z Z
Z
 (; θ):=P( ≤ ; θ) =
···
(x; θ)x
(11)
| {z }
{x: (x)≤; x∈R
}
In relation to this result it is important to emphasize that learning from data x0
about M∗ (x) is achieved when the relevant error probabilities relate directly to a
statistically adequate model in the sense that its probabilistic assumptions are valid for
data x0 ; see Mayo and Spanos (2004). In this sense, frequentist inference presupposes
a well-defined statistical model whose statistical adequacy has already been secured.
Anything less would not guarantee the reliability of frequentist inferences.
Let us take a closer look at the above claim that N-P testing, based on the pre-data
type I and II error probabilities, requires  (x; ) in (7) to be the relevant distribution
of the sample. This claim is clearly false because (x; ) will yield the wrong error
probabilities. This stems from the fact that the ‘mixed experiment’ model M (x) in
(4), associated with (7), is misspecified because we know that the data x0 could not
have been generated by (4). By assumption, at least one of the two models (M1 (x)
or M2 (x)) had nothing to do with generating data x0 . This has certain serious
consequences:
[a] it gives rise to the wrong likelihood function,
[b] it yields erroneous sampling distributions, and thus
[c] it generates erroneous error probabilities.
That is, using (7) as the relevant (x; θ) will give rise to the wrong sampling distribution  (), and as a result of that the nominal [the assumed ones] and actual
error probabilities will be different, giving rise to unreliable inferences. The surest
way to lead an inference astray is to fix the significance level at .05 but the actual
type I error, due to misspecification, is closer to .90; see Spanos and McGuirk (2001).
This suggests that, the inevitable discrepancy between the actual and nominal error
probabilities, due to the statistical misspecification of M (x), can easily explain the
fallacious results discussed in the literature. This explains the counterintuitive results
based on the test in (8) as due the incorrect error probabilities stemming from (7);
see Berger and Wolpert (1988), Lehmann (1993).
Are there any circumstances under which the distribution of the sample in (7)
is the relevant one? The answer is yes, in the case where both sub-models in (1)
contributed to the generation of data x0 . Consider the experiment where before
each observation a fair coin is tossed to decide which measuring instrument to be
used. In this case M (x) constitutes a proper substantive model from the frequentist
perspective and (7) is the relevant distribution of the sample to be used as a basis of
frequentist inference. In this case no fallacious results will arise.
How did this misspecification come about? A first clue is given by the fact that
6
assumption [ii] is equivalent to assuming a prior ( 2 ) of the form:
2
( 2 )
 21
 22
1
2
1
2
(12)
A second clue is that when (12) is combined with (1) yields a posterior that coincides
with the distribution of the mixed experiment model M (x) in (7), but re-interpreted
as a function of the unknown parameter , not x That is, the form of misspecification
of M (x) fits nicely into the Bayesian approach as a consequence of using a prior
distribution to represent the substantive specification uncertainty pertaining to  2 ;
indicating ignorance as to which of the two measuring devices has been used. Hence,
M (x) makes perfectly good sense in a Bayesian context when a uniform prior was
attached to .
More generally, the mixed experiment model M (x) makes no sense in the context
of the frequentist approach, simply because it could not have generated data x0  The
focus of frequentist inference is on the ‘true’ model M∗ (x) and whenever a prior is
introduced as part of the available substantive information it will invariably give rise
to the wrong distribution of the sample. Intuitively, what goes wrong is that the prior
probabilities in (12) will invariably be false since the actual distribution associated
with M∗ (x) will always be degenerate, i.e. it will take the value 1 for θ = θ∗ and 0
for every other value of θ.
The question that naturally arises is why frequentist statisticians have not rejected
(4) as an illegitimate premises for discussing frequentist inference in the context of
the mixed experiment model? The answer is that shedding light on the inconspicuous
features of the mixed experiment model calls for the elucidation of two fundamental
foundational issues pertaining to frequentist inference:
(a) what constitutes a legitimate statistical model, and
(b) how should substantive information like [i]-[ii] above be handled.
A key problem of frequentist modeling and inference relates to how the statistical
and substantive information can be blended without compromising the credibility of
either source of information. Disentangling the two sources of information has been
a long standing problem in frequentist statistics.
4
Substantive information and the reliability of
inferences
Empirical models in most applied fields constitute a blend of statistical and substantive information, ranging from a solely data-driven formulation like an ARIMA(p,d,q)
model, to entirely theory-driven formulation like a structural (simultaneous) equation
model; see Spanos (2006). The majority of empirical models lie someplace in between these two extremes, but the role of the two sources of information and how
they could be properly combined to learn from data are issues that have not been
clearly delineated; see Lehmann (1990), Cox (1990).
7
In relation to the blending of the statistical and substantive information, the
imposition of the latter on the data at the outset by estimating the structural model
M (x) directly is invariably a bad strategy because the presence of any statistical
misspecification is likely to undermine the reliability of any inferences pertaining to
substantive information.
One way to avert this problem is to distinguish clearly between statistical and
substantive information and adopt a purely probabilistic perspective on the notion of
a statistical model M (x), based exclusively on statistical information; the latter is
reflected in the chance regularity patterns exhibited by data x0 ; see Spanos (1999).
One can then relate it to the substantive information when its statistical adequacy
of M (x) has been secured.
From this perspective, a statistical model M (x) is viewed as a parameterization
of a generic stochastic process {  ∈N} assumed to underlie data x0 . The specification of the statistical model takes the form of selecting a probabilistic structure of
{  ∈N} with a view to:
(i) render x0 a ‘truly typical’ realization thereof, and
(ii) choose a parameterization of {  ∈N} that enables one to pose the substantive questions of interest.
In this sense, a statistical model comprises a set of probabilistic assumptions that
define the statistical (inductive) premises of inference, built exclusively on statistical
information. Statistical adequacy refers to the model assumptions being valid vis-avis data x0 . The particular parameterization of {  ∈N} that defines the statistical
model M (x) is chosen with a view to potentially accommodate the substantive information by rendering the substantive model M (x) a reparametrization/restriction
thereof; see Spanos (2006).
In the case of the ‘two measuring instruments’ example, a truly typical realization
of the data x0 :=(1  2    ) – assumed to have been generated by one of the two
simple Normal models in (1) – would look like figure 1. Hence, the relevant statistical
model is:
M (x):  v NIID( 2 ) θ:=( 2 )∈R×R+  =1   
(13)
where both parameters ( 2 ) are assumed unknown. The underlying distribution of
the sample is:
©
ª
P
 (x; θ) = ( √12 ) exp − 21 2 =1 ( − )2 
(14)
which is very different from (7). Indeed, in the context of M (x) (table 1) the optimal
(Uniformly Most Powerful Unbiased; see Lehmann, 1986) N-P test for assessing the
primary hypotheses of interest (2) is:
√
 ¡
¢2
P
0
2 1
0)
v
St(−1)

=
−
 1 ()={x: |∗ (x)| }
∗ (X)= (−

(15)



=1
It is easy to see that the relevant error probabilities associated with this t-test will
be different from those of the N-P tests in (8) and (10).
8
The main differences between the two distributions of the sample in (7) and (14)
stem from the fact that the substantive information [i]-[ii] should be considered irrelevant for the specification of the statistical model. This is because such information
does not pertain directly to the probabilistic structure of the process {  ∈N}
as reflected in the chance regularities exhibited by data x0 in figure 1; it concerns
the particular value of  2 . Whether the ‘true’ substantive model behind M (x) is
M1 (x) or M2 (x) is irrelevant for selecting and validating the statistical premises,
as given in table 1 because  2 is allowed to take any positive value.
3
2
x
1
0
-1
-2
-3
1
10
20
30
40
50
Index
60
70
80
90
100
Fig. 1: t-plot of NIID(0,1) data
Table 1 - Simple Normal Model
Statistical GM:
 =  +   ∈N={1
⎫ 2 }
[1] Normality:
 v N( )  ∈R ⎬
[2] Constant mean:
( ) = 
∈N.
⎭
[3] Constant variance:  ( ) =  2
[4] Independence: {  ∈N} independent process
For securing statistical adequacy one needs a complete list of testable probabilistic
assumptions as in table 1. To assess the validity of assumptions [1]-[4] one needs to
apply thorough Mis-Specification (M-S) testing; see Mayo and Spanos (2004). Hence,
a statistical model M (x) is deemed appropriate when its assumptions are valid for
data x0 . It is important, however, to emphasize that statistical adequacy does not
imply substantive adequacy, but it is a precondition for ensuring the reliability of an
inferences pertaining to probing the validity of the substantive information. Because
of its crucial importance, this strategy is elevated to a modeling/inference principle.
Statistical Adequacy Principle. The statistical adequacy of M (x) is a precondition for ensuring the reliability of appraising substantive hypotheses or claims,
including the substantive adequacy of M (x).
The acceptance of this principle addresses the issue of what is the relevant distribution of the sample and secures the reliability of inference for assessing the empirical
validity of substantive claims of interest.
9
4.1
Appraising the prior information in [i]:  22   21
To illustrate this, let us return to the statistical model in (13) and assumed that its
statistical adequacy has been secured. One can then proceed to assess the substantive
information in [i] using N-P testing.
Since the two models M1 (x) and M2 (x) are nested within [constitute reparameterizations/restrictions] the statistical model M (x) the validity of the substantive
information in [i] can be easily assessed by testing which one, if either, of the two
substantive models M1 (x) or M2 (x) the data indicate as the generating process.
An appropriate way to frame this is in terms of the partitioning hypotheses:
0 :  2 ≤  21 vs. 1 :  2   21 
(16)
where the substantive discrepancy = 22 −  21  0 from the null is of particular
interest; see Mayo and Spanos (2006). Equivalently, one can specify the hypotheses
as: 0 :  2 ≤  22 vs. 1 :  2   22  with the same substantive discrepancy of interest
= 22 −  21  0. For testing the hypotheses in (16), the test based on:
¢2
P ¡

(17)
1 ()={x: (x)   }
(X)=[ =1  −  21 ] v0 2 (−1)
is Uniformly Most Powerful (UMP); see Lehmann (1986).
This way of dealing with point hypotheses is in sharp contrast with the way Berger
and Wolpert (1988), p. 7, deal with their example 3 based on testing the two simple
hypotheses,  = −1 vs.  = 1 claiming that their treatment stems from the NeymanPearson (N-P) testing reasoning. N-P testing has its fair share of problems, but this
is not one of them because the two hypotheses, when properly defined, need to be
specified in such a way so that they constitute a partition of the relevant parameter
space. Hence, what a proper N-P tester would not do is to specify the null and
alternative hypotheses as:
0 :  2 =  21 vs. 1 :  2 =  22 
(18)
and apply the Neyman-Pearson lemma. This is because (18) does not constitute
a partition of the statistically relevant parameter space. The fact of the matter
is that although there might be only 2 values ( 21   22 ) which are of interest from a
substantive viewpoint, all possible values of the parameter  2 ∈R+ are of interest from
the statistical perspective.
In this sense this test will be used to assess the data-cogency of the substantive
information in [i] and only impose it if validated. Note that all three models M1 (x),
M2 (x) and M (x) incorporate untested substantive information. Assuming that
one of the two models M1 (x) or M2 (x) is data-cogent, say M1 (x), one could
then impose the substantive information  2 = 21  and proceed to test the primary
hypothesis of interest (2) in the context of the latter. In certain cases imposing the
substantive information makes sense because the inference based on M1 (x) can be
shown to be more precise than the one based on M (x) which ignores the information
 2 = 21 . In particular, the relevant test will not be (15) but:
1 (X)=
√
(−0 ) 0
v
1
N(0 1)
11 ()={x: |1 (x)|   }
10
(19)
4.2
The mixed experiment as a proper model
It is important to note that when the mixed experiment model M (x) is a proper
model from the frequentist perspective, i.e. both sub-models in (1) did contribute
to the generation of x0  it can be viewed as a special case of the statistical model
M (x) in (13). In this sense, the substantive information incorporated in M (x)
can be appraised by testing the hypotheses:
0 :  2 = 12 ( 21 +  22 ) vs. 1 :  2 6= 12 ( 21 +  22 )
3
15
2
10
1
5
0
z
x
From the statistical adequacy perspective, one can use the t-plots of the data
to distinguish between the three models M1 (x), M2 (x) and M (x) It is easy to
distinguish between M1 (x) and M2 (x) by simply realizing the a typical realization
of latter is likely to have much bigger variation around the mean. Fig. 3-4 show the
t-plots of typical realizations of NIID(0 1) and NIID(0 25) processes and the range
of the y-axis values of  and  are very different; (−3 3) and (−15 15), respectively.
-1
0
-5
-2
-10
-3
1
10
20
30
40
50
Index
60
70
80
90
1
100
15
15
10
10
5
5
0
-5
-10
-10
10
20
30
40
50
Index
60
70
80
30
40
50
Index
60
70
80
90
100
90
Variable
x
z
y
0
-5
1
20
Fig. 2: t-plot of NIID(0,25) data
Data
y
Fig. 1: t-plot of NIID(0,1) data
10
1
100
Fig. 3 - 12 N(0,1)+ 12 N(0,25) data
10
20
30
40
50
60
Index
70
80
90
100
Fig. 4: all three data series
Also notice that the typical realization of the mixed experiment, when it is implemented as envisaged, exhibits leptokurticity as shown in fig. 3; this is often measured
11
using the kurtosis coefficient:
p
4 =(−())4 [ (−())2 ]4 
The data in fig. 1-2 have an estimated kurtosis coefficient 
b 4 ' 3, but for the data in
fig. 3 
b 4 ' 65 That is, one can distinguish between M1 (x), M2 (x) and M (x)
on statistical adequacy grounds, because the typical realization of the respective
underlying processes are easily distinguisable by a simple M-S test for Normality; in
the case of figure 4, eyeballing will be sufficient to reject Normality.
4.3
Dealing with the prior information [ii]
In contrast to [i], the a priori information in [ii] ‘each model could have given rise
to data x0 with probability 12 ’ is of a very different nature. It turns out that its
imposition at the outset is inappropriate for frequentist inference. Any information
in the form of a prior distribution for  2 that pertains to substantive specification
uncertainty has no place in frequentist modeling and/or inference.
In light of the fact that the primary aim of the frequentist approach is to learn from
data x0 about the ‘true’ underlying data-generating mechanism M∗ (x) = {(x; θ∗ )}
x∈R – whatever value θ∗ happens to be – any form of substantive uncertainty
is addressed within the context of a statistically adequate M (x) using N-P testing.
This is in sharp contrast with the Bayesian approach whose primary inferential aim is
to change the initial rank of all the possible data-generating mechanisms [i.e. different
models M (x0 ) θ∈Θ], based on the prior distribution (θ) into a revised rank in
light of the data x0 based on the posterior distribution:
(θ | x0 ) ∝ (θ | x0 )·(θ)
θ∈Θ
(20)
where (θ | x0 ) ∝ (x0 | θ) denotes the likelihood function. Moreover, beyond the
inferential the Bayesian approach often has another component, that of decision making. Which one of these models, if any, might be chosen will depend not only on the
revised rank stemming from (θ | x0 ) θ∈Θ but also on the relevant loss (utility)
function adopted by the modeler.
For frequentist modeling/inference it is the statistical specification uncertainty
– the appropriate selection of the statistical model M (x) – that is of concern. It is
addressed using thorough Mis-Specification (M-S) testing and respecification (Spanos,
2006). The crucial difference between the statistical and substantive specification
uncertainties is that the former calls for testing beyond the boundaries of M (x)
but the latter can be easily dealt with by N-P testing within M (x); see Spanos
(1999).
The frequentist perspective is, once again, in sharp contrast to the Bayesian one
that often ignores the statistical specification uncertainty altogether, by treating the
choice of the model (or the likelihood function) on par with that of the prior distribution. According to Kadane (2011):
“... likelihoods are just as subjective as priors, and there is no reason to expect
12
scientists to agree on them in the context of an applied problem.” (p. 445)
From the frequentist perspective likelihoods are defined by the probabilistic assumptions of a statistical model, like [1]-[4] in table 1. Hence, there is nothing subjective or arbitrary about the choice of a statistical model M (x), and the selection is
not based on any agreement amongst scientists, since the validity of its probabilistic
assumptions is independently testable vis-a-vis data x0 .
In summary, the framing of substantive specification uncertainty in the form of
prior distributions on parameters is incompatible with frequentist modeling and inference. When such information is imposed at the outset, as in the case of the mixed
experiment model, the result is always a misspecified statistical model in the sense
that it could not have given rise to the particular data x0 because, by definition, it
includes component models which had nothing to do with generating x0 . In a certain sense this is the key difference between what constitutes legitimate substantive
information in the frequentist vs. the Bayesian approaches.
Bayesians often blur the distinction between a priori substantive subject matter
information and a prior distribution over the unknown parameters (θ) θ∈Θ and
proceed to admonish frequentists that they ignore substantive information. In most
scientific fields a priori substantive information does not come in the form of a prior
distribution, but as deterministic restrictions on the parameters including the sign,
magnitude and possible range of values. Such substantive information is easily accommodated in frequentist statistics because it is required that the relationship between
the statistical (θ) and substantive (ϕ) parameters, say (θ ϕ)=0 θ∈Θ ϕ∈Φ, can
be solved uniquely for the latter. This is known in econometrics as the identification condition; see Spanos (2006). On the other hand, replacing  (x; θ) in (11) with
the product  (x; θ)(θ) will totally undermine the very notion of the relevant error
probabilities. But that’s exactly what the adoption of the mixed experiment model
amounts to!
4.4
Whither the Likelihood Principle?
An important implication of the resulting statistical misspecification associated with
the mixed experiment model M (x) = 12 M1 (x) + 12 M2 (x) is that a key premise
of Birnbaum’s (1962) derivation of the Likelihood Principle (LP) is called into questioned. This is the premise which asserts that M (x) provides the same evidence
about  as the model that actually gave rise to the data by invoking the Sufficiency
Principle; see Berger and Wolpert (1988), p. 27.
When evidence is evaluated in terms of the relevant error probabilities, this claim
is patently false. The error probabilities for testing (2) stemming from the mixed
model M (x) will be very different from those associated with a statistically adequate
M (x) or even M1 (x) that will ensue from Fisher’s CP. Indeed, Mayo (2010) goes
even further by demonstrating that Birnbaum’s proof of the Strong LP is erroneous.
13
5
Conclusion
The two measuring instruments (mixed experiment) example raises a number of different issues that boil down to:
(a) what constitutes a proper statistical model, and
(b) how to handle the a priori information:
[i] the variances ( 21   22 ) are known a priori, with  22   21 
[ii] a priori each model [with  21 or  22 ] could have given rise to data x0
with probability 12 .
It was argued above that the traditional approach of imposing all substantive
information at the outset often gives rise to unreliable inferences. A way to avoid
that is to distinguish between statistical and substantive information, ab initio, and
adopt a purely probabilistic construal of the notion of a statistical model M (x).
Once the statistical adequacy of M (x) is established one can then proceed to test
the validity of the substantive information of interest before imposing it. From this
frequentist perspective the validity of the substantive information in [i] can be tested
using the UMP test in (17) after the statistical adequacy of M (x) (table 1) has been
secured.
On the other hand the substantive information [ii], which amounts to assuming
a prior distribution ( 2 ) is impertinent from the frequentist perspective because it
will give rise to an improper (misspecified) model. Hence, the ‘mixed experiment’
model M (x), although legitimate from the Bayesian perspective, it is improper because it could not have generated data x0 . That is, one does not learn from data by
deriving the relevant error probabilities from a misspecified model. This explains the
fallacious results discussed in the literature as stemming from the inevitable discrepancy between actual and nominal error probabilities when (7) is used as the relevant
distribution of the sample.
References
[1] Berger, J. O. and R.W. Wolpert (1988), The Likelihood Principle, Institute of
Mathematical Statistics, Lecture Notes - Monograph series, 2nd edition, vol. 6,
California, Hayward.
[2] Birnabaum, A. (1962), “On the Foundations of Statistical Inference,” Journal of
the American Statistical Association, 57: 269-306.
[3] Cornfield, J. (1969), “Bayesian Outlook and its Applications,” Biometrics, 25:
617-657.
[4] Cox, D. R. (1958), “Some Problems Connected with Statistical Inference,” The
Annals of Mathematical Statistics, 29: 357-372.
[5] Cox, D. R. (1990), “Role of Models in Statistical Analysis,” Statistical Science,
5: 169-174.
[6] Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University
Press, Cambridge.
14
[7] Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, Chapman & Hall,
London.
[8] DeGroot, M. H. and M. J. Schervish (2002), Probability and Statistics, 3rd ed.,
Addison-Wesley, NY.
[9] Kadane, J. B. (2011), Principles of Uncertainty, Chapman & Hall, NY.
[10] Lehmann, E. L. (1986), Testing statistical hypotheses, 2nd edition, Wiley, NY.
[11] Lehmann, E. L. (1990) “Model specification: the views of Fisher and Neyman,
and later developments”, Statistical Science, 5, 160-168.
[12] Lehmann, E. L. (1993), The Fisher and Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?, Journal of the American Statistical Association,
88: 1242-9.
[13] Mayo, D. G. (2010), “An Error in the Argument from Conditionality and Sufficiency to te Likelihood Principle”, pp. 305-314 in Mayo and Spanos (2010).
[14] Mayo, D. G. and A. Spanos (2004), “Methodology in Practice: Statistical Misspecification Testing”, Philosophy of Science, 71: 1007-1025.
[15] Mayo, D. G. and Spanos, A. (2006), “Severe testing as a basic concept in a
Neyman—Pearson philosophy of induction,” British Journal for the Philosophy
of Science, 57: 323—57.
[16] Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges
on Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science, Cambridge University Press, Cambridge.
[17] Pratt, J. W. (1961), “Book Review: Testing Statistical Hypotheses, by E. L.
Lehmann,” Journal of the American Statistical Association, 56: 163-167.
[18] Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 153-198 in Philosophy
of Statistics, Handbook of Philosophy of Science, Elsevier, (eds.) D. Gabbay, P.
Thagard, and J. Woods.
[19] Robert, C. (2007), The Bayesian Choice, 2nd ed., Springer, NY.
[20] Spanos, A. (1999), Probability Theory and Statistical Inference: econometric
modeling with observational data, Cambridge University Press, Cambridge.
[21] Spanos, A. (2006), “Where Do Statistical Models Come From? Revisiting
the Problem of Specification,” pp. 98-119 in Optimality: The Second Erich L.
Lehmann Symposium, edited by J. Rojo, Lecture Notes-Monograph Series, vol.
49, Institute of Mathematical Statistics.
[22] Spanos, A. (2011), “Revisiting the Welch Uniform Model: A case for Conditional
Inference?” Advances and Applications in Statistical Sciences, 5: 33-52.
[23] Spanos, A. (2013), “Who should be afraid of the Jeffreys-Lindley paradox?”,
Philosophy of Science, 80: 73-93.
[24] Spanos, A. and A. McGuirk (2001), “The Model Specification Problem from
a Probabilistic Reduction Perspective,” Journal of the American Agricultural
Association, 83: 1168-1176.
[25] Young, G. A. and R. L. Smith (2005), Essentials of Statistical Inference, Cambridge University Press, Cambridge.
15