Revisiting the ‘Mixed Experiment’ Example: Fallacious Frequentist Reasoning or an Improper Statistical Model? ∗ Aris Spanos† Department of Economics, Virginia Tech, Blacksburg, VA 24061 [email protected] January 2013 Abstract The primary aim of this paper is to revisit the ‘mixed experiment’ example in order to elucidate some of its veiled features with a view to demonstrate that the real source of the fallacious results associated with this example is the problematic nature of the example itself, and not the nature of frequentist reasoning. In particular, the mixed experiment constitutes an improper (misspecified) statistical model in the sense that, by assumption, it could not have generated the data in question. What is often neglected in these discussions is that in frequentist inference learning from data about the underlying mechanism can only occur when the relevant error probabilities relate directly to a statistically adequate model, in the sense that its probabilistic assumptions are valid for the data in question. Hence, using the mixed experiment as a basis of frequentist inference will invariably lead to unreliable inferences because the actual (true) error probabilities will be different from the nominal (assumed) ones. This discrepancy offers a very different explanation for the fallacious results discussed in the literature. Key terms: mixed experiment, two measuring instruments example, conditionality principle, sufficiency principle, likelihood principle, statistical vs. substantive information, statistical misspecification, specification uncertainty. ∗ Published in Advances and Applications in Statistical Science, 8: 29-47, 2013. Thanks are due to Clark Glymour and Deborah Mayo for stimulating discussions on this topic of this paper. † 1 1 Introduction The ‘two measuring instruments’ example was introduced by Cox (1958) to illustrate the cogency of Fisher’s Conditionality Principle (CP) as it relates to the irrelevance of component experiments that could, but were not actually performed. The CP is implemented by conditioning on ancillary statistics to define the relevant distribution for the sample. Historically this example has generated a number of different controversies that linger on to this day1 . In addition to raising issues about the role of pre-data error probabilities and their long-run interpretation in the context of the Neyman-Pearson (N-P) testing, it has been used by critics of the frequentist approach to question its underlying reasoning more generally. In particular, this example gave Birnbaum (1962) the idea for the notion of a mixed experiment that constitutes a key premise in his derivation of the Likelihood Principle (LP); the latter results by combining the CP with the Sufficiency Principle (SP). It also features prominently among the examples used by Bayesians to highlight the fallacious nature of frequentist reasoning more generally due to its supposed insistence on taking into account all possible outcomes, irrespective of relevance; Cornfield (1969), Berger and Wolpert (1988), DeGroot and Schervish (2002), Young and Smith (2005). The main aim of this paper is to revisit this particular example, nowadays called the ‘mixed experiment’, in order to elucidate some of its veiled features with a view to demonstrate that the real source of the fallacious results is the problematic nature of the example itself, and not the fallaciousness of frequentist reasoning. In particular, the mixed experiment constitutes an improper (misspecified) statistical model in the sense that it could not have generated the data in question; the data were generated by a process with a constant variance, but the model assumes a heterogeneous variance — a convex combination of two different variances. What causes confusion in these discussions is that in frequentist inference learning from data can only occur when the relevant error probabilities relate directly to an adequate description of the underlying mechanism; a statistically adequate model in the sense that it could have generated the particular data, i.e. its probabilistic assumptions are valid for the data in question; see Mayo and Spanos (2004). Hence, using the mixed experiment as a basis of frequentist inference will invariably lead to unreliable inferences because, we know by assumption that, one of the two mechanisms assumed did not contribute to the generation of the data in question. As a result, the actual (true) error probabilities will be different from the nominal (assumed) ones. This discrepancy between actual and nominal error probabilities offers a very different explanation for the fallacious results discussed in the literature. It is argued that the Bayesian charges against frequentist reasoning based on this particular example are misplaced and misleading. Moreover, the application of the CP to address the seemingly fallacious results is both artificial and unnecessary when statistical adequacy is viewed as a precondition for securing the reliability of inference. 1 It is interesting to note that Cox (2006) refers to this example as ‘notorious’. 2 In addition, the paper calls into question a key premise in Birnbaum’s derivation of the LP. In section 2, the probabilistic structure of the mixed experiment example is clarified as a prelude to the discussion in section 3 that focuses on the key issues and problems it raises when viewed as a possible mechanism that could have given rise to the data in question. The key problem is that the mixed experiment model is misspecified and the associated distribution of the sample is erroneous, giving rise to erroneous error probabilities and thus unreliable inferences. In section 4 the problematic nature of the mixed experiment is traced to the imposition of illegitimate (from the frequentist perspective) substantive information in the form of a prior distribution. This form of a priori information is shown to be incompatible with frequentist inference because it pertains to substantive adequacy, not statistical, which is handled very differently in the frequentist approach, as opposed to the Bayesian approach. 2 The ‘mixed experiment’ example Cox (1958) described the ‘two measuring instruments’ example as follows: “Suppose that we are interested in the mean of a normal population and that, by an objective randomization device, we draw either (i) with probability 12 , one observation, , from a normal population of mean and variance 21 ? or (ii) with probability 12 , one observation , from a normal population of mean and variance 22 , where 21 22 are known, 22 À 21 , and where we know in any particular instance which population has been sampled.” (p. 360) To make the model more realistic consider a sample of size i.e. data x0 :=(1 2 ) is assumed to have arisen from one of two simple Normal models: M1 (x): v N( 21 ), M2 (x): v N( 22 ) =1 (1) In addition, it is assumed that the following substantive information is available: [i] the variances ( 21 22 ) are known a priori, with 22 21 [ii] a priori each model could have given rise to data x0 with probability 12 . The primary hypotheses of interest are: 0 : = 0 vs. 1 : 6= 0 (2) and the key question is ‘what is the relevant distribution of the sample upon which frequentist inferences should be based?’ The conventional wisdom pertaining to the mixed experiment example can be summarized as follows. In light of [ii] the joint density of ( ), where is a Bernoulli random variable: =0 for M1 (x) (=0) = 12 =1 for M2 (x) (=1) = 3 1 2 takes the form: ( ; ) = (|; )·() = 12 (; 21 )1− (; 22 ) =0 1 ¡ ¢ ¡ ¢ = 12 I=0 · (; 21 ) + 12 I=1 · (; 22 ) (3) where (; 2 ) denotes the Normal density N( 2 ) and I= the indicator function for = Hence, the relevant model for Neyman-Pearson (N-P) testing is the ‘mixed experiment’ model : M (x) = 12 I=0 M1 (x) + 12 I=1 M2 (x) whose distribution of the sample takes the form: ¡ ¢ Y ( ; 21 )1− ( ; 22 ) (x z; ) = 12 =1 (4) (5) where Z:=(1 2 ). The statistics literature often uses a simplified version of the above distribution of the sample that stems from the marginal density (; ) derived by summing out : P (; ) = ¡ ¢ ( ; )= (|=0; ¡ ¢ ) · (=0)+(|=1; ) · (=1)= (6) = 12 (; 21 ) + 12 (; 22 ) Under certain simplifying assumptions concerning the equal contributions of the two sub-populations one can derive an approximate distribution of the sample (Lehmann, 1993): Q Q (x; ) = 12 ( ; 21 ) + 12 ( ; 22 ) for all x∈R (7) Hence, the N-P test for the hypotheses in (2) that suggests itself is: (X) = √ (−0 ) 0 v ∗ N(0 1) 1 ()={x : |(x)| } (8) P 2 + 2 where 2∗ = 1 2 2 = 1 =1 and 1 () denotes the -level rejection region. It turns out that when (7) is used as a basis of frequentist inference, the N-P t-test (8) gives rise to several fallacious results. Berger and Wolpert (1988), p. 7, argued: “... we can phrase this example as "mixed experiment" in which with probability 12 (independent of ) either experiment 1 or experiment 2 (both pertaining to ) will be performed. Should the analysis depend only on the experiment actually performed, or should the possibility of having done the other experiment be taken into account. The obvious intuitive answer to the questions in the above example is that only the experiment actually preformed should matter. But this is counter to the pre-experimental frequentist reasoning, which says that one should average over all possible outcomes (here, including the coin flip).” 4 Hence, the Bayesian explanation of why frequentist reasoning is fallacious in this example is that it insists on taking into account experiments that could have happened but didn’t, blaming the quantifier for all (x)∈R × {0 1}. This is usually contrasted with the Bayesian approach which adheres to the Likelihood Principle insisting that only the observed data x0 value is relevant for inference purposes; see Robert (2007). In light of such criticisms several suggestions have been proposed in the literature on different ways to avoid them; see Lehmann (1993), Cox and Hinkley (1974). In particular, Cox (1958) used this example to argue in favour of using Fisher’s Conditionality Principle (CP) implemented by conditioning on =since is an ancillary statistic since (=0)= (=1)= 12 is independent of This yields the conditional densities: (|=1)=(; 22 ) (|=0)=(; 21 ) suggesting that the relevant distribution of the sample is not (7) but: Q (x|; ) = =1 ( |=; ) for all x∈R (9) Hence, the proper way to test (2) is based on two separate N-P tests, depending on = : √ 0 0) (10) v N(0 1) 1 ()={x: | (x)| } =1 2 (X)= (− It is clear that the relevant error probabilities associated with the two N-P tests, (8) and (10) will be different! 3 Fallacious reasoning or an improper model? The fact that – by assumption – the mixed experiment model M (x) in (4) could not have generated data x0 :=(1 2 ) raises serious doubts about the above Bayesian charges. It should be obvious that one could not learn from data x0 about the generating mechanism, by using error probabilities from a model that could not have generated x0 To shed further light on this issue let us return to the basics of the frequentist approach. It is well known that the cornerstone of frequentist inference is provided by the distribution of the sample, say: (1 2 ; θ):= (x; θ), for all x∈R defined as the joint distribution of the random variables making up the sample of the postulated statistical model: M (x) = { (x; θ) θ∈Θ} x∈R The quantifier for all x∈R refers to all possible outcomes associated with this particular statistical model. That is, the frequentist approach assumes that data x0 were generated by a single member of M (x) : the ‘true’ data-generating mechanism M∗ (x)={ (x; θ ∗ )} x∈R , where θ ∗ denotes the ‘true’ value of θ The family of models denoted by M (x) for θ∈Θ aims to facilitate learning from data about M∗ (x) Hence, the quantifier ‘for all possible outcomes’ (x∈R ) refers to all outcomes associated with M (x) and not those of other experiments that did not contribute to generating x0 . 5 The crucial role of (x; θ) stems from the fact that it defines the sampling distribution of all relevant statistics, such as estimators, tests or predictors. In particular, the sampling distribution of any statistic =(X) is derived via: Z Z Z (; θ):=P( ≤ ; θ) = ··· (x; θ)x (11) | {z } {x: (x)≤; x∈R } In relation to this result it is important to emphasize that learning from data x0 about M∗ (x) is achieved when the relevant error probabilities relate directly to a statistically adequate model in the sense that its probabilistic assumptions are valid for data x0 ; see Mayo and Spanos (2004). In this sense, frequentist inference presupposes a well-defined statistical model whose statistical adequacy has already been secured. Anything less would not guarantee the reliability of frequentist inferences. Let us take a closer look at the above claim that N-P testing, based on the pre-data type I and II error probabilities, requires (x; ) in (7) to be the relevant distribution of the sample. This claim is clearly false because (x; ) will yield the wrong error probabilities. This stems from the fact that the ‘mixed experiment’ model M (x) in (4), associated with (7), is misspecified because we know that the data x0 could not have been generated by (4). By assumption, at least one of the two models (M1 (x) or M2 (x)) had nothing to do with generating data x0 . This has certain serious consequences: [a] it gives rise to the wrong likelihood function, [b] it yields erroneous sampling distributions, and thus [c] it generates erroneous error probabilities. That is, using (7) as the relevant (x; θ) will give rise to the wrong sampling distribution (), and as a result of that the nominal [the assumed ones] and actual error probabilities will be different, giving rise to unreliable inferences. The surest way to lead an inference astray is to fix the significance level at .05 but the actual type I error, due to misspecification, is closer to .90; see Spanos and McGuirk (2001). This suggests that, the inevitable discrepancy between the actual and nominal error probabilities, due to the statistical misspecification of M (x), can easily explain the fallacious results discussed in the literature. This explains the counterintuitive results based on the test in (8) as due the incorrect error probabilities stemming from (7); see Berger and Wolpert (1988), Lehmann (1993). Are there any circumstances under which the distribution of the sample in (7) is the relevant one? The answer is yes, in the case where both sub-models in (1) contributed to the generation of data x0 . Consider the experiment where before each observation a fair coin is tossed to decide which measuring instrument to be used. In this case M (x) constitutes a proper substantive model from the frequentist perspective and (7) is the relevant distribution of the sample to be used as a basis of frequentist inference. In this case no fallacious results will arise. How did this misspecification come about? A first clue is given by the fact that 6 assumption [ii] is equivalent to assuming a prior ( 2 ) of the form: 2 ( 2 ) 21 22 1 2 1 2 (12) A second clue is that when (12) is combined with (1) yields a posterior that coincides with the distribution of the mixed experiment model M (x) in (7), but re-interpreted as a function of the unknown parameter , not x That is, the form of misspecification of M (x) fits nicely into the Bayesian approach as a consequence of using a prior distribution to represent the substantive specification uncertainty pertaining to 2 ; indicating ignorance as to which of the two measuring devices has been used. Hence, M (x) makes perfectly good sense in a Bayesian context when a uniform prior was attached to . More generally, the mixed experiment model M (x) makes no sense in the context of the frequentist approach, simply because it could not have generated data x0 The focus of frequentist inference is on the ‘true’ model M∗ (x) and whenever a prior is introduced as part of the available substantive information it will invariably give rise to the wrong distribution of the sample. Intuitively, what goes wrong is that the prior probabilities in (12) will invariably be false since the actual distribution associated with M∗ (x) will always be degenerate, i.e. it will take the value 1 for θ = θ∗ and 0 for every other value of θ. The question that naturally arises is why frequentist statisticians have not rejected (4) as an illegitimate premises for discussing frequentist inference in the context of the mixed experiment model? The answer is that shedding light on the inconspicuous features of the mixed experiment model calls for the elucidation of two fundamental foundational issues pertaining to frequentist inference: (a) what constitutes a legitimate statistical model, and (b) how should substantive information like [i]-[ii] above be handled. A key problem of frequentist modeling and inference relates to how the statistical and substantive information can be blended without compromising the credibility of either source of information. Disentangling the two sources of information has been a long standing problem in frequentist statistics. 4 Substantive information and the reliability of inferences Empirical models in most applied fields constitute a blend of statistical and substantive information, ranging from a solely data-driven formulation like an ARIMA(p,d,q) model, to entirely theory-driven formulation like a structural (simultaneous) equation model; see Spanos (2006). The majority of empirical models lie someplace in between these two extremes, but the role of the two sources of information and how they could be properly combined to learn from data are issues that have not been clearly delineated; see Lehmann (1990), Cox (1990). 7 In relation to the blending of the statistical and substantive information, the imposition of the latter on the data at the outset by estimating the structural model M (x) directly is invariably a bad strategy because the presence of any statistical misspecification is likely to undermine the reliability of any inferences pertaining to substantive information. One way to avert this problem is to distinguish clearly between statistical and substantive information and adopt a purely probabilistic perspective on the notion of a statistical model M (x), based exclusively on statistical information; the latter is reflected in the chance regularity patterns exhibited by data x0 ; see Spanos (1999). One can then relate it to the substantive information when its statistical adequacy of M (x) has been secured. From this perspective, a statistical model M (x) is viewed as a parameterization of a generic stochastic process { ∈N} assumed to underlie data x0 . The specification of the statistical model takes the form of selecting a probabilistic structure of { ∈N} with a view to: (i) render x0 a ‘truly typical’ realization thereof, and (ii) choose a parameterization of { ∈N} that enables one to pose the substantive questions of interest. In this sense, a statistical model comprises a set of probabilistic assumptions that define the statistical (inductive) premises of inference, built exclusively on statistical information. Statistical adequacy refers to the model assumptions being valid vis-avis data x0 . The particular parameterization of { ∈N} that defines the statistical model M (x) is chosen with a view to potentially accommodate the substantive information by rendering the substantive model M (x) a reparametrization/restriction thereof; see Spanos (2006). In the case of the ‘two measuring instruments’ example, a truly typical realization of the data x0 :=(1 2 ) – assumed to have been generated by one of the two simple Normal models in (1) – would look like figure 1. Hence, the relevant statistical model is: M (x): v NIID( 2 ) θ:=( 2 )∈R×R+ =1 (13) where both parameters ( 2 ) are assumed unknown. The underlying distribution of the sample is: © ª P (x; θ) = ( √12 ) exp − 21 2 =1 ( − )2 (14) which is very different from (7). Indeed, in the context of M (x) (table 1) the optimal (Uniformly Most Powerful Unbiased; see Lehmann, 1986) N-P test for assessing the primary hypotheses of interest (2) is: √ ¡ ¢2 P 0 2 1 0) v St(−1) = − 1 ()={x: |∗ (x)| } ∗ (X)= (− (15) =1 It is easy to see that the relevant error probabilities associated with this t-test will be different from those of the N-P tests in (8) and (10). 8 The main differences between the two distributions of the sample in (7) and (14) stem from the fact that the substantive information [i]-[ii] should be considered irrelevant for the specification of the statistical model. This is because such information does not pertain directly to the probabilistic structure of the process { ∈N} as reflected in the chance regularities exhibited by data x0 in figure 1; it concerns the particular value of 2 . Whether the ‘true’ substantive model behind M (x) is M1 (x) or M2 (x) is irrelevant for selecting and validating the statistical premises, as given in table 1 because 2 is allowed to take any positive value. 3 2 x 1 0 -1 -2 -3 1 10 20 30 40 50 Index 60 70 80 90 100 Fig. 1: t-plot of NIID(0,1) data Table 1 - Simple Normal Model Statistical GM: = + ∈N={1 ⎫ 2 } [1] Normality: v N( ) ∈R ⎬ [2] Constant mean: ( ) = ∈N. ⎭ [3] Constant variance: ( ) = 2 [4] Independence: { ∈N} independent process For securing statistical adequacy one needs a complete list of testable probabilistic assumptions as in table 1. To assess the validity of assumptions [1]-[4] one needs to apply thorough Mis-Specification (M-S) testing; see Mayo and Spanos (2004). Hence, a statistical model M (x) is deemed appropriate when its assumptions are valid for data x0 . It is important, however, to emphasize that statistical adequacy does not imply substantive adequacy, but it is a precondition for ensuring the reliability of an inferences pertaining to probing the validity of the substantive information. Because of its crucial importance, this strategy is elevated to a modeling/inference principle. Statistical Adequacy Principle. The statistical adequacy of M (x) is a precondition for ensuring the reliability of appraising substantive hypotheses or claims, including the substantive adequacy of M (x). The acceptance of this principle addresses the issue of what is the relevant distribution of the sample and secures the reliability of inference for assessing the empirical validity of substantive claims of interest. 9 4.1 Appraising the prior information in [i]: 22 21 To illustrate this, let us return to the statistical model in (13) and assumed that its statistical adequacy has been secured. One can then proceed to assess the substantive information in [i] using N-P testing. Since the two models M1 (x) and M2 (x) are nested within [constitute reparameterizations/restrictions] the statistical model M (x) the validity of the substantive information in [i] can be easily assessed by testing which one, if either, of the two substantive models M1 (x) or M2 (x) the data indicate as the generating process. An appropriate way to frame this is in terms of the partitioning hypotheses: 0 : 2 ≤ 21 vs. 1 : 2 21 (16) where the substantive discrepancy = 22 − 21 0 from the null is of particular interest; see Mayo and Spanos (2006). Equivalently, one can specify the hypotheses as: 0 : 2 ≤ 22 vs. 1 : 2 22 with the same substantive discrepancy of interest = 22 − 21 0. For testing the hypotheses in (16), the test based on: ¢2 P ¡ (17) 1 ()={x: (x) } (X)=[ =1 − 21 ] v0 2 (−1) is Uniformly Most Powerful (UMP); see Lehmann (1986). This way of dealing with point hypotheses is in sharp contrast with the way Berger and Wolpert (1988), p. 7, deal with their example 3 based on testing the two simple hypotheses, = −1 vs. = 1 claiming that their treatment stems from the NeymanPearson (N-P) testing reasoning. N-P testing has its fair share of problems, but this is not one of them because the two hypotheses, when properly defined, need to be specified in such a way so that they constitute a partition of the relevant parameter space. Hence, what a proper N-P tester would not do is to specify the null and alternative hypotheses as: 0 : 2 = 21 vs. 1 : 2 = 22 (18) and apply the Neyman-Pearson lemma. This is because (18) does not constitute a partition of the statistically relevant parameter space. The fact of the matter is that although there might be only 2 values ( 21 22 ) which are of interest from a substantive viewpoint, all possible values of the parameter 2 ∈R+ are of interest from the statistical perspective. In this sense this test will be used to assess the data-cogency of the substantive information in [i] and only impose it if validated. Note that all three models M1 (x), M2 (x) and M (x) incorporate untested substantive information. Assuming that one of the two models M1 (x) or M2 (x) is data-cogent, say M1 (x), one could then impose the substantive information 2 = 21 and proceed to test the primary hypothesis of interest (2) in the context of the latter. In certain cases imposing the substantive information makes sense because the inference based on M1 (x) can be shown to be more precise than the one based on M (x) which ignores the information 2 = 21 . In particular, the relevant test will not be (15) but: 1 (X)= √ (−0 ) 0 v 1 N(0 1) 11 ()={x: |1 (x)| } 10 (19) 4.2 The mixed experiment as a proper model It is important to note that when the mixed experiment model M (x) is a proper model from the frequentist perspective, i.e. both sub-models in (1) did contribute to the generation of x0 it can be viewed as a special case of the statistical model M (x) in (13). In this sense, the substantive information incorporated in M (x) can be appraised by testing the hypotheses: 0 : 2 = 12 ( 21 + 22 ) vs. 1 : 2 6= 12 ( 21 + 22 ) 3 15 2 10 1 5 0 z x From the statistical adequacy perspective, one can use the t-plots of the data to distinguish between the three models M1 (x), M2 (x) and M (x) It is easy to distinguish between M1 (x) and M2 (x) by simply realizing the a typical realization of latter is likely to have much bigger variation around the mean. Fig. 3-4 show the t-plots of typical realizations of NIID(0 1) and NIID(0 25) processes and the range of the y-axis values of and are very different; (−3 3) and (−15 15), respectively. -1 0 -5 -2 -10 -3 1 10 20 30 40 50 Index 60 70 80 90 1 100 15 15 10 10 5 5 0 -5 -10 -10 10 20 30 40 50 Index 60 70 80 30 40 50 Index 60 70 80 90 100 90 Variable x z y 0 -5 1 20 Fig. 2: t-plot of NIID(0,25) data Data y Fig. 1: t-plot of NIID(0,1) data 10 1 100 Fig. 3 - 12 N(0,1)+ 12 N(0,25) data 10 20 30 40 50 60 Index 70 80 90 100 Fig. 4: all three data series Also notice that the typical realization of the mixed experiment, when it is implemented as envisaged, exhibits leptokurticity as shown in fig. 3; this is often measured 11 using the kurtosis coefficient: p 4 =(−())4 [ (−())2 ]4 The data in fig. 1-2 have an estimated kurtosis coefficient b 4 ' 3, but for the data in fig. 3 b 4 ' 65 That is, one can distinguish between M1 (x), M2 (x) and M (x) on statistical adequacy grounds, because the typical realization of the respective underlying processes are easily distinguisable by a simple M-S test for Normality; in the case of figure 4, eyeballing will be sufficient to reject Normality. 4.3 Dealing with the prior information [ii] In contrast to [i], the a priori information in [ii] ‘each model could have given rise to data x0 with probability 12 ’ is of a very different nature. It turns out that its imposition at the outset is inappropriate for frequentist inference. Any information in the form of a prior distribution for 2 that pertains to substantive specification uncertainty has no place in frequentist modeling and/or inference. In light of the fact that the primary aim of the frequentist approach is to learn from data x0 about the ‘true’ underlying data-generating mechanism M∗ (x) = {(x; θ∗ )} x∈R – whatever value θ∗ happens to be – any form of substantive uncertainty is addressed within the context of a statistically adequate M (x) using N-P testing. This is in sharp contrast with the Bayesian approach whose primary inferential aim is to change the initial rank of all the possible data-generating mechanisms [i.e. different models M (x0 ) θ∈Θ], based on the prior distribution (θ) into a revised rank in light of the data x0 based on the posterior distribution: (θ | x0 ) ∝ (θ | x0 )·(θ) θ∈Θ (20) where (θ | x0 ) ∝ (x0 | θ) denotes the likelihood function. Moreover, beyond the inferential the Bayesian approach often has another component, that of decision making. Which one of these models, if any, might be chosen will depend not only on the revised rank stemming from (θ | x0 ) θ∈Θ but also on the relevant loss (utility) function adopted by the modeler. For frequentist modeling/inference it is the statistical specification uncertainty – the appropriate selection of the statistical model M (x) – that is of concern. It is addressed using thorough Mis-Specification (M-S) testing and respecification (Spanos, 2006). The crucial difference between the statistical and substantive specification uncertainties is that the former calls for testing beyond the boundaries of M (x) but the latter can be easily dealt with by N-P testing within M (x); see Spanos (1999). The frequentist perspective is, once again, in sharp contrast to the Bayesian one that often ignores the statistical specification uncertainty altogether, by treating the choice of the model (or the likelihood function) on par with that of the prior distribution. According to Kadane (2011): “... likelihoods are just as subjective as priors, and there is no reason to expect 12 scientists to agree on them in the context of an applied problem.” (p. 445) From the frequentist perspective likelihoods are defined by the probabilistic assumptions of a statistical model, like [1]-[4] in table 1. Hence, there is nothing subjective or arbitrary about the choice of a statistical model M (x), and the selection is not based on any agreement amongst scientists, since the validity of its probabilistic assumptions is independently testable vis-a-vis data x0 . In summary, the framing of substantive specification uncertainty in the form of prior distributions on parameters is incompatible with frequentist modeling and inference. When such information is imposed at the outset, as in the case of the mixed experiment model, the result is always a misspecified statistical model in the sense that it could not have given rise to the particular data x0 because, by definition, it includes component models which had nothing to do with generating x0 . In a certain sense this is the key difference between what constitutes legitimate substantive information in the frequentist vs. the Bayesian approaches. Bayesians often blur the distinction between a priori substantive subject matter information and a prior distribution over the unknown parameters (θ) θ∈Θ and proceed to admonish frequentists that they ignore substantive information. In most scientific fields a priori substantive information does not come in the form of a prior distribution, but as deterministic restrictions on the parameters including the sign, magnitude and possible range of values. Such substantive information is easily accommodated in frequentist statistics because it is required that the relationship between the statistical (θ) and substantive (ϕ) parameters, say (θ ϕ)=0 θ∈Θ ϕ∈Φ, can be solved uniquely for the latter. This is known in econometrics as the identification condition; see Spanos (2006). On the other hand, replacing (x; θ) in (11) with the product (x; θ)(θ) will totally undermine the very notion of the relevant error probabilities. But that’s exactly what the adoption of the mixed experiment model amounts to! 4.4 Whither the Likelihood Principle? An important implication of the resulting statistical misspecification associated with the mixed experiment model M (x) = 12 M1 (x) + 12 M2 (x) is that a key premise of Birnbaum’s (1962) derivation of the Likelihood Principle (LP) is called into questioned. This is the premise which asserts that M (x) provides the same evidence about as the model that actually gave rise to the data by invoking the Sufficiency Principle; see Berger and Wolpert (1988), p. 27. When evidence is evaluated in terms of the relevant error probabilities, this claim is patently false. The error probabilities for testing (2) stemming from the mixed model M (x) will be very different from those associated with a statistically adequate M (x) or even M1 (x) that will ensue from Fisher’s CP. Indeed, Mayo (2010) goes even further by demonstrating that Birnbaum’s proof of the Strong LP is erroneous. 13 5 Conclusion The two measuring instruments (mixed experiment) example raises a number of different issues that boil down to: (a) what constitutes a proper statistical model, and (b) how to handle the a priori information: [i] the variances ( 21 22 ) are known a priori, with 22 21 [ii] a priori each model [with 21 or 22 ] could have given rise to data x0 with probability 12 . It was argued above that the traditional approach of imposing all substantive information at the outset often gives rise to unreliable inferences. A way to avoid that is to distinguish between statistical and substantive information, ab initio, and adopt a purely probabilistic construal of the notion of a statistical model M (x). Once the statistical adequacy of M (x) is established one can then proceed to test the validity of the substantive information of interest before imposing it. From this frequentist perspective the validity of the substantive information in [i] can be tested using the UMP test in (17) after the statistical adequacy of M (x) (table 1) has been secured. On the other hand the substantive information [ii], which amounts to assuming a prior distribution ( 2 ) is impertinent from the frequentist perspective because it will give rise to an improper (misspecified) model. Hence, the ‘mixed experiment’ model M (x), although legitimate from the Bayesian perspective, it is improper because it could not have generated data x0 . That is, one does not learn from data by deriving the relevant error probabilities from a misspecified model. This explains the fallacious results discussed in the literature as stemming from the inevitable discrepancy between actual and nominal error probabilities when (7) is used as the relevant distribution of the sample. References [1] Berger, J. O. and R.W. Wolpert (1988), The Likelihood Principle, Institute of Mathematical Statistics, Lecture Notes - Monograph series, 2nd edition, vol. 6, California, Hayward. [2] Birnabaum, A. (1962), “On the Foundations of Statistical Inference,” Journal of the American Statistical Association, 57: 269-306. [3] Cornfield, J. (1969), “Bayesian Outlook and its Applications,” Biometrics, 25: 617-657. [4] Cox, D. R. (1958), “Some Problems Connected with Statistical Inference,” The Annals of Mathematical Statistics, 29: 357-372. [5] Cox, D. R. (1990), “Role of Models in Statistical Analysis,” Statistical Science, 5: 169-174. [6] Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press, Cambridge. 14 [7] Cox, D. R. and D. V. Hinkley (1974), Theoretical Statistics, Chapman & Hall, London. [8] DeGroot, M. H. and M. J. Schervish (2002), Probability and Statistics, 3rd ed., Addison-Wesley, NY. [9] Kadane, J. B. (2011), Principles of Uncertainty, Chapman & Hall, NY. [10] Lehmann, E. L. (1986), Testing statistical hypotheses, 2nd edition, Wiley, NY. [11] Lehmann, E. L. (1990) “Model specification: the views of Fisher and Neyman, and later developments”, Statistical Science, 5, 160-168. [12] Lehmann, E. L. (1993), The Fisher and Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?, Journal of the American Statistical Association, 88: 1242-9. [13] Mayo, D. G. (2010), “An Error in the Argument from Conditionality and Sufficiency to te Likelihood Principle”, pp. 305-314 in Mayo and Spanos (2010). [14] Mayo, D. G. and A. Spanos (2004), “Methodology in Practice: Statistical Misspecification Testing”, Philosophy of Science, 71: 1007-1025. [15] Mayo, D. G. and Spanos, A. (2006), “Severe testing as a basic concept in a Neyman—Pearson philosophy of induction,” British Journal for the Philosophy of Science, 57: 323—57. [16] Mayo, D. and Spanos, A. (eds.) (2010), Error and Inference, Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, Cambridge University Press, Cambridge. [17] Pratt, J. W. (1961), “Book Review: Testing Statistical Hypotheses, by E. L. Lehmann,” Journal of the American Statistical Association, 56: 163-167. [18] Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 153-198 in Philosophy of Statistics, Handbook of Philosophy of Science, Elsevier, (eds.) D. Gabbay, P. Thagard, and J. Woods. [19] Robert, C. (2007), The Bayesian Choice, 2nd ed., Springer, NY. [20] Spanos, A. (1999), Probability Theory and Statistical Inference: econometric modeling with observational data, Cambridge University Press, Cambridge. [21] Spanos, A. (2006), “Where Do Statistical Models Come From? Revisiting the Problem of Specification,” pp. 98-119 in Optimality: The Second Erich L. Lehmann Symposium, edited by J. Rojo, Lecture Notes-Monograph Series, vol. 49, Institute of Mathematical Statistics. [22] Spanos, A. (2011), “Revisiting the Welch Uniform Model: A case for Conditional Inference?” Advances and Applications in Statistical Sciences, 5: 33-52. [23] Spanos, A. (2013), “Who should be afraid of the Jeffreys-Lindley paradox?”, Philosophy of Science, 80: 73-93. [24] Spanos, A. and A. McGuirk (2001), “The Model Specification Problem from a Probabilistic Reduction Perspective,” Journal of the American Agricultural Association, 83: 1168-1176. [25] Young, G. A. and R. L. Smith (2005), Essentials of Statistical Inference, Cambridge University Press, Cambridge. 15
© Copyright 2024 Paperzz