On using Sufficient Statistics in (Power) Expected Posterior Prior for

$
'
On using Sufficient Statistics in (Power) Expected
Posterior Prior for Bayesian Model Comparison
Dimitris Fouskakis,
Department of Mathematics, School of Applied Mathematical and Physical Sciences, National
Technical University of Athens, Athens, Greece; e-mail: [email protected].
Joint work with:
Ioannis Ntzoufras
Department of Statistics
Athens University of Economics and Business
Athens, Greece; e-mail: [email protected]
&
Luis Pericchi
Department of Mathematics
University of Puerto Rico
San Juan, PR, USA; e-mail: [email protected]
Presentation is available at:
&
www.math.ntua.gr/∼fouskakis/Presentations/Cancun/Presentation Cancun.pdf.
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
2
$
Synopsis
1. Bayesian Model Comparison
2. Expected-Posterior Prior (EPP)
3. Power-Expected-Posterior Prior (PEP)
4. EPP & PEP using Sufficient Statistics
5. Example 1: Normal Mean Hypothesis Testing
6. Example 2: Normal Linear Regression
7. An Alternative Approach
8. Discussion
&
%
July 2014:
'
1
Twelfth World Meeting of ISBA, Cancun, Mexico
3
$
Bayesian Model Comparison
Within the Bayesian framework the comparison between models M0 and M1 is
evaluated via the Posterior Odds (PO)
P OM1 ,M0
f (M0 |y)
f (y|M0 ) π(M0 )
≡
=
×
= BFM0 ,M1 × OM0 ,M1
f (M1 |y)
f (y|M1 ) π(M1 )
(1)
which is a function of the Bayes Factor BFM0 ,M1 and the Prior Odds OM0 ,M1 .
In the above f (y|M ) is the marginal likelihood under model M and f (M ) is the
prior probability of model M . The marginal likelihood is given by:
Z
f (y|M ) = f (y|θ, M )π(θ|M )dθ,
(2)
where f (y|θ, M ) is the likelihood under model M with parameters θ and π(θ|M )
is the prior distribution of model parameters given model M .
&
%
July 2014:
'
2
Twelfth World Meeting of ISBA, Cancun, Mexico
4
$
Expected-Posterior Priors (EPP)
Pérez & Berger (2002, Biometrika) developed priors for use in model comparison,
through utilization of the device of imaginary training samples. They defined the
expected-posterior prior (EPP) as the posterior distribution of a parameter
vector for the model under consideration, averaged over all possible imaginary
samples y ∗ = (y1∗ , . . . , yn∗ ∗ )T coming from a “suitable” predictive distribution
m∗ (y ∗ ). Hence the EPP for the parameters of any model Mℓ is
πℓEP P (θ ℓ ) =
Z
πℓN (θ ℓ |y ∗ ) m∗ (y ∗ ) dy ∗ ,
(3)
where πℓN (θ ℓ |y ∗ ) is the posterior of θ ℓ for model Mℓ using a baseline prior πℓN (θ ℓ )
and data y ∗ .
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
5
Features of EPP
$
• The EPP naturally arises as the posterior distribution averaged over all
possible samples coming from a predictive measure, which is usually the prior
predictive of a baseline model. In nested cases usually the baseline model is the
simplest model. In this case EPP is the same as the Intrinsic Prior.
• EPP is a method to make priors compatible across models, through their
dependence on a common marginal data distribution.
• One of the advantages of using EPPs is that impropriety of baseline priors
causes no indeterminacy. Impropriety in m∗ also does not cause indeterminacy,
because m∗ is common to the EPPs for all models.
• Usually we choose the smallest n∗ for which the posterior is proper; this is the
minimal training sample size.
• Main Issue: In variable selection problems specification of X∗ℓ .
&
%
July 2014:
'
3
Twelfth World Meeting of ISBA, Cancun, Mexico
6
Power-Expected-Posterior (PEP) Priors
=
Z
πℓP EP (θ ℓ ; δ) =
Z
πℓEP P (θ ℓ )
| {z
w }
w
Ä
$
πℓN (θ ℓ |y ∗ ) m∗ (y ∗ ) dy ∗
| {z
w } | {z
w }
w
w
Ä
Ä
πℓN (θ ℓ |y ∗ , δ) m∗ (y ∗ , δ) dy ∗
|
{z
} | {z
w
w }
w
w
Ä
Ä
we substitute the likelihood terms with the density normalized power - likelihoods (as in Fouskakis, Ntzoufras & Draper, 2014)
f (y∗ | βℓ , σ 2 , mℓ ; X∗ℓ , δ)
∝
f (y∗ |βℓ , σ 2 , mℓ ; X∗ℓ )1/δ
=
fNn∗ (y∗ ; X∗ℓ βℓ , δσ 2 In∗ )
We can set δ = n∗ , n∗ = n and therefore X∗ℓ = Xℓ ; by this way we also dispense
with the selection of the training samples.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
7
$
Features of PEP
PEP priors method amalgamates ideas from Intrinsic Priors, EPPs, Unit
Information Priors and Power Priors, to unify ideas of Non-Data Objective Priors.
Problems that aims to solve PEP (but keeping the advantages of Intrinsic Priors
and EPPs):
• Dependence of training sample size.
• Lack of robustness respect to the sample irregularities.
• Excessive weight of the prior when the number of parameters is close to the
number of data.
PEP solves those but still being a fully objective method.
• Choose δ = n and X∗ℓ = Xℓ .
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
8
$
Sensitivity analysis on imaginary sample size
1.0
Figure 1: Posterior marginal inclusion probabilities, for n∗ values from 17 to n = 50, with the PEP
prior methodology (simulated example for a variable selection problem in normal linear model).
X7
X1
0.8
0.6
0.4
X5
X6
0.2
Posterior Inclusion Probabilities
X11
X10
0.0
X13
&
20
25
30
35
40
45
Imaginary/Training Data Sample Size (n∗ )
50
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
9
$
Sensitivity analysis on imaginary sample size (cont.)
0
−5
−15
−10
Posterior Values
5
10
Figure 2: Boxplots of the posterior distributions of the regression coefficients. For each coefficient,
the left-hand boxplot summarizes the EPP results and the right-hand boxplot displays the PEP
posteriors; solid lines in both posteriors identify the MLEs. We used the first 20 observations from
the simulated data-set and a randomly selected training sample of size n∗ = 17.
&
β1
β2
β3
β4
β5
β6
β7
β8
β9
β10
Regression Coefficients
β11
β12
β13
β14
β15
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
10
$
Features of PEP (cont.)
• In Fouskakis, Ntzoufras & Draper, 2014 (Bayesian Analysis, forthcoming) we
illustrated the robustness of the PEP with respect to the training sample size
• and also that PEP is not so informative when the number of parameters is
close to the number of data.
• However, the Intrinsic priors implied by the PEP method and more generally
the theoretical properties of PEP were until now still unexplored.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
11
$
Features of PEP (cont.)
In this talk we:
1. Derive the PEP priors for important models, including General Linear
Gaussian Models.
– For Normal models the PEP prior can be expressed as a mixture of g-priors.
– the PEP priors have attractive characteristics, like being centered around
Null Models (Savage’s Continuity Condition) and not concentrating (point
masses) around the null as the training sample sizes is allowed to grow.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
12
$
Features of PEP (cont.)
2. Prove the equivalence between EPP & PEP based on individual training
samples and training samples based on Sufficient Statistics.
– Practical Advantage: Sufficient Statistics are far cheaper to generate than
large training samples.
– Tentative Research Ideas: Used different versions of sampling distributions
of Sufficient Statistics to construct the EPP & PEP priors.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
13
$
Features of PEP (cont.)
3. Prove that PEP induces a Model Selection Method that, for :
i) For fixed p and fixed n is free of information inconsistency (This is NOT
the case for g-priors).
ii) For fixed p and growing n, the PEP method is consistent.
&
%
July 2014:
'
4
Twelfth World Meeting of ISBA, Cancun, Mexico
14
$
EPP & PEP using Sufficient Statistics
• The idea is to express the EPP and PEP priors as an average over all possible
sets of sufficient statistics based on imaginary data coming from the
baseline (simplest) model instead of all possible sets of imaginary data.
• The above might result to a great reduction of the problem dimensionality.
This can be beneficial especially in PEP where the dimension is n and also in
cases where the prior and the posterior are not available in closed form
expression.
• Assumptions:
– n > dℓ for all ℓ.
– M0 ⊂ Mℓ for all ℓ. Hence the minimal sufficient statistic of each
model under consideration is always a sufficient statistic of M0 .
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
15
For the Regular Exponential Family
$
Suppose that the likelihood (under model Mℓ ) has the following form:


kℓ
δ
X
X
wℓi (θ ℓ )
tℓi (yj∗ ) .
f (y ∗ |θ ℓ , Mℓ ) = hℓ (y1∗ ) . . . hℓ (yδ∗ )cℓ [θ ℓ ]δ exp 
i=1
T ∗ℓ
Pδ
= Tℓi (y ) = j=1 tℓi (yj∗ ), i = 1, . . . , kℓ , is a
:
Then
=
sufficient statistic (of dimension kℓ ) for θ ℓ , under model Mℓ . If the set
[wℓ1 (θ ℓ ), . . . , wℓkℓ (θ ℓ ), for all θ ℓ ], for all ℓ, contains an open set in Rkℓ , the
distribution of T ∗ℓ has the form:
Ãk
!
ℓ
X
∗
∗
δ
∗
∗
∗
fT ∗ (u1 , . . . , ukℓ |θ ℓ , M ) = Hℓ (u1 , . . . , ukℓ )cℓ [θ ℓ ] exp
wℓi (θ)u∗i .
ℓ
∗ T
∗
)
, . . . , Tℓk
(Tℓ1
ℓ
Tℓi∗
∗
j=1
(4)
i=1
In the above we assume that the imaginary data have been generated
from a model M ∗ (that’s why in the above formula we condition on M ∗ ).
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
16
Details (cont.)
πℓEP P SS (θ ℓ ) =
πℓP EP SS (θ ℓ |δ) =
Z
Z
$
πℓN (θ ℓ |T ∗ℓ )Gℓ (T ∗ℓ |M ∗ )dT ∗ℓ
πℓN (θ ℓ |T ∗ℓ , δ)Gℓ (T ∗ℓ |M ∗ , δ)dT ∗ℓ
We the above definition we still have π0EP P SS (θ 0 ) = π0P EP SS (θ 0 |δ) = π0N (θ 0 ).
Specification of the hyper-prior Gℓ :
Z
Gℓ (T ∗ℓ |M ∗ = M0 ) = gℓ (T ∗ℓ |θ 0 , M0 )π0N (θ 0 )dθ 0 .
We consider gℓ to be the actual sampling distribution of T ∗ℓ for model Mℓ , given
that the imaginary data have been generated from model M0 (i.e.
M ∗ = M0 ), i.e.
&
gℓ (u∗1 , . . . , u∗kℓ |θ 0 , M0 ) = fT ∗ (u∗1 , . . . , u∗kℓ |θ 0 , M0 ).
ℓ
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
17
Theorem (omitting details)
$
Let Y ∗ = (Y1∗ , . . . , Yδ∗ )T be independent and identically distributed random
variables. Let us further consider T ∗ℓ = (Tℓ1 (Y ∗ ), . . . , Tℓkℓ (Y ∗ ))T to be a
(minimal) sufficient statistic of dimension kℓ < n for the parameter vector θ ℓ under
model Mℓ . We assume that the reference model M0 is nested in Mℓ and therefore
T ∗ℓ is also sufficient statistic for θ 0 under model M0 . If g0 (T ∗ℓ |θ 0 ) is the sampling
distribution of T ∗ℓ under the reference model M0 , then
Z
πℓEP P SS (θ ℓ ) =
πℓN (θ ℓ |T ∗ℓ )m0 (T ∗ℓ )dT ∗ℓ
Z
Z
=
πℓN (θ ℓ |T ∗ℓ ) g0 (T ∗ℓ |θ 0 )π0N (θ 0 )dθ 0 dT ∗ℓ
Z
=
πℓN (θ ℓ |Y ∗ )m0 (Y ∗ )dY ∗ = π EP P (θ ℓ ).
Remark: The proof of the above Theorem is general and it holds for any
likelihood function. Therefore it holds for PEP as well.
&
%
July 2014:
'
5
Twelfth World Meeting of ISBA, Cancun, Mexico
18
$
Example 1: Normal Mean Hypothesis Testing
Let y = (y1 , . . . , yn )T be a random sample from Normal(µ, σ 2 ). We would like to
test the hypothesis H0 : µ = 0 versus H1 : µ 6= 0. Hence the two competing models
are M0 : Normal(0, σ 2 ) versus M1 : Normal(µ, σ 2 ) (µ 6= 0). The baseline (reference)
prior under the two models is given by
π0N (σ) ∝ σ −1 and π1N (µ, σ) ∝ σ −2 .
Let y ∗ = (y1∗ , . . . , yδ ∗ )T be a training (imaginary) sample of size δ: δ ≤ n. It is well
known that the sufficient statistic for µ, σ under M1 , is:
³
´
T = T 1 , T2 =
³
Y,
n
X
i=1
´
(Yi − Y )2 .
For the following analysis we use M0 as the reference model.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
19
$
Example 1: EPP using sufficient statistics
The distribution of T is
³
T1 ∼ Normal µ,
and T1 , T2 are independent.
2
σ
δ
´
and T2 ∼ Gamma
¡ δ−1
2
,
1
2σ 2
¢
.
(5)
The EPP using the distribution of the sufficient statistics is given by
· N ∗ ∗ ¸
m0 (T1 , T2 )
IPrior Eq.
EP P SS
N
∗
∗
π1
(µ, σ)
=
π2 (µ, σ)ET1 ,T2 |µ,σ
∗
∗
mN
1 (T1 , T2 )
Z 1
´
³
¡ δ−1 δ ¢
−1
σ2
fN µ; 0, δt fB t; 2 , 2 dt .
=
σ
0
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
20
$
Example 1: PEP using sufficient statistics
Under the PEP approach we consider the normalized version of the power
likelihood resulting in
f (y ∗ |µ, σ, δ) = fN (y ∗ ; µ1δ , δσ 2 I δ ) .
The distribution of T now is
¡
T1 ∼ Normal µ, σ
and T1 , T2 are independent.
&
2
¢
and T2 ∼ Gamma
¡ δ−1
2
,
1
2δσ 2
¢
.
(6)
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
21
$
Then, the PEP using the distribution of the sufficient statistics is given by
· N ∗ ∗ ¸
m0 (T1 , T2 )
IPrior Eq.
π2P EP SS (µ, σ)
=
π2N (µ, σ)ET1∗ ,T2∗ |µ,σ
∗
∗
mN
1 (T1 , T2 )
Z 1
³
´
¡ δ−1 δ ¢
2
−1
σ
fN µ; 0, t fB t; 2 , 2 dt .
=
σ
0
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
22
$
Example 1: Comparison
Method
EPP
PEP
&
π1 (µ, σ)
σ
σ
R
−1 1
0
R
−1 1
0
³
fN µ; 0,
³
fN µ; 0,
2
σ
δt
2
σ
t
´
´
E[µ|σ]
¡
fB t;
¡
fB t;
δ−1 δ
2 , 2
δ−1 δ
2 , 2
¢
¢
dt
0
dt
0
V [µ|σ]
1 2 2δ−3 δ→∞
δ σ δ−3 −−−→
0
δ→∞
2
σ 2 2δ−3
δ−3 −−−→ 2σ
%
July 2014:
'
6
Twelfth World Meeting of ISBA, Cancun, Mexico
23
Example 2: Normal Linear Regression
$
Let y = (y1 , . . . , yn )T be a random sample. We would like to compare the models:
M0 : Normal(y|X0 β 0 , σ02 ), π1N (β 0 , σ0 ) ∝ σ01+k0
vs.
M1 : Normal(y|X1 β 1 , σ12 ), π2N (β 1 , σ1 ) ∝ σ11+k1
where X0 is an (n × k0 ) matrix, X1 is an (n × k1 ) matrix, k0 < k1 and M0 is nested
in M1 .
Let y ∗ = (y1∗ , . . . , yδ ∗ )T be a training (imaginary) sample of size δ: δ ≤ n and let
X∗0 and X∗1 denote the corresponding design matrices.
It is well known that the sufficient statistic for β 1 , σ1 under M1 , is:
³
´
2
c, R .
T1 = β
1
1
For the following analysis we use M0 as the reference model.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
24
$
Example 2: EPP & PEP using sufficient
EPP
PEP
−(k0 +1) R 1
σ1
0
−(k0 +1) R 1
σ1
0
³
fN β e ; 0,
³
fN β e ; 0,
³
σ12
t V
´
δσ12
t V
´T
¡
fB t;
´
¡
fB t;
δ+k0 −k1 δ−k0
, 2
2
δ+k0 −k1 δ−k0
, 2
2
In the above tables we have β 1 = β T0 , β Te ,
¸
·
³
´−1
X0∗ T Xe∗ and X∗1 = [X∗0 |X∗e ]
V −1 = Xe∗ T Iδ − X0∗ X0∗ T X0∗
&
¢
dt
¢
dt
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
25
$
Example 2: EPP & PEP using sufficient (cont.)
Method
E[β e |β 0 , σ1 ]
EPP
0
PEP
0
1 −1
V
δ→∞ δ
where we assume that A = lim
&
V [β e |β 0 , σ1 ]
2δ−3
2
σ
δ+k0 −3 1 V
δ→∞
−−−→ 0
δ→∞
2δ−3
2
2 −1
σ
V
−
−
−
→
2σ
δ δ+k
A
1
1
0 −3
is a positive semi-definite matrix.
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
26
$
PEP & Information Consistency
For any model Mi if {y m , m = 1, 2, ....} is a sequence of data vectors of fixed size
such that as m → ∞,
supa,βi fi (y m |a, βi )
Λi0 (y m ) =
→∞
supa f0 (y m |a)
then
BFi0 (y m ) → ∞
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
27
$
PEP & Model Selection Consistency
P EP
For fixed p and growing n, the PEP method is consistent; i.e. BFjT
→ 0 as
n → ∞, where Mj 6= MT and MT is the true data generating regression model.
&
%
July 2014:
'
7
Twelfth World Meeting of ISBA, Cancun, Mexico
28
$
An Alternative Approach
In this case we consider the part of the likelihood (or the power-likelihood under
the PEP approach) of y1∗ , . . . , yδ ∗ that depends on both T ∗ℓ and θ and we
normalize it with respect to T ∗ℓ in order to form a pdf of T ∗ℓ . Therefore
´
³P
kℓ
∗
exp
w
(θ
)u
ℓi
ℓ
i
i=1
∗
∗
´
³P
.
gℓ (u1 , . . . , ukℓ |θ ℓ , M0 ) = R
kℓ
∗
∗
∗
exp
i=1 wℓi (θ ℓ )ui du1 . . . dukℓ
Again in the above we assume that the imaginary data have been
generated from the model M0 (i.e. M ∗ = M0 ). Furthermore, again, under
this approach the EPP (or PEP) under model M0 is equal to the reference prior
π0N (θ 0 ).
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
29
$
Example 1 (cont.)
Method
EPP 1
σ
EPP 2
PEP 1
PEP 2
&
−1
0
σ
σ
R1
R
−1 1
0
R
−1 1
0
σ
fN
³
π1 (µ, σ)
´
¡ δ−1 δ ¢
2
σ
µ; 0, δt fB t; 2 , 2 dt
³
fN µ; 0,
³
fN µ; 0,
R
−1 1
0
³
2
σ
δt
2
σ
t
fN µ; 0,
´
´
2
σ
t
¡
fB t; 1,
¡
fB t;
´
¡
3
2
¢
δ−1 δ
2 , 2
fB t; 1,
3
2
¢
dt
¢
dt
dt
E[µ|σ]
V [µ|σ]
0
σ 2 2δ−3
δ δ−3
0
∞ (for δ fixed)
0 (for δ → ∞)
0
σ 2 2δ−3
δ−3
0
∞ (δ free)
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
30
Example 2 (cont.)
Method
EPP 1
EPP 2
PEP 1
PEP 2
π1 (β 1 , σ1 )
´
³
¡ δ+k0 −k1 δ−k0 ¢
−(k0 +1) R 1
σ12
, 2 dt
β e ; 0, t V fB t;
σ1
f
2
0 N
−(k0 +1) R 1
σ1
0
−(k0 +1) R 1
σ1
0
−(k0 +1) R 1
σ1
0
³
fN β e ; 0,
³
fN β e ; 0,
³
fN β e ; 0,
³
β T0 , β Te
σ12
t V
´
δσ12
t V
δσ12
t V
´T
¡
fB t;
´
´
¡
fB t;
¡
fB t;
k0 +2 k1 −k0 +2
2 ,
2
¢
δ+k0 −k1 δ−k0
, 2
2
k0 +2 k1 −k0 +2
2 ,
2
In the above tables we have β 1 =
,
¸
·
³
´−1
X0∗ T Xe∗ and X∗1 = [X∗0 |X∗e ]
V −1 = Xe∗ T Iδ − X0∗ X0∗ T X0∗
&
$
dt
¢
¢
dt
dt
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
31
$
Example 2 (cont.)
Method
E[β e |β 0 , σ1 ]
V [β e |β 0 , σ1 ]
EPP 1
0
EPP 2
0
PEP 1
0
2δ−3
2 −1
2
A
−
−
−
→
2σ
δ δ+k
V
σ
1
1
0 −3
PEP 2
0
2
σ
V −→
δ k1k+2
1
0
2δ−3
2
δ+k0 −3 σ1 V
δ→∞
−−−→ 0
δ→∞
k1 +2 2
σ
V
−→
1
k0
k0 6=0
0
δ→∞
δ→∞ k1 +2 2 −1
σ1 A
k 6=0 k0
0
1 −1
V
δ
δ→∞
where we assume that A = lim
&
is a positive semi-definite matrix.
%
July 2014:
'
8
Twelfth World Meeting of ISBA, Cancun, Mexico
32
$
Discussion
• PEP priors have some attractive characteristics, like being centered around
Null Models and not concentrating (point masses) around the null as the
training sample sizes is allowed to grow.
• We proved the equivalence between EPP & PEP based on individual training
samples and training samples based on Sufficient Statistics.
• Alternative Approach: Use of a different sampling distribution of Sufficient
Statistic to construct the PEP priors. Results are great but still justification is
missing.
• PEP induces a Model Selection Method that, for : i) For fixed p and fixed n is
free of information inconsistency. ii) For fixed p and growing n is consistent.
iii) For fixed p is robust w.r.t training sample size.
&
%
July 2014:
'
Twelfth World Meeting of ISBA, Cancun, Mexico
33
$
Thank You Mexico!
&
%